This article will give an opinionated view on establishing a cloud product development strategy.
by Adeel Ahmad
With the rise of platform teams and a paradigm shift towards a producer/consumer model, there needs to be a strategy formulated for product development.
This article will give an opinionated view on establishing a cloud product development strategy.
Platform teams are challenged with growth - scaling up infrastructure on demand. Whilst sounding easy, enterprise constraints make it very difficult to provide the on-demand experience in its vanilla form.
Whilst platform teams are adopting practices around wrapping a cloud service with controls that pacify risk organisations, it would be under the pretence to present a centrally-managed service as a product of the platform.
A service mindset creates an impediment to building scalable products.
Most marketing materials claim the single most significant difference between a product and service is tangible vs intangible. The critical difference most relevant to the platform engineering movement is the transfer of ownership.
“A service is an activity or benefit that one party can offer to another that does not result in the ownership of anything.”
Consider this when a platform team may have developed a Terraform module for a cloud-managed SQL instance. If the platform team is still responsible for the support, uptime, and ongoing maintenance of the provisioned instance, no transfer of ownership has occurred. Does this scenario describe a product or a (shared) service?
A service-only catalogue for the (enterprise) masses does not scale with an intentionally lean platform team. This very behaviour and engagement type is the ultimate hindrance to accelerating cloud consumption at the enterprise scale.
The rest of the article explains the steps and components needed to confidently build a shared responsibility model for a cloud product.
Understanding the fundamental difference between a product and a service changes the course of the product development process.
It is reasonable to declare that a central team assumes all responsibility for a shared service, whilst a platform-managed product requires establishing a shared responsibility model.
Developing a platform product that allows for autonomous consumption and ownership transfer provides true scalability in an enterprise of any size. With this core principle in mind, you have a new set of challenges that forces you to get creative in designing and packaging a product offering.
The essence of product development is to cultivate, maintain, and increase an organisation’s (internal) ‘market’ share (the competition would be shadow infrastructure) by satisfying consumer demand. Identifying the minimal scope and true nature of the consumer problem is critical - this is validated/falsified during the prototyping stage.
A product development runway should be a list of hypotheses to test, not requirements to build. It will help with reducing the scope to its essence.
Features delivered are not a measure of success; business outcomes are. The runway is a series of questions that product owners must test to reduce uncertainty and improve the developer experience.
The above is why the driving metric for product development should be time-to-learn - the time it takes to validate an idea (or perceived risks) with a consumer and better understand its value proposition.
With that said, let’s dig deeper into establishing a framework that repeatedly validates/falsifies hypotheses.
I’ve found that what I’ve been practising organically for several years is part of a well-known method called the scientific method. It is the process of objectively establishing facts through testing and experimentation. Sound familiar?
The scientific method involves making conjectures (hypothetical explanations), deriving predictions from the hypotheses as logical consequences, and conducting experiments or empirical observations based on those predictions.
The difference between expected versus actual indicates which hypothesis better explains the resulting data from the experiment. If proven, later can become a fact.
The diagram below illustrates the process to follow to establish facts successfully.
So let’s put this into perspective as an example scenario:
Hypothesis: I hypothesise that a flat network subnet in the cloud isn’t genuinely flat. Since there is no broadcast domain in the cloud, each IP address in a given subnet is in its domain.
Prediction: On this basis, I predict that two VMs provisioned in the same subnet cannot communicate unless an explicit (firewall) rule is implemented to allow communication between them.
Experiment: Develop a Terraform module for a VM placed on a ‘default’ subnet in the ‘default’ VPC. Invoke the module twice and test connectivity.
Conclusion: (on the basis this was proven) Terraform module for VMs can be further simplified, avoiding adding conditional logic such as environment-based networking. Achieving a lean and scalable module is simple enough to delegate consumption directly to developers.
The scientific method forms a core part of this hypothesis validation framework. The proliferation of the assumptions and hypotheses brings out the value of this method. Another well-known practice that can help with this is Threat Modeling.
Threat Modelling is a principle that can be applied to enterprises beyond technology. It helps enterprises identify the longer, well-lit walk. It assesses the impact on mission and business processes based on the probability and likelihood of the risk. More importantly, mitigating what matters.
By definition, threat modelling is a conceptual exercise that aims at understanding what characteristics of a system’s design should be modified to minimise the security risk for the system's owners, users, and operators.
The target framework will conclude by developing a rationale for mitigation, ultimately contributing to the common language across the organisation.
The threat modelling principle would follow a less in-depth practice but more significant in breadth. When human errors are commonly considered threats/risks - inherently: people and processes become contributing factors.
Having a cognitively diverse set of people involved accelerates this and lowers the barrier to entry to a shared and holistic understanding of the threat. The principal focus should be to build muscle memory. With maturity, there will be a rich set of sane defaults.
There are guidelines on how to establish a threat modelling exercise, as well as associated well-known frameworks that can be followed. Given that we are adapting threat modelling for product development, we must consider a framework conducive to lean methodology and focus on consumer outcomes.
PASTA is one such framework that Gitlab heavily adopts. PASTA has several advantages over other frameworks concerning lean product development.
Risk-centric approach: mitigating what matters.
Focus on the probability of risk, likelihood, inherent risk, and impact of a compromise.
Impact on mission and business processes.
Begin with stating the business objective of the project.
Practically everyone in your development team has a stake in a threat model:
Product Owners: Have insights into consumer behaviour and business context that others may lack.
Architects: want to validate the design and produce repeatable blueprints from it.
Developers: want to both receive guidance and provide feedback on changes made to the design during implementation.
Engineering: use it for architectural review and security controls on deployment.
The diagram above shows the typical relationship in a threat modelling exercise. The critical part to focus on here is Threat-Agents, Threats and Vulnerabilities. Correlating this back to the scientific method - we can reasonably place them under hypothesis and predictions.
The critical piece to note here is to understand the threat's scope, which should always be based on risk factors, i.e. what is in range for MVP and consumption of the said product.
Let’s put this into perspective in the following example scenario:
The consumer is developing a marketing application.
Require a SQL database that will capture metadata.
CIA data classification is “public”.
The application process will be the only entity inputting/processing data.
Based on the above scenario, the scope of the threat should be minimal so that typical countermeasures like data encryption at rest aren’t applicable. Neither is human authentication. The first iteration of sentinel policies (PAC) would leave out mandating customer-managed KMS keys to encrypt SQL.
The recommendation is to start threat modelling ‘little and often’, especially when the product undergoes multiple iterations of use cases and functional requirements. To start, I recommend a session length of 90 minutes. You must give the team the time and space to learn the structure and security concepts involved. The most impactful threat modelling session I participated in took less than 15 minutes. Short and snappy sessions are possible once the team has built 'muscle memory' with the practice.
Be guided by what has the most value right now.
You’ll see a common theme with both the scientific method and threat modelling framework - both are pursuing acquiring knowledge of the unknown and establishing known unknowns. The technique and framework can be measured by the same metric: time to learn.
The practice of threat modelling can be encapsulated in the scientific method.
In the case of threat modelling, once perceived risks can be validated as actual risks, the next step is to identify countermeasures.
Traditionally, infrastructure deployments were performed through a UI-driven management interface. The RBAC model was minimal enough to consume resources or full-administrator privileges. This deployment and permissions model drove the use of detective controls as the principal countermeasure. Detective controls manifested as a human-reviewed change control process, with an additional monitoring and logging system to capture missed risks.
Today, both the deployment and permissions models have changed. Infrastructure-As-Code principally drives the deployment model, and IAM permissions are much more granular, to the extent of defining permissions against explicit resources.
The change in the deployment model allows us to shift our principal control mode to be preventative rather than reactive or detective. Examples of preventive controls are:
Cloud-native IAM roles
Cloud-native Policy Controls
Policy As Code such as HashiCorp Sentinel.
Emphasising preventative over detective provides multiple benefits:
Improving security posture by shifting controls to the left.
Stronger auditability of secure and compliant deployment of infrastructure resources.
The confidence in infrastructure deployment has increased.
Technical assurance opens opportunities to decentralise the deployment of infrastructure resources.
Let’s put this into perspective; example scenario where:
SQL instance as the principal product to develop.
Given the data in question has a classification value of confidential, compliance requires that the data in the SQL instance is encrypted using a company-managed encryption key.
GDPR requirements mean that the SQL instance must be deployed in European regions.
In the above example, you have two options (or both):
CSP-native policy control where all SQL instances are deployed:
Select regions only.
It must be encrypted using a pre-provisioned KMS key.
Define Sentinel policies that mandate:
If the SQL instance as a Terraform resource has a label as data: confidential.
Then a mandatory policy where the KMS key as a parameter must be filled with valid value (.e.g. AWS ARN-ID or GCP self-link of the KMS key).
A mandatory policy where the region specified as a parameter must be from a list of approved regions (i.e. within Europe).
The policies to prevent a SQL instance from being deployed should not meet conditions. The CSP-based policies and PaC appear mutually exclusive. Whether or not they are mutually exclusive depends on how your organisation is set up.
If, for example, your organisation is European-based, all resources can only be deployed in Europe anyway - having a blanket policy defined at the CSP’s top-level construct is valid. Other reasons may have you lean towards PaC.
Peripheral resources defined in Terraform that fall outside the CSP platform require policing.
The organisation is set up for Multi-Cloud.
Policies need to be conditional and only enforced for matched scenarios.
Desire to shorten the feedback loop and error before APIs are called.
PaC is the most powerful and versatile of the different types of preventative controls. However, preventive controls should be thought of holistically and think about spreading the points of power to get maximum leverage.
There is still value to be gained from detective controls in combination with preventative controls. Its duplicate effort is useless if you manually define the same thing as preventive controls only to validate that it’s done (since resources can only be implemented upon passing preventative controls). It becomes valuable when the detective controls manifest as a continuous scanning tool connected to an external database populated with the latest information on risks and vulnerabilities (e.g., CVEs). Prisma Cloud is a great example.
Detective controls can act as a feedback loop to preventative controls; as and when new vulnerabilities are discovered, this becomes a backlog to update existing preventive controls.
At this point, your product development flow using Terraform modules should look like the diagram below:
We’ve reached a significant milestone and are much closer to increasing our knowledge of a minimal viable product. What closes this knowledge gap is identifying the correct consumption patterns. There are three components to think about when developing a product (Terraform module):
Functional vs non-functional
Product vs service
Interactive vs non-interactive
It’s sufficient to say we’re all aligned on the functional requirements. I’ll use an SQL instance as a product as the example, where the functional requirement is to stand up a SQL instance.
Non-functional requirements are broad in scope, but we follow the principles defined earlier, only scope in the minimum required to get the functional product consumed in the right environment. We will also put the non-functional requirements into question, i.e. hypotheses, non-functional hypotheses.
The list of hypotheses will apply to the context factor, and the following may come into scope at the appropriate stage of development:
Security (various levels)
Reliability and Availability
Maintainability and Manageability
The non-functionals are considered part of the composite product, not the product itself. Therefore, in isolation, it can be viewed as a service, depending on the deployment of it. For example, setting up a dynamic secrets management process for the given product requires an API call to a central system like HashiCorp Vault - then this is a service.
Since the principal consumption point is the SQL instance (example), we should minimise the target interactive pattern to just that - deployment of SQL instance. Whilst the rest of the non-functionals and services remain non-interactive.
Making a (non-interactive) service part of the composite product incorporates a value-add, providing an experience of a managed product whilst contributing to minimising the cognitive load of the respective developer.
Following this principle of non-interactive consumption patterns, the countermeasures identified as part of the threat model should be incorporated into (or context of) the composite product as non-interactive, providing a streamlined developer experience.
When aggregating all the non-functionals into a single product composite, you end up with a product looking like the below:
At this crossroads, many enterprises begin to think about scaling this practice. It becomes clear that a set of guiding principles must be in front of the team to steer towards the target product manifestation. I work with leaders who realise a new constraint presents itself - how do you scale the workflow that ultimately creates a product with the correct consumption pattern in mind?
I work with organisations to set up a product portfolio practice that guides and coaches product owners and associates to embed guiding principles into their product development strategy.
The diagram below suggests a view of how this practice would look like:
In my experience, the emphasis on standing up a practice like a product portfolio dawned when the first opinionated MVP was pushed out for release. When non-functional requirements mature, the possibility of aggregating multiple operational disciplines into a single product is realised.
This manifestation becomes very powerful. It proves how some of these traditionally pursued non-functionals as their org-wide workstream have been successfully realised in a single product composite.
The very definition of product versus managed service drives the pursuit of aggregating multiple operational disciplines into a single product composite—the realisation of establishing a shared responsibility model for each product published in the product catalogue.
The flow to establish a product catalogue, not a service catalogue.