Security of Azure Container Registry Image Promotion Flow

In Microsoft Azure, the natural fit to host container images is the Azure Container Registry (ACR) service. Teams implementing immutable container images might stand up a single ACR instance as a central store for all environments. With this topology, it is easy to adopt a tagging and versioning strategy to indicate which image belongs in the development environment and which belongs in production. For example, a multi-stage release pipeline will manage the lifecycle of container image version v1 (Figure 1).

First this image is released to DEV
Deployment to TEST is delayed for approval (either manual or automated)
On approval, the deployment process updates TEST to v1.
And so on…

The process continues to iterate through as many environments as the team feels they need. Version v1.1 might be deployed to DEV while testers continue to evaluate v1 in TEST. In this topology all container images are stored in the same registry, even though the business has different levels of trust for each version. Image v1 has gained some trust through manual experimentation and formal or informal integration in the DEV environment–and this trust is what allowed the business to approve the image to move one step closer to production through deployment to the TEST environment. On the other hand, v1.1 is brand new, it had a little trust since it made it through build and test in the CI/CD pipeline, but not as much trust as v1.

In this topology all container images are stored in the same registry, even though the business has different levels of trust for each version.

Understanding the different levels of trust that these versions have earned is important, because they are all within the same ACR instance, and each environment has access to all of them. The release pipeline is the only point of control to ensure that only fully vetted images are deployed to production. Images may “fall out” of the promotion path and never be deemed trustworthy for production. Developers may reject images deployed to the DEV environment without ever attempting to get approval for TEST, the business may decide a feature needs rework before promoting to STAGE, or feature that looks right may have a critical bug not found before STAGE. All of this history is available in the same ACR instance PROD draws upon for deployments.

Understanding the different levels of trust that these versions have earned is important, because they are all within the same ACR instance, and each environment has access to all of them.

Business owners need a way to mitigate the risk of accidental releases. At the same time, the business wants developers to have the flexibility to iterate quickly in a live environment during development and testing. A tagging-only strategy doesn’t provide sufficient controls to meet both needs. A second ACR instance enables another layer of control. Dedicate the first ACR to the development environment and let developers use it as a sandbox to test work in progress. Use the second ACR instance only for release candidates containing feature-complete code with business owner approval.

Business owners need a way to mitigate the risk of accidental releases. A tagging-only strategy doesn’t provide sufficient controls to meet all needs. A second Azure Container Registry (ACR) instance enables another layer of control.

A topology with multiple ACR instances works hand in hand with a source code branching strategy which distinguishes between work ready to share within the team and work ready to expose with customers and other stakeholders. Developers use merges to the development branch to trigger a container build and push to the DEV ACR instance. Business owners request merges into the master branch when they are happy with a feature in DEV. This merge triggers a pipeline that creates an immutable release candidate by rebuilding the image with a new version number and pushing to the PROD ACR instance. This setup creates a boundary between DEV and PROD (Figure 2). That boundary allows additional operational and security controls. The developer registry is a sandbox, the images there are not even release candidates. Only images resulting from business owner approvals become release candidates in the production registry.

This second approach works reasonably well for small teams, but can create a problem for larger teams that need to work independently in parallel. Microservice teams that want independent release cycles should only depend on production versions of the APIs they integrate with. In the new container flow, the development ACR never contains release candidates–much less actual releases. Read-only access to the production registry could grant access to releases from other microservice teams, and in some organizations granting read-only access introduces no new risks.

Other teams have higher security requirements, and may have enabled the ACR’s firewall and virtual network integration to restrict network level access to the registry’s data plane. For these teams, allowing even read-only access means allowing the untrusted DEV environment to access the firewalled ACR instance. Role-based access controls still provide protection against mistaken image pushes, but that’s not usually enough for teams that opted-in to the firewall in the first place. These teams want multiple layers of protection and opening the network for read access degrades that protection and introduces new factors to consider in the overall threat model.

Other teams have higher security requirements, and may have enabled the ACR’s firewall and virtual network integration to restrict network level access to the registry’s data plane.

To work with these kinds of constraints, adopt a principal from the programming world: inversion of control. The most obvious path for containers to follow is a linear promotion path from dev to prod. Each approval causes the container image to move to the next environment–when we diagram container flow all arrows point the same way. Invert part of the flow and introduce one push “backward” from the production ACR instance into the development ACR instance (Figure 3). A release candidate approved for production should not only be deployed to production, the pipeline should also push a copy to the development ACR. After adding this step, the released image is available in all environments.

With separate ACR instances and one inverted push in the container flow, a team can maintain a security boundary between dev and prod. Developers can work against the latest released images, and the business knows there are multiple controls protecting against mistaken releases to production. Once code reaches the release candidate state, images remain immutable through to production. Even if an “accidental” push of an unblessed release candidate occurs in the inverted step, the movement is from production to development. Artifacts from a more trusted environment ending up in a less trusted environment are less concerning that artifacts from an untrusted environment ending up in production.