CIEM Part 3: Mastering privilege management for developers

Feb 5, 2024
13 min read

Finding the right balance between security and efficiency often needs years of training. The intuition of senior developers doesn´t come just by reading documentation - this needs time. The good news: It can be trained. This article will sharpen your senses and enable you to judge the criticality of IAM roles and related policies. You will learn how to determine the right point in time to harden a role which results in guidance on where to invest your time.

The blast radius and the influence of developers

As you may have found out I am a visual guy - so I´ve tried to find a proper way to express the blast radius in a graph. The best way to show it seems to be a boxplot:

Let me try to explain the different metrics relevant to a role's effective policy. The Boxplot shows a vector for a blast radius. The blast radius has a linear dependency on the top line described with "All privileges". If we start at the top you may wonder why we have to forces to compete against each other. One dimension is influenced by the amount of resources available to a role. It has a direct connection with your account sizing strategy. As we have learned in the first part of this CIEM series an account builds a natural isolation layer in AWS. Each AWS account can be considered a sandbox with limited and well-defined interfaces to the outside world. These interfaces can be at a control/AWS API layer or an operation layer. Examples are trust policies for incoming connections or resource policies which can work in both directions. At an operational layer, we typically see network traffic via load balancers, API Gateway, EIP, or a CDN like Cloudfront or AWS data API activities like incoming or outgoing events. On the other side, we can see organizational measurements decreasing the blast radius or effective permissions. The core of the box plot is defined by the AWS identity policy (including session policy and permission boundary) and an architectural debt. The architectural debt is of a fixed size per resource. What I mean by that is the fact that an AWS API may not deliver the granularity of control you want to achieve. One example: When it comes to RDS services you can expect more granular control of aurora than with a Postgres database. This gap is described as architectural debt.

The bottom is defined by the size of the task that is consuming your role. As a developer, you can influence the box plot with the following metrics:

Your account size is key to strengthening your security posture. This shows the importance of a good platform team operating a landing zone in your AWS organization which enables you to quickly request a new account.
Your application design and technology define the size of a task. My experience: The smaller a task is the easier it is to build your policy with the least privileges in mind. For bigger tasks, we tend to be generous and provide too many privileges.
Policy refining helps you reduce the gap between the needed privileges for a given task and the possible effective permissions you can have with a given role.

Developers actions often scale horizontaly with roles. Each role needs to have an assigned task and needs to have it´s policy refined. In comparison: Organizational measurements are broader in the scope and scale horizontaly to one or more accounts in your organization.

Having a deeper look into the policy box plot

So far we have identified 3 metrics that a developer can influence: account size, task size, and identity policy. Respecting those already helps us a lot, but it will give us no guidance on whether a role policy should be refined or not. So let us have a look and compare different setups:

Imagine you have a simple task and implement it via different means. The above examples give us some more context by introducing a caller type and placing different policies next to each other. As you can see the following additional metrics can be used to influence the need for policy refinement:

Caller Type: The second part of this series showed us the relation between a caller type and the probability of an attacker targeting a role. This needs to be taken into consideration when you build policies.
Environment: The same resource has different weights in different environments. Obfuscated data in a test account is definitely of less value to your company than your productive data.
Number of (business critical) resources in your account: The more resources the higher the gap between your tasks needed privileges and all possible actions for a given role.

Lastly, we can put on our X-ray glasses and have a look at the different action types:

This graph shows a possible weighted version of the relationship between action type and potential damage if the action is misused. We can see that read and list actions in most cases are insignificant in comparison to write actions. This adds yet another metric that must be respected for our policy refinement: action type.

Nice, but didn´t you forget something?

Until now we´ve covered both elements of risk management: the probability of an attack and its potential damage. My equation adds one last element: The after-effect.

Some damage can be compensated or "recovered" after an attack. I´ve tried to give you an idea what we are talking about with the following graph:

Please don´t take the line in the graph as a given. This is just a theoretical example that is meant to illustrate the impact of IaC, resource policies, and backup in case of an attack.

The graph shows two metrics of damage over time. Let´s assume an attack happened at time t=0 and the attacker was able to delete resources causing an application to fail. We can observe an immediate loss of money with destroyed data. This amount can be reduced with different strategies:

Protect business-critical resources: Termination protection, write immutable backups, apply resource policies.
Do backups
Use IaC

After t=0 most products will experience an exponential increase of missed revenue. The earlier can restore your app the better for you. This is where Infrastructure as Code (IaC) plays a key role. Without IaC you will not only be later in restoring your application. You also run the risk that your application's infrastructure integrity gets compromised. This means you may have a different configuration than before you may have forgotten some resources to deploy. A backup can also help to restore persistent data needed for your app.

The second graph shows the impact of an outage and the value of your application or product. In short terms: reputation loss. As you can see the recovery of the value of your product is slower than the restoration of your application. This means no matter how fast you recover from an outage: it will take you more time to get back where you started. In fast living IT this can decide upon the future of your company.

I know: You wanted guidance on how to invest your developer time and not another lecture

I do not want to torture you any longer :) Here are my best practices for working with IAM as a developer.

Rule 1: Define an account strategy

For me, this is a top priority. Simple to define with a huge impact. Please keep in mind that this only scales if your organization can deliver accounts on time and the running cost of an empty account can be accepted. Typically running costs are backend connectivity or private link endpoints.

Rule 2: Find a proper size for tasks

If your task is overpowered it is dangerous by nature. Try to invest your time in a proper design rather than polishing a policy for a task that needs almost admin privileges to do its job.

Rule 3: Secure SSO Access

You come up with a plan on how to use SSO Access. Think about the tasks human beings need to execute in an environment. Apply separations of concern: IaC Manages the infrastructure, human beings interact, and should only have needed actions enabled. Least privilege is typically no good practice here as you will spend too much time rewriting your policies after each change. It´s better to focus on role-based policies with a general scope. For privileged access use temporary access rights like those described here: aws-samples/aws-iam-temporary-elevated-access-broker: Allow users to request temporary elevated access to your AWS environment (github.com )

Rule 4: Secure your account boundaries

Keep a special focus on IAM Trust policies and be careful with "*" resource statements when using resource types that support resource policies. Also, be careful with IAM privileges which can be used to create new threads. Have a look at my second post in this series to get more content about this thread and toxic combinations. Tip: You can use IAM Access Analyzer's new feature IAM reasoning to detect any role that executes sensitive operations.

Rule 5: Protect your most important resources

Follow the money. Business-critical workloads and data must be protected with additional means. Use the features available to you and activate termination protection, make your backups immutable, and use resource policies. This is also critical when you use IaC. Typically your IaC Framework needs access rights to create, update, and destroy resources - including your business-critical ones. I strongly recommend adding a protection layer to prevent the deletion of business-critical resources due to IaC as well.

Rule 6: Use IaC with CI/CD and forget about the Console in productive environments

IaC, especially when you use the AWS CDK can become your friend in refining policies. The "create grant" function often helps to get an already refined policy for a task with little effort. In addition, IaC makes your deployment faster and builds a deterministic and standardized configuration. Tip: I lock all my deployments in production accounts to human beings. Only the CICD Pipeline and a break-glass principal can change the configuration of a deployment.

Rule 7: Backup your data and IaC

As Werner Vogels says: everything fails all the time. Be prepared and back up your data and code.

Rule 8: NEVER log credentials or key material

Unfortunately, we never seem to learn from our mistakes. It happens all the time that sensible data gets leaked on Git Hub or session data is somehow leaked.

Following this ruleset should always be your first thought! With the knowledge above you should be able to judge the relevance of a role and its security needs at a given time. It´s hard to provide "one procedure" for policy refinement which is why I will provide my personal opinionated view. I think it is a good mix between security and efficiency:

How to refine a policy

Writing policies can become time-consuming. Sometimes you run into issues and spend hours finding the right policy for a given task. Especially during development, this can become costly due to various reasons:

The design of your application may change during development. Maybe you run into issues that cause you to rethink the way you build your application. A task may be resized or you make changes to the services you interact with.
You may never use your role in production: Prototyping is a good example. You build a prototype to gain experience and not to run it in production. Typically a prototype gets destroyed after development.
You are stuck in your creativity: Thinking about policies will distract you from the task you may want to implement. Especially when you program complex systems you want to focus first on the task rather than being stuck with missing privileges and get disrupted by modifying policies.
There is a behavior lock-in: Let´s assume you run a Step function with admin privileges for dynamo db. In theory, the SFN can interact with all tables. However, the probability of this happening is pretty low. The behavior of your role is locked to the step function definition. Only if an attacker can hijack a session or change the step function definition you will face danger. I am not aware that the step function service ever got itself hacked.

So this means that we may want to adapt the way policies are refined. I would recommend defining a minimum measurement for policies in your test/dev environment and your staging/productive environment. The test environment typically is more generous and ensures that you aren´t stopped in your productivity and reduces the impact of an attack dramatically in comparison to a plain admin role. The roles in productive environments should undergo refinement.

Sometimes a task is trivial or you already have a suiting policy engineered. In such cases you must not follow the tips below ;D

Remember: The effective policy is just a snapshot. The criticality of losely managed policies may change as time passes and more resources are getting deployed, the setup get´s "copied" to a prodcutive environment or the scope of a task suddently changes.

Step 1: Identify Region, Services, and Resource Types

List all the services and resource types your task will need to interact with. Tip: Don´t forget "called via" actions. For example, your S3 Bucket may depend on a KMS key.

Step 2: Whitelist ListActions for the given services

If you are unsure whether list actions are needed go for the safe side and allow all list actions for the identified services. Do not go for resource types as most of the list actions must be applied to a "*" resource.

Step 3: Allow (management plane) read actions

With a "describe*" and "get*" most of the read actions can be covered. Using the knowledge of Step 1, the actions are locked to specific regions and resource types.

Step 4: Allow (data plane) read actions and write actions

Now we are entering the "dangerous" zone. The missing actions must not contain any wildcards as they could cause harm to your application. All actions in the above steps one to three will expose metadata that can be used for information gathering but isn´t a real threat if not used for another attack.

The resource part of the policy can be built with the use of wildcards depending on your specific need. If you already know the resources you interact with I strongly recommend linking them.

With this, your basic policy is done. You´ve successfully created a policy that doesn´t take a lot of time to write and minimizes the blast radius. This should be your minimal security standard which needs to be enforced for all roles.

Step 5: Add conditions

Conditions can help you narrow down your permitted actions with metadata available to AWS. There are tons of use cases - and to be honest, I have the feeling I didn´t even explore 1% of them. However, I recommend having a look into attribute-based access control for advanced use cases.

Pro tip: I oftern experienced problems with roles in principal blocks (either in the resource policy or the identity policy). During deployment AWS often requires a hard dependency between policy and role. If the role isn´t existant the policy cannot be created. If the role get´s deleted after the policy was written it get´s replaced by the deleted roles principal id (which looks like a random string). The only possible recovery is to rewrite the affected policies after the role was (re)created. You may even lock you or your IaC Framework out with this activity. To prevent such issues I try to work with condition statements - allow access to the whole account and limit the chosen roles with your condition. This also extends the statement as more operations like wildcard statements are supported.

Step 6: Secure critical IAM Actions with permission boundaries and SCP

In my last post, I have mentioned some dangerous operations. Especially when dealing with IAM you need to take special care - EVERY Time.

IAM Create/Update/Delete Role/Policy: If you are allowed to execute these steps you may be able to start a privilege escalation attack. The attack vector is immense: An attacker can add a trust policy to one of the attacker's accounts outside of your AWS organization and modify the policy to have admin rights. IAM Permission boundaries are a best practice to prevent escalation of privileges. However, you may not be able to resolve the trust policy problem. Pro tip: Think again about your possibilities. Critical tasks should be narrowed down to an absolute minimum. Make sure that your task is as small as possible. In addition: Permission boundaries have a 1:1 binding with your policy - it is not possible to bind multiple boundaries to one policy as of today. Make sure that newly created IAM Roles also need to bind the same or a less privileged permission boundary.

IAM AssumeRole: This action indicates that your task is dependent on other accounts. Sometimes this is necessary - but I recommend always reviewing this task and never using it on a * resource. My experience showed that a lot of cross-account tasks are hurting best practices of DDD (domain-driven design) - If you build microservices you normally have well-defined interfaces/contracts like API Gateway, EventBridge, queues, or other Networksockets (ALB/NLB). Even though it may be more work to extend well-defined interfaces it is worth your time as your service doesn´t use unnecessary dependencies out of your domain's control.

IAM PassRole: This command is used in the backend when you need an execution role for an AWS resource. For example: If you deploy a Lambda function you can either provide an existing IAM Role or create a new one. Have a look at the analysis from Ermetic (now tenable) for further details IAM PassRole: Auditing Least-Privilege - Tenable Cloud Security (ermetic.com )

Step 7: Reduce Actions, Resources, and conditions to an absolute minimum

The last step is to polish your role to reach the least privileges. This includes reducing all actions to an absolute minimum and always assigning the resources involved or necessary. Also, conditions can often be used to further reduce the attack surface. Sometimes is also worth thinking about attribute-based access control. An approach that often scales better in praxis, since policies are not hardwired to actual resource instances but to properties of deployed instances.

Additional best practices applicable to a developer team:

The previous paragraph focused on the horizontal scaling of policy refinement. This applies to every role a developer builds. However, there are additional best practices which can be implemented in your development team:

Don´t deploy anything by hand - use code to express your deployments.
Use IAM Reasoning in your CI/CD pipelines
Enforce Code Reviews (with a 4 eye principle for productive deployments)
Use pre-commit hooks and static security checks for IaC (for example trivy for Terraform)
Regularly run IAM Access Analyzer recommendations for your IAM roles
Lock role management for humans in productive deployments. Do not apply any overpowered Sso roles in productive environments.
Use findings from your security tooling (GuardDuty, Inspector, Security Hub or 3rd party) to strengthen your security posture
Delete not in-use Roles and Policies
Build boundaries via SCP like in one of my previous posts: Secure IT Infrastructure with ABAC and SCP (robertdemeyer.com ) Pro Tip: Do not use permission boundaries as they do not scale well if you are using out-of-the-box solutions or predefined deployments.

I am well aware that this list isn´t complete, but it provides a good start and my personal most important best practices.

Wrap up

In this blog post, you´ve learned about a developer-centric view of IAM policies. I´ve described how the blast radius of a given role is influenced by different metrics and which of those can be influenced by developers. In addition, we had a look at the importance of IaC and Backup in a fictive post-attack analysis. A common set of general rules was derived to build the basics of a secure deployment in AWS. This was followed by a step-by-step guideline for refining policies and finally some best practices for developer teams in general. I really would appreciate it if you take the following paragraphs with you and apply them in your daily work. Don´t be shy to confront your managers or seniors, there's nothing to be afraid of:

If I give you 10 random roles and their application context you should be able to sort them based on relevance for your business with the knowledge achieved in this post. Achieving 100% least privilege is an almost impossible task - also for experienced developers. Having a good sense of the criticality of a role is more valuable and efficient than just blindly accepting that you need to polish all policies. Sometimes it´s better to rethink architecture and split a task or to deploy resources in a different account and decrease the blast radius.

This was the third part of my CIEM Series. In the fourth and last part, I will dive deep into the organizational view in terms of IAM Management.