top of page

CIEM Part 1: How least privilege leads to a false sense of security

This is the start of a series about Cloud Identity Entitlement Management (CIEM). I will try to explain in depth what challenges you will face when trying to manage Identity and Access Management (IAM) in AWS. The first part is about least privilege.

Least privilege plays a crucial role in today's cloud security landscape. Each cloud provider often adopts its own opinionated view on this principle. Today, I will share my view on least privilege with you and why I think the application of least privilege misleads us in a wrong sense of security. But before we begin to deep dive - here is the definition of least privilege from NIST:

Least privilege is the principle guiding the design of a security architecture, ensuring that each entity is granted only the minimum system resources and authorizations necessary to perform its function.

Many people associate the least privilege principle with a security goal. However, I believe that our security goal should not be to apply the least privilege principle everywhere. While this might be ideal in a theoretical world with endless resources, it doesn't align with what's achievable in real-world applications. Here is my interpretation of what we aim to achieve with least privilege:

A methodology that consolidates all measurements related to permission optimization, with the goal of protecting business-critical assets while minimizing the impact on the productivity of the entity managing permissions.

In comparison with the original definition, I am not convinced that it´s necessary to go for a minimum set of permissions for all of our assets - even though this should be our desired goal for productive deployments. This consumes too much time and doesn´t reflect what our business wants - a trade-off between productivity and security. The cloud excels in generating value by enabling innovation through the rapid adoption of state-of-the-art technology. In a world where speed matters, being late to launch the latest features in a product can mean falling behind and losing the game. Moreover, as humans, we are prone to making mistakes. Imagine you are responsible for a mid-sized AWS environment with a few hundred accounts and thousands of active roles. Do you think that you will achieve the least privilege for all these assets 24/7?

Nevertheless, thinking about the least privilege is a good thing. The theoretical definition is more of a general statement applicable to many domains. So let us get an understanding of how least privilege looks like in AWS IAM.


Understanding effective policies

A simple abstraction of the least privilege can be described in a contract between a principle and the resources the principle interacts with. Let us have a look at the most simple form of such a contract:



Every policy in its core can be broken down into a set of actions that are executed against a resource and can be called by one principal. The resource statement shouldn´t be taken too strictly in AWS as some actions are generic and may apply to a set of resources. Typically "list" operations fall under this jurisdiction. The core elements of policies are the following:


Principal: This is the entity trying to execute a given set of actions against one or more resources.

Action: An operation (Create, Read, Update, Delete, List) applied to one or more resources.

Resource: The set of assets where the action is allowed to be executed.


AWS extends this basic construct with the following elements:


Effect: Either "allow" or "deny". The next section will explain why this construct was introduced.

Condition (optional): A filter applied to the state or metadata of a principal or resource. This enables us to limit a given action even further. An example could be that the action can only be executed if the principal has authenticated via a multifactor authentication (MFA).


The extended AWS model adds a layer of complexity by introducing new fields. AWS has a philosophy of overruling deny statement. This means that a deny rule will always overwrite any allow rule. In addition with the condition field the AWS policy enables very powerful and fine grained policy model.

I often have the feeling that the application of least privilege is reduced to a set of identity policies applied to a role or an IAM User (in our case a placeholder for the principal). However, what we can execute is determined by a combination of several policies that are connected via logical operations. The identity policy attached to a principal is just one part of the AWS policy evaluation. When I tried to compare the original policy abstraction with the AWS policy I found out that both are necessary and useful. From now on I will call the combination of all kinds of policies applied to a given role the effective AWS policy. If we convert the effective AWS policy to the original abstraction of a policy this is called effective policy.

To strengthen your understanding let me try to give you an example:

AWS policy for principal arn:aws:iam::999999999999:role/example

{
	"Action":[
                "s3:PutObject",
                "s3:GetObject"	
	],
	"Resource":"*",
	"Effect":"Allow"
}

Additional policies applied

{
	"Action":[
                "s3:PutObject",
                "s3:GetObject"
	],
	"Resource":"arn:aws:s3:::restrictedbucket/*",
	"Effect":"Deny"
}

Effective AWS policy

{
	"Action":[
                "s3:PutObject",
                "s3:GetObject"	
	],
	"Resource":"*",
	"Effect":"Allow"
},
{
	"Action":[
                "s3:PutObject",
                "s3:GetObject"	
	],
	"Resource":"arn:aws:s3:::restrictedbucket/*",
	"Effect":"Deny"
}

Effective policy

{
	"Principal":"arn:aws:iam::999999999999:role/example",
	"Action":"s3:PutObject",	
	"Resource":[
			"arn:aws:s3:::openbucket/*",
		]
},
{
	"Principal":"arn:aws:iam::999999999999:role/example",
	"Action":"s3:GetObject",	
	"Resource":[
			"arn:aws:s3:::openbucket/sample_file.txt",
		]
}

Note that the effective AWS policy is decoupled from the actual deployment base. This means we aren´t aware of the available resources and make use of wildcards in our policy. Even if resources are mentioned in a policy it doesn´t mean that the resource is currently deployed. In contrast, the effective policy (broken down to the original model) contains only the actions possible for the given principal and available resources at runtime. Let´s assume our current deployment base looks as follows:



Since write actions (putobject) are applied to a given bucket path the wildcard function is used as well. The effective policy may change over time as resources are created, modified, or deleted. To understand the impact of an attack or the so-called "blast radius" at a given time we need to derive an effective policy. However, an effective AWS policy is also necessary to understand the potential impact in the future.


Extending our policy model and understanding the effective policy

In the last section, we learned that an effective AWS policy consists of a set of AWS policies. Let me try to illustrate the full picture:



If you are new to AWS you may be overwhelmed by the elements in the full picture. However, every element has its use cases and can be integrated into a solution design. Most of the elements will reduce the amount of actions for your identity policy. Let me try to summarize the most important facts without going into detail and the semantics of each policy type:


  1. Every policy is restricted by an account boundary. Without any further action in a different account, it is not possible to break out of this boundary.

  2. The principal gets replaced by either an IAM Role or an IAM User. These two elements build the entities that are allowed to proxy commands against a given set of resources. The principal itself has no intelligence and is only used as a proxy.

  3. The real caller/principal can be seen on the left side of the picture. In general, we can differentiate between Machines, AWS Services, and Humans.

  4. A principal can be consumed either based on a shared secret via an IAM User or a (temporary) role session making use of the security token service. As a general best practice I recommend going for temporary sessions rather than permanent access.

  5. Some resource types like S3 or KMS support an "inverted" policy called resource policy. This allows us to control the access to a resource independent from the identity policy. This adds a layer of security for important resources.

  6. Resource policies can enable roles to break through account boundaries.

  7. The trust policy of a role determines which actual callers can assume a role. This policy also ignores account boundaries. A role can only be successfully assumed if the effective AWS policy of the caller role allows the sts:AssumeRole action on the destination role and the destination role has the caller role allowed with an explicit allow in the trust policy.

  8. The identity policy may be further restricted by a session policy, permission boundary, or service control policy. Each of those policies can be seen as a filter on the identity policy.


In addition to this diagram, it´s also useful to understand the evaluation logic. This is the official policy evaluation logic provided by AWS:



It is pretty useful when you want to validate a given action on a resource against a specific principle. Let us try to look into the evaluation logic from a mathematical point of view. This allows us to convert the sequential view of policy evaluation rules to the effective AWS policy which is independent of any sequence.



My example may seem different from the ones provided by AWS. I´ve tried my best to illustrate an example that represents a real-world use case considering all elements of policy evaluation. I´ve split the policy blocks into two parts for the local account and all accounts outside the runtime of our principal (IAM Role or User). All the boxes you see represent a summary of actions applied to resources. Again, since we are talking about AWS policies we decouple the instantiation of resources from our policy. Our goal is to reduce all our AWS policies into a single AWS policy which can be used to derive an effective policy when combined with the actual deployment base. Note that my boxes seem to be well structured. Don´t get distracted by the nested view - this was done to be able to focus on what´s relevant to our use case.

Please be advised that only the identity policy is mandatory - all other policy types are optional. Let us take some time and interpret the figure. It already gives us hints regarding the semantics of each policy type. An SCP is typically an organizational policy that is maintained by a platform team and applies to a whole AWS organization, organizational unit, or account. Typically it focuses on governance at the organizational level and extends our model of an account boundary with the context of an organization. As you can see the applied policy is very generous. Permission boundaries are typically used when an organization delegates the management of IAM Roles from a centralized team to another team. The identity policy is the policy we intend to attach to a specific role. In special use cases we may want to define a generic identity policy and limit the scope even further by applying a session policy. These are advanced use cases and may only be applied rarely. The last policy in our set is the resource policy which kind of inverts our basic model. Instead of defining what a principal can do, we can define a policy where the resource itself can determine which actions can be executed by a given principal.

Now that we understand the semantics let us try to derive the effective AWS policy. This is pretty simple and can be achieved with the following steps:


  1. Separate deny and allow statements for each available policy type

  2. Get the most specific policy and collect all allow statements

  3. Apply a logical "AND" operation to all allow statements of all other policies except the resource policy. Tip: Always use the "more specific" description and discard the rest

  4. Read the documentation of the resource policy and apply a logical "OR" to the result of (3). This applies for example to an S3 bucket. If the resource is a KMS Key this step can be skipped

  5. Apply a logical "OR" to all deny statements. Eliminate any unnecessary statements that do not match the allowed actions determined in the previous steps

  6. For each resource outside of the local account apply a logical "AND" for the identity policy allow statements and the affected resource policy allow statement

  7. Collect the statements from (4),(5),(6) - you will get a reduced set of statements, our effective AWS Policy

  8. You can try to convert deny statements into allow statements by inverting actions or resources (using the notaction or notresource). This would allow you to get a policy with only allow statements. However, try it out and see for yourself how difficult this can get :)


I can recommend trying this out for some of your policies. I didn´t have the time so far to automate this action (this may be a future project XD). This is a good exercise to measure how good an identity policy is in shape (least privilege principle applied) and how good your organizational measurements are. The last step would be to run the policy against all resources in your account and the defined resources outside of your account. I can recommend the solution built by my AWS community builder peer Michael to find all relevant resources: How to list all resources in your AWS account | by Michael Kirchner | AWS Tip

Note: If conditions are applied you may need to also respect/assume runtime data (such as session tags, caller IP, etc.).


Why the least privilege principle is not enough

The previous section showed us the core elements of a policy and how they can be interpreted in the AWS cloud. We understand now that deriving the authorizations of a given entity includes a lot of work and effort. Let us think one step ahead and try to understand the real problem we want to solve in CIEM. This leads to a lot of open questions I will try to answer in additional blog posts inside this series.

In the end, what we want to achieve with the least privilege approach is to reduce our blast radius to an absolute minimum. However, what we are missing in this equation is the probability of an attack happening and other metrics besides the effective policy that can influence the blast radius. Here are some examples:


  • The trust policy determines the potential amount of callers. This influences the probability of an attack.

  • The type of caller (human / machine / AWS Service) and the connection channel (IAM User, integration via 3rd party Identity provider like Azure Active Directory [now EntraID] or AWS Cognito, native service integration) have different levels of security and scope. This introduces the need to measure the probability of an attack and the blast radius.

  • There are domains besides Identity management: Network connectivity, Authorization, and Authentication in Applications hosted via AWS Services, and many more

  • Architecture: Defining the right scope of a task is not easy to define and should be respected in your solution design

  • AWS Protection mechanisms lower the blast radius: Termination protection, Object Locks, and BackupVaults

  • Factor Human: Mistakes in policies [unintenually enable more actions] or a wrong understanding about what kind of authorizations an entity needs leads to a bigger blast radius

  • Infrastructure as Code and CI/CD: A recovery of an Application will be faster than a manual redeployment. In my opinion, the amount of downtime of an Application should also be considered as an element of a blast radius

  • Just in Time access: Nowadays many solutions shut down principals when they aren´t in use. This is especially useful when using high-privileged roles in combination with the AWS Identity Center.

This results in the finding that least privilege is only one of our elements in a successful CIEM journey. I´ve tried my best to implement least privilege in real-world setups and found out that this "theoretical" concept doesn´t scale pretty well. Here is a list of reasons underlying this statement:


  • Cloud-native application developer teams cannot be productive if the organization platform team doesn´t delegate the creation of IAM resources to them. This means we need to give up on the idea of controlling all IAM Resources 24/7. The platform team must assume that developers are making mistakes and do not apply least privilege to all of their IAM Resources.

  • The scope of a task is defined by the architecture of an application and the teams implementing an application. By nature tasks have different ranges of action. For example, an IaC Role is overpowered by default. The underlying framework needs to be able to configure whole applications. Other tasks can be modified to reduce the blast radius. An example: Bigger tasks running inside an EKS cluster could be refactored to serverless which also opens a chance to split a big task into multiple smaller tasks (refactor a container to use intrinsic functions or lambda function).

  • Service linked roles may have a restricted behavior. Even if you apply generous access right´s a there is almost no chance that you can exploit all of them. For example: You cannot change the way a step function works if the attached role cannot change the step function definition. The behavior is locked to what happens inside the step function. Any additional allowed actions are defined, but will never be executed.

  • If the amount of accounts and roles grows it is difficult to achieve a secure state for every role without making trade-offs. There is a need to constantly measure the state of all IAM Roles and Users. You need to react fast if there is a security risk. Each organization may want to enforce a given set of restrictions which reduces productivity and may have an impact on the availability of systems.

  • Not every policy and action has the same weight: The overall time consumed to create a role with the least privilege in mind is huge. Developers run into the risk of wasting a lot of time in finetuning policies during a development stage. Working on policies may affect your flow during development and distract you from being innovative.

Wrap up

In this post, I've attempted to find a way to derive a generic policy in AWS that helps us map the original definition of least privilege to AWS. I have defined two ways in which a policy can be described. The effective AWS Policy is more complex and is decoupled from the actual deployment base. On the other hand, the effective policy shows us exactly the blast radius at a given time.

The last part of this article focused on the challenge of implementing the idea of least privilege in AWS. We identified that two major parties are involved in policy management. First, the developer teams: Distribute in the organization with the responsibility to create, maintain, and modify policies. Second, a centralized platform team with the responsibility to protect organizational assets. This team must be able to identify potentially dangerous AWS roles or IAM Users by business criticality.


In the end, I´ve shown that least privilege is difficult to implement in bigger AWS environments. I´ve left you with a lot of open questions and problems. The next articles in this series will attempt to explain how you can classify and measure the probability of a potential attack via an IAM Principle (AWS Role or IAM User). In addition, we will take a close look at the blast radius and methods to reduce the blast radius at scale. We will explore how to monitor your IAM infrastructure with AWS native integrations. This enables you to react fast on misconfigured roles without overloading your platform or SOC teams. I will also share best practices for developer teams on how to work effectively with IAM without running the risk of operating insecure applications or wasting too much time with policy definitions.


Reflect on the problems discussed in this article on your own and look forward to my next article in 2024.



5,534 views0 comments

Comments


bottom of page