AWS has delivered a great solution for cost insights with its cost intelligence dashboards (cid). However, especially in terms of EC2, it lacks visibility. AWS is pretty strong in defining its cost usage based on service consumption. This doesn´t necessarily match what we as customers need. To determine the total cost of ownership (TCO) for an EC2 machine, several objectives must be considered. I´ve built a solution and want to share my findings with all FinOps-interested engineers in the community.
WHY? Clarifying the relevance of the use case
In my example situation, I am facing a setup that includes hundreds of AWS Accounts in one AWS organization. Almost 70% of the cost is directly or indirectly caused by computing workloads and customers are asking for a transparent cost overview. I was struggling with the existing tools provided by AWS and building the wished overview was a time-consuming, half-automated, and error-prone task.
When looking at cost allocations two different types of accounts are considered. The first type concerns customers owning an AWS account and charging all the costs to a defined cost center. In the second case another team owns and operates the EC2 workload on behalf of customers. The cost needed to be allocated on a "per EC2 instance" model.

I could derive two stakeholders requiring different granularity in their cost reporting. The finance department was primarily interested in numbers gathered around cost center at a high level. The second target group focused on AWS account optimisation's. Account operators need a more profound understanding of the cost. They wanted to understand how cost is built and distributed to optimize their setups. If you are looking into an open source / free solution in the market to solve this problem AWS has a nice offering of dashboards and solutions. However,...
Which Problems weren´t solved by AWS?
EC2-centric cost view: As a customer, we see an EC2 as one entity. It consists of computing, backup, storage, network, and operating costs.
Confusion around Cost Explorer: The AWS Cost Explorer is widely used and a valuable place to analyze cost. With regards to TCO, the cost explorer has too many variables and space for misinterpretation of cost. Examples: How should a customer know when to use unblended vs amortized cost? How can a customer respect costs originating from other accounts (such as AWS support or a centralized purchased savings plan)
EC2 Other - What has AWS thought when introducing this service type? EC2 other gathers cost around EBS, Snapshots and some (but not all) network related cost. The cost cannot be directly associated with an EC2 instance.
Missing context: Account IDs and Resource IDs are hard to understand. Enriching information like an account name, resource name tag or a cost center tag builds additional insights and helps customers to understand data easier.
Fair distribution of Savings Plan Savings: AWS has its algorithm on how to distribute savings from a savings plan. This can be of advantage or disadvantage to a customer. I was facing a lot of cost fluctuation during a bigger migration or change of commitments in the past and needed a way to normalize this distribution. My goal was that each EC2 machine profit from the savings plan on an equal basis (pro rata based on the actual cost generated).
Centralized Cost (AWS Support) distribution: Some spending will always be covered via your management account. The AWS Support is a good example. In my case the goal was to distribute AWS Support costs to the entities operating AWS Accounts.
OPEX (Employee/Platform) cost distribution: Each AWS account and workload comes with employee cost. For example, There will always be someone who needs to manage your AWS Landing Zone and Governance. Centralized solutions like organisational backup or delegated management of EC2 machines are generating costs.
Missing weight for operational cost and AWS Support: I like to support customers on their way to a secure environment. With an account boundary, AWS gives you the possibility to increase your security posture. However, taking employee and AWS Support costs into consideration customers will have a bigger cost disadvantage when they decide to go with a multi-account setup (assuming that OPEX get´s equally distributed based on number of owning accounts) . Weighting AWS Accounts provides a possibility to correct these metrics and motivate customers to use a Multiaccount environment by balancing different use cases. This enables innovaters to start at low to zero cost into AWS which encourages future deployments in the cloud.
Handling centralized cost
Before examining any architecture it´s important to understand the distribution details of centralized elements in an AWS organization. Typically they sum up to no less than 10% of your overall spending.
Savings plan:
The centrally purchased savings plan needs to be transformed into a customer-oriented solution. The problem with the AWS distribution is that the focus lies on cost optimisation for the organisation. Each machine running on AWS is a potential target to be covered by a savings plan. However, depending on your coverage and the instance type, not every machine will be chosen by the savings plan algorithm. In addition, the amount of savings per instance type varies.

The example assumes that three accounts, A, B, and C have the same total spent on EC2. As you can see in the green boxes the amount of savings can vary for each account. A customer-oriented solution gathers all organisational costs into one pot and redistributes the cost equally. Each machine in an AWS account will profit from a savings plan on a pro-rata basis. The metric is the on-demand cost generated by the machine. This also allows us to respect the running hours - if a machine doesn´t run it will not benefit from a savings plan.
Each customer has it´s own preferrences and models on how to achieve and distribute savings in AWS. I implemented a solution that mirrors the customers intend in a specific use case. There is no globally unique solution that can be applied - my example just shows that you are not necessarily bound by AWS cost explorer and can finetune FinOps towards your organizational needs.
AWS Support Cost and Enterprise Discount (EDP):
Considering that cost should be charged where it is consumed the distribution of EDP and AWS Support cost also needs an upgrade. The AWS default is that EDP is equally distributed (pro rata based on the total on-demand AWS account cost) and applied to each AWS account. In comparison AWS Support cost is getting charged only to your management account - no distribution is applied. After reflecting the purpose I have decided to merge EDP, AWS Support, and internal platform costs into one pot. This includes costs for "base platform" accounts (ie.: AWS landing zone accounts, centralized Networking infrastructure and governance in AWS).

It was decided to keep a part of the EDP discount as a reserve to balance unexpected cost spikes throughout the year. This backfill can be used in case of a high demand for new platform features.
After calculating all centralized costs internal service cost (FTE) was added on top. The service cost was equally distributed with the number of accounts as the distribution key. To finish the process a weight was applied per account at the end. The weight gives a possibility to motivate people to innovate by lowering the entrybar into the cloud.
Solution Architecture
So far requirements and centralized distributions are covered nicely on paper. But talking about a solution approach and implementing the same is the real challenge. The resulting architecture is built on top of already gathered data from AWS:

I used the cost and usage report data in combination with Athena to query daily consumption reports. It took me some hours to get the right queries out of the cost intelligence dashboards views. I can only recommend doing this exercise on your own. It gives you a better understanding of how AWS charges costs under the hood. In addition to cost exports I was also in need of data originating from AWS config. Here I made use of an organisational-level AWS config aggregator to export metadata about all EC2 instances. This includes the relationship between instance ID and ebs volumes that is necessary to link storage cost to an ec2 instance rather than just classify it as "ec2 other".
The core of this solution is build as a step function:

The architecture is built in an idempotent way so that you can redrive daily reports in case of changes in your cost allocation. This helped me a lot to fine tune the cost distribution when playing with different allocation keys.
It was also important for me to make the curated datasets available to other FinOps communities within the same organisation. By using Athena and Glue as engines I was able to build the "management" overview in Quicksight and allowing others to access the curated data via Athena.
The last missing part was a "correction" of data at the end of the month. This was necessary as AWS Support costs will be only available a few (2-3) days after the end of month. The correction step function just got the lattest AWS Support cost and re-run the core step function to get the support cost fixed for the last month. In my case the difference in Support cost was less than 0.1% of the overall AWS spend.
End User View
This may be the most interesting part of this post: Let me share the end result with you.
I want to share 4 use cases with you that really helped me to be better in FinOps activities.
Top EC2 Instances (by account and environment)

This graph shows your "top" instances and the TCO in terms of EC2. It helps you to pinpoint where most of the cost is originating from: Compute, Storage or Backup. I was especially happy to implement the top line indicating the number of running hours of the instance. This line helps you to visualize if teams are actively shutting down workloads to save cost.
Identify unused EC2 Instances

Who doesn´t ran into this issue: You spin up an EC2 instance to test something out. The EC2 instances eventually gets stopped, but a cleanup never happened after you´re done with your tests. Most reports with regards to EBS will only show you volumes in an "unattached" state. This report helps you to identify also EBS volumes and snapshots related to shut down EC2 instances.
Measure the velocity of EC2 Instances over time

"You’re saying that you’ve optimized the cost once more. So why does the bill keep increasing?" - This is a comment I frequently hear from management. Demonstrating the speed at which the number of instances changes can help you better clarify your cost trends (though I’ve come to understand that perhaps tracking the number of running hours might be more effective if you’re extensively using auto-scaling services).
Getting a more educated overall service cost view
This one doesn´t bring the big innovation, but it was never the less good to have a better visibility about the cost distribution in EC2 by splitting "ec2 other" into more meaningful metrics "Backup", "Storage" and "the real other (mostly shared networking cost)".
Wrap up
Working on this insights project independently was a fascinating experience. It allowed me to deepen my understanding of data engineering and refresh my skills with the pandas library, which is invaluable for transforming, merging, and grouping multiple datasets. Additionally, creating Athena queries from raw CUR data broadened my perspective on AWS's FinOps capabilities. The final results enabled me to lower overall costs and enhance visibility into EC2 expenses. Furthermore, I incorporated operational costs, providing the business with a clearer understanding of TCO.
If you are interested in more details - feel free to reach out to me on LinkedIn or via the Chat function in the blog.
Comments