This is a common situation. You look at your bill - or your credit card statement - and realise that you paid more than you expected.
It first happened to me in 2009, in the early days of AWS. I've seen it periodically ever since.
This happens to everyone. But what actually happened to cause it?
Follow this tried and tested process to find out.
We want to find out what services or resources changed in our AWS to cause the bill to increase. When we have these reasons we can decide if any of them are worrying.
The basic process:
- break up your spend by different dimensions; then
- analyse the differences between the dimensions.
We'll use a real-world example here as an illustration of the process.
Step 0: How Much Is The Change?
- Get this month's bill and last month's bill as CSVs, If you don't already have them, here's the AWS console billing link.
- Subtract the total last month from the total this month.
- The difference is the amount of change we're looking for.
- Keep this as a running total as you work.
In our example, it's:
Running Total = $74,343 - $60,095 = $14,248
So we're looking for over $14k of new AWS usage.
Step 1: Check The Calendar
Most AWS services charge per-hour or per-minute, so if you run them for longer, you pay more. This is obvious, right?
And some months have more hours than others. This is also obvious.
But that means March (31 days = 744 hours) has more than 10% more hours than February (28 days = 672 hours). Every year we'd expect our March bill to be 10% higher than February - without even doing anything!
If you're like me, you won't always remember this. But this calendaring certainty doesn't mean your underlying usage has changed at all. So you might have nothing to worry about.
Check the trend. If your bill dropped 10% from Jan to Feb then increased 10% into March, that's natural variation.
In our example, both months have 31 days. So our running total doesn't change.
Running Total = $14,248
No, it's not always that simple. The variation in months can hide other changes. And this only applies to services billed hourly, which isn't all of them. But unless you've got infinite time, ignore that until next month. The trend always becomes clearer with more data.
Step 2: Whole Bill Charges
We're looking here for costs that AWS calculates as a percentage of your bill:
- AWS Support (Developer, Business, Enterprise) can cost up to 10% of your usage. See the AWS Support pricing docs for specifics.
- Tax (VAT, GST, Sales Tax, etc) may be a percentage of your bill. Talk to your government to understand how much.
Work them out, then ignore these. Subtract them from your running total. They didn't cause the change in the bill, they're a consequence of usage increasing.
Here's our example:
Tax = $939
Support = $367
Running Total = $14,248 - ($939 + $367) = $12,942
One thing to note here: if you're paying for AWS Support but not using it, cancel it. That will save you 3% to 10% of your bill without any effort at all.
Step 2.9: Get The Data Into Excel
- Jump into Excel with your CSVs.
- Copy paste the two months of data into a single sheet, so that you can work with it.
- Create a PivotTable.
- Add BillingPeriodStartDate to the Columns of the PivotTable.
- Save this as an XLSX, because the raw CSV format doesn't work with PivotTables.
I suppose you could use Google Sheets? I've never had much success with it for complex tasks though.
Step 3: Compare The Region Breakdown
- Add RegionCode to the Rows of the PivotTable.
This lets you see how each region has changed from month to month.
Look for anomalies. You know your infrastructure, so you know where workloads should be running. Think about what happened this month - any changes to the regions should be in line with that.
Generally you won't find much here. But when you do it can be interesting. An example:
- Someone accidentally spins up a resource in a region you don't use.
- They forget it's there and spin up another in your normal region.
- The resource sits there accruing cost until you notice it.
If you find an anomaly, make a note of it, and subtract it from the running total.
In our example the infrastructure was all in ap-southeast-2 (Sydney), so we move onto the next step without changing the total:
Running Total = $12,942
This analysis isn't possible with the vanilla AWS data. They don't separate the region of a resource in the billing CSV. One of the reasons we built Stax was to help illuminate these sort of questions.
Some teams find accidental use of regions to be an ongoing issue. In this case you can ask AWS Support to block access to other regions.
Step 4: Study The Product Breakdown
- Clear any Rows in the PivotTable.
- Add ProductName as a Row to the PivotTable.
This gives you a look at the per-service differences from month to month.
This information guides the rest of the analysis. These per-service differences add up to the Running Total. But this breakdown is too crude to be useful.
- Note down the most significant per-service differences.
- "Significant" is everything with a difference over 2% the total difference.
- Drill into each service as per the next step.
In our example data, the most significant services are:
- EC2 - $3,425
- RDS - $3,232
- Connect - $2,030
- NAT Gateway - $1,311
- EBS - $734
- CloudFront - $621
- Elasticsearch - $318
- S3 - $192
The AWS supplied billing CSV won't give you the level of granularity shown here. It uses more crude groupings. This is another reason we built Stax.
Step 5: Drill Into Each Significant Service
Each AWS service has different ways of billing. The task when drilling into them is to slice and dice these in a way that gets you an answer.
You'll break it up by a combination of these basic AWS dimensions:
- AWS Account Name (eg. project-firebird, master_account, etc)
- AWS Region (eg. ap-southeast-2, us-east-1, etc)
- Usage Type (eg. instance usage, storage, data transfer, etc)
- Resource Type (eg. m4.2xlarge, EBS snapshot or volume, etc)
- Resource Name (eg. S3 bucket name, RDS DB name, etc)
- Payment Option (eg. on demand, reserved, spot, etc)
If you have good tagging, you might also use derived dimensions:
- Environment (eg. prod, test, staging, etc)
- Cost Centre (eg. Retail, Back Office, R&D, etc)
- Project (eg. Firebird, Argon, Manhattan, etc)
- Team (eg. Data Science, SecOps, CRM, etc)
- Service (eg. Customer Identity, etc)
- Application (eg. CMS, BI, Public Website, etc)
- Owner (eg. Bob, Jill, Najla, Pierre, etc)
- whatever works for your organisation
These are powerful for cost management because they get you to an answer faster. If you don't have them set up right then that should be a priority. Stax helps with this.
The process is:
- Filter down your data to a particular service using PivotTable Filters.
- Add the appropriate dimension in PivotTable Rows.
- See if that surfaces any useful differences.
- "Useful" here means big enough to be notable.
- If you have useful differences, try then adding another dimension.
- The more dimensions you can add together, the closer you get to the specific change.
- If you can't find useful differences, start with a different dimension.
Here's an example from our test data, looking at RDS. The total difference here is $3,232.
Looking at it first by usage type, we see:
- instance usage = +$2,643
- storage = +$517
- provisioned IOPS = +$409
- data transfer = +$22
- IO requests = +$0
- replica usage = +$0
- backups = -$359
This doesn't give us much. The biggest item is instance usage, but that's not insightful. We'd expect increases in RDS spend to be instance-related.
So let's try breaking it down by instance name:
- bob_test_repl = +$800
- acme_replication_sandbox = +$440
- xxttest = +$428
- xxttest-restore = +$420
- db2-tuning = +$366
- xxt-firebird = +$335
- ...and lots of smaller differences.
OK, this is insightful. Now we can see specifically which instances have caused the cost to increase. The meaningful names give us enough information to talk to the responsible person. We can remove the total of this breakdown from the running total.
Keep drilling down until you find no more useful information. This is the case when you no longer have big significant groups. The goal is to minimise the number of groups while maximising the specificity of each.
Step 5.5: Repeat Step 5
Repeat this process across each of the services in your list. Stop when your running total becomes insignificant. This gives you clear answers about what's changed. Now you know who to talk to and what you can do about the spend.
In our example, the results are:
- RDS - New Instances - $2,384
- Connect - Increased Usage - $2,031
- EC2 - Changed Usage in account YWQ-3 - $1,790
- NAT Gateway - Increased Usage in account SecRoot - $1,284
- RDS - Increased Usage - $1,181
- EBS - Increased Usage - $685
- CloudFront - Increased Usage - $621
- EC2 - Changed Usage in account SecRoot - $333
- Elasticsearch - Increased Usage on cluster app-logs - $320
Removing these from the running total, we're down to $391. This is small enough that I'm happy we're done.
Step 6: Work Out Reasons
Last, as well as where, it's important to understand why your costs changed.
- A resource or stack was spun up for testing and not turned off.
- Your account has gone over the free tier for a service.
- Reserved instances/capacity have expired, so your usage costs more.
- There's more load (traffic/requests/etc) on a service, so it's using more infrastructure/data transfer.
- A workload has become misconfigured.
- Your architecture has changed to a more expensive setup.
- A service or app was migrated, so two copies of were running for a period.
- Backups or file storage have continued to increase at their normal rate.
These are more or less worrying depending on your situation. Once you know what they are, you can assess whether you need to take action.