- Proactively reducing cloud costs can make data engineering teams look good and save money for the organization.
- Data engineering teams should determine the organization's focus on speed versus cost and put together cost calculations to present to the business for validation.
- Some low-hanging fruit for cost-cutting includes shutting down unused clusters or VM instances.
- Optimization strategies include reserving instances, using spot instances, compressing data, and using columnar file formats.
- Switching to use-based managed services like BigQuery or Presto can also save costs, but it may require more technical changes and careful consideration of timing and savings.
In my last post, I gave some general suggestions on how analytics and data engineering teams should be dealing with COVID-19. Now, I want to give specific advice on how data engineering teams can reduce their cloud costs.
Note: while this post focuses on cloud, similar approaches could be done for companies with chargeback models. These cost optimizations could be used as further reasons to move an organization’s clusters to the cloud.
Nothing makes a data engineering team look better, or worse, than failing to address cloud costs themselves. If the CFO is looking around the organization for ways to cut costs, the data engineering team will look good if they’ve proactively reduced costs. When the meeting with the CFO comes, the data engineering team can talk about the ways they’ve already cut costs. This stands in contrast to data engineering teams who are reactive about costs and have to follow up or take action items out of a meeting. Reactive teams seem like they aren’t taking the initiative to save money and that makes them look bad.
What to Cut?
During the pre-COVID-19 times, data engineering teams were more focused on the speed of analytics than costs. The big question for organizations is what is their focus now? Is the organization still focused on speed over costs? Is the organization now focused on trading costs for speed? More often, I think organizations will try to be in a delicate balance of optimizing costs while not giving up too much speed.
By figuring out the organization’s focus, the data engineering team can start to put costs together to put in front of the product owner/business user/business sponsor. For example, the data engineering team could calculate that an average cluster costs $100 per day and it takes 10 minutes to run the average query. If the data engineering team were to cost-optimize the cluster, it would cost $75 per day and take 20 minutes on average per query.
By putting the numbers in front of the business, the business can make a more educated decision.
In this example, saving $25 per day but adding an extra 10 minutes per query may not be worth it. Coming at the problem from the business view, the person running the query is spending double the amount of time waiting for the query to finish. This cost-cutting could reduce productivity in half and have the opposite effect of saving money. Halving productivity relative to a person’s salary could more than negate any cost savings.
In another scenario with the same example, adding 10 minutes per query may not be an issue. This could be the case if the queries are more of an asynchronous report or another usage where a person’s productivity isn’t directly affected by the extra 10 minutes.
Before doing the actual cost-cutting, the data engineering team will want to work with the business to validate that the primary and secondary effects are worth it.
The Actual Cuts
It turns out doing cost calculations and comparisons for data engineering is more difficult than other IT spend. This difficulty stems from the speed and possible data loss stemming from removing nodes in a cluster. Some distributed systems make it difficult if not impossible to remove nodes from a cluster without a big risk of data loss.
Some cuts can be low-hanging fruit.
Â Is the cluster below 80%-100% utilization? Is the cluster completely unused during the night or on the weekends? Are there VM instances that were spun up and no one knows what they’re used for? These are all good candidates for being shut down or having nodes removed to save money.
Sometimes, the best strategy isn’t to remove nodes but to optimize how the cloud provider charges for the usage. All three big cloud providers give discounts for reserving instances (AWS, GCP, and Azure). Without any technical changes, the cluster could be running on reserved instances and saving 30-80%. The main tradeoff for reserved instances is the upfront money relative to costs. Reserved instances require an upfront payment and possibly monthly amounts on top of the hourly usage. This could make businesses have to dip into precious cash on hand but would benefit the business over the long term.
Other times, the query or processing needs to get done but there isn’t a huge time crunch for it. If the job fails several times, it can be restarted without any adverse effect on the business. These use cases can leverage a cloud provider’s spot instance market (AWS, GCP, and Azure). Spot instances can save as much as 80%. The tradeoff is that spot instances can be terminated and removed from the cluster with little to no notification.
You might already know about spot instances but you might not know that using spot instances is built into some of the managed services. GCP’s DataProc can be configured to start with preemptible VMs. The instances running on preemptible VMs won’t run HDFS to remove the propensity for data loss.
Some services are charged based on bytes stored and bytes processed. These services can be optimized by simply compressing the data. The act of compressing the data can be transparent to the rest of the application and just be configuration property or simple API call.
An example of a managed service that takes advantage of compression is AWS Athena. All charges are purely based on the amount of data read by the Presto cluster. By compressing the data, the amount of data read in bytes goes down and the queries will be cheaper.
You can achieve further optimization for managed services like AWS Athena by using columnar file formats. In the case of Athena, it supports Apache Parquet. By splitting the file into columns, the individual columns can be read instead of reading the entire file. Because Athena is charged on bytes read, you are only charged for the columns you read during a query and the costs are lower as a result. Less technical users and other queries will need to be checked to verify they’re only querying on the columns that are needed in order to take advantage of reading specific columns.
More Technical Optimizations
Other optimizations may take more technical changes or choosing completely new technologies.
Running a cluster represents a fixed cost. Switching to a use-based managed service may be a way to cut costs. Two examples of these pay by usage services are GCP’s BigQuery and AWS’s Presto. Using these services could offload some work or prevent spinning up a cluster and its associated costs. An example of saving costs could be that a cluster is unused overnight except for some end-of-day queries. These queries run on a Spark cluster that runs 24/7. These queries could be transitioned over to BigQuery/Presto and the Spark cluster could be turned off overnight. There could be an automated script that turns the Spark cluster on and off overnight and weekends.
Simply moving to a use-based managed service may not be as straightforward to compare costs. This is because use-based services are often charged on bytes processed and other services are based on time. Another factor when comparing could be the increase or decrease in the amount of time each process takes because this factors into the user’s time. As I talked about previously, there could be infrastructure savings that actually costs more by having people wait around for queries to finish. You will want to take into account timing and savings when making a decision to move to a use-based managed service.
Some use cases lend themselves well to dynamic clusters. These are clusters that only spin up to run a specific job and then stop. This keeps cluster usage as close to 100% as possible. There will be cluster startup time that needs to be amortized out too. By keep usage at 100%, the overall costs are lower because there isn’t wasted money for idle clusters.