My Big Data Journey

Jesse Anderson
June 22, 2016
Blog, Business, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

The author's background in creating distributed systems from scratch in Java gave them a leg up in learning Big Data technologies.
Multi-threading experience was also helpful in understanding how to move data around between threads and processes in Big Data frameworks.
The author worked at a mobile startup that would have had several Big Data problems, including a recommendation engine and massive geo-located lookups.
At Intuit, the author's team hit the boundaries of what their relational database could handle and denormalized the data to get better performance, but eventually realized they needed a NoSQL solution.
The author's Million Monkeys project using Hadoop went viral and gained them notoriety in the Big Data community.

Everyone’s Big Data journey starts somewhere. We’re often given stories of outright mastery, but I want to tell you how I got started with Big Data. Each of these stories about mastery forget or omit their humble beginnings. This is my story from my humble beginnings.

Distributed Systems

My specialty in programming has always been distributed systems. This has included writing three distributed systems from scratch in Java. They made extensive use of the java.util.concurrent and java.nio packages. These projects ranged from distributed storage systems to distributed processing engines.

This gave me a serious leg up in learning Big Data technologies. Big Data frameworks are distributed systems that deal with massive amounts of data. Many of the same theories and patterns still applied.

The projects also imbued a desire to not have to write my own distributed system again. They’re not only complicated to write, but they’re impossible to debug. Somewhere about 80% of code is dedicated to just handling errors. I’d much rather focus on writing a feature or analytic than debugging a distributed systems problem.

Multi-threading

Along with creating distributed systems, I wrote UI code. This code had to be multi-threaded. It had to be done correctly and highly performant. This UI code usually served as the front-end for a distributed system.

Having extensive multi-threading experience was another serious leg up. Big Data frameworks handle the threading for you, but you still need to understand how to move data around between threads and processes. With Big Data frameworks, you’re often both multi-threading and multi-processing. Understanding the limitations and patterns of multi-threading really helps in distributing data and the limitations of distributed systems.

Mobile Startup

I joined a mobile startup poised for world domination as their Lead Engineer. Most startups don’t have Big Data problems; they aspire to having enough customers to have Big Data problems. If it would have been successful, we would have suffered a deluge of data.

This startup would have had several Big Data problems. They were mostly thought experiments and knowing there would be some significant technical debt to pay down.

Our most complicated ambition was a recommendation engine. We wanted to give our users recommendations on which events that they should attend. It was made up of two really complicated problems. The first was the recommendation engine itself and the second was how to run that algorithm on many different computers at once. This feature never materialized because the recommendation engine was never written, but worse yet, we weren’t collecting information on user’s likes until I joined. Without good data and vast amounts of it, we could have never created an effective recommendation engine.

We faced another difficult problem once we hit scale. We needed to do massive geo-located lookups. To complicate it further, we needed to have a time dimension. In other words, we needed look up things happening at a given latitude/longitude and time. At small scales, like us, we could just use a relational database. At scale, our relational database would fall over and die. We knew we would need something better, but never hit the scale to need such a solution. Now, we could have used HBase and a geohash with time.

We did hit some scale issues on the analytics I wrote. I was doing both simple analytics, like counts and averages, and advanced analytics with windowing and sessionization. The sessionization is where it killed the memory. I needed to calculate usage over 14 day windows. I further broke those sessions down by how active a user they were during that time. All of these were awesome for the business and chart creation, but I worried how I’d continue to process this data at scale. There would come a time when a day’s worth of event wouldn’t fit in memory. Now, there are technologies like Apache Crunch and Apache Beam that make sessionization at scale possible and even a little easier.

Intuit

At Intuit we hitting the boundaries of what our relational database could handle. We had a DBA attached to our team just to apply the black magic of Oracle query optimization. We spent an inordinate amount of time dealing with database performance issues. We were simply outside the limits of what it could do.

We were hooked on the relational database and we couldn’t move off, at least that’s what we thought. We started denormalizing the data to get better performance and reduce the number of joins. We had the database shared by time, but at the scales we hit, that wasn’t enough. We were hitting the limits of relational databases that NoSQL databases solve.

We could have spent a million dollars on hardware and software upgrades. It would have only given us a 10% improvement. To this day, I think how crazy that is and how frightened we should have been as a business. No amount of amount of money thrown at the relational database problem would have solved it.

We had several mission critical jobs that ran at the end of every business day. It had to be babysat by an engineer every single day. Looking back, this babysitting showed our mountain of technical debt. Instead of paying the NoSQL piper, we kept on keeping our relational database on life support. Whenever there was a problem, we didn’t have a way to process the data any faster and we definitely didn’t have a plan to scale. To recover from an error, we had to start back at step one. There weren’t any built-in mechanisms to be fault tolerant. We had to restart the process and hope that it completed before the cutoff time.

The lack of data processing should have been a scary business issue. There were so many problems that prevented us from scaling. We had so much technical debt, we couldn’t solve the problems well and our constant production problems stopped us from paying down our technical debt.

Million Monkeys

I’m seeing all of these scale issues at Intuit and I started to read about Hadoop. Given my distributed systems background, I understood how most of it worked and I immediately realized that it would be a game changing technology. I liked how novel it solved the most difficult problems of distributed systems.

I read the books on Hadoop. They were great as reference books, but it was hard to get a real handle on the code. Hadoop and its ecosystem were just this massive amount to learn and understand what each part did or the subtle differences between them. I had learned many technologies through books, but Big Data proved more difficult.

I wanted to play with Hadoop, but I needed a project to use it with. I eventually settled on randomly recreating Shakespeare. The rest of the story is mostly history. Hadoop, with a novel idea, and a cool implementation led to the project going viral.

Cloudera

The worldwide success of my Million Monkeys project gained me notoriety. It also helped me start my phase in my Big Data journey. Cloudera was the leader in Hadoop. I thought Hadoop was the future and I wanted to join.

I joined Cloudera’s training team. On this team, I experienced how well-written curriculum and great instructor can make complicated concepts easier to learn and understand. A good instructor can answer any questions and explain any concepts a student doesn’t understand. Great materials can make the concepts and code understandable.

While I was at Cloudera, I started experimenting with novel ways to explain Hadoop. I started explaining these concepts with Legos and playing cards. This allowed me to show even non-technical people how Hadoop worked.

I also wrote and taught Cloudera’s most technically challenging courses. I wrote/co-wrote the HBase, Designing and Building Big Data Applications, Cloudera Manager, and Data Science courses.

While I was at Cloudera, I started to look past the current state of Big Data and into the future. I started to see the common issues holding teams back. I started to what individuals were finding difficult. I began to write about what I saw to help teams avoid the mistakes of other companies.

Final Thoughts

Distributed systems are hard. I had a good understanding of distributed systems before learning Hadoop. It still took me six months of concerted effort before I really felt comfortable with Hadoop and it ecosystem. There really is a significant amount to learn and internalize.

Before I learned Hadoop, I could learn most technologies by simply learning the API. I already knew most of the concepts and I could just read a book. Hadoop was very different. I needed a much deeper understanding of Hadoop to use it correctly. The difficultly of creating a solution with Hadoop is making the many different technologies work together correctly. This isn’t done with some simple glue code; it requires hardcore engineering.

Learning Big Data technologies really is a journey. It takes time and effort. It’s changing rapidly too. I encourage you to start your journey with me.

My Big Data Journey

Distributed Systems

Multi-threading

Mobile Startup

Intuit

Million Monkeys

Cloudera

Final Thoughts

Related Posts

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Data Teams Survey 2024 Results

Join the Newsletter