- Understanding your use case is critical for the success of your project, especially in Big Data.
- Small data use cases can often use the same technology stack, while Big Data use cases require different technology stacks depending on the use case.
- Use case factors heavily into the design of Big Data systems, while in small data, domain knowledge is the main impact on design.
- Scale and timing are critical in Big Data, and understanding the scale and future scale of a project is necessary to decide if it needs Big Data technologies.
- In Big Data, data is often treated as an end product and distributed using technologies like Apache Kafka or HDFS, which requires designing more general-purpose pipelines.
Sometimes I’ll write a post and the comments will say something to the effect of “this is useless.” Other times I’ll be finishing up a class and a student will ask me why I didn’t cover what they’re trying to. I’ve written example code and people will ask me why didn’t write it on something more specific.
In my book and classes, I make it very clear how absolutely critical it is to deeply understand your use case. The companies and individuals that don’t internalize this will often fail in their projects. Technologies and applications of technologies are entirely dependent on their use cases.
I want to share how I contrast small data use cases and Big Data use cases in my classes. Think about how you’ve been designing small data use cases:
- Could you almost always use the same stack (e.g. Java/MySQL/Tomcat) for virtually everything?
- Other than domain knowledge, how much did the use case factor into your design?
- Did you ever have to think about scales and timings in your design?
- Did you ever have to think about data as an end product and how to distribute it?
We’ll contrast that with Big Data. In my private classes, we’ll get into the company’s use case(s). While we’re talking about them, I’ll ask them the questions that the team should already know. You can view some of those questions in Chapter 8 “Steps to Creating Big Data Solutions.”
Let’s talk about the common answers to small data and then add the Big Data answers.
Could you almost always use the same stack (e.g. Java/MySQL/Tomcat) for virtually everything?
In small data, you use virtually the same stack for everything. For most of my career in small data, I used the same stack of technologies. It didn’t matter what the use was, I could always use Java with a MySQL/Oracle/etc and run things on Tomcat. There were very few variables.
For Big Data, all that changes. There can be a different technology stack every time. The companies I teach at will often ask which technology they should use for everything. I reply that the question should be, “which are the right technologies for the job.”
Choosing the wrong stack could make a use case very difficult or impossible given the technology’s constraints. As a direct result, technologies should not be chosen until you understand your use case (once again see Chapter 8 “Steps to Creating Big Data Solutions” for more information
To really drive this home, let me share some examples from conferences. Sometimes, companies in the same industry will give use case talks at the conference. They’ll actually use different technologies even though they’re in the same industry. The reason is that they have different use cases and those different use cases required different underlying technologies.
Other than domain knowledge, how much did the use case factor into your design?
In small data, use case has a minimal impact on design. Virtually every use case fits into the same technology stack. The main impact on design is the domain knowledge.
For Big Data, use case factors heavily into design. It completely drives your use of and choice of technologies.
When creating a system, there is more focus on how the data is interacted with. These interactions are entirely dependent on use case. For example, if you’re using a NoSQL database like HBase, you need to know how the data is going to read and written. If you don’t know these answers ahead of time, you may have a poorly performing system or a system where it’s too difficult to access data.
In this vein, the technical success of your Big Data project comes down to your understanding of the use case as it’s reflected in the domain.
Did you ever have to think about scales and timings in your design?
In small data, scale is rarely taken into account. As long as a query doesn’t take too long to run, you’re doing good. If a query takes too long, you can look at adding an index.
In Big Data, knowing the scale and timing is critical.
First and foremost, you need to know the scale or future scale of a project. You need this information to decide if the project needs Big Data or will become Big Data. Using Big Data technologies for small data tasks isn’t just over-engineering or overkill; it adds an order of magnitude increase in complexity (see Chapter 2 “The Need for Data Engineering” for more information).
Once you’ve decided your use case truly requires Big Data, you’ll need to know the scales and timings to choose technologies. For example, there is a big difference between querying and getting a response in <30ms versus 1 hour. Only by really understanding your use case can you answer these questions.
Did you ever have to think about data as an end product and how to distribute it?
In small data, you don’t really think about data as a product. The database is usually your data product. If you do distribute data, it’s using a REST call. This allows you to more tightly control who and how the data is used.
In Big Data, you do have data products that are exposed via a NoSQL database and/or a REST API.
However, you’ll start to use technologies like Apache Kafka to expose data. You won’t have direct control over who and how the data is used. This causes companies to have some difficulty transitioning over to a data pipeline.
Note: when I say “direct control,” I don’t mean that these technologies lack authorization mechanisms. I’m saying the team may or may not interface with the data engineering team when using a data pipeline.
Other times, you’ll be exposing data as a table or directory in HDFS. In a similar way, you not have direct control over how the data is used.
This lack of control figures into the design of the data pipeline. You should be designing more general purpose pipelines. That includes having fatter data payloads so you aren’t changing the data layout as new use cases arise.
There Isn’t a Cookbook
Sometimes I’ll get to the end of a class and a student will chafe at the lack of a cookbook. This cookbook would be the step by step process for any company and industry. The unfortunate thing is that these general cookbooks don’t exist. The closest equivalents are books for very specific parts of a use case like ETL, clickstream, or deduping.
Since there aren’t cookbooks, what do you do? You need qualified Data Engineers (see Chapter 4 “Data Engineers” for a definition of a qualified Data Engineer). The qualified Data Engineer will know how to get the right information about the use case to create the data pipeline.
This requirement of qualified people instead of cookie cutter cookbooks places qualified Data Engineers in high demand. Managers need to take this into consideration as they create their data engineering team.