- SQL cannot handle all aspects of creating a Big Data data pipeline, and eventually, a programming language will be needed to fill in the gaps.
- Customization is a key factor in the need for programming, as companies often have custom use cases that require custom code.
- Integrating systems is a difficult part of the data engineering process, and SQL is not always sufficient for decoding binary data or expressing more advanced smoothing or normalizing algorithms.
- SQL is best used at the end of the pipeline to create a new and derivative data product, and data engineers must be able to choose the right language for the job.
- SQL is useful for exposing a data pipeline for consumption by less technical groups for ad hoc queries and report creation.
There is a common misunderstanding in data engineering that you can do everything you need to create a Big Data data pipeline with SQL. This notion is being promoted by some vendors and companies. They’re wrong and you can’t do all of your data engineering with SQL. You will eventually need a programming language to fill in the gaps. This goes well beyond adding UDFs (user defined functions) to accommodate custom functions.
This need for programming comes from companies expecting to do custom things. They’ve always had programmers on staff to create custom code for custom use cases. Given time and the additions of even more custom cases and you have something that can’t be handled by anything other than custom code. The older the company, the messier the integration problems and the more data engineering is needed.
The most common object for Big Data is to bring data from all over the company and into a single place. The integration of all of these systems is a difficult part of the data engineering process.
A common integration workflow is to start out with binary data. This binary data will come from some sort of legacy system or embedded device sending out data. How do you decode binary data with a SQL query? From there, the data needs to processed, enriched, and smoothed. I’ve seen companies try to do these steps in SQL. It leads to incomprehensible SQL queries that would have been better expressed with code. More advanced smoothing or normalizing algorithms can’t be expressed with SQL.
You might think this is just an enterprise or larger organization issue. If you’re a startup, you’ll likely have to bring in data from other sources. You’ll be lucky if these outside data sources are usable in their current format. This is because most companies fail at exposing data as a product.
A common outside data source workflow is to take the data and normalize it to your system. The incoming data is often a binary, custom, or quasi-standards compliant format. The SQL tools expect standards compliant JSON, XML, or other format. The incoming data doesn’t conform to the standard 100% because there are small but important differences. You are faced with how to change the data into something that works. You’re back to a task that can potentially be expressed with SQL, but would be so complex and incomprehensible that you’re better served with code. Once again, this assumes that problem is solvable with SQL. In my experience, the majority of problems can’t even be expressed with SQL.
If your company isn’t dealing with the many issues of integrating systems, count your lucky stars. You may hit it eventually.
Where Does SQL Go?
Basically at the end of the pipeline when the pipeline is close to the end product. You might even go so far as to say at the end of the data pipeline and you’d use SQL to create a new and derivative data product. SQL can only be done at the tail end of lots of data engineering work (that’s been coded).
Don’t get me wrong. SQL really has its place. If a team is limited to SQL as their only programming language, they will be severely limited or completely unable to create data pipelines. It’s imperative that data engineers be able to program. A data engineer must be able to choose the right language for the job – be that SQL, Java, Scala, or Python.
SQL augments a data engineer’s abilities. Sometimes SQL is just plain better at expressing a problem. A few common examples of this are
GROUP BY, or simple
WHERE clauses. Some of the better SQL integrations in Big Data frameworks allow data engineers to come in and out of SQL and code. This really gives the best of both worlds and allows data engineers to be even more productive.
SQL is one of the best ways to expose a data pipeline for consumption by other, less technical groups for ad hoc queries and report creation. When I’m evaluating technologies for a team, I’m always looking for SQL support. However, these are the systems receiving and serving up the previous work done in the data pipeline.
How do you test SQL pipelines? These tests can be unit, integration, or system tests.
I’ve had several clients who were transitioning off of RDBMS systems with heavy SQL and stored procedure usage. Their main difficulty was how to replicate the stored procedures and verify it. This made it really difficult to verify that the new code was working as expected.
Some of the real-time SQL processing technologies will have this issue. There will be no way to test the SQL that is being run or unit test against it.
Some issues with data quality will remain the same whether you’re coding or using SQL. I advocate a fail fast policy for data quality. You fail, log it, and move on. Make sure you run negative tests to verify you’re handling bad data correct.
Data Warehousing Teams
When I find a client is relying too much on SQL, I find their data engineering team is lacking in some key skills. This usually happens when a data warehousing team is tasked with big data.
When a team lacks the programming and distributed systems skills, they are headed for failure. I talk more about the required skills for a data engineering team in my Data Engineering Teams book.
A SQL-only focus brings an issue with the creation of systems and creation of systems is where SQL-focused people fall short. They’re used to single systems. Most tasks have a step-by-step cookbook of system creation. These cookbooks have taken the most common architectural patterns and spelled them out in detail. There is the ability to create a one-size-fits-all with small data. With Big Data, that isn’t possible because everything is so drive by the use case. This really the biggest sticking point I’ve found with SQL-focused people.
Data engineering is difficult because you’re pulling so many open source technologies together. You aren’t primarily working with a database and its smaller ecosystem. With Big Data, you’re tying 10-30 different pieces together. One of the reviewers of this post summed it up nicely by saying you need to “know when your problem is simple database problem, and when it is a problem requiring code.”
There are several parts that are absolutely critical to success with Big Data. You need data engineers and they’ll choose between code and SQL as it fits. This all starts with forming the right data engineering team with all of the required skills.