Girl in technology industry uses her laptop

GPT and LLMs from a Data Engineering Perspective

Blog Summary: (AI Summaries by Summarizes)
  • LLMs (Large Language Models) are gaining popularity and excitement from the general public.
  • Google is integrating LLMs into Google Workspace, indicating their long-term presence.
  • LLMs will change certain areas dramatically, but their utility is more narrow compared to AI in general.
  • LLMs excel at summarizing text and can be used to generate code, but caution is needed as they may produce code with subtle bugs.
  • LLMs can be used by data engineers for processing unstructured data and complex language processing.

There has been quite a bit of writing covering GPT and LLMs from data science and business perspectives. I haven’t seen much from the data engineering side. Let me share my perspective, having been in data and AI for a while and using LLMs before they became popular.

It is interesting to see the general public having the same amount of excitement as there was a year ago in the LLM space. Even at the height of the big data craze, the general public wasn’t as excited about it.

I want to answer the three most common questions I get and one bonus question.

Is it here to stay?

Yes, I think it is.

As of now, Google is integrating it into Google Workspace. This integration will make people interact with an LLM on a daily basis. Users won’t have to worry about starting GPT or another program to interact with an LLM.

The metric I use for technology adoption is, what would people say if it were to disappear tomorrow? As soon as people start using LLMs on a daily basis in Gmail and Google Docs, they’re going to expect it. Think of autocompletes. We expect them. Now, expand that to writing long-form content or email. People will throw an absolute fit the next time they have to write an email that isn’t expanded from a multi-sentence prompt.

By this metric, LLMs are here to stay and integrate into our daily lives.

What will it change?

It will change certain areas dramatically. The key issue is that many generalize and think LLMs will exhibit the same amount of progress and utility in all areas. The utility of LLMs is more narrow.

It will change certain areas dramatically. The key issue is that many generalize and think LLMs will exhibit the same amount of progress and utility in all areas. The utility of LLMs is more narrow (as opposed to AI in general). It will change the use cases of human-generated text.

I like to say the dead center of LLM is the summarization of text. I tested this by creating a WordPress plugin called that summarizes blog posts using LLMs. You should have seen the summary of this post at the top of the post. Having spent quite a bit of time testing this, LLM summaries are excellent and can summarize the concepts strewn across multiple sentences.

Another metric I use is talking to my friends in the industry. My non-scientific polls show them split 50/50 on LLMs staying power and overall change. On the upper end, they think it’s going to change the world as much as the internet. On the lower end, that it will fizzle out after a few years, a la Clippy.

I believe in the next 3-5 years, it will decimate some specific titles (e.g., paralegals) and increase the productivity of other titles. The difference in this change will be how it will affect middle-class and knowledge-worker jobs. Previous cycles hit mostly lower wages and skills with automation. This cycle will be different as the AI can do smarter tasks and hit knowledge worker jobs. Some are saying the same number of people will be more productive. If history is any indication, there will be fewer, more productive people rather than the team staying the same size.

The big step will be LLMs running on low-power devices that are so cheap that they’ll be disposable. We’re thinking more along the lines of interacting with LLMs on our phones or laptops. The low-power devices will change the latency and connectivity constraints to allow LLMs to be everywhere.

Is it going to destroy us?

AI will only be as good or evil as we are.
AI will only be as good or evil as we are. Yes, man can be quite evil, but I’m hoping a more rational AI will look at most evil deeds as not worth the ROI. War, for example, rarely has an ROI, and I’m hoping that an AI will see this outcome more often.

I have another different take that I haven’t seen anywhere else. I think LLMs will allow the majority to get stupider, like in Idiocracy, while a gradually decreasing population will benefit and exploit AI (and technology in general). Think of the analogy of the calculator. Why should you learn any math if a calculator can do it? Now, expand that out to a much wider part of the human experience. Why should you learn proper grammar? Why should you memorize X? Why should you learn X? Why learn another language? If there’s a low-power LLM within my voice’s reach or implanted into me, I don’t need to know, understand, or learn how to do any of that.

All the while, there will be a portion of the population that creates more and more complex systems with fewer people. It will be like the Star Trek, Star Wars, or other Sci-Fi episodes when the incredibly advanced civilization has no one left to maintain or create the systems as no one learned enough to do it.

How can we use LLMs in data engineering?

Code generation is the most low-hanging fruit of everything. Be careful. I’ve been experimenting with the code generation. It generates code with some weird hallucinations (e.g., packages/libraries) that don’t exist and code with subtle bugs. These subtle bugs remove any time gains and then some. This will make senior engineers more productive. There isn’t a great integration in my IDEs, and having to switch windows to generate some code is too slow and flow-interrupting for me.

I explain to non-technical people that code generation is the most difficult part of software engineering. LLM code generation solves a boilerplate or class generation problem. That isn’t the most difficult part of software engineering. Solving business/technical problems and debugging are the biggest parts, and I don’t see LLMs doing that anytime soon.

Using LLMs to process unstructured data is amazing. With the right prompts and code, you do some serious data engineering work. Getting the right prompts and debugging them can be a time consuming endeavor. Just remember that if you’re doing this over a large number of times, the LLM latency and costs will become an issue. Using open source LLMs can be a way to improve this.

This sort of work used to be the domain of data scientists specializing in NLP (natural language processing).

LLMs have opened up the door for data engineers to start doing some complex language processing.
LLMs have opened up the door for data engineers to start doing some complex language processing. I’ve been waiting to see what trickles out of data science and into data engineering. This is one good example.

We’re going to need Vector databases. There are a wide variety of Vector databases out there. We’ll need a good place to store LLM logs/prompts and retrieve data to add to prompts. I think the data source technologies will be varied, so you’ll need RDBMS, NoSQL, and Vector databases to get the right data from the right places. I suggest you start learning about Vector databases, their usage, and potential vendors.

Overall, LLMs are here to stay, and this will change data engineering. I suggest you start learning about them now.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes) Many companies, regardless of size, are using data mesh as a methodology. Smaller companies may not necessarily need a

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes) A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes) Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent. The future of

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes) Google created MapReduce and GFS in 2004 for scalable systems. Apache Hadoop was created in 2005 by Doug Cutting

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes) The author founded Big Data Institute eight years ago as an independent, big data consulting company. Independence allows for