GPT and LLMs from a Data Engineering Perspective

Jesse Anderson
September 14, 2023
Blog, Business, Data Engineering
No Comments

Blog Summary: (AI Summaries by Summarizes)

GPT and LLMs are gaining attention from data science and business perspectives, but less so from the data engineering side.
Google is integrating LLMs into Google Workspace, making interaction with LLMs more seamless for users.
The widespread adoption of LLMs will lead people to expect their presence in daily tools like Gmail and Google Docs.
LLMs are likely here to stay and will become integrated into daily life, similar to how autocomplete features are now expected.
LLMs will have a significant impact on various areas, particularly in changing human-generated text use cases.

There has been quite a bit of writing covering GPT and LLMs from data science and business perspectives. I haven’t seen much from the data engineering side. Let me share my perspective, having been in data and AI for a while and using LLMs before they became popular.

It is interesting to see the general public having the same amount of excitement as there was a year ago in the LLM space. Even at the height of the big data craze, the general public wasn’t as excited about it.

I want to answer the three most common questions I get and one bonus question.

Is it here to stay?

Yes, I think it is.

As of now, Google is integrating it into Google Workspace. This integration will make people interact with an LLM on a daily basis. Users won’t have to worry about starting GPT or another program to interact with an LLM.

The metric I use for technology adoption is, what would people say if it were to disappear tomorrow? As soon as people start using LLMs on a daily basis in Gmail and Google Docs, they’re going to expect it. Think of autocompletes. We expect them. Now, expand that to writing long-form content or email. People will throw an absolute fit the next time they have to write an email that isn’t expanded from a multi-sentence prompt.

By this metric, LLMs are here to stay and integrate into our daily lives.

What will it change?

It will change certain areas dramatically. The key issue is that many generalize and think LLMs will exhibit the same amount of progress and utility in all areas. The utility of LLMs is more narrow.

It will change certain areas dramatically. The key issue is that many generalize and think LLMs will exhibit the same amount of progress and utility in all areas. The utility of LLMs is more narrow (as opposed to AI in general). It will change the use cases of human-generated text.

I like to say the dead center of LLM is the summarization of text. I tested this by creating a WordPress plugin called Summariz.es that summarizes blog posts using LLMs. You should have seen the summary of this post at the top of the post. Having spent quite a bit of time testing this, LLM summaries are excellent and can summarize the concepts strewn across multiple sentences.

Another metric I use is talking to my friends in the industry. My non-scientific polls show them split 50/50 on LLMs staying power and overall change. On the upper end, they think it’s going to change the world as much as the internet. On the lower end, that it will fizzle out after a few years, a la Clippy.

I believe in the next 3-5 years, it will decimate some specific titles (e.g., paralegals) and increase the productivity of other titles. The difference in this change will be how it will affect middle-class and knowledge-worker jobs. Previous cycles hit mostly lower wages and skills with automation. This cycle will be different as the AI can do smarter tasks and hit knowledge worker jobs. Some are saying the same number of people will be more productive. If history is any indication, there will be fewer, more productive people rather than the team staying the same size.

I believe in the next 3-5 years, it will decimate some specific titles (e.g., paralegals) and increase the productivity of other titles.

The big step will be LLMs running on low-power devices that are so cheap that they’ll be disposable. We’re thinking more along the lines of interacting with LLMs on our phones or laptops. The low-power devices will change the latency and connectivity constraints to allow LLMs to be everywhere.

Is it going to destroy us?

AI will only be as good or evil as we are.
AI will only be as good or evil as we are.

AI will only be as good or evil as we are. Yes, man can be quite evil, but I’m hoping a more rational AI will look at most evil deeds as not worth the ROI. War, for example, rarely has an ROI, and I’m hoping that an AI will see this outcome more often.

I have another different take that I haven’t seen anywhere else. I think LLMs will allow the majority to get stupider, like in Idiocracy, while a gradually decreasing population will benefit and exploit AI (and technology in general). Think of the analogy of the calculator. Why should you learn any math if a calculator can do it? Now, expand that out to a much wider part of the human experience. Why should you learn proper grammar? Why should you memorize X? Why should you learn X? Why learn another language? If there’s a low-power LLM within my voice’s reach or implanted into me, I don’t need to know, understand, or learn how to do any of that.

All the while, there will be a portion of the population that creates more and more complex systems with fewer people. It will be like the Star Trek, Star Wars, or other Sci-Fi episodes when the incredibly advanced civilization has no one left to maintain or create the systems as no one learned enough to do it.

How can we use LLMs in data engineering?

Code generation is the most low-hanging fruit of everything. Be careful. I’ve been experimenting with the code generation. It generates code with some weird hallucinations (e.g., packages/libraries) that don’t exist and code with subtle bugs. These subtle bugs remove any time gains and then some. This will make senior engineers more productive. There isn’t a great integration in my IDEs, and having to switch windows to generate some code is too slow and flow-interrupting for me.

I explain to non-technical people that code generation is the most difficult part of software engineering. LLM code generation solves a boilerplate or class generation problem. That isn’t the most difficult part of software engineering. Solving business/technical problems and debugging are the biggest parts, and I don’t see LLMs doing that anytime soon.

Using LLMs to process unstructured data is amazing. With the right prompts and code, you do some serious data engineering work. Getting the right prompts and debugging them can be a time consuming endeavor. Just remember that if you’re doing this over a large number of times, the LLM latency and costs will become an issue. Using open source LLMs can be a way to improve this.

This sort of work used to be the domain of data scientists specializing in NLP (natural language processing).

LLMs have opened up the door for data engineers to start doing some complex language processing.

LLMs have opened up the door for data engineers to start doing some complex language processing. I’ve been waiting to see what trickles out of data science and into data engineering. This is one good example.

We’re going to need Vector databases. There are a wide variety of Vector databases out there. We’ll need a good place to store LLM logs/prompts and retrieve data to add to prompts. I think the data source technologies will be varied, so you’ll need RDBMS, NoSQL, and Vector databases to get the right data from the right places. I suggest you start learning about Vector databases, their usage, and potential vendors.

Overall, LLMs are here to stay, and this will change data engineering. I suggest you start learning about them now.

GPT and LLMs from a Data Engineering Perspective

Is it here to stay?

What will it change?

Is it going to destroy us?

How can we use LLMs in data engineering?

Frequently Asked Questions (AI FAQ by Summarizes)

Related Posts

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Data Teams Survey 2024 Results

Join the Newsletter