Creating a Data Engineering Culture

Jesse Anderson
November 7, 2018
Blog, Business, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

Data engineering culture is often implicit and assumed in organizations.
Creating a data engineering culture involves recognizing the value and importance of data engineering at all levels of the organization.
The right ratio of data scientists to data engineers is generally 2 to 5.
Failure in big data projects often occurs due to management failures in creating the right team.
Creating a data engineering culture can prevent teams from underperforming and hitting a wall in difficult data engineering problems.

At DataEngConf Barcelona, I premiered a new talk about the importance of creating a data engineering culture. I share what a data engineering culture is and what management needs to do to be successful with Big Data. You can download the slides from the talk here and the full video transcription is below.

Here is the video from the conference.

Full Video Transcription

So, let’s talk about data engineering culture. So, one of the things I want you to understand as you’ve went through and as you’ve sat through some of these talks is, that their data engineering culture is often times just implicit, It’s assumed. They don’t really say, hey my data engineer did this for me, they just said yes, this is what engineers do. And so, when I talked to new companies, companies who are either just starting out in their big data journey or something like that, they don’t realize that their implicit assumption of data engineers doing all this. So, I’ve kind of set out to say, here’s how we create this data engineering culture so that we start doing that and we start working with that.

So, what we’re going to do is we’re going to talk about what data engineering cultures are. I’ll tell you a few stories so that you can kind of understand it and understand what’s happening there, and then, we’re going to talk about how should you create your own data engineering culture and finally, what are some common reasons for failure. This is actually important because there’s some reasons why we actually want to talk about failure with big data and that is because 85% of big data projects fail. I’m going to talk a wild guess from looking at your faces that you didn’t know that, but here’s the issue, it is not talked about quite frankly. The normal reason that people when a big data project fails is they say, well the tech didn’t work. Well, here’s the thing. The tech actually works really well, sometimes there are known limitations, there are known issues with the tech but when I talk to the teams and I talk to the people, they didn’t hit any of them but yet they claim the technology. So, what is the issue? Why are these project failing? And, that was something I actually set out personally, this is my personal quixotic adventure to go through and figure out, alright why are these teams failing so often? And let me tell you the opposite side. You’ve sat through this conference and you’ve seen a lot of people do some really cool things and I completely agree. Those fifteen percent of projects that actually get through this, they do some really, really cool things. They can create some incredible value but there’s this whole issue of, 85% of these projects failing. So, that should really give you some pause. I can’t remember if gambling is legal here but where I’m from, gambling is legal. You have better odds of putting it all on black, doing some Blackjack, doing roulette reel rolls and getting better ROI in some cases. So, what I’m going to try to do in this, in a pretty brief period is share the things that you need to know so that you don’t fail.

So, let’s talk about what the data engineering culture is. This is a culture where the value and importance, this is key, value and importance of data engineering is recognized at all levels and this may be something that all levels is another important part. All levels means that at the executive level all the way down to the individual contributor level. If this is not recognized, for example, if you have a VP of Engineering, VP or CXO, something like that, CTO who doesn’t recognize that value then when the axe comes or when they start to fire people they’ll say, the engineering team, don’t need it, gone. This is part of that value, the recognition of value. So, it has to be an organization wide realization. Organization wide of what data science, and data and big data they require data engineering. At some point, I’m going to create this visualization. Have you ever heard of Atlas holding up the world? Or the sky depending on which myth you read? So, Atlas was there, he was there holding up the sky or the world and so data engineering is that Atlas holding the data science that is the world. Most teams, most executives are focused on that world, they’re focused on how do I get that data science, how do I get this output when the reality is, they should be focused on Atlas and saying, how do I get Atlas, then he’ll hold up the world where I’ll get those data scientist, those data scientist will be able to do this.

One of the manifestations of whether this is correct or not is the right ratio of data scientists to data engineers. So, one wrong manifestation is, basically, zero or one to one ratio. That’s usually an incorrect ratio because that means that your data scientists are doing way more data engineering than they should be or if it’s a one to zero ratio, in other words there is one data scientist and zero data engineers, there’s a big problem and we’ll talk about some of those problems. Generally, you want to be in the two to five ratio. So, for every data scientist you would have two to five data engineers.

So, let’s take a look at another visualization of this. This is what I like to share with management and I forgot to ask, how many of you are managers in this room out of curiosity. Oh wow, a decent number of you. How many of you are team leads then? Decent number of you as well. Okay, good. So here is a break down of where I see failures happen. So, they’re kind of going from left to right, you see team creation. Team creation, the vasts majority of failure is going to happen before that release. So, there that first release there in the middle, that’s when we’ve said, hey this team is probably done right and this team is probably correctly assembled, correctly staffed, etc, etc but up to that point, that means that the team, the management failed to create the right team. So, I can’t stress this enough. Early production, those 85% of teams failing, have projects failing, that was specifically due to management failures of not getting the right team and we’ll talk about what the right team is in a second but this is kind of what I share to help people understand it’s only after you’ve released your first code, your second release, your third release, your nth release then, the management issues go away and now you have a team that you back off and you say, okay I’ve created the right team, let me back off and let them be successful. Up to that point, it is up to management to make sure that they have the right team and the right people.

So, why would you want to go through that effort of creating a data engineering culture? This is because teams may not actually fail. They may simply vastly under perform. In fact, some of the clients I’ve worked with, they were vastly under performing where in their minds they were thinking, hey weâ€™re doing really cool things with data science. The reality was that they were writing their trike as it were and their tricycle, the three wheels, you know they were there on that little trike, they were going very slowly but they were moving when they could have been in that Ferrari but they were limited. They were limited by their ability to create the data engineering needed to create the data science so what was happening at some of those companies is they would hit the wall, they would hit this wall of this is a difficult data engineering problem and they would just stop and that’s the issue. That’s what underperforming will mean, is that they’ll get to the difficult part of data engineering and stop and they can get somewhat there. So, conferences, as you’ve been to conferences what they’ll kind of say is, they don’t really talk about their data engineering team. They don’t have credits at the end of their conference talk saying, and here’s my data engineer, and here’s the data engineering manager, they don’t call it out but what sometimes they’ll even kind of mention it as an aside and say, I was helped out by the little people in the data engineering community. So, definitely something that we need to have, it’s kind of an implicate, it’s kind of assumed.

So, let’s talk about some brief stories here. One of those is, one of my clients was actually hitting an issue where they were stopping. So, I went through, I interviewed a bunch of their data scientists and started to ask questions. So, what happens, what happens, what happens, tell me more about what you’re trying to do and one of those was very interesting. It was, the business side thought that they were really doing well and they weren’t talking to the data scientists. When I talked to the data scientists they said, oh yeah we’d get this really cool thing that we were trying to do but we would hit this wall of what we couldn’t do. We couldn’t do the data engineering anymore. The system just became too complex. So, here let’s kind of unpack that a little bit. Too complex for a data scientist, quite honestly most data scientists are noob programmers. Don’t take this personally, most of you are frankly beginner programmers. So, you want to compare that with a data engineer who should be a senior level programmer, maybe junior but mostly senior level software engineer. This is down the line, this is exactly what they’re doing, this is what they do on a daily basis. It is well within their ability to do those things and that’s what really key and important here. Data scientists are not programmers. They program, they’ve learned to program after a fashion they can create these data engineering things after a fashion . They are not data engineers. Conversely, you’re not going to expect a data engineer to create a machine learning model. They’re two different people and this is really key, one of the big thrusts of the talk I want you to understand. So, when data scientists do data engineering work, it isn’t just here, it’s going to take me longer, it’s going to take significantly longer and what’s worse, they’re going to quit. If you didn’t know this already, they will quit. If you’re losing data scientists right now, you may want to go back and look and see are they leaving because they’re tired of doing data engineering work. So, kind of what we’re talking about, the personas, what people are doing. A data engineer loves to do this sort of thing, this is what they’ve been doing throughout their career. If you start forcing data scientists to do this, they will quit, they quit after one to six months. May be a little bit different for Europe but in the US I actually talk and I have these conversations with data scientists, they say, I’m tired of doing this, I give it one or two more months and I’m gone. This may be different for Europe but this is exactly what may be happening, you just may have a further lead time. So, why should you do this? It’s because once you have your data science and your data engineering right, you’re going to really accelerate things. This is what I’ve helped my clients do is, now they don’t really think about, can my data engineering do this, can I really do this. It’s accelerated to the point where now they’re really working unison and now their data engineering team, their data infrastructure is no longer the bottle neck. What the bottleneck means is, often whenever the data scientists thought about doing something, maybe those of you who are data scientists in the room, you’re sitting there thinking about, I’d love to do x but then in your mind you’re thinking, but that will be difficult and that will be difficult and it’s not the machine learning, it’s not the neuro-network that’s making it difficult, it’s thinking well, how do I even do that in Spark, and how do I get that data right and how do I do this? That’s what we as data engineers should be removing. This is the symbiotic relationship between data scientists and data engineers.

So, how would we go about creating this data engineering culture? So, data engineer. You may, I’m not sure how many of you read O’Reilly’s Data Blog. I write for my blog, I also write a lot of O’Reilly’s data Blog and in there Iâ€™ve set out to create definitions and these definitions aren’t to put people in boxes. It’s more to give a definition for enterprises, for larger organizations to say, here is what a data engineer is, here is what a data scientist is so they can understand this data scientist, I’m not going to put them in front of the keyboard to start programming. So, my once sentence definition of a data engineer is that, a data engineer is someone who has specialized their skills in creating software, they’re a programmer, around big data. Basically, they’ve focused in on big data for their software skills. Data scientists, they’re doing applied mathematics and then there’s another one, DBA’s. So, I kind of group a lot of titles that are there together that use SQLs specifically into the team DBA, so those could be ETL developers, those could be SQL developers you may have heard them be several different titles. I kind of group them all together and there’s an importance to this because sometimes teams, sometimes people think, or managers will think the data warehousing team is your engineering team. That is wrong. They are two very different skill sets and let’s unpack that a little bit. A DBA does not know how to program. They can only create SQL. That’s important because you cannot create in my opinion, data pipelines solely with SQL. You know, count yourself lucky if you can. I have not hit very many people doing that. It is very, very rare that people can do that.

So, you heard that they were talking about Flink on SQL yesterday and you know that there’s other ones. The key here is that a data engineer will be able to switch between them. They will know that something is better with SQL versus something is better written in code and they can choose between the two. If they only know SQL, they will use the worst tool for the job when something else is better. Very, very important there.

So, what sort of skills are needed? This is just kind of a laundry list. I’ve written a book about this that you’ll see a link in a second but, on every single data engineering team there should be a person who understand distributed systems, a person who understand programming and analysis. Analysts, this is not the level that a data scientist needs, this is analysis at a perhaps, doing counts, doing some pretty rudimentary analysis usually. It’s so that they can keep track of what’s happening, keep them understanding of the data there. Visual communication, they need to be able to visually communicate what the data is saying. They also need verbal communication. This one’s important because, I need to work with you as a data scientist. So, I see a lot of data scientists in the room and what’s going to happen is…Remember that symbiotic relationship that I was talking about? If the two can’t become symbiotic because they can’t communicate, that’s a problem. So, your data engineering, your data scientists will need to have very good communication skills between the two of them. until you do that, you will have that disconnect and that disconnect won’t ever go away and you’ll think, why? It’s because you don’t have a good verbal communication between them. There are a few other ones that are actually kind of weird here. Project veteran. You heard me talk about how data scientists are new programmers. Well, a lot of people are new within data engineering. They don’t have any specific production experience on getting these systems out. So, what’s key and important there is that you need somebody on your team to call out and say, hey you are going to hit a wall at this and hey, you’re going to do this a year from now. Only a veteran on your team can tell you that. Some of the worst designs I’ve seen are from very new people to big data and they will cause problems and these problems aren’t just, okay team we’re going to have to spend a week and pay down some technical debt. I’ve worked with teams where it’s been a year, they have dug such a hole for themselves that they had to spend an entire year to get themselves out of it, so significantly better to avoid that sort of issue there.

Finally Schema. We’re going to talk about this in a second. Schema is very, very important. You need to lay your data out correctly in a Schema evolutionary way. We’ll talk about this in a bit of a second. We also need domain knowledge. Some of your companies it may not be as important but if you’re in something like finance let’s say, it’s very, very key and important to understand that domain and until your data engineering team understands the domain, they won’t be able to create the data manufacture correctly.

So, now let’s talk about some common reasons for failure and I’ll have to go through these relatively quick but one of the biggest reasons for failure and I apologize if any of you are either watching this or are a DBA yourself, or identify as a DBA but if you have a team that is all DBAs and you’re calling yourself a data engineering team and doing big data, you’re unfortunately incorrect. The issue there is that you more than likely will fail with big data and I have the data experience to back this up now that if you only have SQL behind you and only a SQL ability you will fail in your task of trying to do this. So, a few things. There’s links there, there’s minified links there. One of those is an ability gap. The issue there is that in my work and research with teams, it is not simply a matter of time or effort that a DBA needs to learn big data and to create these data engineering systems. It is a nyon impossible sort of thing, it is an ability gap, it is not a skills gap. So, please do think of that. the data warehouse team is not the team that is your big data team. They need to be software engineers.

Another common one is that they’re set up for failure. What happens there is that the company will be circling the drain. What that means is, they’re about to go under, they’re about to go bankrupt and they’ll say, hey I heard about that big data thing it’s going to save the company. I’ve had that before, I’ve had ones where the VP, the CXO, the CTP says, oh big data, that’s just going to magically make up more revenue. Well, maybe but having these unrealistic expectations and desires sets a team up for failure. Also, unrealistic time frames is incredibly terrible. One thing to know is that it takes up to and possibly even more than six months for a team to just feel comfortable with big data. This is actually key and important. If some of you are embarking on a big data project and you’re saying, we’re going to be 100% efficient and proficient from day one, I’m sorry you’re not going to be. I’ve taught way to many teams and you may think, but I’m smart, I’ve taught a lot of smart people. This isn’t simply an issue of being smart, it is the sheer complexity. So, in that link up there, that complexity, I wrote a post for O’Reilly called “On Complexity and Big Data”. In it, I argued that big data is ten times more complex than small data. Please do understand this and internalize that.

Next one is, no one understands Schema. So, you remember how I was talking about all DBAs being a recipe for failure? Conversely having no DBAs or no one who understands Schema being another type of failure and it’s a failure you’re not going to hit from day one, it’s going to be a failure you actually hit later on in that cycle where the person who understand Schema on a team is usually not the software engineers. When I teach and I work with a team, I actually have a very consistent question that I ask them and generally, it is a question that is only answered by DBAs and the thing about that is, the DBAs have spent their entire time and their entire careers being the person who is there about Schema. They know Schema, they know what things should look like and they know how things should be lated. So, in a very clear sense it’s actually important to have somebody that understand that Schema on the team because they’re going to fight for certain things that a software engineer won’t and for that matter, a data scientist won’t. So, it is actually kind of a weird thing. You need in my opinion, at least one person who understands Schema. Now, that may be a software engineer, may, but in my experience it’s actually mostly a DBA there. So, one thing I want to be clear on is, a data engineering team is actually multidisciplinary. It is not just a group of people with the title of data engineer, it may be a group of people with mostly data engineers, but also maybe a DBA or two and maybe some front end or visualization developer.

So, veterans. This is super important where you need to have a veteran. I’ve taught at a lot of companies and when I come in sometimes those companies will be trying to leverage a bunch of junior engineers. They couldn’t find senior people so what they’ll get is a gaggle or bunch of junior engineers and they’ll put them all on that team. Now, the issue with junior engineers is the naivete. They are going do something stupid and something stupid in big data doesn’t mean it performs poorly, it means that it performs terribly and you’re going to have to spend months digging yourself out. So, I’ve worked with teams and one of the junior engineers or mid-level engineers will come up to me and say, hey what do you think of this design and within ten minutes I can save them a month. It’s that important. The designs can be that bad. So, it’s very, very key, very, very important you need to get some kind of veteran skill on your team.

And finally, too ambitious. How many of you have a project where you’re going from zero to big data? The issue there is that you can’t really do that. I’ve talked with teams where they say, here’s my proposed architecture and what they did is they took some conference talk of somebody who said, here’s our architecture and they said, alright everybody this is our architecture, this is what we’re going to do. The issue there is that they’re not telling you some things from the stage. Like, you didn’t raise your hand and say, how many iterations did it take you to get to that architecture? How many years did it take you to get to what you’re showing up on the screen? Conference talks do not show that, they don’t talk about the grueling thing that they took there and so, what will happen is, middle managers, sometimes architects will take that and say, boom I can go from zero to this. They did it, right? It is possible. So, the real key issue there for you as you come through and look at these talks is that, the architecture that they showed you wasn’t version one, it’s version three, it’s version five depending on how old the company is. They have actually iterated on that architecture several different times. So, you can cut that but you can’t go from nothing to that. This is how I see teams really fail is that they say, here’s our architecture and they go through the company and they show it to everybody, the VPs and everybody buys in and then a month later, two months later, six months later the VP calls and says, hey I want a demo and they say, I can’t demo this, the demo is another two years out, look you signed off on the Gantt chart. and he says, no I want to see what you’re doing. So, when I work with teams I recommend that they do this in much more smaller chunks that I believe that there’s a velocity to a data engineering team. I believe that a team has to gain velocity as they do it in small chunks. So, you don’t do all, here’s an entire data platform all ready to go. It’s piecing off and creating that data platform, kind of like what those conference talks are telling you. so, very, very key.

So, if this sounds like your team, you will need to take an honest look at your team. This is actually important. Honest looks are actually really difficult to do. This is something I could do as an outsider. I can look at a team and say boom, boom, boom, you need to do this, you need to do that. Internally, sometimes that’s either difficult a, because maybe they’re your friends, you’ve known them, maybe there’s some other reason, some kind of political reason. When I come in, I’m able to say, hey you need to make these changes and if you make these changes, you’ll be so much better off. You need to check and make sure your teams don’t have a skill gap or worst yet, an ability gap. The issue with ability gaps is, usually ability gaps are not known. They’re lurking underneath the covers but you do need to know and you need to be watching and actually dealing with these ability gaps.

So, this is the link to my book. I wrote a book about data engineering teams. That’s a minified URL to get there. It will explain some of these things I talked about in this talk about what should a data engineering team look like, how should you actually go through, how should a data engineering team actually work with a data science team. These are the sorts of things you can do and it has to be multidisciplinary.

Do make sure that you get help. This is actually an important thing and I think that if there’s a big difference between Europe and US, because I deal a lot with both. I deal with companies around the world. What I found is that in the US, showing any sort of needing of help means that you’re weak and that you don’t know what you’re doing. I found that Europe is a bit more outreach, that they do realize their limits and that they will ask for help. So, what I would highly suggest is that if any of this rings true, if any of this does sound like your team, I would really strongly recommend you get help early. The reason for that is because of the ROI, return on investment. If you do those wrong, the wrongness doesn’t cost you a week or two like it does with small data. Doing this wrong will cost you six months, a year and when you have that sort of cost, sometimes that will cost people their jobs and that’s really something I want people to avoid and that’s kind of why I set out personally on this journey. I set out on this personal journey of educating people on this because I saw these failures and I’m tired of seeing failures. I am sharing with you all the research I’ve done, so that I really want to see you avoid this. Please to avail yourself of these things I’ve created so that you can avoid…I would love for every single one of you to be in that 15% of successes. Let’s push up that number of successes because that’s pretty bad. I know all of us in this room, we don’t want to see an 85% of failures because that means that our industry isn’t going to grow, that those companies are going to can, or they’re going to fire that team or they’re not going to invest in this because they’re not seeing value and that’s a key issue that I’ve been dealing with personally.

So, when should you fix these problems? If this sounds like your team, if you’re a train heading towards that brick wall, the cheapest time to figure out if you’re doing something wrong, it’s up on a whiteboard before you’ve done that, but some of you may have already white boarded, some of you may have already put in months into the code. If you are headed towards that brick wall, it’s never too late to fix, it just costs more and more the further you go down that track because then you’re going to have to really bring it back, you’ll have to bring back even more and fix even more, there’s even higher technical debt there.

So, with those happy thoughts, let’s open it up for questions. I know you have some questions. I saw some shaking heads. Go ahead. You’ve got the mic right there. You’ve got the microphone?

Audience Question 1: At some point you mentioned that there was some skills that teams need. Could we consider having those skills in different people or is there a subset of skills that you know everyone should have as an engineer?

Jesse: That’s a really good question. In my book, I actually go through what’s called a gap analysis. Those of you who are data scientists are shaking your head, oh yeah gap analysis. Those of us on the data engineering side we’re like, gap analysis, what’s that? And, it’s basically what’s his question. No single one person will have all of these skills. Let me backtrack to that slide just in case you don’t remember that long list. No one person will probably have all those skills. You’ve heard of unicorns? In the bay area in Silicon Valley, we call them flying unicorns where they’re even more difficult to find, to have all those stills. So, to the direct answer to his question is, all of these have to appear in a team not necessarily in a single individual and so, what I walk you through in the book is how to do a skills gap analysis of listing out your people, seeing what skills are missing and then certain skills are, hey we can get by on that or there may be other skills where it’s, stop we need stop right now, we need to find that skill. Two of those on a data engineering team are the top two there. If you don’t have the distributed systems and you don’t have programming, that is the time to really evaluate and stop. That is a hard stop, end of story, I have the research on this. Other skills there are going to hit you later on. So for example, that lack of veteran. That’s really going to hit you, not on that first release, it’s actually going to hit you on that second, that third release because now you’re going to be iterating on something that’s already in production and that veteran has to be there to stop you from doing something stupid? Answer your question? Very good. Very good question. Thank you.

Audience Question 2: Hi. I have a question here.

Jesse: Yup.

Audience Question 2: I’m the CTO for a start-up that recently moved from doing small data into big data and I’m currently suffering the consequences of not having the data engineers exactly as you said. I could really identify . My question is about the DBA and the Schema because I think it’s kind of a blind spot. Currently, I and my Team basically that Engineers, we already hired a couple and they act as DBAs.

Jesse: So, the people that you’ve hired are DBAs or are not?

Audience Question 2: No, they’re data engineers that are acting as DBAs so far.

Jesse: Okay.

Audience Question 2: My question is, could you expand on what exactly you mean by Schema. I mean, I know what data Schema is but I don’t know if I exactly understood what you mean by Schema and could you also talk about the work flow around Schema, like who should find the Schema and who should maintain it and what happens when it changes, etcetera.

Jesse: Another good question. I just want to unpack one thing that you said and thank you for mentioning that. He was saying I’m moving from big data to small data and I’m hitting these issues that I’ve talked about. This would have saved you a decent amount of time I’m sure. So, good. Thank you for calling that out. So, let’s talk about Schema. Maybe you heard the question I asked (Wes) and that question I asked (Wes) about Apache Arrow was about Schema evolution and that is a question that I would expect a person who is handling the Schema part of your team, that’s the question that they would be asking. So, as you kind of self evaluate, were you asking that question or weren’t you? It’s not something bad about you it just means do you have that Schema skill or not because you’re thinking about, okay you’ve laid that data down, is that data going to have Schema evolution and that’s a really key important thing. Then, you heard the back and forth with Wes and I where he was saying, this is all intermediate data, it’s temporary data, this isn’t a long term storage of that and so, that’s what that Schema person is. They’re thinking about, I’m going to lay down a petabyte of data and I can’t go back through and rewrite that petabyte of data every single time we make a Schema change. We need something like Apache Avro to handle that. So, there was the second part of you question of, how does this sort of Schema evolution handled. This sort of Schema evolution is handled by often times as a business process. So, often times it’s that person kind of being the Schema Nazi as it were. They are there to make sure that the developers, software engineers…I’m a software engineer, we do stupid stuff sometimes and they’re there to make sure that you don’t do something stupid with your Schema that you don’t take a floating point and make it into an integer for example. That’s something that the Schema evolution can’t handle or better yet, they would have actually prevented you from doing that in the first place. They would have said, okay this is what the sort of data that you’re going to be laying down, I’m not going to allow you to do that. Kind of what your DBAs doing now, you know how you go to them know and you say, I want to make this Schema change to this table and they may fight you and they may say, no you should do that and there’s that back and forth. That’s kind of the back and forth you should be getting at your startup or frankly any company that kind of negotiation, that push back to say, are you using the right types, are you doing this for the right reasons, are you doing this, this and this. The other thing that that Schema person needs to know is the actual byte level representation of this and that’s because as we creep more and more complicated data pipelines, those data pipelines aren’t just going to be a bunch of files on disks, they’re actually going to be real time, they’re going to be here’s the real time representation of this data and Kafka for example, moving, moving, moving real time and then we go into our HDFS for long term storage, are we going to S3 for long term storage, we need to have that same Schema throughout and we need to do that. Another common thing here is the UniTest. You’ll want to make sure you have UniTest coverage of the full integration of your Schema, so starting from the very first supported version all the way to the current version. Can we do Schema evolution backwards and forwards? So, hopefully enough of something to give you a good handle. I saw another hand over here somewhere. Microphones behind you.

Audience Question 3 : It was about Schema, so.

Jesse: Okay, good. Alright, looks….Got a question over here. Stage left.

Audience Question 4: Hi. So, you’ve covered a large portion of like building pipelines. I’m wondering about running workloads and the role of DevOps in all this. Do you envision like a DevOps part of the platform team to help them render pipelines or is there like a separate organization that takes care of that part? Can you say a bit about it?

Jesse: Another really good questions. Kind of to repeat his question. Is data engineering DevOps? And this is another very common question. In my opinion, data engineering is separate from your DevOps team. That actually goes to another question I asked this morning where the person was talking about how they were spinning up pipelines and I asked them, are you a DevOps team? And, it was to clarify some of the research I’m doing there. I just wrote another post for O’Reilly talking about this. So, my opinion, we needed data Ops team and the reason for that is, a Data Op team would be separate from the data engineering team because data engineers are programmers and programmers are not people that I want to put in production systems. I do not want to have my software engineer maintaining these production systems because software engineers are like a bull in a China shop. They will break things and we’re used to kind of here, I’m going to play in my local database and I’m just going to blow it away and they’re like no, no, no stop because that’s the production database. This is not what a software engineer is or usually good at. So yes, you might call it Data DevOps you might call it DataOps. I think the one big difference between DataOps and DevOps is an understanding of data. So, the operations team needs to know about the processes, they need to know about the issues and how to stand up a cluster, that sort of thing but the reason I think DataOps should be separate is because now there’s issues of data. So, is the problem due to a process failing, is the problem due to a disk failing, or is the problem due to bad data? The DataOps team will ideally be able to identity that because it’s key…Otherwise, you’re programming, your data engineering team will constantly be getting pinged for things that aren’t really a problem and it will just drag their productivity down too much. So, we need to have this DataOps team that kind of says, oh yes this is a data issue. Let me handle this and it’s up on O’Reilly. It was a sponsored post but it’s up there now. He’s going for round two.

Audience Question 5: Yeah. Hi. So, everybody’s data driven now. So, we have data teams everywhere doing everything and sometimes the team is a bit far from main company business. So, developers feel a bit frustrated by what they’re doing, why they’re doing. So, what do you suggest to do in this case, how to boost motivation of something like that.

Jesse: So, let me restate your question, you tell me if this is the right question. You’re kind of asking me should a data engineering team be located with the business unit or as a separate team. Is that about your question?

Audience Question 5: Yeah, I mean we applied data techniques every side of company for everything now.

Jesse: So, depending on the size of the company, I would say there a couple different routes I’ve worked with teams on. One is, to have a centralized data engineering team that’s more constatative. I talk about that in my book where whenever a team wants to deal with something data related, they’ll come in and they’ll say, hey I have this project and the data engineering team will actually act kind of like a consulting arm and help them create that. There’s another part to that that you were mentioning of the teams being too isolated, too all over the place and the issue there is, then it becomes a hub and spoke model where there’s a data engineer or data engineers located in the business unit to understand that domain and so we have that domain knowledge but maybe what you’re seeing is a manifestation of lack of domain knowledge or perhaps even interesting domain, let’s just kind of put it out there. Sometimes these jobs are boring of, hey, let me ETL that for you. I’d rather have something more interesting than that. So, perhaps locating them within the business unit that they then have a strong relationship back to that hub of the data engineering team. That’s another route that I’ve worked with teams on. If you want to follow-up with that, talk with me at the office hours. Do we have time for one more, or?

Host: Yeah, one more. If I could ask it and the rest we can take to office hours.

Jesse: No, we can’t let you ask that.

Host: What do you think? Should I ask?

Jesse: Go for it.

Audience Question 6: So, with the proliferation of SAAS tools and hosted data infrastructure, right. I think of something like Google Cloud data flow which sort of simplifies a lot of things and abstracts away a lot of the previously necessary infrastructure. Do you think that’s changing the skills and the culture that are needed inside certain teams where they don’t have to have such a depth of data engineering experience? Do you see a trend there?

Jesse: I don’t see this going down and this is actually an interesting question. I’ll answer it with my general theory and then I’ll talk a little bit about it. I don’t believe that a general purpose big data system can be made simple or easy. It can be made easier but it cannot be made simple. I think that only specific use cases and very specific industries. You can only have a specific purpose built system for this thing. That can be easier but when you’re dealing with the levels of complexity, I don’t think it’s bringing it down I think it may be just a little bit, but it’s not an appreciable amount where we can say, hey I can hire Johnny off the street, front end developer and let’s get him up on this big data stuff. I guess the ideal I’d like to see is, an eventuality where we don’t differentiate between big data and small data, kind of what Wes was saying. We have pandas for small data and we have this other thing for big data, we have Pie Spark. I’d like to see that but I don’t think that the bar will be lowered so, put a different way, put more of a business way, that’s probably the reason why I started my business specially into big data is because I see the barrier to entry being pretty high of, it’s pretty difficult to get to this level and when people try to get to that level and aren’t at that level it’s very, very apparent and you see it all over the place.

Creating a Data Engineering Culture

Full Video Transcription

Related Posts

Gemini Batch API for Java

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Join the Newsletter