Getting Your Programming Skills Ready for Data Engineering

Blog Summary: (AI Summaries by Summarizes)
  • The author's journey to becoming a data engineer began with an introduction to programming through the book "Introduction to Statistical Learning in R".
  • They then learned Python through "Learn Python the Hard Way" and created basic projects such as a flash card app, a mental math game, and a website for their band.
  • The author completed a machine learning tutorial and participated in Kaggle's Titanic Data Challenge, which led them to realize they enjoyed the programming and data management aspect of data science more than the analytical facet.
  • They also taught themselves SQL and used PowerShell scripts to automate tasks at their job.
  • After researching different paths within computer science/software engineering, the author decided to specialize in data engineering and found resources such as Pranav Dar's Comprehensive List of Resources to Become a Data Engineer and Robert Chang's A Beginner's Guide to Data Engineering.

Note: This post was guest written by John Desmond.

My preparation for the course began before I knew about the course, and before I realized that I wanted to specialize in data engineering. When I decided I wanted to learn programming, I hadn’t quite decided if I wanted to be a programmer or a data scientist. My initial introduction to programming was through reading the book: Introduction to Statistical Learning in R. I read through the first few chapters of the book and completed the exercises at the end of each chapter. This gave me an introduction to the ways in which programming could be used in a statistical context. Learning from this book allowed me to leave open the option of becoming a programmer or a data scientist, as it taught skills that could be used in both disciplines.

After telling one of my friends who was already working as a data scientist that I was interested in programming, he recommended Learn Python the Hard Way by Zed Shaw. I bought the PDF version of the book for approximately thirty dollars, and I started working on the exercises. I was really happy with the book, and I highly recommend it to beginner programmers. The book starts with the basics of printing and data types, moves on to user input, data structures and control flows, and finishes off with object oriented programming (OOP), virtual machines, unit-testing and porting a game you design to a web application using HTML5, CSS, and Flask (a micro web framework). While completing the book I took the concepts I learned and started applying them to basic projects. I made a flash card app, a game that allows the player to practice mental math in the context of playing darts, and a basic website for my band.

I still wasn’t entirely sure if I would go the data science or programming route, so I completed Dan Becker’s Machine Learning Tutorial and applied what I had learned from the tutorial to Kaggle’s Titanic Data Challenge, which is essentially the hello world for Kaggle Competitions. While completing the tutorial I realized that a lot of my time went into preparing the data to be used for analysis by the given algorithms. At this point I was beginning to realize that I liked the programming and data management aspect of data science more than the analytical facet.

While completing these projects and tutorials, I was working as a data analyst. In my role I was introduced to relational databases, and I was given time to teach myself SQL to create reports for my team. One of the resources I found that helped me practice SQL queries was https://sqlzoo.net/. It starts from basic single column queries, to more advanced topics like recursive queries and subqueries. Some of the tasks at my job were rather repetitive, and so I decided that I would attempt to automate portions of a Standard Operating Procedure that my team had to perform daily. A friend recommended using PowerShell scripts to do so. I took his advice and spent a few days learning the syntax. I then found a script online that I was able to modify. By modifying the script, I was able to automate the saving of particular emails within my inbox to specific folders on my team’s shared drive.

As I continued to study in my free time, it became apparent that there were many paths within computer science/software engineering in which one could specialize. After watching a conversation between Sylvester Morgan and John Sonmez in which they discussed the importance of specializing, I found this post by John that lays out some of the potential paths that one can follow within software development. Although data engineering isn’t one of the listed specializations, I realized that given my background with databases, Python, and some knowledge of data science, that specializing as a data engineer was something worth striving for.

At this point I started to google how to become a data engineer and found some great material right off of the bat, in particular I found Pranav Dar’s Comprehensive List of Resources to Become a Data Engineer. From there I read Part I and Part II of Robert Chang’s A Beginner’s Guide to Data Engineering which gives a background of the theory relevant to data engineering and delves into how Robert used data engineering at Airbnb. I continued to search around for advice on what to learn to become a data engineer, and this time I turned to reddit. There I found a few suggestions to pick up the book Designing Data Intensive Applications by Martin Kleppmann and I also found a few references to Jesse Anderson’s Professional Data Engineering course.

I looked up Jesse’s course, and found the webpage that provides a free copy of the Ultimate Guide to Switching Careers to Big Data. I read through the guide, and was surprised by the advice that I should find an expert to guide me through the career switch. Throughout the guide Jesse made it clear that intermediate level Java skills were necessary to succeed as a data engineer, and before reading the guide I had yet to learn any Java. I emailed Jesse and he told me that the Java skills mentioned in the guide were a prerequisite to taking his course.

I had just put in my two weeks at my job to begin studying full time, and so I decided to try to teach myself the basics of Java as quickly as possible. I started off with Bryson Payne’s Learn Java the Easy Way on Udemy as I had access to the course from a package deal from a website called Infostack. The course was useful as it guides you through downloading the necessary JDKs, Eclipse (an IDE) and Android Studio. I completed the first app on the command line, and quickly realized that I would need to begin more advanced projects in order to get my skills where they needed to be to start the Data Engineering course.

I looked up how I could learn a programming language as quickly as possible, given that I already had an intermediate grasp of another programming language. The advice I heard from the Tech Lead (Patrick Shyu) and John Sonmez, was to re-write projects that you had already created in another programming language into the new language that you wanted to learn. Their point being that this allows you to take the abstract concepts you learned previously and implement them using the new language. I did this with the mental math game and the RSS feed reader I mentioned earlier.

As I began these projects, I mentioned to some friends who already knew Java that I was starting to learn the language, and they recommended Maven for project dependency management and Spring Boot as a web framework. I utilized Maven for both of my projects and I used Spring Boot as the web framework for my second project. I highly recommend becoming familiar with both of these technologies when starting out with Java, as they greatly simplify the process of creating Java applications. To get started with Maven I used the Command Line approach given here: Maven in Five Minutes (spoiler alert, this may take you longer than 5 minutes). To get started with Spring and Maven I recommend this link and this guide as well.

For the RSS feed reader I set up the HTML using Thymeleaf, the database using JPA/Hibernate/MariaDB, the web application using Spring Boot and the Java code I used touched on interfaces, inheritance, and threading. Throughout the process of completing the two projects I used a lot of blog posts and occasionally YouTube tutorials to answer some of the questions I had regarding Java. For questions regarding the Spring framework I found myself frequently on the https://www.baeldung.com/ and https://springframework.guru/ websites. For YouTube Java tutorials I recommend the Java Brains channel. For the basics of the Java syntax and for the core abstractions behind the language I recommend the Java tutorials within the Java Documentation website.  

I sent Jesse my two projects as an example of what I had learned over the past two and a half weeks, and he recommended that I gain a deeper understanding of threading in Java before beginning the course. I had one asynchronous thread that was running in my Spring Boot application, but I hadn’t taken the time to explore threading in detail.

To finish my preparation for the course, I read through and played with the code found in this great tutorial series, Rajeev Singh’s Java Concurrency/Multithreading tutorial. The final two posts of the tutorial were especially useful, as they answered some of the questions that I had concerning how to refactor one’s code such that multiple threads aren’t modifying the same variable or resource simultaneously.

To practice threading, I decided to try my hand at the Producer Consumer problem. I made an application that abstractly cooks and delivers burgers, then checks to see if the burgers have been cooked/delivered. Although my application could have been more efficient, I was able to alter the shared variables within the application in a synchronized manner using two threads and 5 tasks. To ensure that my application was thread safe, I made use of the BlockingQueue interface, the volatile keyword, Atomic Booleans for the burger states and Synchronized Lists to store the orders, i.e. the burgers before they were cooked, and the burgers to be delivered. To schedule the tasks I used the ScheduledExecutorService Interface. This allowed me to create a thread pool, and then schedule each of my tasks periodically. Something I would have done differently now that I have created other threaded programs would be to make a method for each Runnable class, and submit the Runnable class within its given method. This would make the main method of the App class much more readable. Throughout the process, I made continual reference to the tutorial from Rajeev Singh that I mentioned earlier.

If you look back at my earlier commits for the project, you’ll see that I was using the synchronized keyword on the Producer and Consumer methods and the AtomicBurger variable set methods, and I was making all of my shared variables volatile. Although the application ran as such, I decided to use Atomic Variables instead of using the synchronized keyword, and to limit my use of the volatile keyword to make my application more efficient. When I instantiated the majority of the objects I needed in the main thread without using the volatile keyword, my application ran much faster; the reason being that the volatile keyword disables the compiler’s optimization of instructions for the variable it is used to define. While the synchronized keyword and Atomic Variables achieve the same end of ensuring that a shared resource is altered by one thread at a time, Atomic Variables do so by using the compare-and-swap method which in many cases is more efficient than locking the variables. Although the application implementation and design were deliberately simple, the project was still challenging to complete given the complexity of threading. 

I am currently a week into the course, and I feel that my preparation for the course has been sufficient; allowing me to comprehend the technical details within the lectures and to complete the assigned exercises. I hope you have been able to learn from the path I’ve taken to arrive at my current educational locus. If you are new to Java and don’t have an intermediate level programming project from another language that you can copy over, here are 13 project ideas for Python developers which can also be implemented using Java.

Full disclosure: John received a discount on my Professional Data Engineering course for writing this post.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)There are several types of learning videos: hype, low effort, novice, and professional.It is important to avoid hype, low-effort, and

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)The concept of a “data discovery team” is introduced, which focuses on searching for data in an enterprise data reality.Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Current 2023 Announcements

Blog Summary: (AI Summaries by Summarizes)Confluent’s Current Conference featured several announcements that are important for both technologists and investors.Confluent has two existing moats (replication and

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on