Are Your Programming Skills Ready for Big Data?

Jesse Anderson
February 21, 2018
Blog, Business, Data Engineering, Data Engineering is hard
2 Comments

Blog Summary: (AI Summaries by Summarizes)

Programming skills are crucial for working with Big Data.
Programming skills can range from brand new to experienced in a language other than Java/Scala/Python.
The level of programming skills needed depends on your role on the team.
Big Data code is relatively small and self-contained, with the framework doing a lot behind the scenes.
Regular expressions are commonly used in processing data.

As people start with Big Data, they go through the list of necessary skills. One of those crucial skills is to program.

The question arises — how good does a personâ€™s programming skills need to be? This is because programming skills are on a wide spectrum. There are people who are:

Brand new to programming
Never programmed before and will have to learn how to program
Program in a language other than Java/Scala/Python
Been programming in Java for many years

Another dimension is your role on the team. For example, a Data Engineer will need far better programming skills than a Data Analyst. My book The Ultimate Guide to Switching Careers to Big Data goes through the individual titles and specific recommendations in more depth.

Example Code

To give you an idea of what some Big Data code looks like, here is an example Mapper class from my Uno Example.

public class CardMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private static Pattern inputPattern = Pattern.compile("(.*) (\\d*)");

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String inputLine = value.toString();

        Matcher inputMatch = inputPattern.matcher(inputLine);

        // Use regex to throw out Jacks, Queens, Kings, Aces and Jokers
        if (inputMatch.matches()) {
            // Normalize inconsistent case for card suits
            String cardSuit = inputMatch.group(1).toLowerCase();
            int cardValue = Integer.parseInt(inputMatch.group(2));

            context.write(new Text(cardSuit), new IntWritable(cardValue));
        }
    }
}

You’ll notice a few things about this code.

The class is relatively small and self-contained. This is true for most Big Data code, even production code. This is because the framework, in this case Hadoop MapReduce, is doing all sorts of things behind the scenes for us.

We also see that we’re using regular expressions to parse incoming data. It’s very common to use regular expressions in processing data. They’re so necessary, that I cover them in my Professional Data Engineering course.

You’ll also notice this class isn’t doing anything exotic. The code and syntax itself isn’t going to stress most people’s knowledge of Java. The closest thing to exotic is the occasional use of the transient keyword. In this sense, Big Data knowledge of syntax can be intermediate.

As you just saw, the programming side is necessary, but not extremely difficult. You will need to know how to program. I’ve seen people come from other languages without significant difficulties.

What Is Difficult Then?

There are two main difficulties for the programming side. They are understanding the framework and the algorithms you need to write when creating a distributed system.

The Framework

Looking back at the code above:

How does the map function get data?
Where does the key's data come from?
Where does the value's data come from?
What happens when you do a write?
What should you use for your output key and value?

Some of these questions are answered by knowing and understanding what the framework is doing for you. In this example code, Hadoop MapReduce is doing several things for you. What should come in and out of the map function is dependant on what you’re trying to do. At its core, you need to have a deep understanding of the framework before you can use it or code for it.

This lack of realization of where the difficulty lies is a common issue for people starting out with Big Data. They think they can use their existing small data background and not make a concerted effort to learn Big Data technologies. They’re dead wrong in this thinking and this causes people to fail in switching careers to Big Data. I talk more about these necessary realizations and what to do in my book The Ultimate Guide to Switching Careers to Big Data.

The Algorithms

With Big Data, you’re doing things in parallel and across many different computers. This causes you to change the way you process and work with data.

As you saw in the code above, you will need to decide what should come in and out of your map function. But how do you this in a distributed system?

A simple example of the difference can be shown with calculating an average. Let’s say we want to calculate the average of these numbers:

88 91 38 3 98 79 3 31 23 61

On a single computer, that’s easy. The answer is to iterate through all 10 values and the answer is 51.5.

Now let’s distribute out the data to 3 computers.

Computer 1: 88 91 38

Computer 2: 3 98 79

Computer 3: 3 31 23 61

Now, we run the averages on all 3 computers.

Computer 1: 72.3

Computer 2: 60

Computer 3: 29.5

But we don’t have an average of the dataset. We average out the results from all three computers to get 53.94. Now we’re off by 2.44. Why? Because an average of averages isn’t correct.

In order to distribute out data and run an algorithm in parallel, we need to change the way we’d calculate the averages.