- Hadoop requires technical skills to get started with big data.
- For developers, required skills include intermediate to advanced Java knowledge and a general understanding of Linux.
- Recommended skills for developers include knowledge of Scala and SQL, as well as a background in distributed systems.
- For administrators, required skills include very good knowledge of Linux, especially from the command line.
- Recommended skills for administrators include knowledge of security topics such as authentication with Kerberos and encryption.
A common question beginners ask about Hadoop are the technical skills needed to get started. This helps level set what skills you need before you embark on a big data journey.
For developers and administrators, I divide up the skills as those that required and those that are nice to have or recommended.
The majority of Hadoop and Hadoop ecosystem is written in Java. You should have an intermediate to advanced level of Java knowledge. You should understand things like generics, inheritance, and abstract classes.
You should have a general understanding of Linux. You’ll need to be relatively familiar with using a Linux command line to issue commands. Most work can be done in the GUI, but some Hadoop-specific interactions are on the command line.
Hadoop and the its ecosystem support other languages to varying extents. Since Scala is a JVM language you can use it throughout. For technologies like Apache Spark, Scala is often used. Other languages, will work to varying degrees of effectiveness and gotchas.
Another helpful language is SQL. Apache Hive uses a SQL-like language to process data and Spark has SQL built in.
More helpful, but not required is a background in distributed systems. Big Data frameworks like Hadoop are distributed systems. These frameworks make it easier to work with distributed systems, but don’t completely mask all of the complexity. At some point in your journey, you need to learn these concepts to really master Big Data frameworks.
Hadoop runs on Linux. The majority of clusters run on RHEL/CentOS or Ubuntu. The Big Data nature of things will stress things in ways you might not have seen before. It will expose weird problems you only see at scale. To diagnose and fix these issues, you’ll need to very good with Linux, especially from the command line. Most of the computers in your Hadoop cluster will be sitting in a data center’s rack or in the cloud.
Some of the Hadoop companies like Cloudera and Hortonworks are making cluster administration easier with web-based GUIs. This will help in detecting and monitoring Hadoop clusters. Despite these programs, you’ll still need to know how to troubleshoot a computer with a Linux command line.
If you’re planning on administrating an enterprise cluster, you probably be dealing with security. This is everything from authentication with Kerberos to line encryption to at rest encryption. It’s the administrators job to set all of this up and keep things secure. Having some knowledge of these topics will give you an advantage job seeking.