I had an interesting question at TDWI Boston that I haven’t been asked before:
If you absolutely had to do something with Hadoop and Big Data tomorrow, how would you do it?
I’ll answer this from a technical and then a management point of view.
I call Apache Hive the Big Data technology you already know. This is because most people already know SQL. Hive uses a SQL-like language for its queries. By using SQL, it reduces the programming overhead to process the data.
That works around the programming side, but doesn’t give you a cluster to run on. I’d turn to the cloud. Both Amazon Web Services and Google Dataproc support spinning up a cluster with Hive.
Depending on which cloud provider you choose, you’ll need to upload the data. From there it’s a question of writing and running your SQL query. You’d be limited to processing the data with whatever SQL can do, but you’d be at least processing data.
Hive has two ways to add new functionality with user defined functions and transforms. I cover both of these programming interfaces in my Professional Data Engineering course.
Getting a cluster up and running could give the impression that Big Data is easy when it’s not. The method I described above doesn’t represent the work it takes to create a data pipeline. It represents more of a one off and isn’t automated. It skips all of the planning and understanding necessary.
To do a data pipeline right, you will need to put more effort. The team will also need training on the technologies. Without this proper training, the team will be very limited in what they can achieve.