- Apache Hive is a Big Data technology that uses a SQL-like language for its queries, reducing programming overhead to process data.
- To run Hive, you can turn to the cloud and use services like Amazon Web Services or Google Dataproc to spin up a cluster.
- Uploading data to the cloud provider is necessary, and then you can write and run your SQL query.
- Hive has two ways to add new functionality with user-defined functions and transforms.
- Creating a data pipeline requires more effort and understanding than just spinning up a cluster with Hive.
I had an interesting question at TDWI Boston that I haven’t been asked before:
If you absolutely had to do something with Hadoop and Big Data tomorrow, how would you do it?
I’ll answer this from a technical and then a management point of view.
I call Apache Hive the Big Data technology you already know. This is because most people already know SQL. Hive uses a SQL-like language for its queries. By using SQL, it reduces the programming overhead to process the data.
Depending on which cloud provider you choose, you’ll need to upload the data. From there it’s a question of writing and running your SQL query. You’d be limited to processing the data with whatever SQL can do, but you’d be at least processing data.
Hive has two ways to add new functionality with user defined functions and transforms. I cover both of these programming interfaces in my Professional Data Engineering course.
Getting a cluster up and running could give the impression that Big Data is easy when it’s not. The method I described above doesn’t represent the work it takes to create a data pipeline. It represents more of a one off and isn’t automated. It skips all of the planning and understanding necessary.
To do a data pipeline right, you will need to put more effort. The team will also need training on the technologies. Without this proper training, the team will be very limited in what they can achieve.