Doing Big Data ASAP

Blog Summary: (AI Summaries by Summarizes)
  • Apache Hive is a Big Data technology that uses a SQL-like language for its queries, reducing programming overhead to process data.
  • To run Hive, you can turn to the cloud and use services like Amazon Web Services or Google Dataproc to spin up a cluster.
  • Uploading data to the cloud provider is necessary, and then you can write and run your SQL query.
  • Hive has two ways to add new functionality with user-defined functions and transforms.
  • Creating a data pipeline requires more effort and understanding than just spinning up a cluster with Hive.

I had an interesting question at TDWI Boston that I haven’t been asked before:

If you absolutely had to do something with Hadoop and Big Data tomorrow, how would you do it?

I’ll answer this from a technical and then a management point of view.

Technical

I call Apache Hive the Big Data technology you already know. This is because most people already know SQL. Hive uses a SQL-like language for its queries. By using SQL, it reduces the programming overhead to process the data.

That works around the programming side, but doesn’t give you a cluster to run on. I’d turn to the cloud. Both Amazon Web Services and Google Dataproc support spinning up a cluster with Hive.

Depending on which cloud provider you choose, you’ll need to upload the data. From there it’s a question of writing and running your SQL query. You’d be limited to processing the data with whatever SQL can do, but you’d be at least processing data.

Hive has two ways to add new functionality with user defined functions and transforms. I cover both of these programming interfaces in my Professional Data Engineering course.

Management

Getting a cluster up and running could give the impression that Big Data is easy when it’s not. The method I described above doesn’t represent the work it takes to create a data pipeline. It represents more of a one off and isn’t automated. It skips all of the planning and understanding necessary.

To do a data pipeline right, you will need to put more effort. The team will also need training on the technologies. Without this proper training, the team will be very limited in what they can achieve.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)There are several types of learning videos: hype, low effort, novice, and professional.It is important to avoid hype, low-effort, and

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)The concept of a “data discovery team” is introduced, which focuses on searching for data in an enterprise data reality.Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Current 2023 Announcements

Blog Summary: (AI Summaries by Summarizes)Confluent’s Current Conference featured several announcements that are important for both technologists and investors.Confluent has two existing moats (replication and