Beam’s Pico WordCount

Blog Summary: (AI Summaries by Summarizes)
  • The goal of the game in Big Data frameworks is to create the fewest lines of code for WordCount.
  • The author is a committer on Apache Beam and has recently improved the regular expression handling in Beam.
  • The smallest WordCount using Beam involves reading in a file, using the new Regex transform, counting the words, manually converting a PCollection> to a PCollection, and writing out the words and counts.
  • The new ToString class in version 0.6.0 makes it easier to change objects to strings.
  • The output of the WordCount will be placed in the "output" directory with files prefixed as "stringcounts".

There’s this friendly game in Big Data frameworks. It’s what’s the fewest lines of code it takes to do WordCount.

I’m a committer on Apache Beam and most of my time is dedicated to making things easier for developers to use Beam. I also help explain Beam in articles and in conference sessions.

One of my recent commits was to improve the regular expression handling in Beam. I wanted to make commonly used regular expression activities easier and built-in.

Here is the smallest WordCount I could create using Beam.

public class PicoWordCount {
    public static void main(String[] args) {

        PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline p = Pipeline.create(options);

        p
                .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
                .apply(ToString.elements())
                .apply(TextIO.Write.to("output/stringcounts"));

        p.run();
    }
}

The first step is read in the file. In this case, the file is playing_cards.tsv.

After that, we use the new Regex transform. This is the transform I recently committed to make it easier to do regular expressions. If you think along the lines of Java’s String.split() method, that’s exactly what I’m doing here. Except this code is running in a distributed system across many different processes.

Next, we count the words that we’ve split up.

The next section is something that I’m looking at now. There isn’t a built-in way to take a KV object and make it a String. The MapElements.via() method is manually taking the KV<String, Long> and making it a String. Another way of saying this is that I’m taking a PCollection<KV<String, Long>> and manually converting it to a PCollection<String>. I have to do this because the TextIO.Write.to() requires a PCollection<String>. As I said, look for some API improvements on this front.

Update for 0.6.0: We’ve added a class to make it easier to change objects to strings. The new ToString class does this. Using the ToString.elements() runs the toString() method of the object and returns it.

Finally, we write out the words and counts. These will be placed in the output directory with files prefixed as stringcounts.

Once we’ve setup all of the transforms, we can run the Pipeline.

If you interested in a course on Apache Beam, please sign up to be notified once it’s out.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased