Beam’s Pico WordCount

Jesse Anderson
December 7, 2016
Blog, Business, Data Engineering
No Comments

Blog Summary: (AI Summaries by Summarizes)

The goal of the game in Big Data frameworks is to create the fewest lines of code for WordCount.
The author is a committer on Apache Beam and has recently improved the regular expression handling in Beam.
The smallest WordCount using Beam involves reading in a file, using the new Regex transform, counting the words, manually converting a PCollection> to a PCollection, and writing out the words and counts.
The new ToString class in version 0.6.0 makes it easier to change objects to strings.
The output of the WordCount will be placed in the "output" directory with files prefixed as "stringcounts".

There’s this friendly game in Big Data frameworks. It’s what’s the fewest lines of code it takes to do WordCount.

I’m a committer on Apache Beam and most of my time is dedicated to making things easier for developers to use Beam. I also help explain Beam in articles and in conference sessions.

One of my recent commits was to improve the regular expression handling in Beam. I wanted to make commonly used regular expression activities easier and built-in.

Here is the smallest WordCount I could create using Beam.

public class PicoWordCount {
    public static void main(String[] args) {

        PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline p = Pipeline.create(options);

        p
                .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
                .apply(ToString.elements())
                .apply(TextIO.Write.to("output/stringcounts"));

        p.run();
    }
}

The first step is read in the file. In this case, the file is playing_cards.tsv.

After that, we use the new Regex transform. This is the transform I recently committed to make it easier to do regular expressions. If you think along the lines of Java’s String.split() method, that’s exactly what I’m doing here. Except this code is running in a distributed system across many different processes.

Next, we count the words that we’ve split up.

The next section is something that I’m looking at now. There isn’t a built-in way to take a KV object and make it a String. The MapElements.via() method is manually taking the KV<String, Long> and making it a String. Another way of saying this is that I’m taking a PCollection<KV<String, Long>> and manually converting it to a PCollection<String>. I have to do this because the TextIO.Write.to() requires a PCollection<String>. As I said, look for some API improvements on this front.

Update for 0.6.0: We’ve added a class to make it easier to change objects to strings. The new ToString class does this. Using the ToString.elements() runs the toString() method of the object and returns it.

Finally, we write out the words and counts. These will be placed in the output directory with files prefixed as stringcounts.

Once we’ve setup all of the transforms, we can run the Pipeline.

If you interested in a course on Apache Beam, please sign up to be notified once it’s out.

Beam’s Pico WordCount

Related Posts

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Data Teams Survey 2024 Results

Join the Newsletter