Beam’s Pico WordCount

Blog Summary: (AI Summaries by Summarizes)
  • The goal of the game in Big Data frameworks is to create the fewest lines of code for WordCount.
  • The author is a committer on Apache Beam and has recently improved the regular expression handling in Beam.
  • The smallest WordCount using Beam involves reading in a file, using the new Regex transform, counting the words, manually converting a PCollection> to a PCollection, and writing out the words and counts.
  • The new ToString class in version 0.6.0 makes it easier to change objects to strings.
  • The output of the WordCount will be placed in the "output" directory with files prefixed as "stringcounts".

There’s this friendly game in Big Data frameworks. It’s what’s the fewest lines of code it takes to do WordCount.

I’m a committer on Apache Beam and most of my time is dedicated to making things easier for developers to use Beam. I also help explain Beam in articles and in conference sessions.

One of my recent commits was to improve the regular expression handling in Beam. I wanted to make commonly used regular expression activities easier and built-in.

Here is the smallest WordCount I could create using Beam.

public class PicoWordCount {
    public static void main(String[] args) {

        PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline p = Pipeline.create(options);

        p
                .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
                .apply(ToString.elements())
                .apply(TextIO.Write.to("output/stringcounts"));

        p.run();
    }
}

The first step is read in the file. In this case, the file is playing_cards.tsv.

After that, we use the new Regex transform. This is the transform I recently committed to make it easier to do regular expressions. If you think along the lines of Java’s String.split() method, that’s exactly what I’m doing here. Except this code is running in a distributed system across many different processes.

Next, we count the words that we’ve split up.

The next section is something that I’m looking at now. There isn’t a built-in way to take a KV object and make it a String. The MapElements.via() method is manually taking the KV<String, Long> and making it a String. Another way of saying this is that I’m taking a PCollection<KV<String, Long>> and manually converting it to a PCollection<String>. I have to do this because the TextIO.Write.to() requires a PCollection<String>. As I said, look for some API improvements on this front.

Update for 0.6.0: We’ve added a class to make it easier to change objects to strings. The new ToString class does this. Using the ToString.elements() runs the toString() method of the object and returns it.

Finally, we write out the words and counts. These will be placed in the output directory with files prefixed as stringcounts.

Once we’ve setup all of the transforms, we can run the Pipeline.

If you interested in a course on Apache Beam, please sign up to be notified once it’s out.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)Learning options trading involves data and programming but is not as technical as data engineering or software engineering.Different types of

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)Data discovery team plays a crucial role in searching for data in the IT landscape.Data discovery team must make data