Apache Beam Regex

Blog Summary: (AI Summaries by Summarizes)
  • The Regex class in Beam allows for distributed string processing.
  • The interface of the Regex class is designed to be familiar to Java developers.
  • The Regex.find() method can be used to filter a file based on a regular expression.
  • The Regex.find() method can also be used to extract specific groups from the regular expression.
  • The Regex.replaceFirst() method can be used to do a distributed search and replace on a dataset.

In a previous post, I showed how to use Beam’s Regex class to split up a string.

In this post, I’m going to going to show some other features of the Regex class.

The Regex class gives you a distributed way to work with strings. I tried to make the interface very familiar to Java developers. The Regex methods mimic the methods in the String class.

Here is a sample of the file this code is running against:

6 Diamond
3 Diamond
4 Club
4 Heart
3 Club
5 Spade

Let’s look at the code.

p.apply(TextIO.Read.from("playing_cards.tsv"))
  .apply(Regex.find("\\d*\\sHeart"))
  .apply(TextIO.Write.to("output/allheart"));

In this code snippet, we’ll be processing the file to only include the ones matching the regular expression. In this case, the regular expression is looking for all numbers followed by whitespace and the word Heart. The result are files with only the Heart lines.

p.apply(TextIO.Read.from("playing_cards.tsv"))
  .apply(Regex.find("(\\d*)\\sHeart", 1))
  .apply(TextIO.Write.to("output/allnumbers"));

Sometimes, you’ll want to get a specific group in the regular expression. This code snippet shows how to specify a group in the regular expression and choose it. By specifying the 1, you are choosing first group.

p.apply(TextIO.Read.from("playing_cards.tsv"))
  .apply(Regex.replaceFirst("Heart", "Hearts"))
  .apply(TextIO.Write.to("output/allhearts"));

The final example shows how to do a distributed search and replace. The dataset says the word Heart. We want to change the word to Hearts. The replaceFirst method takes in a regular expression and the string to replace it with. The result is the entire dataset looking like:

6 Diamond
3 Diamond
4 Club
4 Hearts
3 Club
5 Spade

If you interested in a course on Apache Beam, please sign up to be notified once it’s out.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased