In a previous post, I showed how to use Beam’s
Regex class to
split up a string.
In this post, I’m going to going to show some other features of the
Regex class gives you a distributed way to work with strings. I tried to make the interface very familiar to Java developers. The
Regex methods mimic the methods in the
Here is a sample of the file this code is running against:
6 Diamond 3 Diamond 4 Club 4 Heart 3 Club 5 Spade
Let’s look at the code.
p.apply(TextIO.Read.from("playing_cards.tsv")) .apply(Regex.find("\\d*\\sHeart")) .apply(TextIO.Write.to("output/allheart"));
In this code snippet, we’ll be processing the file to only include the ones matching the regular expression. In this case, the regular expression is looking for all numbers followed by whitespace and the word
Heart. The result are files with only the Heart lines.
p.apply(TextIO.Read.from("playing_cards.tsv")) .apply(Regex.find("(\\d*)\\sHeart", 1)) .apply(TextIO.Write.to("output/allnumbers"));
Sometimes, you’ll want to get a specific group in the regular expression. This code snippet shows how to specify a group in the regular expression and choose it. By specifying the
1, you are choosing first group.
p.apply(TextIO.Read.from("playing_cards.tsv")) .apply(Regex.replaceFirst("Heart", "Hearts")) .apply(TextIO.Write.to("output/allhearts"));
The final example shows how to do a distributed search and replace. The dataset says the word
Heart. We want to change the word to
replaceFirst method takes in a regular expression and the string to replace it with. The result is the entire dataset looking like:
6 Diamond 3 Diamond 4 Club 4 Hearts 3 Club 5 Spade
If you interested in a course on Apache Beam, please sign up to be notified once it’s out.