Regular expression to make life more comfortable

Sachith Muhandiram
3 min readFeb 15, 2020

--

Recently I had been working with Apache-NiFi at my workplace (VizuaMatix). I was given a large csv data set and I need to extract a few columns including a JSON object in one column. Task was simple until I got real data, it was a mess. This JSON object was not properly structured . It came as a broken object in multiple lines.

So I was trying to find a proper regex to extract this JSON almost for a day. Last evening I watched TechLead’s 7-Productive life hacks. Here he says regex is a productive thing to learn as it can save a lots of time. Specially with searching and coding. Somehow this morning YouTube suggested me a video which made my day Regular Expressions (Regex) Tutorial: How to Match Any Pattern of Text.

He explains all the basics of regex you need to know. Thanks to Corey Schafer , I have learned something to make my today. Here I will explain some basics samples I tested.

First lets see basic regex operators.

.   — Matches any character except new line. (0–9, a-z, A-Z etc)\d  — Digit
\D — Not a digit
\w - Word character (a-z,A-Z,0-9,_)
\W - Not a word character
^ - Beginning of a character
$ - End of a character
\s - White space
\S - Not a white space
| - Or\b - Word boundary
\B - Not a word boundary

Here word boundary is kind of special thing. Lets see how it works. Our sample word is Ha HaHa .

\bHa     Ha HaHa\bHa\b   Ha HaHa

Here this selects only :

  • Starts of a new line.
  • Before and After a Space.

These are the true boundaries of a word. Most importantly it doesn't recognize new line as a word boundary.

Then there are two special operators. Character set and Groups.

[ ]  - Character set( )  - Group[^ ] - Matches characters not inside

Then there are Quantifiers.

*   - zero or more+   - one or more?   - zero or one
{n} - exactly n number of things
{n,m} - range (min,max)

Lets go to some examples :

First, I have used following regex for validating an email address. At that time, I just googled and took it from stackoverflow.

^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}$

Lets divide it and see what it mean.

[a-z0-9._%+-] character set.

^[a-z0-9._%+-] This character set should be at the line beginning.

[a-z0-9._%+-]+ This whole character set value/s can occur at least one time or more.

a-z0-9._%+- : This can have any simple letters, numbers, or any of these characters ., _, % , , +.

@ regex will match this sign soon after it gets a match after above character set.

Then again similar character set followed after @,which can occur one or more time followed by . Finally we have

[a-z]{2,4}$ : $ means this regex occurs only at the end of a line.

[a-z]{2,4} : at the end it can have at least two characters or maximum 4 characters.

I have placed some samples for validating this regex.

And for my original JSON object validation, I came up with following regex.

\{(“\w+”:”\w+”[,]?)+[}]$

This can occurs anywhere in the file. But [}]$ it should have a } at the end of the line. Here \ is used as escape character for { at the beginning of the regex.

("\w+":"\w+"[,]?)+ This is a group which can occur one or more times and it contains \w+ word characters followed by : and then again another word characters.

[,]? a , can occur zero or one time in this group. sample for this regex.

This is just some basics of regex, the best way to learn is practicing more and more regex. regex101 is a good place to get your hands dirty.

Thanks again Corey Schafer.

--

--

Sachith Muhandiram
Sachith Muhandiram

Written by Sachith Muhandiram

DevOps who is willing to learn and try new things.

No responses yet