Regex
is one of the most powerful, flexible, and efficient text processing approaches. Regex
has its own terminologies, conditions and syntax; it is, in a sense, a mini programming language.
Regex
can be used to add, remove, isolate and manipulate all kinds of text and data. It could be used a a simple text editor command, e.g. search and replace, or as it’s own powerful text-processing language. Because of that, Regex
has so many applications in technology today, such as: Extract Useful Information With Web Crawlers, Data Scrapping and Web Scraping, Data Wrangling, and machine learning __ namely, natural language process and speech recognition.
Regex
is not a programming language-specific application; in facet, it can be used in all programming languages today. Programming languages give support to the usage of Regex
, but all the magic and strength comes from the Regex
itself.
Using Regex
can save the programmer a precious time that can be wasted on mundane tasks. Tasks such as:
- looking for emails in a folder of files
- removing repetition from a bunch of text files
- analyzing the syntax of a specific language
- highlighting some context in a file
- and much much more
[-a-z0-9]+(\[-a-z0-9]+)*
Full Regex
is often composed of two basic characters:
-
metacharacters
Grammer of
Regex
-
literals
Words of the language
Types:
- Metacharacters Class
- Quantifiers
- Position Metacharacters
These types of metacharacters are used to match single characters, and they all start with\
to distinguish them from literals
. Here is a table of the six possible Metacharacters classes:
METACHARACTER | NAME | WHAT IT MATCHES |
---|---|---|
\w | Word | Any word character a-z, A-Z or digits 0-9 |
\W | Non word | Any non-word character |
\d | Digit | Any digit between 0-9 |
\D | Non digit | Anything that is not a digit between 0-9 |
\s | Whitespace | Whitespace characters, space, tab, newline |
\S | Non whitespace | Non-whitespace characters |
These types of metacharacters are used to indicate the number of occurrences of a character in the pattern we are trying to match. Say we want to match both “Jessy” and “Jesy”, we would use one of the quantifiers to indicate that both options are acceptable. There are four types of quantifiers.
METACHARACTER | NAME | WHAT IT MATCHES |
---|---|---|
? | Question | Characters appearing zero or one time only |
* | Star | Characters appearing zero or more time |
+ | Plus | Characters appearing one or more times |
(min, max) | Specific Range | Characters appearing a within a range of times |
Position metacharacters are used to indicate the location of the character we are looking for. Is it a t the beginning of the text, at the end of the line, or a word? Is it at the beginning of the text, at the end of the line, or a word? to get this specific, we use position metacharacters.
METACHARACTER | NAME | WHAT IT MATCHES |
---|---|---|
^ | Caret | A character in the start of the line |
$ | Dollar | A character in the end of the line |
\< | Upper word boundary | A character in the start of the word |
\> | Lower word boundary | A character in the end of the word |
That’s just what I call them — definitely not the official name —- these are some extra metacharacters that are used to join other metacharacters and literals.
METACHARACTER | NAME | WHAT IT MATCHES |
---|---|---|
[] | Square Bracket | A set of characters |
. | Dot | Any one character |
| | Or Operator | A character between tow or more options |
() | parentheses | Used to group quantifiers |
Literals are all words and characters that is not a metacharacter. For example,
“Automation”, “Regex”, “Hello” all these are literals.
A problem arises if I want to match one of the metacharacters, for example, say I want to match *
, ^
characters, would should I do?
In this case, we use the escape character to Regex to explicitly indicate we want to match that character. So we type \^
or \\*
instead of just ^
and *
.
- Distro Enrollment
/(distroEnrollment Question )(\d{1,2})/g
distroEnrollment Question 2",
distroEnrollment Question 2
distroEnrollment Question 3
distroEnrollment Question 31
-
This regex validates German vehicle registration numbers. It includes 'H' for Oldtimers (Historic) and 'E' for electric. Futhermore it validates optional seasonal plates. For example for motorcycles or recreational vehicles.
/^([A-ZäÄÖÜ]{1,3})\-[ ]{0,1}([A-Z]{0,2})[ ]{0,1}([0-9]{1,4}[HE]{0,1})[ ]{0,1}([0-9]{0,2})[ ]{0,1}([0-9]{0,2})$/gm ABC-DE 1234 ABC-DE 1234H ABC-DE 1234E ABC-DE 1234 04 10
-
Retails csv date wise validator
/Retails_OB_(?<YYYY>\d{4})(?<MM>\d{2})(?<DD>\d{2}).csv/gm Retails OB.csv Retails_OB_20200717.csv Retails_OB_20200723.csv Retails_OB_20200804.csv Retails_OB_20200814.csv Retails_OB_20200821.csv Retails_OB_20200825.csv Retails_OB_20200902.csv Retails_OB_20200910.csv Retails_OB_20200917.csv Retails_OB_20200924.csv Retails_OB_20200929.csv Retails_OB_20201006.csv Retails_OB_20201013.csv Retails_OB_20201021.csv Retails_OB_20201028.csv Retails_OB_20201119.csv Retails_OB_20201125.csv Retails_OB_20201214.csv
-
Extract image path from thumbnail
/(?:image:\/\/)(?<Thumnail>.+)\/transform\?size=thumb$/gm image://%2fhome%2fkeana%2fPictures%2fWallpapers%2fpawel-czerwinski-6lQDFGOB1iw-unsplash.jpg/transform?size=thumb
-
Validate (although, not recommended) URI scheme, and separate the URI syntax with multiple groups.
/(?:(?<Protocol>https?):\/\/)?(?:(?<Subdomain>[\w\.]+)?\.)?(?<Hostname>\w+)\.(?<Domain>\w+)\:?(?<Port>\d+)?(?<Path>\/.*)?/g https://regex101.com:8000/api?test
-
Nums in filename
/(?<=_)(\d+)(?=\.jpg)/gm tesT_3_1312.jpg test3_v_32.jpg v_32.jpg wow_123_1234.jpg