Skip to content

Instantly share code, notes, and snippets.

@kiranparajuli589
Created February 4, 2021 07:36
Show Gist options
  • Save kiranparajuli589/71141fb2a724e690b507b0a48725d88d to your computer and use it in GitHub Desktop.
Save kiranparajuli589/71141fb2a724e690b507b0a48725d88d to your computer and use it in GitHub Desktop.
REGEX TRAINING 20210204

Regular Expression (Regex)

Regex is one of the most powerful, flexible, and efficient text processing approaches. Regex has its own terminologies, conditions and syntax; it is, in a sense, a mini programming language.

Regex can be used to add, remove, isolate and manipulate all kinds of text and data. It could be used a a simple text editor command, e.g. search and replace, or as it’s own powerful text-processing language. Because of that, Regex has so many applications in technology today, such as: Extract Useful Information With Web Crawlers, Data Scrapping and Web Scraping, Data Wrangling, and machine learning __ namely, natural language process and speech recognition.

Regex is not a programming language-specific application; in facet, it can be used in all programming languages today. Programming languages give support to the usage of Regex, but all the magic and strength comes from the Regex itself.

Using Regex can save the programmer a precious time that can be wasted on mundane tasks. Tasks such as:

  • looking for emails in a folder of files
  • removing repetition from a bunch of text files
  • analyzing the syntax of a specific language
  • highlighting some context in a file
  • and much much more
[-a-z0-9]+(\[-a-z0-9]+)*

Language Analogy

Full Regex is often composed of two basic characters:

  • metacharacters

    Grammer of Regex

  • literals

    Words of the language

Metacharacters

Types:

  • Metacharacters Class
  • Quantifiers
  • Position Metacharacters

Metacharacters Class

These types of metacharacters are used to match single characters, and they all start with\ to distinguish them from literals. Here is a table of the six possible Metacharacters classes:

METACHARACTER NAME WHAT IT MATCHES
\w Word Any word character a-z, A-Z or digits 0-9
\W Non word Any non-word character
\d Digit Any digit between 0-9
\D Non digit Anything that is not a digit between 0-9
\s Whitespace Whitespace characters, space, tab, newline
\S Non whitespace Non-whitespace characters

Quantifiers

These types of metacharacters are used to indicate the number of occurrences of a character in the pattern we are trying to match. Say we want to match both “Jessy” and “Jesy”, we would use one of the quantifiers to indicate that both options are acceptable. There are four types of quantifiers.

METACHARACTER NAME WHAT IT MATCHES
? Question Characters appearing zero or one time only
* Star Characters appearing zero or more time
+ Plus Characters appearing one or more times
(min, max) Specific Range Characters appearing a within a range of times

Position Metacharacters

Position metacharacters are used to indicate the location of the character we are looking for. Is it a t the beginning of the text, at the end of the line, or a word? Is it at the beginning of the text, at the end of the line, or a word? to get this specific, we use position metacharacters.

METACHARACTER NAME WHAT IT MATCHES
^ Caret A character in the start of the line
$ Dollar A character in the end of the line
\< Upper word boundary A character in the start of the word
\> Lower word boundary A character in the end of the word

Meta-extras

That’s just what I call them — definitely not the official name —- these are some extra metacharacters that are used to join other metacharacters and literals.

METACHARACTER NAME WHAT IT MATCHES
[] Square Bracket A set of characters
. Dot Any one character
| Or Operator A character between tow or more options
() parentheses Used to group quantifiers

Literals

Literals are all words and characters that is not a metacharacter. For example,

“Automation”, “Regex”, “Hello” all these are literals.

Problem

A problem arises if I want to match one of the metacharacters, for example, say I want to match *, ^ characters, would should I do?

In this case, we use the escape character to Regex to explicitly indicate we want to match that character. So we type \^ or \\* instead of just ^ and *.

EXAMPLES:

  • Distro Enrollment
/(distroEnrollment Question )(\d{1,2})/g

distroEnrollment Question 2",
distroEnrollment Question 2
distroEnrollment Question 3
distroEnrollment Question 31
  • This regex validates German vehicle registration numbers. It includes 'H' for Oldtimers (Historic) and 'E' for electric. Futhermore it validates optional seasonal plates. For example for motorcycles or recreational vehicles.

    /^([A-ZäÄÖÜ]{1,3})\-[ ]{0,1}([A-Z]{0,2})[ ]{0,1}([0-9]{1,4}[HE]{0,1})[ ]{0,1}([0-9]{0,2})[ ]{0,1}([0-9]{0,2})$/gm
    
    ABC-DE 1234
    ABC-DE 1234H
    ABC-DE 1234E
    ABC-DE 1234 04 10
    
  • Retails csv date wise validator

    /Retails_OB_(?<YYYY>\d{4})(?<MM>\d{2})(?<DD>\d{2}).csv/gm
    
    Retails OB.csv
    Retails_OB_20200717.csv
    Retails_OB_20200723.csv
    Retails_OB_20200804.csv
    Retails_OB_20200814.csv
    Retails_OB_20200821.csv
    Retails_OB_20200825.csv
    Retails_OB_20200902.csv
    Retails_OB_20200910.csv
    Retails_OB_20200917.csv
    Retails_OB_20200924.csv
    Retails_OB_20200929.csv
    Retails_OB_20201006.csv
    Retails_OB_20201013.csv
    Retails_OB_20201021.csv
    Retails_OB_20201028.csv
    Retails_OB_20201119.csv
    Retails_OB_20201125.csv
    Retails_OB_20201214.csv
    
  • Extract image path from thumbnail

    /(?:image:\/\/)(?<Thumnail>.+)\/transform\?size=thumb$/gm
    
    image://%2fhome%2fkeana%2fPictures%2fWallpapers%2fpawel-czerwinski-6lQDFGOB1iw-unsplash.jpg/transform?size=thumb
    
  • Validate (although, not recommended) URI scheme, and separate the URI syntax with multiple groups.

    /(?:(?<Protocol>https?):\/\/)?(?:(?<Subdomain>[\w\.]+)?\.)?(?<Hostname>\w+)\.(?<Domain>\w+)\:?(?<Port>\d+)?(?<Path>\/.*)?/g
    
    https://regex101.com:8000/api?test
    
  • Nums in filename

    /(?<=_)(\d+)(?=\.jpg)/gm
    
    tesT_3_1312.jpg
    test3_v_32.jpg
    v_32.jpg
    wow_123_1234.jpg
    
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment