Skip to content

Instantly share code, notes, and snippets.

@hayderimran7
Created April 2, 2023 18:49
Show Gist options
  • Save hayderimran7/0694e078ed531d74da98a5caac703d49 to your computer and use it in GitHub Desktop.
Save hayderimran7/0694e078ed531d74da98a5caac703d49 to your computer and use it in GitHub Desktop.
python regexes

python regex

chap1 : intro to regex

  • regex: Regular expressions are text patterns that define the form a text string should have.

    • useful for email checking patern
    • matching word "color" and "colour"
    • extra specific info like postal code LOL: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
  • How regex started (birth of grep) Ken Thompson's work didn't end in just writing a paper. He included support for these regular expressions in his version of QED. To search with a regular expression in QED, the following had to be written: g//p In the preceding line of code, g means global search and p means print. If, instead of writing regular expression, we write the short form re, we get g/re/p, and therefore, the beginnings of the venerable UNIX command-line tool grep

?: match single char (file?.xml matches file1.xml and file9.xml but not file99.xml)

  • : match any numder of char

in file?.xml: literals -> file and xml metacharacters -> ? (or '*' )

OUR FIRST REGEX

/a\w*/ ==> matches any word starting with 'a'

Escaping Metacharacters

Metachars can coexist but what if need to use metachar as luterals? 3 ways to do it:

  • escape the metachar by preceding with a backlash
  • in python , use "re.escape"
  • Quoting with \Q and \E: (not supported in Python)

There are 12 metachar that should be escaped when needed to use as char: \ backslash ^ Caret $ Dollar Sign . Dot | Pipe Symbol ? Question

  • Asterik
  • Plus sign ( ) [ {

Character class

Character classes allow us to define a char that will match if any of defined char on set is present

for example to match "license" and "licene" --> /licen[sc]e/ we can use range of chars [b-e] or num [2-9] Ranges can be combined : [0-9a-zA-z]

  • Negation of ranges [^0-9] match anything not a number but there has to be a char e.g. /hello[^0-9]/ wont match hello as there no char in its place

Predefined char class

| Element | Description | . | matches any char except newline | \d | matches any decimal , equivalent to [0-9] | \D | matches any non-digit , eq to [^0-9] | \s | matches any whitespace class: eq to [ \t\n\r\f\v ] | \S | matches non-whitespace , eq to [ ^ \t\n\r\f\v ] | \w | matches any alphanumeric eq to [0-9a-zA-Z_]

[^/] -> matches any char thats not a backslash or slash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment