Skip to content

Instantly share code, notes, and snippets.

@pedrohenriqueromio
Created November 8, 2022 13:04
Show Gist options
  • Save pedrohenriqueromio/bb9fcfaca5198468b51349b4fac0639c to your computer and use it in GitHub Desktop.
Save pedrohenriqueromio/bb9fcfaca5198468b51349b4fac0639c to your computer and use it in GitHub Desktop.
Common regex used to extract data from Html
Get Hexadecimal color code
\#([a-fA-F]|[0-9]){3, 6}
Validate email address
/[A-Z0-9._%+-]+@[A-Z0-9-]+.+.[A-Z]{2,4}/igm
IPv4 address
/\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b/
IPv6 address
(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
Thousands separator
/\d{1,3}(?=(\d{3})+(?!\d))/g
Get domain from url
/https?:\/\/(?:[-\w]+\.)?([-\w]+)\.\w+(?:\.\w+)?\/?.*/i
Sort keywords by word count
^[^\s]*$ matches exactly 1-word keyword
^[^\s]*\s[^\s]*$ matches exactly 2-word keyword
^[^\s]*\s[^\s]* matches keywords of at least 2 words (2 and more)
^([^\s]*\s){2}[^\s]*$ matches exactly 3-word keyword
^([^\s]*\s){4}[^\s]*$ matches 5-words-and-more keywords (longtail)
Valid phone number
^\+?\d{1,3}?[- .]?\(?(?:\d{2,3})\)?[- .]?\d\d\d[- .]?\d\d\d\d$
Leading & trailing whitespaces
^[ \s]+|[ \s]+$
Get img src
\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\'\ >]*)
Validate date in DD/MM/YYY format
^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
Valid ISBN
/\b(?:ISBN(?:: ?| ))?((?:97[89])?\d{9}[\dx])\b/i
Check zip code
^\d{5}(?:[-\s]\d{4})?$
Valid twitter username
/@([A-Za-z0-9_]{1,15})/
Credit card numbers
^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$
Find css attributes
^\s*[a-zA-Z\-]+\s*[:]{1}\s[a-zA-Z0-9\s.#]+[;]{1}
Strip html comments
<!--(.*?)-->
Facebook profile url
/(?:http:\/\/)?(?:www\.)?facebook\.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-]*)/
Check IE version
^.*MSIE [5-8](?:\.[0-9]+)?(?!.*Trident\/[5-9]\.0).*$
Extract price
/(\$[0-9,]+(\.[0-9]{2})?)/
Parse email header
/\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b/i
Match a particular filetype
/\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b/i
Match a url string
/[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi
Append rel="nofollow" to links
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>
Media query match
/@media([^{]+)\{([\s\S]+?})\s*}/g
Match empty paragraph tags
<p>(\s|&nbsp;|<\/?\s?br\s?\/?>)*<\/?p>
Extract Heading(h1) tag
<h1>This is a heading</h1>
<h1>([^<]+)</h1>
Extract Hyperlink from (A) tag
<a href="https://www.datascraping.co">Typical Website Link</a>
<a href="([^"]+)">Typical Website Link</a>
Extract Hyperlink and Anchor Text from (A) tag
<a href="https://www.datascraping.co">Typical Website Link</a>
<a href="([^"]+)">([^<]+)</a>
Extract Image alt text and source from (IMG) tag
<img alt="screen scraping" src="https://cdn.datascraping.co/images/create-a-web-scraping-agent.jpg">
<img alt="([^"]+)" src="([^"]+)"/>
Extract data attribute and price from (DIV) tag
<div data-id="17839" data-availability="InStock">USD 129.00</div>
<div data-id="(\d+)" data-availability="(\w+)">USD\s*([^"]+)<\/div>
Extract text from (STRONG) tag
<strong>My Bold text</strong>
<strong>([^<]+)</strong>
Extract text from (span) tag with some CSS class
<span class="some-css-class">My Favorite Data</span>
<span class="some-css-class">([^<]+)</span>
Extract META description content value
<meta name="description" content="the SEO description of web page in heading section" />
<meta name="description" content="([^"]+)" />
Scrape only the 1st 3 values in a table
<tr>\s*<td>([^<]+)</td>\s*<td>([^<]+)</td>\s*<td>([^<]+)</td>\s*</tr>
Extract paragraphs within a div
<div class=description>
<p> paragraph 1 </p>
<p> paragraph 2 </p>
<p> paragraph 3 </p>
<div class=description>(.*?)</div>
Ebates ----------------------------------------------------
Store Names: <a [^>]*>([^<]+)
PercentValue: <a [^>]*>\s*([\d.]+)
DollarValue: <a [^>]*>\s*\$([\d.]+)
UptoDollarValue: <a [^>]*>\s*Up\sto\s\$([\d.]+)
UptoPercentValue: <a [^>]*>\s*Up\sto\s([\d.]+)\%
NoDiscount: <a [^>]*>\s*([No Discount]+)
CouponsOnly: <a [^>]*>\s*([Coupons Only]+)
InStoreOnly: <a [^>]*>\s*([In\-Store]+)
BeFrugal ----------------------------------------------
StoreName: <tr>\s*<td><a [^>]*>([^<]+)
PercentValue: <td class="green"[^>]*>\s*([\d.]+)
DollarValue: <td class="green"[^>]*>\s*\$([\d.]+)
UptoDollarValue: <td class="green"[^>]*>\s*Up\sto\s\$([\d.]+)
UptoPercentValue: <td class="green"[^>]*>\s*Up\sto\s([\d.]+)\%
Extrabux ---------------------------------------------
StoreName: <a [^>]*>([^<]+)
PercentValue: <a class="cashBack transferLink"[^>]*>\s*([\d.]+)
DollarValue: <a class="cashBack transferLink"[^>]*>\s*\$([\d.]+)
UptoDollarValue: <a class="cashBack transferLink"[^>]*>\s*Up\sto\s\$([\d.]+)
UptoPercentValue: <a class="cashBack transferLink"[^>]*>\s*Up\sto\s([\d.]+)\%
CashbackBin ------------------
Store Name: <h1[^>]*>([^<]+)
Vendor: title="([^"]*)
Rate: <td class="l lo" style="text-align: center;">[^>]*>([^<]*)
Bonus: <td class="l bonus_amount">[\s\S]*?<span class="card_secondary_text">([^<]*)
CBM -------------------------
Match upto the 1st occurance of %
[^%]*
Cb ----------------------------------------------------
https://www.couponbox.com/us/stores/a
https://www.couponbox.com/us/stores/b
https://www.couponbox.com/us/stores/c
https://www.couponbox.com/us/stores/d
https://www.couponbox.com/us/stores/e
https://www.couponbox.com/us/stores/f
https://www.couponbox.com/us/stores/g
https://www.couponbox.com/us/stores/h
https://www.couponbox.com/us/stores/i
https://www.couponbox.com/us/stores/j
https://www.couponbox.com/us/stores/k
https://www.couponbox.com/us/stores/l
https://www.couponbox.com/us/stores/m
https://www.couponbox.com/us/stores/n
https://www.couponbox.com/us/stores/o
https://www.couponbox.com/us/stores/p
https://www.couponbox.com/us/stores/q
https://www.couponbox.com/us/stores/r
https://www.couponbox.com/us/stores/s
https://www.couponbox.com/us/stores/t
https://www.couponbox.com/us/stores/u
https://www.couponbox.com/us/stores/v
https://www.couponbox.com/us/stores/w
https://www.couponbox.com/us/stores/x
https://www.couponbox.com/us/stores/y
https://www.couponbox.com/us/stores/z
href="https:\/\/www\.couponbox\.com\/us\/coupons\/([^"]*)
srcset="\/\/([^"]*)
href="([^"]*)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment