pedrohenriqueromio/RegEx on Html

## RegEx on Html
Get Hexadecimal color code
\#([a-fA-F]|[0-9]){3, 6}

Validate email address
/[A-Z0-9._%+-]+@[A-Z0-9-]+.+.[A-Z]{2,4}/igm

IPv4 address
/\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b/

IPv6 address
(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))

Thousands separator
/\d{1,3}(?=(\d{3})+(?!\d))/g

Get domain from url
/https?:\/\/(?:[-\w]+\.)?([-\w]+)\.\w+(?:\.\w+)?\/?.*/i

Sort keywords by word count
^[^\s]*$      matches exactly 1-word keyword
^[^\s]*\s[^\s]*$    matches exactly 2-word keyword
^[^\s]*\s[^\s]*     matches keywords of at least 2 words (2 and more)
^([^\s]*\s){2}[^\s]*$    matches exactly 3-word keyword
^([^\s]*\s){4}[^\s]*$    matches 5-words-and-more keywords (longtail)

Valid phone number
^\+?\d{1,3}?[- .]?\(?(?:\d{2,3})\)?[- .]?\d\d\d[- .]?\d\d\d\d$

Leading & trailing whitespaces
^[ \s]+|[ \s]+$

Get img src
\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\'\ >]*)

Validate date in DD/MM/YYY format
^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

Valid ISBN
/\b(?:ISBN(?:: ?| ))?((?:97[89])?\d{9}[\dx])\b/i

Check zip code
^\d{5}(?:[-\s]\d{4})?$

Valid twitter username
/@([A-Za-z0-9_]{1,15})/

Credit card numbers
^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$

Find css attributes
^\s*[a-zA-Z\-]+\s*[:]{1}\s[a-zA-Z0-9\s.#]+[;]{1}

Strip html comments
<!--(.*?)-->

Facebook profile url
/(?:http:\/\/)?(?:www\.)?facebook\.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-]*)/

Check IE version
^.*MSIE [5-8](?:\.[0-9]+)?(?!.*Trident\/[5-9]\.0).*$

Extract price
/(\$[0-9,]+(\.[0-9]{2})?)/

Parse email header
/\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b/i

Match a particular filetype
/\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b/i

Match a url string
/[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi

Append rel="nofollow" to links
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>

Media query match
/@media([^{]+)\{([\s\S]+?})\s*}/g


Match empty paragraph tags
<p>(\s|&nbsp;|<\/?\s?br\s?\/?>)*<\/?p>

Extract Heading(h1) tag
<h1>This is a heading</h1>
<h1>([^<]+)</h1>

Extract Hyperlink from (A) tag
<a href="https://www.datascraping.co">Typical Website Link</a>
<a href="([^"]+)">Typical Website Link</a>

Extract Hyperlink and Anchor Text from (A) tag
<a href="https://www.datascraping.co">Typical Website Link</a>
<a href="([^"]+)">([^<]+)</a>

Extract Image alt text and source from (IMG) tag
<img alt="screen scraping" src="https://cdn.datascraping.co/images/create-a-web-scraping-agent.jpg">
<img alt="([^"]+)" src="([^"]+)"/>

Extract data attribute and price from (DIV) tag
<div data-id="17839" data-availability="InStock">USD 129.00</div>
<div data-id="(\d+)" data-availability="(\w+)">USD\s*([^"]+)<\/div>

Extract text from (STRONG) tag
<strong>My Bold text</strong>
<strong>([^<]+)</strong>

Extract text from (span) tag with some CSS class
<span class="some-css-class">My Favorite Data</span>
<span class="some-css-class">([^<]+)</span>

Extract META description content value
<meta name="description" content="the SEO description of web page in heading section" />
<meta name="description" content="([^"]+)" />

Scrape only the 1st 3 values in a table
<tr>\s*<td>([^<]+)</td>\s*<td>([^<]+)</td>\s*<td>([^<]+)</td>\s*</tr>

Extract paragraphs within a div
<div class=description>
<p> paragraph 1 </p>
<p> paragraph 2 </p>
<p> paragraph 3 </p>
<div class=description>(.*?)</div>

Ebates ----------------------------------------------------
Store Names: <a [^>]*>([^<]+)
PercentValue: <a [^>]*>\s*([\d.]+)
DollarValue: <a [^>]*>\s*\$([\d.]+)
UptoDollarValue: <a [^>]*>\s*Up\sto\s\$([\d.]+)
UptoPercentValue: <a [^>]*>\s*Up\sto\s([\d.]+)\%
NoDiscount: <a [^>]*>\s*([No Discount]+)
CouponsOnly: <a [^>]*>\s*([Coupons Only]+)
InStoreOnly: <a [^>]*>\s*([In\-Store]+)

BeFrugal ----------------------------------------------
StoreName: <tr>\s*<td><a [^>]*>([^<]+)
PercentValue: <td class="green"[^>]*>\s*([\d.]+)
DollarValue: <td class="green"[^>]*>\s*\$([\d.]+)
UptoDollarValue: <td class="green"[^>]*>\s*Up\sto\s\$([\d.]+)
UptoPercentValue: <td class="green"[^>]*>\s*Up\sto\s([\d.]+)\%

Extrabux ---------------------------------------------
StoreName: <a [^>]*>([^<]+)
PercentValue: <a class="cashBack transferLink"[^>]*>\s*([\d.]+)
DollarValue: <a class="cashBack transferLink"[^>]*>\s*\$([\d.]+)
UptoDollarValue: <a class="cashBack transferLink"[^>]*>\s*Up\sto\s\$([\d.]+)
UptoPercentValue: <a class="cashBack transferLink"[^>]*>\s*Up\sto\s([\d.]+)\%

CashbackBin ------------------
Store Name: <h1[^>]*>([^<]+)
Vendor: title="([^"]*)
Rate: <td class="l lo" style="text-align: center;">[^>]*>([^<]*)
Bonus: <td class="l bonus_amount">[\s\S]*?<span class="card_secondary_text">([^<]*)

CBM -------------------------
Match upto the 1st occurance of %
[^%]*

Cb ----------------------------------------------------
https://www.couponbox.com/us/stores/a
https://www.couponbox.com/us/stores/b
https://www.couponbox.com/us/stores/c
https://www.couponbox.com/us/stores/d
https://www.couponbox.com/us/stores/e
https://www.couponbox.com/us/stores/f
https://www.couponbox.com/us/stores/g
https://www.couponbox.com/us/stores/h
https://www.couponbox.com/us/stores/i
https://www.couponbox.com/us/stores/j
https://www.couponbox.com/us/stores/k
https://www.couponbox.com/us/stores/l
https://www.couponbox.com/us/stores/m
https://www.couponbox.com/us/stores/n
https://www.couponbox.com/us/stores/o
https://www.couponbox.com/us/stores/p
https://www.couponbox.com/us/stores/q
https://www.couponbox.com/us/stores/r
https://www.couponbox.com/us/stores/s
https://www.couponbox.com/us/stores/t
https://www.couponbox.com/us/stores/u
https://www.couponbox.com/us/stores/v
https://www.couponbox.com/us/stores/w
https://www.couponbox.com/us/stores/x
https://www.couponbox.com/us/stores/y
https://www.couponbox.com/us/stores/z
href="https:\/\/www\.couponbox\.com\/us\/coupons\/([^"]*)
srcset="\/\/([^"]*)
href="([^"]*)
	Get Hexadecimal color code
	\#([a-fA-F]\|[0-9]){3, 6}

	Validate email address
	/[A-Z0-9._%+-]+@[A-Z0-9-]+.+.[A-Z]{2,4}/igm

	IPv4 address
	/\b(?:(?:25[0-5]\|2[0-4][0-9]\|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]\|2[0-4][0-9]\|[01]?[0-9][0-9]?)\b/

	IPv6 address
	(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}\|([0-9a-fA-F]{1,4}:){1,7}:\|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}\|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}\|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}\|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}\|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}\|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})\|:((:[0-9a-fA-F]{1,4}){1,7}\|:)\|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}\|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]\|(2[0-4]\|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]\|(2[0-4]\|1{0,1}[0-9]){0,1}[0-9])\|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]\|(2[0-4]\|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]\|(2[0-4]\|1{0,1}[0-9]){0,1}[0-9]))

	Thousands separator
	/\d{1,3}(?=(\d{3})+(?!\d))/g

	Get domain from url
	/https?:\/\/(?:[-\w]+\.)?([-\w]+)\.\w+(?:\.\w+)?\/?.*/i

	Sort keywords by word count
	^[^\s]*$ matches exactly 1-word keyword
	^[^\s]\s[^\s]$ matches exactly 2-word keyword
	^[^\s]\s[^\s] matches keywords of at least 2 words (2 and more)
	^([^\s]\s){2}[^\s]$ matches exactly 3-word keyword
	^([^\s]\s){4}[^\s]$ matches 5-words-and-more keywords (longtail)

	Valid phone number
	^\+?\d{1,3}?[- .]?\(?(?:\d{2,3})\)?[- .]?\d\d\d[- .]?\d\d\d\d$

	Leading & trailing whitespaces
	^[ \s]+\|[ \s]+$

	Get img src
	\< [img][^\>][src] = [\"\']{0,1}([^\"\'\ >]*)

	Validate date in DD/MM/YYY format
	^(?:(?:31(\/\|-\|\.)(?:0?[13578]\|1[02]))\1\|(?:(?:29\|30)(\/\|-\|\.)(?:0?[1,3-9]\|1[0-2])\2))(?:(?:1[6-9]\|[2-9]\d)?\d{2})$\|^(?:29(\/\|-\|\.)0?2\3(?:(?:(?:1[6-9]\|[2-9]\d)?(?:0[48]\|[2468][048]\|[13579][26])\|(?:(?:16\|[2468][048]\|[3579][26])00))))$\|^(?:0?[1-9]\|1\d\|2[0-8])(\/\|-\|\.)(?:(?:0?[1-9])\|(?:1[0-2]))\4(?:(?:1[6-9]\|[2-9]\d)?\d{2})$

	Valid ISBN
	/\b(?:ISBN(?:: ?\| ))?((?:97[89])?\d{9}[\dx])\b/i

	Check zip code
	^\d{5}(?:[-\s]\d{4})?$

	Valid twitter username
	/@([A-Za-z0-9_]{1,15})/

	Credit card numbers
	^(?:4[0-9]{12}(?:[0-9]{3})?\|5[1-5][0-9]{14}\|6(?:011\|5[0-9][0-9])[0-9]{12}\|3[47][0-9]{13}\|3(?:0[0-5]\|[68][0-9])[0-9]{11}\|(?:2131\|1800\|35\d{3})\d{11})$

	Find css attributes
	^\s[a-zA-Z\-]+\s[:]{1}\s[a-zA-Z0-9\s.#]+[;]{1}

	Strip html comments
	<!--(.*?)-->

	Facebook profile url
	/(?:http:\/\/)?(?:www\.)?facebook\.com\/(?:(?:\w)#!\/)?(?:pages\/)?(?:[\w\-]\/)([\w\-])/

	Check IE version
	^.MSIE [5-8](?:\.[0-9]+)?(?!.Trident\/[5-9]\.0).*$

	Extract price
	/(\$[0-9,]+(\.[0-9]{2})?)/

	Parse email header
	/\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b/i

	Match a particular filetype
	/\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b/i

	Match a url string
	/[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi

	Append rel="nofollow" to links
	(<a\s(?!.\brel=)[^>])(href="https?://)((?!(?:(?:www\.)?'.implode('\|(?:www\.)?', $follow_list).'))[^"]+)"((?!.\brel=)[^>])(?:[^>])>

	Media query match
	/@media([^{]+)\{([\s\S]+?})\s*}/g



	Match empty paragraph tags
	<p>(\s\| \|<\/?\s?br\s?\/?>)*<\/?p>

	Extract Heading(h1) tag
	<h1>This is a heading</h1>
	<h1>([^<]+)</h1>

	Extract Hyperlink from (A) tag
	<a href="https://www.datascraping.co">Typical Website Link</a>
	<a href="([^"]+)">Typical Website Link</a>

	Extract Hyperlink and Anchor Text from (A) tag
	<a href="https://www.datascraping.co">Typical Website Link</a>
	<a href="([^"]+)">([^<]+)</a>

	Extract Image alt text and source from (IMG) tag
	<img alt="screen scraping" src="https://cdn.datascraping.co/images/create-a-web-scraping-agent.jpg">
	<img alt="([^"]+)" src="([^"]+)"/>

	Extract data attribute and price from (DIV) tag
	<div data-id="17839" data-availability="InStock">USD 129.00</div>
	<div data-id="(\d+)" data-availability="(\w+)">USD\s*([^"]+)<\/div>

	Extract text from (STRONG) tag
	<strong>My Bold text</strong>
	<strong>([^<]+)</strong>

	Extract text from (span) tag with some CSS class
	<span class="some-css-class">My Favorite Data</span>
	<span class="some-css-class">([^<]+)</span>

	Extract META description content value
	<meta name="description" content="the SEO description of web page in heading section" />
	<meta name="description" content="([^"]+)" />

	Scrape only the 1st 3 values in a table
	<tr>\s<td>([^<]+)</td>\s<td>([^<]+)</td>\s<td>([^<]+)</td>\s</tr>

	Extract paragraphs within a div
	<div class=description>
	<p> paragraph 1 </p>
	<p> paragraph 2 </p>
	<p> paragraph 3 </p>
	<div class=description>(.*?)</div>

	Ebates ----------------------------------------------------
	Store Names: <a [^>]*>([^<]+)
	PercentValue: <a [^>]>\s([\d.]+)
	DollarValue: <a [^>]>\s\$([\d.]+)
	UptoDollarValue: <a [^>]>\sUp\sto\s\$([\d.]+)
	UptoPercentValue: <a [^>]>\sUp\sto\s([\d.]+)\%
	NoDiscount: <a [^>]>\s([No Discount]+)
	CouponsOnly: <a [^>]>\s([Coupons Only]+)
	InStoreOnly: <a [^>]>\s([In\-Store]+)

	BeFrugal ----------------------------------------------
	StoreName: <tr>\s<td><a [^>]>([^<]+)
	PercentValue: <td class="green"[^>]>\s([\d.]+)
	DollarValue: <td class="green"[^>]>\s\$([\d.]+)
	UptoDollarValue: <td class="green"[^>]>\sUp\sto\s\$([\d.]+)
	UptoPercentValue: <td class="green"[^>]>\sUp\sto\s([\d.]+)\%

	Extrabux ---------------------------------------------
	StoreName: <a [^>]*>([^<]+)
	PercentValue: <a class="cashBack transferLink"[^>]>\s([\d.]+)
	DollarValue: <a class="cashBack transferLink"[^>]>\s\$([\d.]+)
	UptoDollarValue: <a class="cashBack transferLink"[^>]>\sUp\sto\s\$([\d.]+)
	UptoPercentValue: <a class="cashBack transferLink"[^>]>\sUp\sto\s([\d.]+)\%

	CashbackBin ------------------
	Store Name: <h1[^>]*>([^<]+)
	Vendor: title="([^"]*)
	Rate: <td class="l lo" style="text-align: center;">[^>]>([^<])
	Bonus: <td class="l bonus_amount">[\s\S]?<span class="card_secondary_text">([^<])

	CBM -------------------------
	Match upto the 1st occurance of %
	[^%]*

	Cb ----------------------------------------------------
	https://www.couponbox.com/us/stores/a
	https://www.couponbox.com/us/stores/b
	https://www.couponbox.com/us/stores/c
	https://www.couponbox.com/us/stores/d
	https://www.couponbox.com/us/stores/e
	https://www.couponbox.com/us/stores/f
	https://www.couponbox.com/us/stores/g
	https://www.couponbox.com/us/stores/h
	https://www.couponbox.com/us/stores/i
	https://www.couponbox.com/us/stores/j
	https://www.couponbox.com/us/stores/k
	https://www.couponbox.com/us/stores/l
	https://www.couponbox.com/us/stores/m
	https://www.couponbox.com/us/stores/n
	https://www.couponbox.com/us/stores/o
	https://www.couponbox.com/us/stores/p
	https://www.couponbox.com/us/stores/q
	https://www.couponbox.com/us/stores/r
	https://www.couponbox.com/us/stores/s
	https://www.couponbox.com/us/stores/t
	https://www.couponbox.com/us/stores/u
	https://www.couponbox.com/us/stores/v
	https://www.couponbox.com/us/stores/w
	https://www.couponbox.com/us/stores/x
	https://www.couponbox.com/us/stores/y
	https://www.couponbox.com/us/stores/z
	href="https:\/\/www\.couponbox\.com\/us\/coupons\/([^"]*)
	srcset="\/\/([^"]*)
	href="([^"]*)