Skip to content

Instantly share code, notes, and snippets.

The Man in the High Castle Kindle Edition
by
Philip K. Dick (Author)
@sysnucleus
sysnucleus / gist:f91f8b8b84d918918b6857b3a7acff3a
Created June 22, 2018 04:47
WebHarvy Amazon extraction regular expressions
src="([^_]*)_[^\.]*\.([^"]*)
<div id="feature-bullets"[^>]*>([\s\S]*?)</div>
@sysnucleus
sysnucleus / yellow pages egypt.js
Created October 15, 2018 05:08
RegEx strings to extract listing name, telephone, website and address
[\s]*(.*)
tel: ([^"]*)
title="Website"[\s\S]*?href="([^"]*)
class="col-md-9 company_address"[^>]*>([^<]*)
<DATAFIELD>
<type>Text</type>
<name>Name</name>
<selector>
#hotellist_inner &gt; DIV:nth-of-type(02) &gt; DIV:nth-of-type(2) &gt; DIV:nth-of-type(1) &gt; DIV:nth-of-type(1) &gt; DIV:nth-of-type(1) &gt; H3 &gt; A &gt; SPAN:nth-of-type(1)
</selector>
<heading />
<pattern>true</pattern>
<regex />
</DATAFIELD>
@sysnucleus
sysnucleus / gist:84a0574cbf908813787d2d95b8a6c2ed
Created August 20, 2020 13:22
JS code to configure pagination (scroll) in WebHarvy for Twitter scraping
groupEl = document.getElementsByTagName('article')[0].parentElement.parentElement.parentElement.parentElement;
groupEl.children[groupEl.childElementCount-1].scrollIntoView();
@sysnucleus
sysnucleus / gist:436a2b0be80882f0ae61a391931abf5d
Created August 31, 2020 13:55
RegEx strings to extract email, phone, website and address from yellowpages.com.au
data-email="([^"]*)
tel:([^"]*)
title="([^\s]*)\s*\(opens in a new window\)
<p class="listing-address[^>]*>([^<]*)
@sysnucleus
sysnucleus / tripadvisor
Created September 3, 2020 03:51
Codes to extract reviewer submitted images from TripAdvisor using WebHarvy
// RegEx to Follow links
href="([^"]*)
// More button click
document.getElementsByClassName('moreBtn')[0].click();
// Get images block
@sysnucleus
sysnucleus / ta-expand.js
Created April 5, 2021 09:25
Expand Tripadvisor reviews 'Read More' link..
els = document.getElementsByTagName('span');
for (var i = els.length - 1; i >= 0; i--) {
if(els[i].innerText === 'Read more') {
els[i].click();
}
}
@sysnucleus
sysnucleus / WebHarvy XML Miner Options
Created September 9, 2021 02:45
WebHarvy Miner Options
<MinerOptions>
<MinLevelsUp>2</MinLevelsUp>
<MinChildCount>10</MinChildCount>
<SelAccuracy>2</SelAccuracy>
</MinerOptions>
@sysnucleus
sysnucleus / WebHarvy XML Version Info
Last active September 9, 2021 02:46
WebHarvy XML Version Info
<VersionInfo>6.3.0.189</VersionInfo>
<RegInfo>SysNucleus</RegInfo>