Skip to content

Instantly share code, notes, and snippets.

@redapple
Last active August 29, 2015 14:02
Show Gist options
  • Save redapple/89ed08e45148d0a06cff to your computer and use it in GitHub Desktop.
Save redapple/89ed08e45148d0a06cff to your computer and use it in GitHub Desktop.
paul@paul-SATELLITE-R830:~/dev/scrapy$ scrapy shell "https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1"
2014-06-19 12:18:12+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot)
2014-06-19 12:18:12+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-06-19 12:18:12+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-06-19 12:18:12+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-19 12:18:13+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-19 12:18:13+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-19 12:18:13+0200 [scrapy] INFO: Enabled item pipelines:
2014-06-19 12:18:13+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-19 12:18:13+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-19 12:18:13+0200 [default] INFO: Spider opened
2014-06-19 12:18:14+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fc8b3452e10>
[s] item {}
[s] request <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1>
[s] response <200 https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1>
[s] sel <Selector xpath=None data=u'<html xmlns="http://www.w3.org/1999/xhtm'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x7fc8b2bdc810>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
/usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated. All its subpackages have been moved to the top `IPython` level.
warn("The top-level `frontend` package has been deprecated. "
In [1]: sel.xpath('//iframe/@src').extract()
Out[1]: [u'https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1']
In [2]: fetch('https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1')
2014-06-19 12:19:16+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fc8b3452e10>
[s] item {}
[s] request <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1>
[s] response <200 https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1>
[s] sel <Selector xpath=None data=u'<html>\r\n<head><script type="text/javascr'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x7fc8b2bdc810>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [3]: sel.css('span.iCIMS_JobsTableHeader + a::attr(href)').extract()
Out[3]:
[u'https://careers-meridianhealth.icims.com/jobs/5516/environmental-service-aide/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5507/resident-assistant/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5489/registered-nurse---shore-rehab/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5477/practice-manager/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5476/medical-receptionist/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5475/medical-receptionist/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5474/medical-receptionist/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5473/medical-receptionist/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5472/certified-medical-assistant/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5471/certified-medical-assistant/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5470/certified-medical-assistant/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5469/certified-medical-assistant/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5468/certified-medical-assistant/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5465/certified-medical-assistant/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5463/medical-receptionist---jersey-shore-ob-gyn/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5460/certified-medical-assistant-i/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5458/asst-regional-practice-admin/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5444/certified-medical-assistant-i-thoracic-surgery/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5441/systems-operator-i/job?in_iframe=1',
u'https://careers-meridianhealth.icims.com/jobs/5439/echo-technologist-%5bcardiac-diagnostics%5d/job?in_iframe=1']
In [4]: fetch('https://careers-meridianhealth.icims.com/jobs/5516/environmental-service-aide/job?in_iframe=1')
2014-06-19 12:21:05+0200 [default] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5516/environmental-service-aide/job?in_iframe=1> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fc8b3452e10>
[s] item {}
[s] request <GET https://careers-meridianhealth.icims.com/jobs/5516/environmental-service-aide/job?in_iframe=1>
[s] response <200 https://careers-meridianhealth.icims.com/jobs/5516/environmental-service-aide/job?in_iframe=1>
[s] sel <Selector xpath=None data=u'<html>\r\n<head><script type="text/javascr'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x7fc8b2bdc810>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [5]: sel.xpath('//div[2]/div[1]/div[2]/span/span/span[3]/text()').extract()
Out[5]: [u'Holmdel']
In [6]: sel.xpath('//div[2]/div[1]/div[2]/span/span/span[2]/text()').extract()
Out[6]: [u'NJ']
In [7]:
paul@paul-SATELLITE-R830:~/tmp/stackoverflow/24301376$ scrapy runspider 24301376.py
2014-06-19 17:56:07+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot)
2014-06-19 17:56:07+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-06-19 17:56:07+0200 [scrapy] INFO: Overridden settings: {}
2014-06-19 17:56:07+0200 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-19 17:56:09+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-19 17:56:09+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-19 17:56:09+0200 [scrapy] INFO: Enabled item pipelines:
2014-06-19 17:56:09+0200 [meridianhealth] INFO: Spider opened
2014-06-19 17:56:09+0200 [meridianhealth] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-19 17:56:09+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-19 17:56:09+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-19 17:56:09+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1> (referer: None)
2014-06-19 17:56:10+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1)
2014-06-19 17:56:10+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5471/certified-medical-assistant/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'CERTIFIED MEDICAL ASSISTANT']
[u'Forked River']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5473/medical-receptionist/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'MEDICAL RECEPTIONIST']
[u'Manahawkin']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5475/medical-receptionist/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'MEDICAL RECEPTIONIST']
[u'Manahawkin']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5474/medical-receptionist/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'MEDICAL RECEPTIONIST']
[u'Manahawkin']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5477/practice-manager/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5472/certified-medical-assistant/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'PRACTICE MANAGER']
[u'Forked River']
[u'NJ']
[u'CERTIFIED MEDICAL ASSISTANT']
[u'Brick']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5489/registered-nurse---shore-rehab/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'REGISTERED NURSE - SHORE REHAB']
[u'Brick']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5507/resident-assistant/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'RESIDENT ASSISTANT']
[u'Holmdel']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5476/medical-receptionist/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'MEDICAL RECEPTIONIST']
[u'Manahawkin']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5441/systems-operator-i/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'SYSTEMS OPERATOR I']
[u'Neptune']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5532/rehabilitation-aide/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'REHABILITATION AIDE']
[u'Shrewsbury']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5516/environmental-service-aide/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'ENVIRONMENTAL SERVICE AIDE']
[u'Holmdel']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5458/asst-regional-practice-admin/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'ASST REGIONAL PRACTICE ADMIN']
[u'Neptune']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5444/certified-medical-assistant-i-thoracic-surgery/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'CERTIFIED MEDICAL ASSISTANT I-THORACIC SURGERY']
[u'Neptune']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5460/certified-medical-assistant-i/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5463/medical-receptionist---jersey-shore-ob-gyn/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'CERTIFIED MEDICAL ASSISTANT I']
[u'Holmdel']
[u'NJ']
[u'MEDICAL RECEPTIONIST - Jersey Shore OB/GYN']
[u'Freehold']
[u'NJ']
2014-06-19 17:56:11+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5465/certified-medical-assistant/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'CERTIFIED MEDICAL ASSISTANT']
[u'Manahawkin']
[u'NJ']
2014-06-19 17:56:12+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5469/certified-medical-assistant/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'CERTIFIED MEDICAL ASSISTANT']
[u'Manahawkin']
[u'NJ']
2014-06-19 17:56:12+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5468/certified-medical-assistant/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'CERTIFIED MEDICAL ASSISTANT']
[u'Manahawkin']
[u'NJ']
2014-06-19 17:56:12+0200 [meridianhealth] DEBUG: Crawled (200) <GET https://careers-meridianhealth.icims.com/jobs/5470/certified-medical-assistant/job?in_iframe=1> (referer: https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1&in_iframe=1)
[u'CERTIFIED MEDICAL ASSISTANT']
[u'Manahawkin']
[u'NJ']
2014-06-19 17:56:12+0200 [meridianhealth] INFO: Closing spider (finished)
2014-06-19 17:56:12+0200 [meridianhealth] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 10391,
'downloader/request_count': 22,
'downloader/request_method_count/GET': 22,
'downloader/response_bytes': 342845,
'downloader/response_count': 22,
'downloader/response_status_count/200': 22,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 19, 15, 56, 12, 295969),
'log_count/DEBUG': 24,
'log_count/INFO': 7,
'request_depth_max': 2,
'response_received_count': 22,
'scheduler/dequeued': 22,
'scheduler/dequeued/memory': 22,
'scheduler/enqueued': 22,
'scheduler/enqueued/memory': 22,
'start_time': datetime.datetime(2014, 6, 19, 15, 56, 9, 77102)}
2014-06-19 17:56:12+0200 [meridianhealth] INFO: Spider closed (finished)
paul@paul-SATELLITE-R830:~/tmp/stackoverflow/24301376$
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
class MeridianhealthSpider(Spider):
name = "meridianhealth"
start_urls = [
"https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1",
]
def parse(self, response):
selector = Selector(response)
return Request(url=selector.xpath('//iframe/@src').extract()[0],
callback=self.parse_iframe)
def parse_iframe(self, response):
selector = Selector(response)
for posting_url in selector.css('span.iCIMS_JobsTableHeader + a::attr(href)').extract():
yield Request(url=posting_url, callback=self.parse_job_posting)
def parse_job_posting(self, response):
selector = Selector(response)
print selector.css('h1[itemprop="title"]::text').extract()
print selector.xpath('//div[2]/div[1]/div[2]/span/span/span[3]/text()').extract()
print selector.xpath('//div[2]/div[1]/div[2]/span/span/span[2]/text()').extract()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment