Skip to content

Instantly share code, notes, and snippets.

@ruairif
Created August 18, 2015 14:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ruairif/25f412d6371d93c00954 to your computer and use it in GitHub Desktop.
Save ruairif/25f412d6371d93c00954 to your computer and use it in GitHub Desktop.

Extract from even more sites using Portia

Today the latest version of Portia has been released bringing with it the ability to crawl pages that require JavaScript. To celebrate this release we are making Splash available as a free trial to all Portia users so you can try it out with your projects.

How to use it

If you would like to crawl using JavaScript in your project you can do so by:

  1. Navigating to your spider in Portia.
  2. Opening the Crawling tab.
  3. Clicking the Enable JS checkbox.

By clicking this checkbox you will now be able to annotate pages which require JavaScript to be enabled to be crawled correctly.

After you enable JavaScript you will be presented with the ability to limit which pages that JavaScript is enabled for. You can choose for JavaScript to only be run on certain pages in the same way that you can limit which pages are followed by Portia.

The reason you may want to limit which pages load JavaScript is that it can increase the amount of time required to run your spider.

Once you have these changes made you can publish your spider and it will crawl

How do I know if I need it?

When you are creating a spider you can use the show followed links checkbox to decide if you need to JavaScript enabled on a page or not. By showing followed links you can see which links are followed only if JavaScript is enabled and which links are always followed.

To decide if you need JavaScript enabled for extracting data you can try to create a sample for the page that you wish to extract data from. If JavaScript is not enabled for this page and you can see the data you wish to extract then you don't need to change anything; your spider works! If you don't see the data you want then you can:

Enable JavaScript for the spider as described above or, if you already have JavaScript enabled then you can add a pattern for this URL to the follow patterns.

This should be all you need to get started with JavaScript support with Portia. Happy Scraping!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment