Today the latest version of Portia has been released bringing with it the ability to crawl pages that require JavaScript. To celebrate this release we are making Splash available as a free trial to all Portia users so you can try it out with your projects.
If you would like to crawl using JavaScript in your project you can do so by:
- Navigating to your spider in Portia.
- Opening the Crawling tab.
- Clicking the Enable JS checkbox.
By clicking this checkbox you will now be able to annotate pages which require JavaScript to be enabled to be crawled correctly.
After you enable JavaScript you will be presented with the ability to limit which pages that JavaScript is enabled for. You can choose for JavaScript to only be run on certain pages in the same way that you can limit which pages are followed by Portia.
The reason you may want to limit which pages load JavaScript is that it can increase the amount of time required to run your spider.
Once you have these changes made you can publish your spider and it will crawl
When you are creating a spider you can use the show followed links checkbox to decide if you need to JavaScript enabled on a page or not. By showing followed links you can see which links are followed only if JavaScript is enabled and which links are always followed.
To decide if you need JavaScript enabled for extracting data you can try to create a sample for the page that you wish to extract data from. If JavaScript is not enabled for this page and you can see the data you wish to extract then you don't need to change anything; your spider works! If you don't see the data you want then you can:
Enable JavaScript for the spider as described above or, if you already have JavaScript enabled then you can add a pattern for this URL to the follow patterns.
This should be all you need to get started with JavaScript support with Portia. Happy Scraping!