Skip to content

Instantly share code, notes, and snippets.

Last active August 29, 2015 14:26
Show Gist options
  • Save ruairif/b9b8c7338b271376e1e1 to your computer and use it in GitHub Desktop.
Save ruairif/b9b8c7338b271376e1e1 to your computer and use it in GitHub Desktop.

The Road to loading JavaScript in Portia

Support for pages that heavily use JavaScript has been a much requested feature ever since Portia was released just over 2 years ago. The wait is nearly over and we are happy to be launching these changes in the very near future (if you are adventurous you can try it out right now in the develop branch on Github). The this post isn't an announcement of the release of JavaScript support within Portia but instead we want to explain how we added this support.

The Plan

As with everything in software we started out by investigating what our requirements were and what othera had done in this situation. We were looking for a solution that was reliable and would allow for reproducible interaction with the webpages.

Reliability: A solution that could render the pages in the same way during spider creation and crawling.

Interaction: A system that would allow us to record the user's actions so that they could be replayed while crawling.

The Investigation

The results of the investigation produced some interesting and some crazy ideas, here are the ones we probed further:

  1. Place the Portia UI inside a browser addon and using the additional privileges gained from being an addon read from and interact with the page.
  2. Place the Portia UI inside a bookmarlet and after doing some post processing of the page on our server allow interaction with the page.
  3. Render a static screenshot of the page along with the co-ordinates for all of them elements and send them to the UI interaction involves re-rendering the whole screenshot.
  4. Render a tiled screenshot of the page along with co-ordinates and when an interaction event is detected update the representation on the server, send the updated tiles to the UI to be rendered and send the updated DOM to the user.
  5. Render the page in an iframe but use a proxy to avoid cross-origin issues and disable unwanted activity
  6. Render the page on the server and send the html to the user. Whenever the user interacts with the page the server sends the changes that have happened as a result of the interaction to the user.
  7. Build a desktop application using Webkit and have full control over the UI, page rendering and everything else we would need
  8. Build an internal application using Webkit that is run on a server and is accessed through a web based VNC

We rejected 7 and 8 because they would increase the barrier of entry for Portia which was not something we wanted. This method is used by for their spider creation tool.

1 and 2 were rejected because it would be hard to fit the whole Portia UI into an addon in an acceptable way (we may revisit these in the future). ParseHub and Kimono use this method to great effect.

3 and 4 were investigated further, inspired by the work done by LibreOffice for their Android document editor. In the end though it was clunky and we could achieve better performace by sending DOM updates rather than image tiles.

The Solution

The solution we have now built is a combination of 5 and 6. The most important aspect is the server-side browser. This browser provides a tab for each user allowing the page to be loaded and interacted with in a controlled manner. We looked at using existing solutions including Selenium, PhantomJS and Splash. All of these technologies are wrappers around WebKit providing domain specific functionality. We use Splash for our browser not because it is a Scrapinghub technology but because it is designed to be used for web crawling rather than automated testing making it a better fit for our requirements.

The server side browser gets input from the user. Websockets are used to send events and DOM updates between the user and the server. Initially we looked at React's virtual DOM, while it worked it wasn't perfect. Luckily, there is an inbuilt solution, available in most browsers released since 2012, called MutationObserver. This in conjunction with the Mutation Summary library allows us to update the page in the UI for the user when they interact with it.

We now proxy all of the resources that the page needs rather than loading them from the host. The advantage of this is that we can load resources from the cache in our server side browser or from the original host and provide SSL protection to the resources if the host doesn't already provide it.

The Future

JS support makes data available that was not available before.

Before JS support (left), after JS support (right).

For now we are very happy with how it works and hopfully it will help users extract the data they need. This initial release will provide the means to crawl and extract pages that require JavaScript but we want to make it better! We are now building a system to allow actions to be recorded and replayed on pages during crawling. We are hoping that this feature will make filling out forms, pressing buttons and triggering infinite scrolling simple and easy to use. If you have any ideas for what features you would like to see in Portia leave a comment below!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment