As with everything in software we started out by investigating what our requirements were and what othera had done in this situation. We were looking for a solution that was reliable and would allow for reproducible interaction with the webpages.
Reliability: A solution that could render the pages in the same way during spider creation and crawling.
Interaction: A system that would allow us to record the user's actions so that they could be replayed while crawling.
The results of the investigation produced some interesting and some crazy ideas, here are the ones we probed further:
- Place the Portia UI inside a browser addon and using the additional privileges gained from being an addon read from and interact with the page.
- Place the Portia UI inside a bookmarlet and after doing some post processing of the page on our server allow interaction with the page.
- Render a static screenshot of the page along with the co-ordinates for all of them elements and send them to the UI interaction involves re-rendering the whole screenshot.
- Render a tiled screenshot of the page along with co-ordinates and when an interaction event is detected update the representation on the server, send the updated tiles to the UI to be rendered and send the updated DOM to the user.
- Render the page in an iframe but use a proxy to avoid cross-origin issues and disable unwanted activity
- Render the page on the server and send the html to the user. Whenever the user interacts with the page the server sends the changes that have happened as a result of the interaction to the user.
- Build a desktop application using Webkit and have full control over the UI, page rendering and everything else we would need
- Build an internal application using Webkit that is run on a server and is accessed through a web based VNC
We rejected 7 and 8 because they would increase the barrier of entry for Portia which was not something we wanted. This method is used by Import.io for their spider creation tool.
1 and 2 were rejected because it would be hard to fit the whole Portia UI into an addon in an acceptable way (we may revisit these in the future). ParseHub and Kimono use this method to great effect.
3 and 4 were investigated further, inspired by the work done by LibreOffice for their Android document editor. In the end though it was clunky and we could achieve better performace by sending DOM updates rather than image tiles.
The solution we have now built is a combination of 5 and 6. The most important aspect is the server-side browser. This browser provides a tab for each user allowing the page to be loaded and interacted with in a controlled manner. We looked at using existing solutions including Selenium, PhantomJS and Splash. All of these technologies are wrappers around WebKit providing domain specific functionality. We use Splash for our browser not because it is a Scrapinghub technology but because it is designed to be used for web crawling rather than automated testing making it a better fit for our requirements.
The server side browser gets input from the user. Websockets are used to send events and DOM updates between the user and the server. Initially we looked at React's virtual DOM, while it worked it wasn't perfect. Luckily, there is an inbuilt solution, available in most browsers released since 2012, called MutationObserver. This in conjunction with the Mutation Summary library allows us to update the page in the UI for the user when they interact with it.
We now proxy all of the resources that the page needs rather than loading them from the host. The advantage of this is that we can load resources from the cache in our server side browser or from the original host and provide SSL protection to the resources if the host doesn't already provide it.