ruairif/portia_js.rst Secret

## portia_js.rst

      
    Raw
  

              portia_js.rst
            
          
    The Road to loading JavaScript in Portia

Support for pages that heavily use JavaScript has been a much requested
feature ever since Portia was released just over 2 years ago. The wait is
nearly over and we are happy to be launching these changes in the very near
future (if you are adventurous you can try it out right now in the develop
branch on Github). The this post isn't an announcement of the release of
JavaScript support within Portia but instead we want to explain how we added
this support.

The Plan

As with everything in software we started out by investigating what our
requirements were and what othera had done in this situation. We were looking
for a solution that was reliable and would allow for reproducible interaction
with the webpages.
Reliability: A solution that could render the pages in the same way during
spider creation and crawling.
Interaction: A system that would allow us to record the user's actions so
that they could be replayed while crawling.

The Investigation

The results of the investigation produced some interesting and some crazy
ideas, here are the ones we probed further:

Place the Portia UI inside a browser addon and using the additional
privileges gained from being an addon read from and interact with the page.
Place the Portia UI inside a bookmarlet and after doing some post processing
of the page on our server allow interaction with the page.
Render a static screenshot of the page along with the co-ordinates for all
of them elements and send them to the UI interaction involves re-rendering
the whole screenshot.
Render a tiled screenshot of the page along with co-ordinates and when an
interaction event is detected update the representation on the server, send
the updated tiles to the UI to be rendered and send the updated DOM to the
user.
Render the page in an iframe but use a proxy to avoid cross-origin issues
and disable unwanted activity
Render the page on the server and send the html to the user. Whenever the
user interacts with the page the server sends the changes that have happened
as a result of the interaction to the user.
Build a desktop application using Webkit and have full control over the UI,
page rendering and everything else we would need
Build an internal application using Webkit that is run on a server and is
accessed through a web based VNC

We rejected 7 and 8  because they would increase the barrier of entry for
Portia which was not something we wanted. This method is used by Import.io for
their spider creation tool.
1 and 2 were rejected because it would be hard to fit the whole Portia UI into
an addon in an acceptable way (we may revisit these in the future). ParseHub
and Kimono use this method to great effect.
3 and 4 were investigated further, inspired by the work done by LibreOffice
for their Android document editor. In the end though it was clunky and we could
achieve better performace by sending DOM updates rather than image tiles.

The Solution

The solution we have now built is a combination of 5 and 6. The most important
aspect is the server-side browser. This browser provides a tab for each user
allowing the page to be loaded and interacted with in a controlled manner. We
looked at using existing solutions including Selenium, PhantomJS and Splash.
All of these technologies are wrappers around WebKit providing domain specific
functionality. We use Splash for our browser not because it is a Scrapinghub
technology but because it is designed to be used for web crawling rather than
automated testing making it a better fit for our requirements.
The server side browser gets input from the user. Websockets are used to send
events and DOM updates between the user and the server. Initially we looked at
React's virtual DOM, while it worked it wasn't perfect. Luckily, there is an
inbuilt solution, available in most browsers released since 2012, called
MutationObserver. This in conjunction with the Mutation Summary library
allows us to update the page in the UI for the user when they interact with it.
We now proxy all of the resources that the page needs rather than loading
them from the host. The advantage of this is that we can load resources from
the cache in our server side browser or from the original host and provide SSL
protection to the resources if the host doesn't already provide it.

The Future


Before JS support (left), after JS support (right).

For now we are very happy with how it works and hopfully it will help users
extract the data they need. This initial release will provide the means to
crawl and extract pages that require JavaScript but we want to make it better!
We are now building a system to allow actions to be recorded and replayed on
pages during crawling. We are hoping that this feature will make filling out
forms, pressing buttons and triggering infinite scrolling simple and easy
to use. If you have any ideas for what features you would like to see in Portia
leave a comment below!