Web scraping is technique of extracting information from websites. For mobile applications, it should be considered a last resort. Instead try to get access to the inderlying data via a documented REST web service API.
However, you may find that an REST or SOAP API is not available and you may need to web scrape in order to get the web site data into your mobile application.
If you are going web scrape, then don't do it in the mobile app. Instead, use a microservices platform, like Arrow. By implementing the screen scraping in an Arrow middle tier server, then when the web site changes, you can change your scraping algorithm without needing to publish a new mobile application.
This blog post will show a simple example of using Arrow Builder to build an API that utilizes web scraping.
The basic steps are outlined below:
- Create an Arrow project
- Install the x-ray npm
- Create a custom API (i.e. not based on a connector or model)
- Follow the documentation at x-ray
Execute the following command from the command line:
appc new
In your project folder, execute the following command from the command line:
sudo npm install x-ray
In the project /api/ folder, create a new file. My simple example, geta.js, is shown below:
var Arrow = require('arrow');
var Xray = require('x-ray');
var x = Xray();
var GetA = Arrow.API.extend({
group: 'xrayapis',
path: '/api/geta',
method: 'GET',
description: 'this is an api that shows how to web scrape using x-ray npm',
parameters: {},
action: function (req, resp, next) {
x('http://google.com', ['a'])(function(err, a) {
if(err) {
resp.response.status(500);
resp.send({"error": "cannot reach url"});
next(false);
} else {
resp.send({data: a});
next();
}
});
}
});
module.exports = GetA;
Follow the documentation at x-ray
In my example above, I am extracting all 'a' tags on the web site http://google.com using the following:
x('http://google.com', ['a'])
When I run this API, it returns the following:
{
"data": [
"Images",
"Maps",
"Play",
"YouTube",
"News",
"Gmail",
"Drive",
"More »",
"Web History",
"Settings",
"Sign in",
"Get Google Chrome",
"Advanced search",
"Language tools",
"Advertising Programs",
"Business Solutions",
"+Google",
"About Google",
"Privacy",
"Terms"
]
}
In this example, we saw how we can leverage the x-ray npm and Arrow to perform web scraping and expose the data as a mobile optimized REST API. Furthermore, when the web site changes, you can modify your web scraping logic in the Arrow API and avoid the need to re-publish a new version of the mobile application.