Skip to content

Instantly share code, notes, and snippets.

@lbrenman
Last active March 15, 2017 16:28
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save lbrenman/db7199b1e04ea84eca4d to your computer and use it in GitHub Desktop.
Save lbrenman/db7199b1e04ea84eca4d to your computer and use it in GitHub Desktop.
Arrow Web Scraping Example using x-ray

Arrow Web Scraping

Web scraping is technique of extracting information from websites. For mobile applications, it should be considered a last resort. Instead try to get access to the inderlying data via a documented REST web service API.

However, you may find that an REST or SOAP API is not available and you may need to web scrape in order to get the web site data into your mobile application.

If you are going web scrape, then don't do it in the mobile app. Instead, use a microservices platform, like Arrow. By implementing the screen scraping in an Arrow middle tier server, then when the web site changes, you can change your scraping algorithm without needing to publish a new mobile application.

This blog post will show a simple example of using Arrow Builder to build an API that utilizes web scraping.

The basic steps are outlined below:

  1. Create an Arrow project
  2. Install the x-ray npm
  3. Create a custom API (i.e. not based on a connector or model)
  4. Follow the documentation at x-ray

Create an Arrow project

Execute the following command from the command line:

appc new

Install the x-ray npm

In your project folder, execute the following command from the command line:

sudo npm install x-ray

Create a custom API

In the project /api/ folder, create a new file. My simple example, geta.js, is shown below:

var Arrow = require('arrow');
var Xray = require('x-ray');
var x = Xray();

var GetA = Arrow.API.extend({
	group: 'xrayapis',
	path: '/api/geta',
	method: 'GET',
	description: 'this is an api that shows how to web scrape using x-ray npm',
	parameters: {},
	action: function (req, resp, next) {
		x('http://google.com', ['a'])(function(err, a) {
			if(err) {
				resp.response.status(500);
				resp.send({"error": "cannot reach url"});
				next(false);
			} else {
				resp.send({data: a});
				next();
			}
		});
	}
});

module.exports = GetA;

Follow the documentation at x-ray

In my example above, I am extracting all 'a' tags on the web site http://google.com using the following:

x('http://google.com', ['a'])

When I run this API, it returns the following:

{
  "data": [
    "Images",
    "Maps",
    "Play",
    "YouTube",
    "News",
    "Gmail",
    "Drive",
    "More »",
    "Web History",
    "Settings",
    "Sign in",
    "Get Google Chrome",
    "Advanced search",
    "Language tools",
    "Advertising Programs",
    "Business Solutions",
    "+Google",
    "About Google",
    "Privacy",
    "Terms"
  ]
}

Summary

In this example, we saw how we can leverage the x-ray npm and Arrow to perform web scraping and expose the data as a mobile optimized REST API. Furthermore, when the web site changes, you can modify your web scraping logic in the Arrow API and avoid the need to re-publish a new version of the mobile application.

var Arrow = require('arrow');
var Xray = require('x-ray');
var x = Xray();
var GetA = Arrow.API.extend({
group: 'xrayapis',
path: '/api/geta',
method: 'GET',
description: 'this is an api that shows how to web scrape using x-ray npm',
parameters: {},
action: function (req, resp, next) {
x('http://google.com', ['a'])(function(err, a) {
if(err) {
resp.response.status(500);
resp.send({"error": "cannot reach url"});
next(false);
} else {
resp.send({data: a});
next();
}
});
}
});
module.exports = GetA;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment