Skip to content

Instantly share code, notes, and snippets.

@joewiz
Last active March 9, 2023 07:18
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joewiz/2072e9f04838188b2aa6ee4ae0f72a15 to your computer and use it in GitHub Desktop.
Save joewiz/2072e9f04838188b2aa6ee4ae0f72a15 to your computer and use it in GitHub Desktop.
Web Scraping with XQuery

Web Scraping with XQuery

Overview

Learning how to web scrape empowers you to apply your XQuery skills to any data residing on the web. You can fetch data from remote sites and services—for example, entire web pages or just the pieces of a page that matter to you. Once fetched, you can perform further analysis on the data, clean it up, mash it up with other data, transform it into different formats, etc.

Built-in functions for making HTTP requests

XPath-based languages like XQuery offer an standard function for accessing remote documents, the fn:doc() function. However, a limitation of this function is that it only works if the URI returns a well-formed XML document. In practice, much of the web is not non-well formed XML. To illustrate this, try using fn:doc() on a page like https://www.kingjamesbibleonline.org/Genesis-Chapter-1_Original-1611-KJV/:

xquery version "3.1";

fn:doc("https://www.kingjamesbibleonline.org/Genesis-Chapter-1_Original-1611-KJV/")

This query will return an error like this:

err:FODC0005 exerr:ERROR:

An error occurred while parsing https://www.kingjamesbibleonline.org/Genesis-Chapter-1_Original-1611-KJV/:

Open quote is expected for attribute "lang" associated with an element type "html".

This "parsing" error occurs because the source of the HTML file begins with this line:

<html lang=en>

While this is valid HTML 5 syntax, it is not valid XML, which would require quotes around the attribute value, e.g.:

<html lang="en">

To test whether a URI can return a well-formed document, you can use the fn:doc-available() function:

xquery version "3.1";

fn:doc-available("https://www.kingjamesbibleonline.org/Genesis-Chapter-1_Original-1611-KJV/")

This query will return the boolean value, false(), indicating that the URI does not contain a well-formed XML document. Using fn:doc-available() first can let your XQuery code gracefully handle this problem, rather than halting with an error like the one above.

That said, using the fn:doc() function alone won't get you very far in screen scraping the web, where you want to be able to access HTML documents even if they are not well-formed, not to mention other resources like text or binary files. The solution to this problem is the EXPath HTTP Client.

The EXPath HTTP Client

The EXPath HTTP Client is a specification for a module, implemented in several products (including BaseX, eXist, MarkLogic, and Saxon), for performing HTTP requests. The module uses a single function, hc:send-request() function, to issue HTTP requests. Conveniently for our purposes and unlike the fn:doc() function, the EXPath HTTP Client turns non-well-formed XML documents into well-formed XML documents. This makes the documents accessible to XPath and XQuery.

The hc:send-request() function is a little more verbose than the fn:doc() function. It supports all HTTP methods (also known as verbs): not just GET (which fn:doc() is limited to), but also POST, PUT, DELETE, etc. To make an HTTP request, you must construct the request in an <hc:request> element, using two attributes to specify the URL and the HTTP method:

  • @href: the URL being requested
  • @method: the HTTP method

Besides these two attributes, the <hc:request> element can contain a number of request headers, in the form of <hc:header> elements, each with a @name and @value set of attributes. In our examples here, we will include a header that will close the HTTP connection after each request, thus freeing a portion of memory that would otherwise not be freed and could lead one's system to run out of memory or exceed the limit for the number of "open files." (Additional parameters may be included for more complex requests; see the EXPath HTTP Client specification for more information.)

So, to perform an fn:doc()-style HTTP GET request, we would write the following query:

xquery version "3.1";

import module namespace hc = "http://expath.org/ns/http-client";

let $url := "https://www.kingjamesbibleonline.org/Genesis-Chapter-1_Original-1611-KJV/"
let $request := 
    <hc:request href="{$url}" method="GET">
        <hc:header name="Connection" value="close"/>    
    </hc:request>
return
    hc:send-request($request)

The hc:send-request() function returns two items:

  1. The remote server's response header, wrapped in an <hc:response> element
  2. The remote server's response body

The response header looks like this:

<hc:response xmlns:hc="http://expath.org/ns/http-client" status="200" message="OK" spent-millis="1154">
    <hc:header name="date" value="Sat, 18 Mar 2017 18:55:22 GMT"/>
    <hc:header name="server" value="Apache"/>
    <hc:header name="x-powered-by" value="PHP/5.4.45"/>
    <hc:header name="cache-control" value="max-age=9200, must-revalidate"/>
    <hc:header name="expires" value="Sun, 19 Mar 2017 18:55:22 GMT"/>
    <hc:header name="connection" value="close"/>
    <hc:header name="transfer-encoding" value="chunked"/>
    <hc:header name="content-type" value="text/html"/>
    <hc:body media-type="text/html"/>
</hc:response>

The header isn't terribly exciting, but it can contain valuable troubleshooting information, telling you if the request was successful or not and what format to expect in the response body:

  • The root element's @status and @message attribute tell you if the request was successful or not. A @status of 404 or 500 would indicate, respectively that the requested page could not be found or that there was a server error. There are many HTTP response codes, which are easily looked up online if you do not recognize them.
  • The content-type header is also useful for determining what kind of data is going to be returned in the response body. In the case of this request, the content-type is reported as text/html. There are many content-type values for different file formats; these, too, are easily looked up online.

If the response body is XML or can be repaired into well-formed XML, the body can be queried. In the case of this request, the non-well-formed HTML was repaired into well-formed XML:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml" lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
        <title>GENESIS CHAPTER 1 &#160;(ORIGINAL 1611 KJV)</title>
        <!-- snip -->
    </head>
    <body>
    <body>
        <div class="wrapper">
            <!-- snip -->
            <section>
                <div class="main">
                    <!-- snip -->
                    <div class="dottedborder">
                        <!-- snip -->
                        <p>
                            <span style="color:#666666;font-size:16px;">CHAP. I.</span>
                        </p>
                        <p style="margin:0px;text-indent:0px;">
                            <span style="color:#666666;font-size:9px;">1 The creation of Heauen and
                                Earth, 3 of the light, 6 of the firmament, 9 of the earth separated
                                from the waters, 11 and made fruitfull, 14 of the Sunne, Moone, and
                                Starres, 20 of fish and fowle, 24 of beasts and cattell, 26 of Man
                                in the Image of God. 29 Also the appointment of food.</span>
                            <br clear="none" />
                            <br clear="none" />
                        </p>
                        <p class="indent">
                            <strong>
                                <a shape="rect" class="red" href="../1611_Genesis-1-1/"
                                    title="View more translations of Genesis 1:1">
                                    <sup class="red">1</sup>In the beginning God created the Heauen,
                                    and the Earth.</a>
                            </strong>
                            <a shape="rect" href="#1" style="color:#666666;font-size:8px;"
                                title="Reference: Psal.33.6. and 136.5. acts.14.15. and 17.24. Hebr.11.3.">
                                <sup style="color:#666666;font-size:9px;">1</sup>
                            </a>
                        </p>
                        <p class="indent">
                            <sup style="color:#666666;">2</sup>
                            <a shape="rect" href="../1611_Genesis-1-2/"
                                title="View more translations of Genesis 1:2">And the earth was
                                without forme, and voyd, and darkenesse was vpon the face of the
                                deepe: and the Spirit of God mooued vpon the face of the waters.</a>
                        </p>
                        <!-- snip -->

These paragraphs can be retrieved by a query like the following:

xquery version "3.1";

import module namespace hc = "http://expath.org/ns/http-client";

declare namespace html = "http://www.w3.org/1999/xhtml";

let $url := "https://www.kingjamesbibleonline.org/Genesis-Chapter-1_Original-1611-KJV/"
let $request := 
    <hc:request href="{$url}" method="GET">
        <hc:header name="Connection" value="close"/>    
    </hc:request>
let $response := hc:send-request($request)
let $response-head := $response[1]
let $response-body := $response[2]
return
    $response-body//html:p

Storing the results

Typically, when screen scraping, you will want to store the response on your local system. This lets you try various queries against one document that you could apply to the entire corpus. The method for storing documents differs by XQuery implementation, so we'll use eXist's implementation, the xmldb:store() function. This function takes 3 parameters:

  • the name of the database collection where you plan to store the document
  • the name of the document
  • the content of the document

In the following query we store the original request and the full response. Why store anything but the response body? The request tells us what request we actually made, and the response head gives us troubleshooting and other metadata about the response.

xquery version "3.1";

import module namespace hc = "http://expath.org/ns/http-client";

declare namespace html = "http://www.w3.org/1999/xhtml";

let $base-url := "https://www.kingjamesbibleonline.org/"
let $resource-name := "Genesis-Chapter-1_Original-1611-KJV"
let $url := $base-url || $resource-name || "/"
let $request := 
    <hc:request href="{$url}" method="GET">
        <hc:header name="Connection" value="close"/>    
    </hc:request>
let $response := hc:send-request($request)
let $record := 
    <record> 
        <date>{ current-dateTime() }</date>
        {
            $request,
            $response
        }
    </record>
return
    xmldb:store("/db/kjv1911", $resource-name || ".xml", $record)

This query will store the record in the eXist database as /db/kjv1911/Genesis-Chapter-1_Original-1611-KJV.xml.

Now we can query this document to our heart's content without making a request each time.

xquery version "3.1";

declare namespace hc = "http://expath.org/ns/http-client";
declare namespace html = "http://www.w3.org/1999/xhtml";

let $doc := doc("/db/kjv1911/Genesis-Chapter-1_Original-1611-KJV.xml")
return
    $doc//html:p

If all of the content you need is on a single page, then you've retrieved it and can use XPath and XQuery to query and/or clean up the contents of the page. For example, you might want to remove the headers and footers, strip extraneous markup, and convert the body of the document to TEI.

But if your content is spread across multiple pages, you can retrieve them one at a time using this technique, or if that becomes cumbersome, you will need some method to retrieve multiple documents.

Batch downloading

At a high level, batch downloading is as simple as performing an XQuery FLWOR expression. To fetch a collection of pages, you just iterate over a sequence of URLs one by one, making an HTTP request to get each of them and store their results. But where does this sequence of URLs come from? How do you feed it into your FLWOR expression? There are a couple of possibilities. Which you choose depends on the structure of the site.

Pre-generating the list of URLs

Look at the structure of the site you're trying to scrape. Do you notice a pattern in the URLs that you can leverage to construct a simple pattern? For example, if the site's pages contain a sequential number, e.g., page-1.html, page-2.html, page-3.html, you can construct the full URLs using a FLWOR expresion:

xquery version "3.1";

let $base-url := "http://www.example.com/"
let $urls := (1 to 10) ! ($base-url || "page-" || . || ".html")
return
    $urls

This query returns ten URLs, from page-1.html to page-10.html.

Querying for the URLs

On many sites, though, like the King James Bible site above, the URLs aren't quite so predictable. Here are the general approaches where you cannot pre-generate the URLs:

  • Find a table of contents page, and request all of the links on that page (e.g., $page//html:a/@href).
  • Find a page, and request the link to the "next page" button; repeat until you've reached the last page (e.g., $page//html:a[. eq 'Next Page']/@href).
  • Some combination of these two.

For the King James Bible site above, you will probably need some combination of the above:

Requesting text and binary files

Besides downloading XML and turning HTML into XML, the EXPath HTTP Client can also download non-XML resources like text files or binary files, like images. While you make requests the same way, the way you receive and process responses is different. First, you should look closely at the response body to see what content-type the remote server says the resource is. Let's take a look at the response head for an image on the page above, http://www.kingjamesbibleonline.org/1611-Bible-Original/Genesis-Chapter-1-1.jpg:

xquery version "3.1";

let $url := "http://www.kingjamesbibleonline.org/1611-Bible-Original/Genesis-Chapter-1-1.jpg"
let $request := 
    <hc:request href="{$url}" method="HEAD">
        <hc:header name="Connection" value="close"/>    
    </hc:request>
return
    hc:send-request($request)

Notice that here we issued a HEAD request, rather than a GET request. A HEAD request lets us request just the response head. This way, we don't waste bandwidth fetching the entire resource when we only want to inspect the response head. Here's the result:

<hc:response xmlns:hc="http://expath.org/ns/http-client" status="200" message="OK" spent-millis="140">
    <hc:header name="server" value="ApacheBooster/1.8"/>
    <hc:header name="date" value="Sun, 19 Mar 2017 19:35:58 GMT"/>
    <hc:header name="content-type" value="image/jpeg"/>
    <hc:header name="content-length" value="162102"/>
    <hc:header name="last-modified" value="Thu, 20 Jun 2013 21:17:59 GMT"/>
    <hc:header name="connection" value="close"/>
    <hc:header name="vary" value="Accept-Encoding"/>
    <hc:header name="etag" value=""51c37187-27936""/>
    <hc:header name="expires" value="Sun, 26 Mar 2017 19:35:58 GMT"/>
    <hc:header name="cache-control" value="max-age=604800"/>
    <hc:header name="x-cache" value="HIT from Backend"/>
    <hc:header name="accept-ranges" value="bytes"/>
</hc:response>

The content-type response header tells us that the file is of type image/jpeg.

More to come...

Conclusion

This tutorial covers the basics of using the EXPath HTTP Client to make HTTP GET requests to download remote resources (whether well-formed XML pages, non-well-formed XML HTML pages, or non-XML text or binary resources) and batch download these resources. Once you've got these sources on your system, you're ready to begin querying and transforming them into the form you needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment