Skip to content

Instantly share code, notes, and snippets.

@mooniker
Last active January 5, 2016 18:51
Show Gist options
  • Save mooniker/d5852ff5abb6f62e8967 to your computer and use it in GitHub Desktop.
Save mooniker/d5852ff5abb6f62e8967 to your computer and use it in GitHub Desktop.
Tutorial on web scraping legal summaries with Node.js

How to JSONify a Mess of Legal Summaries With Node.js

Have you ever been to the Virginia judicial system's website for state supreme court opinions, found the index of slip opinions going back to 1995 on one long-scrolling page, and wondered, how could this be made more useful? Instead of politely asking the clerk of the court to set up a web API, let's see how long it takes to do it ourselves.

But why?

Maybe our legal reporter or scholar needs a better way to sort and track court opinions. Maybe we're building an app to aggregate legal abstracts in a friendly JSON format. Maybe we're looking for ways to turn a public or government document dumping ground into something more usable. Or maybe we want to practice web scraping and JSONifying something with Node.js.

Prerequisites

To follow along properly with this tutorial, we must know the HyperText Markup Language (HTML) well enough to know what's meant by parent and child elements and the document object model (DOM).

We must also be familiar with jQuery selectors and ways to manipulate strings with JavaScript. If you don't know how to begin doing such things, look them up, find a tutorial, practice, and come back.

We'll need to have Node.js, installed on our computer. JavaScript is used by web browsers to manipulate HTML but with Node we'll use JavaScript outside the web browser to request a webpage over the Internet, process the page, and send the results to our web browser for display.

We'll need to bring up our computer's command-line interface and be comfortable with running some basic commands to get our project up and running.

And of course, we should already be familiar with JavaScript Object Notation (JSON), a data-interchange format we'll use to compile and present the information we extract.

What we want

For each court opinion listed on the target webpage, we want to extract a case name, a docket number, a date, a summary, and links to the full-text PDF and compile the gathered information into JSON. We'll specify how we want these target data in more detail below.

Can we do this?

Browse the source code for our target webpage containing an index of opinions delivered by the Virginia Supreme Court: http://www.courts.state.va.us/scndex.htm

Many government websites leave much to be desired when it comes to the presentation of information in a useful way. But here in the Commonwealth of Virginia, we are lucky the high court (or its content management system) writes HTML in a consistent (if not totally semantic) way. If it weren't for this, scraping and parsing this information into JSON might not be feasible.

Most of the opinion information is contained in HTML paragraph elements that read like this:

<p><a name="1150150_20151217"></a><a href="opinions/opnscvwp/1150150.pdf"> 150150</a> <b>Butler v. Fairfax County School Board</b>  12/17/2015
Code § 22.1-296.1(A) requires, as a condition precedent to employment, that every applicant for employment by a school board must [yada yada yada]. [There] was no showing that the teacher reasonably relied on representations of the board to her detriment and, since the contract was void ab initio, it cannot form the basis for a claim of estoppel. The judgment is affirmed.
</p>

Some of the paragraph elements differ in form slightly by not including a PDF link on the docket number because the opinion was combined with that for another case:

<p><a name="1141820_20151112"> 141820</a> <b>Commonwealth v. Chilton</b>  11/12/2015
In two appeals from decisions of the Court of Appeals of Virginia, it is held that the wounding or bodily injury element necessary to prove the crime of strangulation in violation of Code § 18.2-51.6 does not require that the victim experience any observable wounds, cuts, or breaking of the skin, broken bones or bruises. [Yada yada yada.] Proof of some form of bodily injury is required to support conviction under this statute. The judgments are affirmed.
 <a href="opinions/opnscvwp/1141650.pdf"> Combined case with Record No. 141650 </a><br></p>

That said, all the opinion information miraculously seems to be in one of those two formats and neatly contained by paragraph elements. Had the court used semantic markup to indicate each opinion's date and set apart the summary, our job here would be trivial, but what they do give us is something we can work with.

Prepping the machinery

Assuming we have Node.js and Node's package manager, npm, installed, let's set up a workspace (a folder or directory) with a descriptive name, initialize a Node project (using default settings), and install some of our extra tools: express, request, and cheerio. Press enter at each of npm init's prompts to elect a default setting. The --save flag for npm install automatically records these tools as dependencies in the Node project's package.json file, where we keep track of the project's meta information and other requirements. Chances are, you'll type in commands that look like this:

mkdir va_court_opinion_jsonifier
cd jsonify_va_court_opinions
npm init
npm install express request cheerio --save

Express is a web framework that'll let us use our web browser as a user interface for displaying the results from our web scraping script. Express isn't necessary for web scraping, but it conveniently allows us to use a web browser to display our final JSON and the intermediary steps. Request allows our application to request a document over the internet and Cheerio will let us traverse a webpage's DOM with jQuery and extract information.

We're going to create a file called index.js in our project's folder and make sure package.json references index.js in the main field. That lets Node know we'll be putting all our code in index.js and that code begins like this:

// index.js
var request = require('request'); // HTTP client used to request a webpage
var express = require('express'); // HTTP server - middle of the action
var cheerio = require('cheerio'); // power up jQuery on the server

var app = express(); // initialize our web application's server

app.listen( 4000, function() {
  console.log('Hit up port 4000 to make it happen.');
} );

Notice our dependencies -- libraries of outside code that we gathered and installed earlier to be used within our own code -- are declared and assigned a local name for our reference with a syntax like this: var moduleName = require('module-name'). If a dependency is not installed, Node will complain that a module couldn't be found.

We also assign the name, app, for our application, instantiate the app as an instance of an Express HTTP server (express()), and have our app listen for requests on our local computer's port 4000.

Next let's add a route for /scrape that triggers the app:

app.get( '/scrape', function( req, res ) {

  // target for our scraping, opinions from Virginia's highest state court
  var url = 'http://www.courts.state.va.us/scndex.htm';

  console.log('Scraping target:', url);

  // here we want to request the webpage at the given url

  // and then we want to craft JSON from the info we extract

} );

What we set up is a way to activate our web scrape by accessing a webpage hosted by our app on our local machine. Time to test the code so far. Run node index.js or nodemon and direct your web browser to [http://localhost:4000/scrape](http://localhost:4000/scrape). The browser shouldn't render anything but make sure the app logs to the console a message about the scraping target. Now we're ready to figure out how to get our target document into our app for scraping. (If we use nodemon, we won't have to keep killing and restarting the server after each change to the code.)

Making a request

First things first, we want to get an HTML document from the court's web server to our application, not via a web browser. That's where request comes in. We'll be using the "super simple to use" example from Request's documentation to fashion our HTTP request for the document containing the information we want to parse, but instead of rendering the page on the screen, we have other plans.

A request takes two arguments: the URL for the target page and a callback function invoked to deal with the result of our request. The callback function should handle an error if the request fails and should do something with whatever data it receives.

The callback function's error argument should be null unless the Virginia server is down or the URL is incorrect. If error is null, then response and body contain stuff we want to look at. If you've never used request or you're not sure what's going on here, put a request at the end of our function for the /scrape route and play around with it to see what response and body actually look like, and if you want to see what an error looks like, temporarily put a typo in the URL.

request( url, function( error, response, body ) {

  console.error('error:', error);     // print `error` to terminal, may be null
  console.log('response:', response); // print response to terminal
  res.send(body);                     // send body of response to our web browser

} );

We are obviously not prone to making mistakes, but let's add error handling in case children are watching. Our web scrape can only work with the actual HTML of the webpage, and we only get that if the request is successful, which is when the request's error parameter is null and the status code of the request's response is 200 / OK. The request's response contains the raw data received from the remote server, such as the HTTP status code, while body is just the requested document (as enclosed in the response). Because the callback's body argument is the HTML document we want to parse, let's go ahead and rename body to html as a reminder to ourselves.

request( url, function( error, response, html ) {

  if ( !error && response.statusCode == 200 ) { // status code 200 means OK

    // and then we want to craft JSON from the info we extract

    res.send(html); // or, instead, just show us the HTML we got for now

  } else if (error) { // if we fail to get a response from the Virginia server
    console.error(error);
  } else { // if the status code of the response is anything other than OK
    console.log(response);
  }

} );

With our request looking like this, we should see a mirror of our target URL's content (less the images) when we hit our application with a web browser at [http://localhost:4000/scrape](http://localhost:4000/scrape). Here's a recap of how our components are working right now.

  1. We direct our web browser to get a webpage on our local machine.
  2. Our local server receives that request and sends its own request to the remote Virginia court server to get the court webpage.
  3. Our local server takes the HTML it got from Virginia and sends it to our web browser.

Now, instead of mirroring the webpage on our local machine, we want to extract the useful information into JSON. Everything we add from here will be inside the first if statement.

It's also worth noting here that with a request implemented, we have two response objects that we don't want to confuse. The first, res, is part of the callback function for the /scrape route on our local Express server and we'll use that to display our results. The second, response, is part of the callback function for our server's HTTP request and we won't do anything with it other than check its statusCode.

Saying hi to cheerio

Cheerio is an implementation of jQuery that will let our app transverse an HTML document. (If you don't know jQuery and the document object model, things are about to get a bit dicey.) Whereas jQuery works in a web browser to allow us to navigate and manipulate a webpage's HTML, Cheerio works on our server (not in our web browser) but navigates and manipulates HTML in the same way as jQuery, so it's very convenient if we're already familiar with jQuery or something similar, such as CSS selectors.

At this point, we want to have the HTML source code of our target webpage open for close reference. We want to grab all the paragraph elements that contain opinion information, which is all the p elements except the first two and the last one. We want to extract information from within each p element. And we want to compile an array of JSON objects for information extracted for each opinion.

Let's start with compiling an array of JSON elements that contain just the plain text inside each opinion's p element:

// Initialize Cheerio with requested HTML
var $ = cheerio.load(html); // load HTML for jQuery-esque DOM transversal

var $opinions = $('p').slice(2); // get all p elements, discarding first two
$opinions = $opinions.slice(0, $opinions.length - 2); // discard last one

var jsons = []; // initialize array for collecting all our JSON objects

for ( var p = 0; p < $opinions.length; p += 1 ) {
  var pContent = $opinions.eq(p).text();

  // crafting JSON from the info extracted from each element in $opinions
  // would happen here

  jsons.push({
    content: pContent // let's just push the contents of each p into JSON
  });
}

res.json(jsons); // respond with our compilation of JSON

The dollar sign ($) is a convention used by Cheerio and jQuery to distinguish their objects from other JavaScript objects. We declare a variable called $ (just a lone dollar sign) and assign it the HTML we load with Cheerio so that we can use Cheerio's selector methods on the HTML document. We also use a dollar sign in variable names to remind ourselves when those objects may work with Cheerio's methods.

JavaScript's slice() method for arrays is used to discard the p elements that don't contain opinion information. We keep the p elements we want in an array called $opinions and loop through each p element using the eq() method to get at each individual p element and the text() method to get the element's inner text.

We also create a JSON object (an attribute/value pair enclosed in { and }) and push it into our jsons array for display.

Reload [http://localhost:4000/scrape](http://localhost:4000/scrape) and the web browser should display JSON objects for each opinion with one attribute (content) and a value containing the plain text inside that paragraph. That's not quite what we want but it gives us something to look at while we consider how to extract the case information into neat attribute/value pairs for each JSON object. Specifically, we want to think of DOM selectors and JavaScript string manipulation techniques that can zero in on each piece of information:

  • case name (i.e. "Norfolk Southern Ry. v. E.A. Breeden, Inc."),
  • docket number (i.e. "131066"),
  • date (i.e. "04/17/2014"),
  • summary (i.e. "The circuit court did not err.... [Its judgement] is affirmed"), and
  • URL to the full-text PDF (i.e. "http:/www.courts.state.va.us/opinions/opnscvwp/1131065.pdf).

Take a moment to think or try your own strategies to grab the relevant information from each paragraph. We have all the opinions in an array of Cheerio objects ($opinions) and we can refer to each by our loop's p index.

The case names and links are enclosed by HTML tags, which makes them easy to extract with Cheerio/jQuery selectors. For example, $element.find('b') would allow us to select a b element within $element and $element.find('a[href]') would allow us to select a elements with an href attribute. From there we could use $element.text() to get the text within the selected elements.

The rest of the information is a bit more difficult to extract, but it follows a consistent format: anchor or hyperlink a tags, the case name enclosed in b tags, a date in the mm/dd/yyyy format, a summary, and sometimes a hyperlink in the summary. If you pay close attention to the source code, the Virginia court's content management system consistently puts extra spaces here and there and one notable line break (\n) right before the summary begins. White space like that in HTML is ignored by web browsers, but in our app we'll have to either filter it out or, in the case of the line break, realize that it marks where the summary begins.

We won't consider using regular expressions here, but those so inclined could efficiently grab all the information we want without Cheerio and with very little code.

Walking through the text wrangling

One approach that'll we'll try out here makes use of Cheerio's find(), text(), and attr() methods for some of the easy to extract stuff. We also use JavaScript's trim() method to discard extra spaces and split() method to carve out the summary and extract specific words.

The case name is the easiest item to pull out. A simple find() and text() lets us zero in on the contents of each paragraphs only b (bold) element.

The hypertext links to PDFs can be acquired in the same way, but because each opinion may be associated with zero, one, or more PDFs, we'll have to use another for loop (nested within our for loop for the p elements) to iterate through each of a p element's child a elements with an href attribute. Because the actual text that is part of the hyperlink to the PDF describes how the PDF relates to the opinion at issue, we'll grab that as well and nest that meta info inside the JSON as a description for the hyperlink.

The rest of the opinion's details can be extracted by splitting each p element's text on the line break (\n) or on every space. JavaScript's split() method returns an array of substrings split from a string with the given separator. For example, if var abcd = 'a b c d', a string of four letters separated by a space, were split using a space as a separator, abcd.split(' '), ['a', 'b', 'c', 'd'], an array of every substring in the original string separated by a space would be returned. If the string to be split had multiple adjacent spaces, split(' ') would return some empty substrings (['a', '', 'b', '', 'c', '', 'd']).

The summary takes up the entire paragraph after the line break, so a simple split on the line break gets us an array of two substrings. The second one (or [1] with an array's zero index), after its trailing white space is trimmed, is the summary neatly extracted.

The first substring in that array is basically all the meta information for the opinion, within which the targets at hand are the docket number, the first word in our meta info substring, and the date, the last word in the meta info substring. We can split that substring again on a space and then we just have to pick out the first and last item in this new array. The first item, zero indexed, is [0] and the last item, zero indexed, is one less than the length of the meta info array. Remember that our split on the space will only work nicely if we trim extra whitespace from the beginning and end of the meta info substring first.

Finding a path

The PDF hyperlinks are relative to the Virginia webpage's URL (i.e. opinions/opnscvwp/1150150.pdf), which means they don't make sense in our JSON without their original context. So what we'll do is figure out each PDF's absolute URL by joining the Virginia webpage's URL with the PDF's relative hyperlink. Lucky for us, Node's path module can handle this for us. Add path as a dependency (var path = require('path')) so that we can use its join() and dirname() methods. Using dirname() with the Virginia webpage's URL automatically returns the the path to the directory where that page exists. We then pass that path and the PDF's relative hyperlink to join, which joins the two into an absolute hyperlink for the PDF that we can use outside the context of the original court webpage.

Here's an example of how that works:

  var targetWebpageUrl = 'http://www.courts.state.va.us/scndex.htm'; // has absolute path
  var pdfUrl = 'opinions/opnscvwp/1150150.pdf'; // has relative path

  var path = require('path'); // lets us use `path` module

  // assign the directory where the target webpage is located to variable
  var targetPath = path.dirname(targetWebpageUrl);
  console.log(targetPath); // prints 'http://www.courts.state.va.us'

  // joins webpage's path with PDF hyperlink's relative URL
  var absoluteUrl = path.join(targetPath, relativeUrl);
  console.log(absoluteUrl);
  // prints 'http:/www.courts.state.va.us/opinions/opnscvwp/1150150.pdf',
  //         which is the webpage's URL with the file name swapped out for
  //         the PDF's relative URL.

Reeling it all in

If you've been following along, you should have something like this:

// Initialize Cheerio with requested HTML
var $ = cheerio.load(html); // allows DOM transversal jQuery-style on $

var $opinions = $('p').slice(2); // get all p elements, discard first two
$opinions = $opinions.slice(0, $opinions.length - 2); // discard footer

var jsons = []; // initialize array for collecting all our JSONs

for ( var p = 0; p < $opinions.length; p += 1 ) {

  // Grab all text inside p element
  var pContent = $opinions.eq(p).text();

  // Grab text inside b element (bold tags)
  var caseName = $opinions.eq(p).find('b').text();

  // Grab the first word in the text (and assume it's a docket)
  var docketNumber = pContent.trim().split(' ')[0];

  // Split text on line break and assume the second part is summary
  var summary = pContent.split('\n')[1].trim();

  // Split text on line break and assume first part is meta info
  var metaInfo = pContent.split('\n')[0].trim();
  // Grab the last word in meta info and assume it's the date
  var date = metaInfo.split(' ')[metaInfo.split(' ').length - 1];

  // Find all the a elements that have an href attribute
  var $aWithHrefs = $opinions.eq(p).find('a[href]');
  var hrefs = []; // compile the href json in this
  // iterate through all the a[href] elements and extract the link info
  for ( var a = 0; a < $aWithHrefs.length; a += 1 ) {
    hrefs.push({
      name: $aWithHrefs.eq(a).text().trim(),
      href: path.join( path.dirname(url), $aWithHrefs.eq(a).attr('href') )
    });
  }

  jsons.push({ // now let's put the extracted info into our collection
    case_name: caseName,
    docket_number: docketNumber,
    date: date,
    summary: summary,
    hrefs: hrefs,
  });
}

res.json(jsons); // respond with our compilation of JSON

The resulting JSON for each opinion looks like this:

{
  "case_name": "Norfolk Southern Ry. v. E.A. Breeden, Inc.",
  "docket_number": "131066",
  "date": "04/17/2014",
  "summary": "The circuit court did not err.... [Its judgement] is affirmed.",
  "hrefs": [{
    "name": "131066",
    "href": "http:/www.courts.state.va.us/opinions/opnscvwp/1131066.pdf"
    }]
}

And with that, we have over 2,000 legal summaries, case names, docket numbers, and hyperlinks converted to JSON on demand and ready to use.

Final thoughts

So what have we accomplished? We took a webpage containing information on court decisions going back two decades and parsed it out semantically into a more accessible JSON format we could use in another application. We used Node's Express module to wrap up our application and show results on a web browser. We used Request to fetch the remote webpage and Cheerio and basic JavaScript to extract and wrangle what we wanted from the webpage's DOM.

We didn't engineer a web crawler because all the information is contained on one page, but for websites on which an index contains basic information but additional information is hyperlinked on other pages, we may want to use multiple requests and compile additional information scraped from crawling multiple pages. For example, in our scrape above, we only make one request for the court's index of opinions, but if we wanted to extract additional information from the PDFs, we'd add an additional request in our loop of the p elements to get and parse something from the each PDF.

From here, we could drop the information in a Mongo database and keep it updated every day. We could extract extra information, such as legal code references in the summaries or filter the results by summaries that do or do not contain specified keywords. Had the information had more of a tabular nature, we could extract the data into a spreadsheet or CSV file and pass it to our analyst for number crunching or data visualization. With little modification, our code could work for Virginia's appellate court opinions as well.

But the important takeaway here is that the government and public agencies often dump useful information on the web in a less than useful way. And while we push our governments to build their webpages with transparency, accessibility, and usability in mind, we needn't wait for them to do so if we can get what we want ourselves with Node.

See the code used for this tutorial on Github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment