Skip to content

Instantly share code, notes, and snippets.

@aroman
Forked from eldilibra/boilerpipeVsGoose.md
Last active August 29, 2015 14:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aroman/110a20c6b23c3e603db6 to your computer and use it in GitHub Desktop.
Save aroman/110a20c6b23c3e603db6 to your computer and use it in GitHub Desktop.

Hello all. We must choose an extraction library for our new Goliath system to provide default values when no rules have been set. The choice has been narrowed down to Boilerpipe and Goose. They both have sub-par documenation (Boilerpipe, Goose), so I've dug around in the code to find the exact process by which they pull out data. Here I will compare them so we can choose one.

#Boilerpipe

This bad mamba jamba was developed by a Ph.D-having guy who, along with some other folks, wrote a big fat academic paper around the algorithm it uses, which you can find in our Dropbox if you really want to read it. Basically, they use link density, text density, and number of words on a block-by-block basis to distinguish boilerplate blocks from content blocks. A block is simply a contiguous piece of text terminated by the start of a new tag, with anchor tags being ignored (since they are used to calculate link density). Based on empirical study, they found certain ratios that work in the real world, so they simply check against those ratios in a big nested if statement. Here's an extract from the paper:

Algorithm 1 Densitometric Classifier

  • curr_linkDensity <= 0.333333
    • prev_linkDensity <= 0.555556
      • curr_textDensity <= 9
        • next_textDensity <= 10
          • prev_textDensity <= 4: BOILERPLATE
          • prev_textDensity > 4: CONTENT
        • next_textDensity > 10: CONTENT
      • curr_textDensity > 9
        • next_textDensity = 0: BOILERPLATE
        • next_textDensity > 0: CONTENT
    • prev_linkDensity > 0.555556
      • next_textDensity <= 11: BOILERPLATE
      • next_textDensity > 11: CONTENT
  • curr_linkDensity > 0.333333: BOILERPLATE

Algorithm 2 Classifier based on Number of Words

  • curr_linkDensity <= 0.333333
    • prev_linkDensity <= 0.555556
      • curr_numWords <= 16
        • next_numWords <= 15
          • prev_numWords <= 4: BOILERPLATE
          • prev_numWords > 4: CONTENT
        • next_numWords > 15: CONTENT
      • curr_numWords > 16: CONTENT
    • prev_linkDensity > 0.555556
      • curr_numWords <= 40
        • next_numWords <= 17: BOILERPLATE
        • next_numWords > 17: CONTENT
      • curr_numWords > 40: CONTENT
  • curr_linkDensity > 0.333333: BOILERPLATH

Now, here are some code snippets from the Boilerpipe library:

protected boolean classify(final TextBlock prev, final TextBlock curr,
            final TextBlock next) {
        final boolean isContent;

        if (curr.getLinkDensity() <= 0.333333) {
            if (prev.getLinkDensity() <= 0.555556) {
                if (curr.getTextDensity() <= 9) {
                    if (next.getTextDensity() <= 10) {
                        if (prev.getTextDensity() <= 4) {
                            isContent = false;
                        } else {
                            isContent = true;
                        }
                    } else {
                        isContent = true;
                    }
                } else {
                    if (next.getTextDensity() == 0) {
                        isContent = false;
                    } else {
                        isContent = true;
                    }
                }
            } else {
                if (next.getTextDensity() <= 11) {
                    isContent = false;
                } else {
                    isContent = true;
                }
            }
        } else {
            isContent = false;
        }

        return curr.setIsContent(isContent);
    }

Then, algorithm 2 from the code:

protected boolean classify(final TextBlock prev, final TextBlock curr,
            final TextBlock next) {
        final boolean isContent;

        if (curr.getLinkDensity() <= 0.333333) {
            if (prev.getLinkDensity() <= 0.555556) {
                if (curr.getNumWords() <= 16) {
                    if (next.getNumWords() <= 15) {
                        if (prev.getNumWords() <= 4) {
                            isContent = false;
                        } else {
                            isContent = true;
                        }
                    } else {
                        isContent = true;
                    }
                } else {
                    isContent = true;
                }
            } else {
                if (curr.getNumWords() <= 40) {
                    if (next.getNumWords() <= 17) {
                        isContent = false;
                    } else {
                        isContent = true;
                    }
                } else {
                    isContent = true;
                }
            }
        } else {
            isContent = false;
        }

        return curr.setIsContent(isContent);
    }

That pretty much sums up Boilerpipe. It works quite well (95% - 98% accuracy on varied datasets) to distinguish between content and boilerplate. It'll take some digging into their structure to grab the raw HTML instead of the text, but I'm sure it can be done since the creator does this in his SAAS web app using this library.

#Goose

Goose is made by Gravity Labs, who I've never heard of before :~). It uses the number of stopwords in a chunk of text, the number of consecutive paragraphs, how high up the element is in the page (since comments are usually towards the bottom and should be scored lower), and link density to score the elements in a page on the content vs boilerplate spectrum. Ultimately, the concept is the same as Boilerpipe: find some ratios that fingerprint content in the context of a page, then check those to exclude out the boilerplate markup. Goose simply uses different metrics, and uses them in a slightly different manner (scoring vs straight-up conditional statements that lead to decisions quickly). The code is much longer and more all-over-the-place, and it is a bit slower (//TODO I haven't timed the two yet; it just "seems" slower). It is written in Scala, which is nice. Also, it provides more fields (published date, author, and a few more) than Boilerpipe (title, content), but we could easily implement what they are doing for those fields, which comes down to grabbing metadata tags from the document. I encourage you to check out their code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment