aroman/boilerpipeVsGoose.md

## boilerpipeVsGoose.md

      
    Raw
  

              boilerpipeVsGoose.md
            
          
    Hello all. We must choose an extraction library for our new Goliath system to provide default values when no rules have been set. The choice has been narrowed down to Boilerpipe and Goose. They both have sub-par documenation (Boilerpipe, Goose), so I've dug around in the code to find the exact process by which they pull out data. Here I will compare them so we can choose one.
#Boilerpipe
This bad mamba jamba was developed by a Ph.D-having guy who, along with some other folks, wrote a big fat academic paper around the algorithm it uses, which you can find in our Dropbox if you really want to read it. Basically, they use link density, text density, and number of words on a block-by-block basis to distinguish boilerplate blocks from content blocks. A block is simply a contiguous piece of text terminated by the start of a new tag, with anchor tags being ignored (since they are used to calculate link density). Based on empirical study, they found certain ratios that work in the real world, so they simply check against those ratios in a big nested if statement. Here's an extract from the paper:
Algorithm 1 Densitometric Classifier

curr_linkDensity <= 0.333333

prev_linkDensity <= 0.555556

curr_textDensity <= 9

next_textDensity <= 10

prev_textDensity <= 4: BOILERPLATE
prev_textDensity > 4: CONTENT


next_textDensity > 10: CONTENT


curr_textDensity > 9

next_textDensity = 0: BOILERPLATE
next_textDensity > 0: CONTENT


prev_linkDensity > 0.555556

next_textDensity <= 11: BOILERPLATE
next_textDensity > 11: CONTENT


curr_linkDensity > 0.333333: BOILERPLATE

Algorithm 2 Classifier based on Number of Words

curr_linkDensity <= 0.333333

prev_linkDensity <= 0.555556

curr_numWords <= 16

next_numWords <= 15

prev_numWords <= 4: BOILERPLATE
prev_numWords > 4: CONTENT


next_numWords > 15: CONTENT


curr_numWords > 16: CONTENT


prev_linkDensity > 0.555556

curr_numWords <= 40

next_numWords <= 17: BOILERPLATE
next_numWords > 17: CONTENT


curr_numWords > 40: CONTENT


curr_linkDensity > 0.333333: BOILERPLATH

Now, here are some code snippets from the Boilerpipe library:
protected boolean classify(final TextBlock prev, final TextBlock curr,
            final TextBlock next) {
        final boolean isContent;

        if (curr.getLinkDensity() <= 0.333333) {
            if (prev.getLinkDensity() <= 0.555556) {
                if (curr.getTextDensity() <= 9) {
                    if (next.getTextDensity() <= 10) {
                        if (prev.getTextDensity() <= 4) {
                            isContent = false;
                        } else {
                            isContent = true;
                        }
                    } else {
                        isContent = true;
                    }
                } else {
                    if (next.getTextDensity() == 0) {
                        isContent = false;
                    } else {
                        isContent = true;
                    }
                }
            } else {
                if (next.getTextDensity() <= 11) {
                    isContent = false;
                } else {
                    isContent = true;
                }
            }
        } else {
            isContent = false;
        }

        return curr.setIsContent(isContent);
    }

Then, algorithm 2 from the code:
protected boolean classify(final TextBlock prev, final TextBlock curr,
            final TextBlock next) {
        final boolean isContent;

        if (curr.getLinkDensity() <= 0.333333) {
            if (prev.getLinkDensity() <= 0.555556) {
                if (curr.getNumWords() <= 16) {
                    if (next.getNumWords() <= 15) {
                        if (prev.getNumWords() <= 4) {
                            isContent = false;
                        } else {
                            isContent = true;
                        }
                    } else {
                        isContent = true;
                    }
                } else {
                    isContent = true;
                }
            } else {
                if (curr.getNumWords() <= 40) {
                    if (next.getNumWords() <= 17) {
                        isContent = false;
                    } else {
                        isContent = true;
                    }
                } else {
                    isContent = true;
                }
            }
        } else {
            isContent = false;
        }

        return curr.setIsContent(isContent);
    }

That pretty much sums up Boilerpipe. It works quite well (95% - 98% accuracy on varied datasets) to distinguish between content and boilerplate. It'll take some digging into their structure to grab the raw HTML instead of the text, but I'm sure it can be done since the creator does this in his SAAS web app using this library.
#Goose
Goose is made by Gravity Labs, who I've never heard of before :~). It uses the number of stopwords in a chunk of text, the number of consecutive paragraphs, how high up the element is in the page (since comments are usually towards the bottom and should be scored lower), and link density to score the elements in a page on the content vs boilerplate spectrum. Ultimately, the concept is the same as Boilerpipe: find some ratios that fingerprint content in the context of a page, then check those to exclude out the boilerplate markup. Goose simply uses different metrics, and uses them in a slightly different manner (scoring vs straight-up conditional statements that lead to decisions quickly). The code is much longer and more all-over-the-place, and it is a bit slower (//TODO I haven't timed the two yet; it just "seems" slower). It is written in Scala, which is nice. Also, it provides more fields (published date, author, and a few more) than Boilerpipe (title, content), but we could easily implement what they are doing for those fields, which comes down to grabbing metadata tags from the document. I encourage you to check out their code.