Skip to content

Instantly share code, notes, and snippets.

@gpiat
Last active November 19, 2021 15:52
Show Gist options
  • Save gpiat/33fa8d45a6bb418fc85c266ee5a5a524 to your computer and use it in GitHub Desktop.
Save gpiat/33fa8d45a6bb418fc85c266ee5a5a524 to your computer and use it in GitHub Desktop.
This document describes how I acquire plaintext versions of the books and wikipedia corpora.

How to: get books corpus and wikipedia corpus

As specified by the authors, the books corpus needs to be downloaded from smashwords. However, there is no easy download option, it seems that it needs to be scraped.

The Wikipedia dataset can be downloaded from Wikimedia but only as XML.

Huggingface makes these datasets available, making it easier to acquire them.

The steps are as follow:

Create conda environment

Create a conda env with spacy, the Huggingface nlp library (precursor to the datasets library but seems to work more reliably with the wikipedia dataset) and the unidecode library. We call this env getwb for "Get Wikipedia & Books"

conda create --name getwb
conda activate getwb
pip install unidecode
pip install spacy

pip install nlp

Get books dataset -- new method

Books can be scraped and preprocessed with this utility: https://github.com/soskek/bookcorpus However, this takes a significant amount of time due to rate limitations by smashwords.com. @theshawwn on twitter has done the work for us.

Some books contain HTML tags:

grep -H "<\!DOCTYPE html>" ./*

outputs the following file names:

a-curious-affair.epub.txt
living-the-write-life-tips-on-making-the-most-of-your-writin.epub.txt

angular-4-from-theory-to-practice.epub.txt
computing-without-compromise-love-letters-to-open-source.epub.txt
raspberry-pi-insider-guide.epub.txt
rys-git-tutorial.epub.txt

The first two can be easily fixed by hand, the rest should be removed from the corpus as they contain a significant amount of computer code. The following files also contain mostly non-English-textual information and should be removed:

1000-lines-magic-sequence.epub.txt
100-alphanumeric-crosswords.epub.txt
150-crosswords.epub.txt
find-how-many-in-each-crossword-abc-letters-are.epub.txt
find-how-many-in-each-crossword-abc-letters-are-volume-ii.epub.txt
find-how-many-in-each-crossword-abc-letters-are-volume-iii.epub.txt
find-how-many-in-each-crossword-abc-letters-are-volume-iv.epub.txt

200-most-frequently-used-esperanto-words-2000-example-senten.epub.txt
ekadashi-collection-of-texts-in-6-languages.epub.txt
el-espiritu-santo-en-la-iglesia.epub.txt
el-estado-de-la-inseguridad-alimentaria-en-el-mundo-2014-for.epub.txt
el-estado-de-los-bosques-del-mundo-2014.epub.txt
el-estado-mundial-de-la-agricultura-y-la-alimentacion-2013-s.epub.txt
el-estado-mundial-de-la-pesca-y-la-acuicultura-2014.epub.txt
el-triangulo-de-las-bermudas-el-encubrimiento-de-la-guerra-d.epub.txt
datos-de-composicion-de-alimentos.epub.txt
donnees-sur-la-composition-des-aliments.epub.txt
foundamentals-of-interpretation-rudimento-de-interpretacion-.epub.txt
guia-de-nutricion-de-la-familia.epub.txt
gw-basic-commands-analytical-metalexicon-logodynamics-under-.epub.txt
la-situation-mondiale-de-lalimentation-et-de-lagriculture-20.epub.txt
la-situation-mondiale-des-peches-t-de-laquaculture-2014.epub.txt
metalexicon-i-ching-logodynamics-of-849-basic-english-words.epub.txt
panorama-de-la-seguridad-alimentaria-y-nutricional-2013.epub.txt
polozenie-del-v-oblasti-prodovolstvia-i-selskogo-hozajstva-2.epub.txt
tempestade-de-guerra.epub.txt


advance-java-programming.epub.txt
all-prime-numbers-from-1-1000000-and-the-java-code-used-to-f.epub.txt
autohotkey-tricks-you-ought-to-do-with-windows.epub.txt
cobol-for-the-approved-workman.epub.txt
internet-programming-basics.epub.txt
most-essential-concepts-of-javascript.epub.txt
programming-without-codes.epub.txt
sqlite-database-programming-for-xamarin-cross-platform-c-dat.epub.txt
ultimate-guide-to-python-basics.epub.txt
visual-basic-for-the-approved-workman.epub.txt

There are likely other books that should be fixed or removed, but they are not immediately apparent.

The following script preprocesses the books corpus and outputs a single file with one sentence per line and books separated by one blank line.

import os
import re

from spacy.lang.en import English
from sys import argv
from unidecode import unidecode

in_dir = argv[1]
outfile = argv[2]

# emptying outfile if need be
with open(outfile, 'w') as f:
    pass

nlp = English()
nlp.add_pipe("sentencizer")

# This pattern identifies URLs. Shamelessly stolen from
# https://gist.github.com/Syncrossus/b4034d03d8f1e24bac804acefc917ff2
url_pattern = (r"\(?<?(https?\://(www\.)?([A-z]|[0-9]|\.|-|%)+\.[A-z]{2,6}"
               r"(/([A-z]|[0-9]|-|\.|_|#)+)*/?"
               r"(\?([A-z]|[0-9]|\.|-|%|=|&|_|#|\:|\+)+)?)>?\)?")
# See wikipedia preprocessing script for explanation on this pattern.
html_tag_pattern = r'<([^0-9]*|[a-zA-Z\s]+ ?= ?".*?)>'
# This pattern identifies decorative characters. Shamelessly stolen from
# https://gist.github.com/Syncrossus/b4034d03d8f1e24bac804acefc917ff2
garbage_pattern = r"((\||(--+)|(__+)|<|>|\+|\*|\^|#|=|~)+|(\\|_|/){2,})"
email_pattern = r"(([A-z]|[0-9]|\.|\+)+@([A-z]|[0-9]|\.|-)+\.[A-z]{2,6})"

os.chdir(in_dir)
books = os.listdir()
for book in books:
    with open(book, 'r') as f:
        # the reason we only keep lines longer than 10 char is to filter out
        # empty lines or lines containing non-sentences such as layout stuff.
        # 10 is pretty arbitrary, but we'll always miss long lines of garbage,
        # and sentences < 10 char are few and unlikely to be very important.
        lines = [unidecode(line.strip())
                 for line in f.readlines()
                 if len(line) > 10]

        # Separating by sentence
        lines = [sent.text for line in lines for sent in nlp(line).sents]
        doc = '\n'.join(lines) + '\n'
        
        doc = re.sub(url_pattern, '', doc)
        doc = re.sub(html_tag_pattern, '', doc)
        doc = re.sub(garbage_pattern, '', doc)
        doc = re.sub(email_pattern, '', doc)

        # writing document to output file
        with open(outfile, 'a') as f:
            print(doc, file=f)

Get wikipedia dataset -- new method

Download the wikimedia XML dump. This may take a long time, and can be done in the background. The following script allows the download to persist after logging out of an SSH session:

curl -o /data/dataset/knowbert/wikipedia_xml/enwiki-latest-pages-articles.xml.bz2 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 &
jobs -l
#> [1]+ 450143 Running                 curl -o ...
disown %1
logout

Execution time: ~2h

Install wikiextractor

conda activate getwb # if not already done
pip install wikiextractor

Launch extraction using wikiextractor. If you have a computing cluster running SLURM as a job manager, you can run:

# 1 task, allocate 64 cores, -o is output path, max 1 GB per output file, output as JSON, 64 threads, output file
srun -n 1 -c 64 wikiextractor -o /data/dataset/knowbert/wikipedia_xml/output/ -b 1G --json --processes 64 /data/dataset/knowbert/wikipedia_xml/enwiki-latest-pages-articles.xml

Execution time: ~1h

The corpus is now in JSON format and organized in subdirectories of 100 files inside the output directory. It needs to be pre-processed:

import json
import os
import re
from sys import argv
from spacy.lang.en import English
from unidecode import unidecode

# go to wikipedia extractor output dir.
# '/data/dataset/knowbert/wikipedia_xml/output/'
os.chdir(argv[1])

# '/data/dataset/knowbert/full_wiki_plaintxt_from_xml.txt'
outfile = argv[2]
# emptying outfile if need be
with open(outfile, 'w') as f:
    pass

# Some special parenthesized text gets removed during extraction, such as
# phonetics for foreign names. When this happens, parentheses and punctuation
# tend to be left behind.
specialparen_pattern = r'\([ \.,;]*\) '
url_pattern = (r"\(?<?(https?\://(www\.)?([A-z]|[0-9]|\.|-|%)+\.[A-z]{2,6}"
               r"(/([A-z]|[0-9]|-|\.|_|#)+)*/?"
               r"(\?([A-z]|[0-9]|\.|-|%|=|&|_|#|\:|\+)+)?)>?\)?")
# Some HTML tags and entities get through the previous preprocessing steps, so
# we flag relevant documents with this regex. This may return false positives,
# but that doesn't matter as we handle each entity specifically.
entity_pattern = r'&.*?;'
# Most text contained between &lt; and &gt; are HTML tags. Exceptions are
# mostly things like "If a < b and c > d". In most of these exceptions,
# b and / or c are numbers. HTML tags that contain numbers typically do so by
# assigning a partially numeric value to a field in the tag, such as
# '<ref name="Wales2004/05">'. Therefore, a fairly good policy is to
# consider as HTML anything between &lt; and &gt; that contains either no
# numbers or something of the form fieldname="text containing numbers 0123".
tag_pattern = r'&lt;([^0-9]*|[a-zA-Z\s]+ ?= ?".*?)&gt;'
# /[^0-9]*?/ matches any non-numeric character any number of times except &gt;
# In the case that there are numeric characters, /[a-zA-Z\s]+ ?= ?".*?/
#   matches any string starting with alphabetical characters followed by an =
#   followed by any quoted text. The = can have a single space before and/or
#   after it, and the pattern will stop as soon as it matches a &gt;. The end
#   quote is omitted from the regex as it would complexify making sure that
#   the pattern stops at the first &gt;
def fix_html(text):
    text = re.sub(tag_pattern, '', text)  # removing HTML tags
    text = text.replace('&amp;', '&')
    text = text.replace('&lt;', '<')
    text = text.replace('&gt;', '>')
    text = text.replace('&quot', '"')
    return text

# os.walk('.') returns a generator of (directory, subdirectories, filenames)
# recursing through the current directory. We don't care about files in the
# current directory so we skip the first element, and then we flatten the
# the remaining files and subdirs into a list of file names. This needs to
# be done as wikiextractor organizes files in subdirectories.
filenames = [os.path.join(subdir, fname) 
             for (subdir, _, fnames) in list(os.walk('.'))[1:]
             for fname in fnames]

ndocs = 0
empty_docs = 0

nlp = English()
nlp.add_pipe("sentencizer")

for filename in filenames:
    # this buffer will contain the plain text document strings.
    # Using a buffer allows us to minimize disk access at the cost of RAM.
    buffer = []
    with open(filename, 'r') as f:
        wiki_shard = [json.loads(line) for line in f]
    for json_doc in wiki_shard:
        ndocs += 1
        # unidecode processes foreign characters and html special character
        # encodings and makes them ascii readable
        doc = unidecode(json_doc['text'])

        # catalogging empty documents which for some reason occur in the
        # output of wikiextractor. It seems to happen when the name of the
        # article is malformed (case, spacing, etc.) and would cause wikipedia
        # to handle redirection.
        if doc == "":
            empty_docs += 1
            continue
        doc = re.sub(specialparen_pattern, '', doc)
        if re.search(entity_pattern, doc) is not None:
            doc = fix_html(doc)
        doc = re.sub(url_pattern, '', doc)

        # Removing empty lines, section headers, and generally
        # non-useful lines of text
        lines = doc.split('\n')
        lines = [line.strip() for line in lines 
                 if len(line.strip().split()) > 5]
        # sentence-tokenizing text with spacy
        lines = [sent.text for line in lines for sent in nlp(line).sents]
        # also adding a newline at the end of the text
        doc = '\n'.join(lines) + '\n'

        buffer.append(doc)

    # writing documents to output file
    with open(outfile, 'a') as f:
        for doc in buffer:
            print(doc, file=f)

print(f"Number of empty documents found: {empty_docs}/{ndocs}")

Execution time: ~1h

Concatenating wikipedia & books and preprocessing for NSP

cat /data/dataset/knowbert/retrain/plaintext/books.txt /data/dataset/knowbert/full_wiki_plaintxt_from_xml.txt > retrain/plaintext/fullcorpus_books_wiki.txt
# from inside the allenai/kb repository
python bin/create_pretraining_data_for_bert.py /data/dataset/knowbert/retrain/plaintext/fullcorpus_books_wiki.txt /data/dataset/knowbert/nsp_corpus/shard 18 18000000000

Get books dataset -- old method

Unfortunately, the huggingface version of the plaintext dataset does not have document delimiters. Please refer to the next section in order to properly pre-process the corpus. This section is kept for record-keeping purposes.

from nlp import load_dataset
cache_dir = '<path>'  # specify a directory to use as cache
# in my case, cache is /data/dataset/knowbert/books/
dataset = load_dataset("bookcorpus", cache_dir=cache_dir)

Once this is completed, <cache_dir>/bookcorpus/plain_text/1.0.0/ should contain the plain text files of the books corpus.

Get wikipedia dataset -- old method

Unfortunately, the huggingface version of the data is noisy and is missing some words. For instance, , officially the , is a in the province of , . According to the , it has a population of people. is a sentence that appears in the corpus and is a corrupted version of the wikipedia page for Manabo. Please refer to the next section in order to acquire the Wikipedia corpus. This section is kept for record-keeping purposes.

The wikipedia dataset is obtained in a similar fashion to the books dataset, but does not create a plaintext version of itself. Therefore, it must be processed specifically.

from nlp import load_dataset
from unidecode import unidecode
import re

dataset = load_dataset("wikipedia", "20200501.en")
#, cache_dir="/data/dataset/knowbert/retrain/")
dataset = dataset['train']  # there is only a train split for this dataset

# in my case, outfile is /data/dataset/knowbert/retrain/wikiplain.txt
outfile = '/<path>/wikiplain.txt'
# emptying outfile if need be
with open(outfile, 'w') as f:
    pass

# we don't want to keep the "References" or "See Also" sections, so we detect
# them with regexes. Basically, any combination of newlines and spaces,
# followed by "References" or "See also", followed by a newline, and then
# anything else.
ref_pattern = r'(\n| )+References *\n(\n|.)+'
sa_pattern = r'(\n| )+See also *\n(\n|.)+'

for i in range(len(dataset)):
	# unidecode processes foreign characters and html special character
    # encodings and makes them ascii readable
    doc = unidecode(dataset[i]['text'])
    # removing references and see also
    doc = re.sub(ref_pattern, '\n', doc)
    doc = re.sub(sa_pattern, '\n', doc)

    # Removing empty lines, section headers, and generally
    # non-useful lines of text
    lines = doc.split('\n')
    lines = [line.strip() for line in lines if len(line.strip().split()) > 5]
    doc = '\n'.join(lines)

    # writing document to output file
    with open(outfile, 'a') as f:
        print(doc, file=f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment