This gist describes the procedure of scraping CC0-licensed Belarusian texts for the purposes of Mozilla Common Voice, in addition to the existing collection of sentences exported from Wikipedia. For context, see this thread.
We assume the texts, written by authors who died more than 70 years ago and published during their lifetime, to be in public domain. Belarusian legislation is even less restrictive (lifetime + 50 years) but we adhere to the most widely adopted international practice. A list of relevant Belarusian authors can be retrieved from Wikidata using this query:
SELECT DISTINCT ?person ?personLabel ?died ?sitelinks
WHERE
{
VALUES ?occupation { wd:Q36180 wd:Q12144794 wd:Q487596 wd:Q4853732 wd:Q4263842 }
?person wdt:P106 ?occupation; wdt:P6886 wd:Q9091; wdt:P570 ?died .
filter (?died < "1951-06-24T00:00:00Z"^^xsd:dateTime && ?died >= "1900-01-01T00:00:00Z"^^xsd:dateTime)
?person wikibase:sitelinks ?sitelinks.
service wikibase:label { bd:serviceParam wikibase:language "be". }
} order by desc(?sitelinks) limit 100
We save the output as public-domain-authors-70.tsv
. A large online library of Belarusian texts, knihi.com, provides texts by many of these authors. Links to the author pages on knihi.com were added manually to the column knihi_com_page
. See below the complete file.
Put public-domain-authors-70.tsv
in your working directory and run:
mkdir index
cat public-domain-authors-70.tsv | cut -d$'\t' -f5 | grep https | sed 's@https://knihi.com/@@;s@/@@' | xargs -n 1 -I {} bash -c "wget https://knihi.com/{} -O index/{}.html; sleep 2"
We'd like to keep only those links that (1) refer to the texts written by these authors, which are (2) not audio books and (3) not translations from Belarusian to other languages:
cat public-domain-authors-70.tsv | cut -d$'\t' -f5 | grep https | sed 's@https://knihi.com/@@;s@/@@' | tr $'\n' '|' | sed -r 's/\|$//' > author-regex
cat index/*.html | grep -Po 'href="[^"]+"' | grep -i htm | grep -iv audio | grep -Piv '\-[a-z]{3}\.htm' | sort | uniq | grep -P "/($(cat author-regex))/" | sed -r 's@^href="@https://knihi.com@;s/"$//' > text-links
Given the links, now retrieve the texts:
wget -U 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0' -v --no-cookies --content-on-error -o wget.log --no-check-certificate -e robots=off --timeout 3 --random-wait --tries 30 --waitretry 2 -x -i text-links
TODO: Understand why the specified timeout is not respected.
For many files provided by knihi.com, style and genre information is encoded in the metadata:
cd knihi.com
find . -type f | xargs grep "StyleGenre" | cut -d: -f2- | sort | uniq -c | less
We remove documents annotated as poetry:
find . -type f | xargs grep "StyleGenre" | grep -P 'верш|паэма' | cut -d: -f1 | xargs rm
We remove documents that don't contain any HTML-formatted text but rather refer to some external files which may be PDF, DJVU, ZIP, etc.:
find . -type f | sort | xargs grep "BOOK_BEGIN" | cut -d: -f1 > ../files-with-text
comm -23 <(find . -type f | sort) ../files-with-text | xargs rm
We also remove documents that exhibit a large amount of spellings in the classical orthography (taraškievica):
find . -type f | grep -P 'html$' | xargs grep -P '([нсц])ь\1' | grep -v text-center | grep -v HEADER_FIELD | cut -d: -f1 | uniq -c | sed -r 's/^\s+//' | grep -P '^[0-9]{2}' | cut -d' ' -f2 | xargs rm
In each HTML document on knihi.com, the text itself is located in between the headers BOOK_BEGIN
and BOOK_END
, styled as HTML comments. We drop everything before and after the text, creating .txt
files, which actually are truncated HTMLs though:
for filename in $(find . -type f | sort | sed -r 's/\.html$//'); do
cat $filename".html" | sed -ne '/BOOK_BEGIN/,/BOOK_END/ p' > $filename".txt"
done
Then put the script remove_markup.py
(see below) in the working directory, one level above knihi.com
folder, and invoke it:
find . -type f | grep -P 'txt$' | python3 ../remove_markup.py -
This would yield .clean
files, one per each original document.
Now put the script split_sentences.py
(see below) in the working directory and invoke it:
find . -type f | grep -P 'clean$' | python3 ../split_sentences.py -
This would yield .split
files, one per each original document. These files have one sentence per line.
Here are a few remaining tweaks:
- Replace latinic characters with their Belarusian counterparts, specifically, put Belarusian і (U+0456, U+0406) instead of Latinic i (U+0069, U+0049).
- Remove excessive whitespaces.
- Normalize ,— into , —.
- Normalize clitics -бы, -жа into бы, жа with preceding whitespaces, which is the modern orthography.
- Normalize мо’, трэ’ into мо, трэ, which is the modern orthography.
- Drop 1933 orthography (комуніст, большэвік, совецкі) and certain phrases, like собственно in Paulinka and меджду протчым in Tutejshyja.
for filename in $(find . -type f | grep -P 'split$' | sort); do
cat $filename | sed -r 's/i/і/g;s/I/І/g;s/\s+/ /g;s/,([—‐–―−])/, \1/;s/([^жу])\-(бы|жа)([^а-зй-шы-яёіў])/\1 \2\3/g;s/’ / /g' | grep -Piv 'комун|большэ|совец|профэс|меджду|собственно|дабрудзею' > $filename".preproc"
done
This would yield .preproc
files, one per each original document.
Some of the texts are available in multiple editions. In order to avoid extracting some sentences more than once, we'd like to identify duplicate files in each author's folder and keep each of these files in a single copy. Duplicate detection is based on a straightforward BoW model with pairwise cosine similarities.
Put the scripts find_duplicates.py
, remove_duplicates.py
(see below) in the working directory and invoke them:
cd ..
python3 find_duplicates.py | tee duplicates | python3 remove_duplicates.py -
This would remove those .preproc
files which are detected as duplicates. Of each two duplicate files, the smaller one is removed. By assumption, there exist pairs of duplicates but there aren't any triples or larger groups. The pairs of duplicates and their respective cosine similarities are piped into duplicates
.
We'd like each of the .preproc
files to be processed separately by the sentence extractor, in order to keep track which sentence originates from which file. To that end, we make an appropriately structured copy of all .preproc
files:
find . -type f | sort | grep -P 'preproc$' | xargs -n 1 -I {} echo "mkdir -p extraction-input/{}" | sed 's@\./knihi.com/@@' | rev | sed 's@/@__@' | rev | sed -r 's/\.txt\.clean\.split\.preproc$//' | sh
find . -type f | sort | grep -P 'preproc$' | xargs -n 1 -I {} echo "cp {} {}" | sed 's@\.preproc \./knihi.com/@.preproc extraction-input/@' | rev | sed 's@/@__@' | rev | sed -r 's@\.txt\.clean\.split\.preproc$@/file@' | sh
Assuming the sentence extractor repo is cloned in the working directory, set min_word_count = 4
in be.toml
and finalize the procedure as follows:
mkdir extraction-output
cd cv-sentence-extractor
for dirname in $(ls ../extraction-input | sort); do
cargo run -- extract-file -l be -d ../extraction-input/$dirname >> ../extraction-output/$dirname.txt
done
This may take a couple of hours, as extract-file
mode appears to be rather slow.
To put all sentences in one file, concatenate together the text files in extraction-output
and remove the remaining duplicates. Also adding several final touches that haven't made their way into the preprocessing stage.
cd ..
cat extraction-output/*.txt | sed -r "s/^Ў/У/;s/,([а-зй-шы-яёіў])/, \1/ig;s/([Мм])о'/\1о /ig;s/ / /g;s/ч'а/ча/g;s/а'р/ар/g;s/\-([бж])([^а-зй-шы-яёіў])/ \1\2/g" | grep -Piv '(тудэма|вось-цо-да|бродзіе)' | sort | uniq | shuf > knihi.com-sampled-sentences.txt
Since the sentences are shuffled, we can take the first 4K as the random sample:
head -n 4000 knihi.com-sampled-sentences.txt > knihi.com-sampled-sentences-4K.txt
As advised by Yuras Hetsevich, we applied stricter filtering to the dataset by feeding it into an online spell checker of Belarusian. The spell checker returns a list of unrecognized tokens, one token per line. We filter out all sentences that contain at least one of these tokens. A straightforward approach, although algorithmically suboptimal, is to loop through the tokens, for each of them grep all sentences that contain it, then deduplicate the accumulated sentences. Assuming blacklist.txt
(see below) is the spell checker's output, this is how we do it:
for line in $(cat blacklist.txt | sed -r 's/^у/[уў]/'); do
cat knihi.com-sampled-sentences.txt | grep -Pi "([^а-яёіў]|^)$line([^а-яёіў]|$)" >> accumulator;
done
cat accumulator | sort | uniq > knihi.com-sampled-sentences-blacklisted.txt
comm -23 <(sort knihi.com-sampled-sentences.txt) <(sort knihi.com-sampled-sentences-blacklisted.txt) | shuf > knihi.com-sampled-sentences-non-blacklisted.txt
head -n 4000 knihi.com-sampled-sentences-non-blacklisted.txt | sort > knihi.com-sampled-sentences-non-blacklisted-4K.txt
Now knihi.com-sampled-sentences-non-blacklisted.txt
is the final submission, and knihi.com-sampled-sentences-non-blacklisted-4K.txt
is the 4K random sample, sorted lexicographically for convenience.