Skip to content

Instantly share code, notes, and snippets.

@sgraaf
Last active October 24, 2021 09:49
Show Gist options
  • Save sgraaf/7c061824b1c57c292faa0a123d95a714 to your computer and use it in GitHub Desktop.
Save sgraaf/7c061824b1c57c292faa0a123d95a714 to your computer and use it in GitHub Desktop.
Simple bash script to extract and clean a Wikipedia dump. Adapted from: https://github.com/facebookresearch/XLM/blob/master/get-data-wiki.sh
#!/bin/sh
set -e
WIKI_DUMP_FILE_IN=$1
WIKI_DUMP_FILE_OUT=${WIKI_DUMP_FILE_IN%%.*}.txt
# clone the WikiExtractor repository
git clone https://github.com/attardi/wikiextractor.git
# extract and clean the chosen Wikipedia dump
echo "Extracting and cleaning $WIKI_DUMP_FILE_IN to $WIKI_DUMP_FILE_OUT..."
python3 -m wikiextractor.WikiExtractor $WIKI_DUMP_FILE_IN --processes 8 -q -o - \
| sed "/^\s*\$/d" \
| grep -v "^<doc id=" \
| grep -v "</doc>\$" \
> $WIKI_DUMP_FILE_OUT
echo "Succesfully extracted and cleaned $WIKI_DUMP_FILE_IN to $WIKI_DUMP_FILE_OUT"
@sgraaf
Copy link
Author

sgraaf commented Oct 24, 2021

As commented on my other Gist, it is a .xml.bz2 file. For a complete guide on how to download, extract, clean and pre-process a Wikipedia dump, see this Medium post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment