Skip to content

Instantly share code, notes, and snippets.

@sgraaf
Last active October 24, 2021 09:49
Show Gist options
  • Save sgraaf/7c061824b1c57c292faa0a123d95a714 to your computer and use it in GitHub Desktop.
Save sgraaf/7c061824b1c57c292faa0a123d95a714 to your computer and use it in GitHub Desktop.
Simple bash script to extract and clean a Wikipedia dump. Adapted from: https://github.com/facebookresearch/XLM/blob/master/get-data-wiki.sh
#!/bin/sh
set -e
WIKI_DUMP_FILE_IN=$1
WIKI_DUMP_FILE_OUT=${WIKI_DUMP_FILE_IN%%.*}.txt
# clone the WikiExtractor repository
git clone https://github.com/attardi/wikiextractor.git
# extract and clean the chosen Wikipedia dump
echo "Extracting and cleaning $WIKI_DUMP_FILE_IN to $WIKI_DUMP_FILE_OUT..."
python3 -m wikiextractor.WikiExtractor $WIKI_DUMP_FILE_IN --processes 8 -q -o - \
| sed "/^\s*\$/d" \
| grep -v "^<doc id=" \
| grep -v "</doc>\$" \
> $WIKI_DUMP_FILE_OUT
echo "Succesfully extracted and cleaned $WIKI_DUMP_FILE_IN to $WIKI_DUMP_FILE_OUT"
@sgraaf
Copy link
Author

sgraaf commented Dec 14, 2019

Hi @sgraaf. Thanks for the useful tutorial. I just wanted to let you know missing $ sign before WIKI_DUMP_FILE_OUT variable on line 16.

You're entirely right! Thanks for the catch, fixed the issue :)

@thibault-roux
Copy link

Hello @sgraaf. I tried to make it work but the file WikiExtractor.py wasn't found. I modified the path to the file as it was in wikiextractor/wikiextractor but it still doesn't work:

File "/home/troux/voy/wikiextractor/wikiextractor/WikiExtractor.py", line 66, in <module> from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces ImportError: attempted relative import with no known parent package

Do you have any idea how to solve this problem ?

@sgraaf
Copy link
Author

sgraaf commented Jun 13, 2021

Hello @sgraaf. I tried to make it work but the file WikiExtractor.py wasn't found. I modified the path to the file as it was in wikiextractor/wikiextractor but it still doesn't work:

File "/home/troux/voy/wikiextractor/wikiextractor/WikiExtractor.py", line 66, in <module> from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces ImportError: attempted relative import with no known parent package

Do you have any idea how to solve this problem ?

I have updated the script Could you try it again?

@thibault-roux
Copy link

It seems to work. The extracting take a lot of time but that's expected. I will come back if I meet a problem, else consider the problem as fixed !

Thanks.

@Alezas
Copy link

Alezas commented Oct 23, 2021

for this script what is the input?

@sgraaf
Copy link
Author

sgraaf commented Oct 24, 2021

As commented on my other Gist, it is a .xml.bz2 file. For a complete guide on how to download, extract, clean and pre-process a Wikipedia dump, see this Medium post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment