Skip to content

Instantly share code, notes, and snippets.

@sgraaf
Last active October 23, 2021 21:28
Show Gist options
  • Save sgraaf/926c52fba668f779f5ecac81d21e98a0 to your computer and use it in GitHub Desktop.
Save sgraaf/926c52fba668f779f5ecac81d21e98a0 to your computer and use it in GitHub Desktop.
Simple python script to pre-process a Wikipedia dump
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from pathlib import Path
from blingfire import text_to_sentences
def main():
wiki_dump_file_in = Path(sys.argv[1])
wiki_dump_file_out = wiki_dump_file_in.parent / \
f'{wiki_dump_file_in.stem}_preprocessed{wiki_dump_file_in.suffix}'
print(f'Pre-processing {wiki_dump_file_in} to {wiki_dump_file_out}...')
with open(wiki_dump_file_out, 'w', encoding='utf-8') as out_f:
with open(wiki_dump_file_in, 'r', encoding='utf-8') as in_f:
for line in in_f:
sentences = text_to_sentences(line)
out_f.write(sentences + '\n')
print(f'Successfully pre-processed {wiki_dump_file_in} to {wiki_dump_file_out}...')
if __name__ == '__main__':
main()
@Alezas
Copy link

Alezas commented Oct 23, 2021

the input in this program is a wiki.bz2 or a txt?

@sgraaf
Copy link
Author

sgraaf commented Oct 23, 2021

It's for a .txt file. If you want to extract and clean a .xml.bz2 file, take a look at this script.

@Alezas
Copy link

Alezas commented Oct 23, 2021

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment