Skip to content

Instantly share code, notes, and snippets.

@Phil1108
Created April 22, 2021 11:47
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Phil1108/e1821fec6eb746edc8e04ef5f76d23f1 to your computer and use it in GitHub Desktop.
Save Phil1108/e1821fec6eb746edc8e04ef5f76d23f1 to your computer and use it in GitHub Desktop.
GC4 Corpus Filtering Scripts
import json
import gzip
import pathlib
import os
import pdb
from ast import literal_eval
from tqdm import tqdm
if __name__ == '__main__':
parent_dir = pathlib.Path("data_head_url")
for file in tqdm(parent_dir.iterdir()):
with gzip.open(file,'rt') as f:
a = f.readline()
a = a.split("{'url'")
a = [("{'url'" + item) for item in a]
b = []
for item in tqdm(a):
try:
if literal_eval(item)['language_score'] > 0.98:
b.append(literal_eval(item))
except:
None
with gzip.open(f"{file.name}_filtered.tar.gz", 'wt') as file_new:
for part in a[1:]:
file_new.write(part + '\n')
@PhilipMay
Copy link

Thanks @sorgfresser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment