Skip to content

Instantly share code, notes, and snippets.

@kasperjunge
Created July 26, 2022 07:19
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kasperjunge/f7ba3199e65306171948e0a83d44e0ba to your computer and use it in GitHub Desktop.
Save kasperjunge/f7ba3199e65306171948e0a83d44e0ba to your computer and use it in GitHub Desktop.
Download n Ekstra Bladet new articles from the Danish mC4 dataset.
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
def download_n_eb_articles(n: int) -> pd.DataFrame:
"""Extract n Ekstra Bladet articles from the Danish subset
of the mC4 dataset.
Args:
n (int): Number of articles to extract.
Returns:
pd.DataFrame: Ekstra Bladet articles.
"""
mc4 = load_dataset("mc4", "da", streaming=True)
i, docs = 0, []
with tqdm(total=n) as pbar:
for doc in mc4["train"]:
if "ekstrabladet.dk" in doc["url"]:
docs.append(doc)
i += 1
pbar.update(1)
if i == n:
break
return pd.DataFrame(docs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment