Skip to content

Instantly share code, notes, and snippets.

@soldni

soldni/README.md Secret

Created September 14, 2023 00:48
Show Gist options
  • Save soldni/570a2de19adc16e068214bfc803432d1 to your computer and use it in GitHub Desktop.
Save soldni/570a2de19adc16e068214bfc803432d1 to your computer and use it in GitHub Desktop.
Wikipedia Ablation Example (old code)

Wikipedia Ablation Example

Run all following commands from root of this repository.

Step 1: Run Taggers

Install filter code:

# make sure to install an conda if on a bare machine
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p $HOME/miniforge

# if on linux, make sure gcc and protobuf are installed, e.g.
sudo apt install build-essential protobuf-compiler -y

# now install the filters
pip install pretrain_data/filters

# if on macOS, also run
python -m smashed.utils.install_blingfire_macos

Add tags:

ai2_llm_filters \
    -d wikipedia/v0 \
    -n abl0 \
    -t random_number_v1 \
        cld2_en_paragraph_with_doc_score_v2 \
        ft_lang_id_en_paragraph_with_doc_score_v2 \
        char_length_with_paragraphs_v1 \
        whitespace_tokenizer_with_paragraphs_v1 \
    -p 96   # run on 96 cores

Step 2: Deduplicate Against Perplexity Eval Set

Compile and install mixer/deduper:

cd pretrain_data/mixer
make build-tools    # will install rust and tools to build the mixer
make mixer          # will build the mixer; available at ./target/release/mixer

Download the bloom filter for decontamination:

aws s3 cp \
    s3://ai2-llm/eval-data/perplexity/blocklists/eval_subset_v2/deduper_decontamination_lucas_20230525.bin \
    /tmp/decontamination/deduper_decontamination_lucas_20230525.bin

Now run the deduper:

DEDUPER_BIN="pretrain_data/mixer/target/release/deduper"
$DEDUPER_BIN \
    examples/wikipedia_ablation/deduper_config.json

Step 3: Run Mixer

Run mixer with mixer_config.json:

MIXER_BIN="pretrain_data/mixer/target/release/mixer"
$MIXER_BIN \
    examples/wikipedia_ablation/mixer_config.json

You can check out the mixer config to see how it works. In particular, it applies four operations:

  • Include all documents with length less than 100,000 whitespace-separated words:
    "include": [
        "$.attributes[?(@.abl0__whitespace_tokenizer_with_paragraphs_v1__document[0][2] < 100000)]"
    ]
  • Remove any document that is shorter than 50 words:
    "exclude": [
        "$.attributes[?(@.abl0__whitespace_tokenizer_with_paragraphs_v1__document[0][2] < 50)]",
        ...
    ]
  • Remove any document whose total English cld2 score is below 0.5:
    "exclude": [
        ...,
        "$.attributes[?(@.abl0__ft_lang_id_en_paragraph_with_doc_score_v2__doc_en[0][2] <= 0.5)]",
        ...
    ]
  • Replace paragraphs whose not-English cld2 socre is below 0.9 in a document with an empty string
    "span_replacement": [
        {
            "span": "$.attributes.abl0__cld2_en_paragraph_with_doc_score_v2__not_en",
            "min_score": 0.1,
            "replacement": ""
        },
        ...
    ]
  • Remove all documents that contain a paragraph that has tagged as duplicates with the validation set using bff
    "exclude": [
        ...,
        "$@.attributes[?(@.bff_duplicate_paragraph_spans && @.bff_duplicate_paragraph_spans[0] && @.bff_duplicate_paragraph_spans[0][2] >= 1.0)]"
    ]

Note how the configuration only runs the mixing on 27 languages. Nevertheless, with the filters above, we went from 27GB to just over 8.4GB.

{
"documents": [
"pretraining-data/sources/wikipedia/v0/documents/lang=ady/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=af/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ak/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=als/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=am/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ami/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=an/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ang/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ar/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=arc/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ary/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=arz/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=as/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ast/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=atj/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=av/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=avk/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=awa/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ay/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=az/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=azb/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ba/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ban/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bar/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bat_smg/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bcl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=be/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bg/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bh/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bjn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=blk/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bm/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bpy/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=br/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bs/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bug/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=bxr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ca/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=cbk_zam/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=cdo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ce/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ceb/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ch/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=chr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=chy/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ckb/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=co/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=cr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=crh/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=cs/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=csb/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=cu/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=cv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=cy/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=da/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=dag/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=de/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=din/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=diq/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=dsb/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=dty/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=dv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=dz/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ee/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=el/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=eml/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=en/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=eo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=es/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=et/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=eu/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ext/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fa/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ff/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fiu_vro/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fj/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=frp/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=frr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fur/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fy/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ga/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gag/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gan/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gcr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gd/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=glk/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gom/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gor/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=got/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gu/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=guw/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=gv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ha/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=hak/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=haw/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=he/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=hi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=hif/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=hr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=hsb/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ht/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=hu/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=hy/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=hyw/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ia/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=id/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ie/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ig/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ik/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ilo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=inh/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=io/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=is/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=it/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=iu/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ja/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=jam/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=jbo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=jv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ka/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kaa/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kab/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kbd/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kbp/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kcg/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kg/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ki/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kk/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=km/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ko/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=koi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=krc/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ks/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ksh/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ku/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=kw/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ky/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=la/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lad/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lb/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lbe/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lez/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lfn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lg/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=li/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lij/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lld/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lmo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ln/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lt/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ltg/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=lv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mad/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mai/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=map_bms/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mdf/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mg/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mhr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=min/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mk/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ml/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mni/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mnw/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mrj/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ms/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mt/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mwl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=my/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=myv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=mzn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=na/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nah/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nap/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nds/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nds_nl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ne/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=new/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nia/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=no/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nov/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nqo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nrm/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nso/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ny/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=oc/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=olo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=om/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=or/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=os/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pa/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pag/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pam/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pap/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pcd/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pcm/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pdc/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pfl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pih/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pms/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pnb/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pnt/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ps/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pt/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pwn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=qu/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=rm/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=rmy/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=rn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ro/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=roa_tara/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ru/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=rue/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=rw/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sa/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sah/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sat/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sc/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=scn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sco/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sd/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=se/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sg/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sh/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=shi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=shn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=si/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=simple/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sk/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=skr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sm/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=smn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=so/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sq/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=srn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ss/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=st/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=stq/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=su/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sw/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=szl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=szy/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ta/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tay/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tcy/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=te/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tet/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tg/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=th/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ti/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tk/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tn/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=to/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tpi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=trv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ts/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tt/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tum/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tw/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ty/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=tyv/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=udm/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ug/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=uk/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ur/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=uz/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ve/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=vec/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=vep/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=vi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=vls/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=vo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=wa/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=war/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=wo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=wuu/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=xal/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=xh/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=xmf/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=yi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=yo/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=za/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=zea/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=zh/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=zh_classical/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=zh_min_nan/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=zh_yue/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=zu/*.gz"
],
"work_dir": {
"input": "/tmp/deduper/input",
"output": "/tmp/deduper/output"
},
"dedupe": {
"name": "decontamination",
"paragraphs": {
"attribute_name": "bff_duplicate_paragraph_spans"
},
"skip_empty": true
},
"bloom_filter": {
"file": "/tmp/decontamination/deduper_decontamination_lucas_20230525.bin",
"size_in_bytes": 8388608,
"read_only": true,
"estimated_doc_count": 3898706,
"desired_false_positive_rate": 0.001
},
"processes": 96
}
{
"streams": [
{
"name": "example-wikipedia-ablation",
"documents": [
"pretraining-data/sources/wikipedia/v0/documents/lang=en/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=de/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ja/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=es/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ru/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=it/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=zh/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pt/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=uk/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=nl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=pl/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ca/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ar/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=vi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=cs/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=th/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=he/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=hu/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fa/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=no/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=id/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=sr/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=fi/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=el/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ko/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=ro/*.gz",
"pretraining-data/sources/wikipedia/v0/documents/lang=simple/*.gz"
],
"output": {
"path": "pretraining-data/sources/wikipedia/a0/documents",
"max_size_in_bytes": 1000000000
},
"attributes": [
"abl0",
"decontamination"
],
"filter": {
"include": [
"$.attributes[?(@.abl0__whitespace_tokenizer_with_paragraphs_v1__document[0][2] < 100000)]"
],
"exclude": [
"$.attributes[?(@.abl0__whitespace_tokenizer_with_paragraphs_v1__document[0][2] < 50)]",
"$.attributes[?(@.abl0__ft_lang_id_en_paragraph_with_doc_score_v2__doc_en[0][2] <= 0.5)]",
"$@.attributes[?(@.bff_duplicate_paragraph_spans && @.bff_duplicate_paragraph_spans[0] && @.bff_duplicate_paragraph_spans[0][2] >= 1.0)]"
]
},
"span_replacement": [
{
"span": "$.attributes.abl0__cld2_en_paragraph_with_doc_score_v2__not_en",
"min_score": 0.1,
"replacement": ""
}
]
}
],
"work_dir": {
"input": "/tmp/mixer/input",
"output": "/tmp/mixer/output"
},
"processes": 96
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment