https://wandb.ai/rom1504/dalle2_train_decoder/runs/mic5buox/files/decoder_config.json
get dalle2
get the config file
get these 2 .sh
run sbatch start_big.sh
https://wandb.ai/rom1504/dalle2_train_decoder/runs/mic5buox/files/decoder_config.json
get dalle2
get the config file
get these 2 .sh
run sbatch start_big.sh
var fs = require('fs'); | |
var xml2js = require('xml2js'); | |
var parser = new xml2js.Parser(); | |
fs.readFile(__dirname + '/protocol.xml', function(err, data) { | |
parser.parseString(data, function (err, result) { | |
fs.writeFileSync('output.json', JSON.stringify(result, null, 2)); | |
var protocol = JSON.parse(fs.readFileSync(__dirname + '/output.json')); | |
End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset
This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:
To collect this eval set we can:
import json | |
import pandas as pd | |
import subprocess | |
import sys | |
def get_msg( | |
backticks=True # whether to add backticks for Discord formatting or not | |
): | |
"gets a list of cluster usage from squeue and creates a text message from it" | |
a = json.loads(subprocess.check_output(['squeue','--json']).decode("utf8")) |
Steps:
(you can get https://huggingface.co/datasets/laion/laion-coco/resolve/main/part-00000-2256f782-126f-4dc6-b9c6-e6757637749d-c000.snappy.parquet as an example parquet)
See https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 for step by step about spark jars
import wandb | |
import os | |
import numpy as np | |
import time | |
from os import listdir | |
import uuid | |
import sys | |
path = "/fsx/home-rom1504/" |
""" | |
This is a deduplication method using pyspark. | |
input: table with id and 2 columns that contain float values | |
2 items are considered the same if the float values are equal with a threshold of 0.05 | |
algo: multiply the 2 columns by 1/0.05, resulting in 2 longs. Then use pyspark to perform exact dedup using these 2 columns | |
Pyspark does distributed sort then linear dedup, so this scales to 100B | |
""" |
""" | |
Can you improve it to avoid reading the whole tar file to count the number of samples? | |
""" | |
import json | |
import concurrent.futures | |
import tarfile | |
import fsspec | |
import io |