Skip to content

Instantly share code, notes, and snippets.

View rom1504's full-sized avatar

Romain Beaumont rom1504

View GitHub Profile
@rom1504
rom1504 / distributed_dalle2_laion.md
Last active April 7, 2024 13:16
distributed dalle2 laion
@rom1504
rom1504 / mcpe_convert_protocol.js
Last active October 22, 2023 15:17
mcpe json conv
var fs = require('fs');
var xml2js = require('xml2js');
var parser = new xml2js.Parser();
fs.readFile(__dirname + '/protocol.xml', function(err, data) {
parser.parseString(data, function (err, result) {
fs.writeFileSync('output.json', JSON.stringify(result, null, 2));
var protocol = JSON.parse(fs.readFileSync(__dirname + '/output.json'));
@rom1504
rom1504 / video_platform_filter.md
Last active October 20, 2023 15:51
Filtering url to keep only video platforms links

End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset

This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:

  • link not containing video that can be downloaded by yt-dlp should be discarded
  • "Bad" links (eg porn) should be discarded in vast majority

To collect this eval set we can:

@rom1504
rom1504 / slurm_stats.py
Last active August 7, 2023 02:05
slurm_users.py
import json
import pandas as pd
import subprocess
import sys
def get_msg(
backticks=True # whether to add backticks for Discord formatting or not
):
"gets a list of cluster usage from squeue and creates a text message from it"
a = json.loads(subprocess.check_output(['squeue','--json']).decode("utf8"))
@rom1504
rom1504 / spark_on_ssh.md
Last active August 7, 2023 02:03
spark_on_ssh.py
@rom1504
rom1504 / does_it_freeze.py
Last active August 7, 2023 02:03
does_it_freeze.py
import wandb
import os
import numpy as np
import time
from os import listdir
import uuid
import sys
path = "/fsx/home-rom1504/"
@rom1504
rom1504 / bucket_dedup.py
Created February 19, 2023 21:47
bucket_dedup.py
"""
This is a deduplication method using pyspark.
input: table with id and 2 columns that contain float values
2 items are considered the same if the float values are equal with a threshold of 0.05
algo: multiply the 2 columns by 1/0.05, resulting in 2 longs. Then use pyspark to perform exact dedup using these 2 columns
Pyspark does distributed sort then linear dedup, so this scales to 100B
"""
@rom1504
rom1504 / Streaming.py
Last active August 7, 2023 02:02
Count tar. Generated by gpt4
"""
Can you improve it to avoid reading the whole tar file to count the number of samples?
"""
import json
import concurrent.futures
import tarfile
import fsspec
import io
@rom1504
rom1504 / open_clip_slurm.md
Last active August 7, 2023 02:01
open clip at slurm

Install

git clone https://github.com/mlfoundations/open_clip.git
cd open_clip
python3.8 -m venv .env
source .env/bin/activate
pip install -U pip
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip install -e .