Romain Beaumont rom1504

## distributed_dalle2_laion.md

      
              4 files
            
          
              1 fork
            
          
              0 comments
            
          
              16 stars
            
          
                rom1504
                / distributed_dalle2_laion.md
            
            
              Last active
              April 7, 2024 13:16
            
              
                distributed dalle2 laion
              
          
    https://wandb.ai/rom1504/dalle2_train_decoder/runs/mic5buox/files/decoder_config.json
get dalle2
get the config file
get these 2 .sh
run sbatch start_big.sh

  
## mcpe_convert_protocol.js
var fs = require('fs');
var xml2js = require('xml2js');
var parser = new xml2js.Parser();

fs.readFile(__dirname + '/protocol.xml', function(err, data) {
  parser.parseString(data, function (err, result) {
    fs.writeFileSync('output.json', JSON.stringify(result, null, 2));

    var protocol = JSON.parse(fs.readFileSync(__dirname + '/output.json'));


## video_platform_filter.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rom1504
                / video_platform_filter.md
            
            
              Last active
              October 20, 2023 15:51
            
              
                Filtering url to keep only video platforms links
              
          
    End goal: have a function keeping only interesting video platform links.
Having this would enable getting billions of such links via cc2dataset
This function is a binary classifier.
To evaluate it we need links that naturally occur in common crawl.
Criteria:

link not containing video that can be downloaded by yt-dlp should be discarded
"Bad" links (eg porn) should be discarded in vast majority

To collect this eval set we can:

  
## slurm_stats.py
import json
import pandas as pd
import subprocess
import sys

def get_msg(
    backticks=True   # whether to add backticks for Discord formatting or not
    ):
    "gets a list of cluster usage from squeue and creates a text message from it"
    a = json.loads(subprocess.check_output(['squeue','--json']).decode("utf8"))

## spark_on_slurm.md

      
              3 files
            
          
              0 forks
            
          
              4 comments
            
          
              2 stars
            
          
                rom1504
                / spark_on_slurm.md
            
            
              Last active
              August 7, 2023 02:04
            
              
                spark on slurm
              
          
    Steps:

wget https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83/raw/865fb35e00f21330b5b82aeb7c31941b6c18f649/spark_on_slurm.sh
wget https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83/raw/865fb35e00f21330b5b82aeb7c31941b6c18f649/worker_spark_on_slurm.sh
wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz && tar xf spark-3.3.1-bin-hadoop3.tgz
sbatch spark_on_slurm.sh
build a venv, install pyspark, then run something like this:

(you can get https://huggingface.co/datasets/laion/laion-coco/resolve/main/part-00000-2256f782-126f-4dc6-b9c6-e6757637749d-c000.snappy.parquet as an example parquet)

  
## spark_on_ssh.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                rom1504
                / spark_on_ssh.md
            
            
              Last active
              August 7, 2023 02:03
            
              
                spark_on_ssh.py
              
          
    See https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83 for step by step about spark jars

  
## does_it_freeze.py
import wandb

import os
import numpy as np
import time
from os import listdir
import uuid
import sys

path = "/fsx/home-rom1504/"

## bucket_dedup.py
"""
This is a deduplication method using pyspark.

input: table with id and 2 columns that contain float values
2 items are considered the same if the float values are equal with a threshold of 0.05

algo: multiply the 2 columns by 1/0.05, resulting in 2 longs. Then use pyspark to perform exact dedup using these 2 columns
Pyspark does distributed sort then linear dedup, so this scales to 100B
"""

## Streaming.py
"""
Can you improve it to avoid reading the whole tar file to count the number of samples?
"""

import json
import concurrent.futures

import tarfile
import fsspec
import io

## open_clip_slurm.md

      
              2 files
            
          
              1 fork
            
          
              3 comments
            
          
              5 stars
            
          
                rom1504
                / open_clip_slurm.md
            
            
              Last active
              August 7, 2023 02:01
            
              
                open clip at slurm
              
          
    Install

git clone https://github.com/mlfoundations/open_clip.git
cd open_clip
python3.8 -m venv .env
source .env/bin/activate
pip install -U pip
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip install -e .
	var fs = require('fs');
	var xml2js = require('xml2js');
	var parser = new xml2js.Parser();

	fs.readFile(__dirname + '/protocol.xml', function(err, data) {
	parser.parseString(data, function (err, result) {
	fs.writeFileSync('output.json', JSON.stringify(result, null, 2));

	var protocol = JSON.parse(fs.readFileSync(__dirname + '/output.json'));
	import json
	import pandas as pd
	import subprocess
	import sys

	def get_msg(
	backticks=True # whether to add backticks for Discord formatting or not
	):
	"gets a list of cluster usage from squeue and creates a text message from it"
	a = json.loads(subprocess.check_output(['squeue','--json']).decode("utf8"))
	import wandb

	import os
	import numpy as np
	import time
	from os import listdir
	import uuid
	import sys

	path = "/fsx/home-rom1504/"
	"""
	This is a deduplication method using pyspark.

	input: table with id and 2 columns that contain float values
	2 items are considered the same if the float values are equal with a threshold of 0.05

	algo: multiply the 2 columns by 1/0.05, resulting in 2 longs. Then use pyspark to perform exact dedup using these 2 columns
	Pyspark does distributed sort then linear dedup, so this scales to 100B
	"""
	"""
	Can you improve it to avoid reading the whole tar file to count the number of samples?
	"""

	import json
	import concurrent.futures

	import tarfile
	import fsspec
	import io