Skip to content

Instantly share code, notes, and snippets.

View rom1504's full-sized avatar

Romain Beaumont rom1504

View GitHub Profile
@rom1504
rom1504 / video_platform_filter.md
Last active October 20, 2023 15:51
Filtering url to keep only video platforms links

End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset

This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:

  • link not containing video that can be downloaded by yt-dlp should be discarded
  • "Bad" links (eg porn) should be discarded in vast majority

To collect this eval set we can:

@rom1504
rom1504 / Streaming.py
Last active August 7, 2023 02:02
Count tar. Generated by gpt4
"""
Can you improve it to avoid reading the whole tar file to count the number of samples?
"""
import json
import concurrent.futures
import tarfile
import fsspec
import io
@rom1504
rom1504 / bucket_dedup.py
Created February 19, 2023 21:47
bucket_dedup.py
"""
This is a deduplication method using pyspark.
input: table with id and 2 columns that contain float values
2 items are considered the same if the float values are equal with a threshold of 0.05
algo: multiply the 2 columns by 1/0.05, resulting in 2 longs. Then use pyspark to perform exact dedup using these 2 columns
Pyspark does distributed sort then linear dedup, so this scales to 100B
"""
@rom1504
rom1504 / does_it_freeze.py
Last active August 7, 2023 02:03
does_it_freeze.py
import wandb
import os
import numpy as np
import time
from os import listdir
import uuid
import sys
path = "/fsx/home-rom1504/"
@rom1504
rom1504 / spark_session_aws.py
Last active June 26, 2023 21:40
spark_session_aws.py
from pyspark.sql import SparkSession
import os
import sys
from pyspark import SparkContext
from pyspark.sql.functions import rand
from pyspark.sql import SparkSession
import random
import math
import time
import boto3
@rom1504
rom1504 / spark_on_ssh.md
Last active August 7, 2023 02:03
spark_on_ssh.py
@rom1504
rom1504 / phash.py
Created December 1, 2022 17:58
phash.py
import numpy as np
from scipy.fftpack import dct
def hash_algo(pil_img, size=10):
"""
Get perceptual hash of the input image.
Args:
image_array: numpy array that corresponds to the image.
@rom1504
rom1504 / test_gpu.sh
Last active October 16, 2022 10:57
test_gpu.sh
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --job-name=gputest
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 8
#SBATCH --cpus-per-gpu=6
#SBATCH --gres=gpu:8
#SBATCH --nodelist gpu-st-p4d-24xlarge-42
#SBATCH --output=%x_%j.out
#SBATCH --open-mode=append
@rom1504
rom1504 / jax.sh
Last active October 2, 2022 21:41
jax gpu setup
python3.8 -m venv .env
source .env/bin/activate
pip install -U pip
pip install "jax[cuda11_cudnn82]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
python
import jax
jax.default_backend()
jax.devices()