Skip to content

Instantly share code, notes, and snippets.

View csarron's full-sized avatar
🎯
Focusing

Qingqing Cao csarron

🎯
Focusing
View GitHub Profile
import datetime
import json
import re
import string
import unicodedata
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModelForCausalLM
import torch
import time
import fire

Two main workarounds for mitigating the hyak io issues

  • containerizing job environment, apptainer is recommended by Hyak team for both speeding up python startup time and reproducibility

  • copying frequently used data to /tmp dir on the node, /tmp as described by Hyak team has around 400GB isolated fast SSD storage, and loading/saving data there won't affect others' jobs or slowdown hyak

Build a container image on the gpu node

using alloc to create an interactive session, e.g. salloc -c 8 -p ckpt --time=5-00:00 -n 1 --mem=64G --gpus=a40:1

@csarron
csarron / convert_pyarrow.py
Last active April 6, 2022 09:09
pip install pyarrow fire tqdm
"""
crawl images:
pip install img2dataset==1.11.0
img2dataset --url_list cc3m.tsv\
--output_folder cc3m-img --input_format "tsv"\
--url_col "url" --caption_col "caption"\
--output_format files --resize_mode=no\
--processes_count 10 --thread_count 64 --number_sample_per_shard 2000\
--enable_wandb True --save_metadata False
@csarron
csarron / install_zsh_on_sherlock.sh
Created October 18, 2021 04:26 — forked from mgbckr/install_zsh_on_sherlock.sh
Compiling and installing Zsh without root privileges on Stanford's Sherlock (https://sherlock.stanford.edu) for use in tmux
# # Install Zsh on Sherlock
# Installs Zsh with Oh-My-Zsh without root privileges
# on Stanford's Sherlock (https://sherlock.stanford.edu) for use in tmux
#
# ## Instructions
# 1) bash install_zsh.sh
# 2) edit .zshrc (add the path to your Zsh binary to the PATH variable, etc.)
# 3) add `set-option -g default-shell <path to zsh>/bin/zsh` to `~/.tmux.conf`
# 4) also see comments for potential further notes
#
# 0. clone/fork repo,
# 1. add remote upstream
git remote add upstream https://github.com/xx/xx.git
# 2. checkout work branch
git checkout -b work
# If you are trying to "checkout" a new remote branch (that exists only on the remote, but not locally)
git fetch origin
@csarron
csarron / calc_openness.py
Created July 22, 2020 19:56
Get openness statistics of conferences from DBLP
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Get openness statistics of top conferences, motivated by
http://s3.eurecom.fr/~balzarot/notes/inbreeding/inbreeding.html
install: pip install requests matplotlib
usage example:
@csarron
csarron / optimize_bert.py
Created July 13, 2020 03:31 — forked from icemelon/optimize_bert.py
Optimize the BERT model on CPUs
import time
import argparse
import numpy as np
import mxnet as mx
import gluonnlp as nlp
import tvm
from tvm import relay
import tvm.contrib.graph_runtime as runtime
def timer(thunk, repeat=1, number=10, dryrun=3, min_repeat_ms=1000):
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@csarron
csarron / download_glue_data.py
Created December 2, 2019 01:17 — forked from W4ngatang/download_glue_data.py
Script for downloading data of the GLUE benchmark (gluebenchmark.com)
''' Script for downloading all GLUE data.
Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).
mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC