Skip to content

Instantly share code, notes, and snippets.

View thomwolf's full-sized avatar
🚂
training

Thomas Wolf thomwolf

🚂
training
View GitHub Profile
@lhoestq
lhoestq / en_wiki_length.py
Created June 15, 2020 19:08
english wikipedia length
from nlp import load_dataset
from tqdm.auto import tqdm
wiki = load_dataset('wikipedia', '20200501.en', split="train")
batch_size = 1000
total_length = 0
for i in tqdm(range(0, len(wiki), batch_size)): # loop takes ~1min to run
batch = wiki[i:i + batch_size]
total_length += sum(len(sample_text) for sample_text in batch["text"])
@thomwolf
thomwolf / loading_wikipedia.py
Last active January 18, 2024 14:04
Load full English Wikipedia dataset in HuggingFace nlp library
import os; import psutil; import timeit
from datasets import load_dataset
mem_before = psutil.Process(os.getpid()).memory_info().rss >> 20
wiki = load_dataset("wikipedia", "20200501.en", split='train')
mem_after = psutil.Process(os.getpid()).memory_info().rss >> 20
print(f"RAM memory used: {(mem_after - mem_before)} MB")
s = """batch_size = 1000
for i in range(0, len(wiki), batch_size):
@W4ngatang
W4ngatang / download_glue_data.py
Last active April 16, 2024 06:10
Script for downloading data of the GLUE benchmark (gluebenchmark.com)
''' Script for downloading all GLUE data.
Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).
mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC
@Tushar-N
Tushar-N / pad_packed_demo.py
Last active December 27, 2022 06:35
How to use pad_packed_sequence in pytorch<1.1.0
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
seqs = ['gigantic_string','tiny_str','medium_str']
# make <pad> idx 0
vocab = ['<pad>'] + sorted(set(''.join(seqs)))
# make model
# coding: utf-8
import logging
import re
from collections import Counter
import numpy as np
import torch
from sklearn.datasets import fetch_20newsgroups
from torch.autograd import Variable
from graphviz import Digraph
import torch
from torch.autograd import Variable, Function
def iter_graph(root, callback):
queue = [root]
seen = set()
while queue:
fn = queue.pop()
if fn in seen:
@0xjac
0xjac / private_fork.md
Last active May 10, 2024 12:56
Create a private fork of a public repository

The repository for the assignment is public and Github does not allow the creation of private forks for public repositories.

The correct way of creating a private frok by duplicating the repo is documented here.

For this assignment the commands are:

  1. Create a bare clone of the repository. (This is temporary and will be removed so just do it wherever.)

git clone --bare git@github.com:usi-systems/easytrace.git

@GilLevi
GilLevi / README.md
Last active June 17, 2023 20:58
Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns

Gil Levi and Tal Hassner, Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns

Convolutional neural networks for emotion classification from facial images as described in the following work:

Gil Levi and Tal Hassner, Emotion Recognition in the Wild via Convolutional Neural Networks and Mapped Binary Patterns, Proc. ACM International Conference on Multimodal Interaction (ICMI), Seattle, Nov. 2015

Project page: http://www.openu.ac.il/home/hassner/projects/cnn_emotions/

If you find our models useful, please add suitable reference to our paper in your work.

@maxim
maxim / gh-dl-release
Last active April 29, 2024 08:55
Download assets from private Github releases
#!/usr/bin/env bash
#
# gh-dl-release! It works!
#
# This script downloads an asset from latest or specific Github release of a
# private repo. Feel free to extract more of the variables into command line
# parameters.
#
# PREREQUISITES
#