Skip to content

Instantly share code, notes, and snippets.

View sgraaf's full-sized avatar

Steven van de Graaf sgraaf

View GitHub Profile
@sgraaf
sgraaf / http_user_agent_functions.php
Last active August 6, 2017 22:09
A collection of three useful HTTP_USER_AGENT functions for the client IP, OS and Browser
// function to get the client ip address
function get_ip() {
$ipaddress = '';
if ($_SERVER['HTTP_CLIENT_IP'])
$ipaddress = $_SERVER['HTTP_CLIENT_IP'];
else if($_SERVER['HTTP_X_FORWARDED_FOR'])
$ipaddress = $_SERVER['HTTP_X_FORWARDED_FOR'];
else if($_SERVER['HTTP_X_FORWARDED'])
$ipaddress = $_SERVER['HTTP_X_FORWARDED'];
else if($_SERVER['HTTP_FORWARDED_FOR'])
@sgraaf
sgraaf / Bandcamp auto-renamer
Last active December 4, 2017 21:07
A simple script that auto-renames my chosen files and folders (after having downloaded them from bandcamp) to standardize them for my music library
import os
import mutagen
import sys
# function to find nth occurence of needle in haystack
def find_nth(haystack, needle, n):
start = haystack.find(needle)
while start >= 0 and n > 1:
start = haystack.find(needle, start+len(needle))
n -= 1
import java.util.Comparator;
import java.util.Random;
import java.util.Properties;
import java.util.Arrays;
public class FitnessComparator implements Comparator<double[]>{
@Override
public int compare(double[] entry1, double[] entry2) {
{
"_brand": "Mammut",
"_color_nos": [
"4072",
"50134"
],
"_colors": [
"Olive",
"Poseidon"
],
@sgraaf
sgraaf / download_wiki_dump.sh
Last active October 26, 2022 04:02
Simple bash script to download the latest Wikipedia dump in the chosen language. Adapted from: https://github.com/facebookresearch/XLM/blob/master/get-data-wiki.sh
#!/bin/sh
set -e
LG=$1
WIKI_DUMP_NAME=${LG}wiki-latest-pages-articles.xml.bz2
WIKI_DUMP_DOWNLOAD_URL=https://dumps.wikimedia.org/${LG}wiki/latest/$WIKI_DUMP_NAME
# download latest Wikipedia dump in chosen language
echo "Downloading the latest $LG-language Wikipedia dump from $WIKI_DUMP_DOWNLOAD_URL..."
wget -c $WIKI_DUMP_DOWNLOAD_URL
@sgraaf
sgraaf / extract_and_clean_wiki_dump.sh
Last active October 24, 2021 09:49
Simple bash script to extract and clean a Wikipedia dump. Adapted from: https://github.com/facebookresearch/XLM/blob/master/get-data-wiki.sh
#!/bin/sh
set -e
WIKI_DUMP_FILE_IN=$1
WIKI_DUMP_FILE_OUT=${WIKI_DUMP_FILE_IN%%.*}.txt
# clone the WikiExtractor repository
git clone https://github.com/attardi/wikiextractor.git
# extract and clean the chosen Wikipedia dump
@sgraaf
sgraaf / preprocess_wiki_dump.py
Last active October 23, 2021 21:28
Simple python script to pre-process a Wikipedia dump
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from pathlib import Path
from blingfire import text_to_sentences
def main():
wiki_dump_file_in = Path(sys.argv[1])
@sgraaf
sgraaf / transformers_vs_tokenizers.csv
Last active January 19, 2020 19:05
Tokenizers timing experiments: Transformers vs Tokenizers
implementation mean execution time
transformers 6min 42s
tokenizers 45.6s
@sgraaf
sgraaf / multithreaded.csv
Last active January 20, 2020 16:58
Tokenizers timing experiments: Multithreaded performance
implementation mean execution time
submit 1min 8s
map 1min 9s
encode_batch 10.6s
@sgraaf
sgraaf / Tokenizers_timing_experiment.ipynb
Last active January 20, 2020 17:01
Tokenizers timing experiments: Jupyter Notebook
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.