Skip to content

Instantly share code, notes, and snippets.

View ellimilial's full-sized avatar

ellimilial

View GitHub Profile
@ellimilial
ellimilial / string_to_deterministic_uniform_float.py
Last active January 21, 2020 09:29
Convert string to a float in range range [0.0, 1.0), similar to random().
import hashlib
def string_to_deterministic_float(s: str) -> float:
"""
Given a string, deterministically map it to a uniformly sampled float in range [0.0, 1.0).
"""
b = bytes(s, encoding='utf-8')
h_dig = hashlib.sha256(b).hexdigest()
return int(h_dig, base=16) / 16 ** len(h_dig)
@ellimilial
ellimilial / string_to_deterministic_float.py
Created January 20, 2020 13:03
Convert string to a float in range range [0.0, 1.0), similar to random().
import hashlib
def string_to_deterministic_float(s: str) -> float:
"""
Given a string, deterministically convert it to a range [0.0, 1.0).
"""
b = bytes(s, encoding='utf-8')
h_dig = hashlib.sha256(b).hexdigest()
return int(h_dig, base=16) / 16 ** len(h_dig)
@ellimilial
ellimilial / blog_pl_scraper.py
Created January 7, 2018 20:02
Sample scraper for blog.pl website, which is about to become decomissioned.
from collections import OrderedDict
import os
import bs4
import requests
import re
import json
from requests.adapters import HTTPAdapter
from urllib3 import Retry
@ellimilial
ellimilial / hadoop_benchmark_terasort_multiple_runs.sh
Last active December 23, 2015 14:48
Run hadoop terasort benchmark, average the run time for all stages. SSH friendly.
#!/bin/bash
# Script to run hadoop terasort benchmark specified amount of times, getting the average runtime for all 3 stages.
#
# To run via ssh, say on Jenkins, wrap in:
# ssh namenode.server.com <<'ENDSSH'
# (... code ...)
# ENDSSH
readonly EXAMPLES_JAR="(...)/hadoop-mapreduce-examples.jar"
@ellimilial
ellimilial / hadoop_benchmark_DFSIO_read_write_multiple_runs.sh
Last active December 23, 2015 12:00
A bash script to run Hadoop DFSIO tests multiple times with different file/batch sizes, averaging the results. Suitable for command line and SSH (Jenkins).
#!/bin/bash
#
# Run DFSIO write and read tests for multiple file size/count configurations. Get the average speed over RUNS_PER_CONFIG executions.
# The throughtput calculation method assumes all tests are run on a single 'wave' , i.e. BATCH_SIZE < total mapper task
#
# Replace hadoop/yarn in run commands as required.
#
# To run via ssh, say on Jenkins, wrap in:
# ssh namenode.server.com <<'ENDSSH'
@ellimilial
ellimilial / gist:5ef1d1917e00970d4457
Last active February 20, 2017 20:55
pip installation of custom gevent 1.1 repo - python 2.7.8+ problem with _ssl.sslwrap missing
sudo apt-get install python-dev cython git python-pip
sudo pip install git+git://github.com/ellimilial/gevent.git@master