Skip to content

Instantly share code, notes, and snippets.

View thvasilo's full-sized avatar

Theodore Vasiloudis thvasilo

View GitHub Profile
@thvasilo
thvasilo / script-template.sh
Created November 2, 2023 17:41 — forked from m-radzikowski/script-template.sh
Minimal safe Bash script template - see the article with full description: https://betterdev.blog/minimal-safe-bash-script-template/
#!/usr/bin/env bash
set -Eeuo pipefail
trap cleanup SIGINT SIGTERM ERR EXIT
script_dir=$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd -P)
usage() {
cat <<EOF
Usage: $(basename "${BASH_SOURCE[0]}") [-h] [-v] [-f] -p param_value arg1 [arg2...]
# Script to set up the environment and files for training XGBoost jobs
# on the master of an MPI cluster created using AWS ParallelCluster
# Install personal choice packages
sudo apt install -y tmux emacs-nox htop parallel
# Needed for dmlc-core (?)
sudo apt install -y libcurl4-openssl-dev libssl-dev
# Parallel compress/decompress because we work with large bzipped files
@thvasilo
thvasilo / parallel_file_process.py
Created October 18, 2018 16:23
An example script to process a text file in parallel using Python
import argparse
import multiprocessing as mp
import os
from operator import itemgetter
from collections import Counter
import functools
import json
def parse_args():
@thvasilo
thvasilo / output.log
Last active August 17, 2018 01:20
Error ouput from datasketch-cpp
==18350== Invalid free() / delete / delete[] / realloc()
==18350== at 0x4C2F74B: operator delete[](void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18350== by 0x469786: datasketches::kll_sketch<float>::~kll_sketch() (kll_sketch.hpp:173)
==18350== by 0x4A0056: void std::_Destroy<datasketches::kll_sketch<float> >(datasketches::kll_sketch<float>*) (stl_construct.h:93)
==18350== by 0x494DF3: void std::_Destroy_aux<false>::__destroy<datasketches::kll_sketch<float>*>(datasketches::kll_sketch<float>*, datasketches::kll_sketch<float>*) (stl_construct.h:103)
==18350== by 0x4890D4: void std::_Destroy<datasketches::kll_sketch<float>*>(datasketches::kll_sketch<float>*, datasketches::kll_sketch<float>*) (stl_construct.h:126)
==18350== by 0x479AE0: void std::_Destroy<datasketches::kll_sketch<float>*, datasketches::kll_sketch<float> >(datasketches::kll_sketch<float>*, datasketches::kll_sketch<float>*, std::allocator<datasketches::kll_sketch<float> >&) (stl_construct.h:151)
==18350== by
@thvasilo
thvasilo / renaming_stuff.sh
Last active January 22, 2018 17:07
Using mmv and GNU parallel to easily rename bunches of files under mulitple dirs
# My situation: I have a bunch of experiments nested under parameter dirs
# 10/ 20/ 30/ ...
# Each experiment dir has some experiment files, trailing _X indicates X repeat of experiment
# specific dataset
# ls 10/
# dataset1_0.csv dataset2_0.csv dataset1_1.csv dataset2_1.csv
# Problem: I want to rename all the <datasetname>_1.csv files to <datasetname>_2.csv
# Solution: parallel & mmv!
# Use GNU parallel because it has a nicer syntax than bash for loops
parallel -j -q 2 mmv {1}/"*_1.csv" {1}/"#1_2.csv" ::: {10..100..10}
@thvasilo
thvasilo / multiplytest_pytorch.py
Last active December 24, 2020 01:11
A matrix multiplication benchmark.
# Benchmark for measuring matrix multiplication speed, Martin Nilsson, Rise SICS
# relevant for certain Machine Learning tasks v1.0 2017-11-21
# v1.1 Theodore Vasiloudis (PyTortch solution)
# ====================================================
# Run by:
#
# python3 multiplytest.py 10000
#
# to measure squaring a 10000 x 10000 random matrix.
# Weirdly enough K80 and Titan X get different results prolly something to do with numerical accuracy.
@thvasilo
thvasilo / surv_1k.csv
Created December 2, 2016 11:47
A generated survival analysis file
ID EVENT TIME x x.1
1 1 110.443671250798 0 0.88954899716191
2 1 746.21020937277 1 0.85477636102587
3 1 249.656292624447 0 1.19875323530287
4 1 76.5375073833034 0 1.13521479736082
5 1 68.3884146201972 1 0.866565287671983
6 1 309.475210375677 0 0.832409728225321
7 1 19.2999312165329 1 1.0273647472728
8 0 1600.50948046765 1 0.750024672644213
9 1 524.368976549325 0 1.26851084339432
@thvasilo
thvasilo / IncrementalSGD.java
Created November 21, 2016 10:55
A basic online SGD using the Flink stream API.
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*

Keybase proof

I hereby claim:

  • I am thvasilo on github.
  • I am tvas (https://keybase.io/tvas) on keybase.
  • I have a public key whose fingerprint is BD7D 432D 4124 630C A4F2 061E 4AA5 5B32 660B 2CB2

To claim this, I am signing this object:

@thvasilo
thvasilo / TestNgrams.scala
Created March 23, 2015 11:09
Simple job to ensure LZO compressed Google Ngrams data can be read
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import scala.util.Random
import java.io._
import java.util.Properties
import org.apache.hadoop.fs._;
import org.apache.hadoop.conf._;
import org.apache.hadoop.io._;