Skip to content

Instantly share code, notes, and snippets.

View alexanderlz's full-sized avatar

Alexander Leibzon alexanderlz

View GitHub Profile
@alexanderlz
alexanderlz / alluxio-cleanup.sh
Last active December 8, 2020 13:04
Alluxio - clean lost files in path
#!/bin/bash
ALLUXIO_CMD=/opt/alluxio/bin/alluxio
echo "using alluxio command at $ALLUXIO_CMD"
SCAN_PATH=$1
#for k in $($ALLUXIO_CMD fs ls -fR $SCAN_PATH | grep LOST | cut -f2 -d'%' | tr -d ' '); do echo "deleting $k"; $ALLUXIO_CMD fs rm $k;done
$ALLUXIO_CMD fs ls -fR $SCAN_PATH | grep LOST | cut -f2- -d'%' | xargs -P 30 -I{} $ALLUXIO_CMD fs rm '{}'
@alexanderlz
alexanderlz / ubuntu-docking-station-monitor-switch.md
Last active July 31, 2016 12:26
Create a helper script that switches monitors when ubuntu laptop docked/undocked
  1. install arandr

    sudo apt-get install arandr

  2. save 2 configurations, one for monitors attached on docking, another for laptop only.

    ~/.screenlayout/dock.sh ~/.screenlayout/undock.sh

  3. add new udev rule:

@alexanderlz
alexanderlz / set_cover.py
Created January 21, 2014 14:19
greedy set cover implementation (lurk more here http://en.wikipedia.org/wiki/Set_cover_problem) targetSet - the set you want to cover subSets - dictionary of subsets to cover the target set (uncomment the "#result" in case you want to see the resulting set also)
def greedySetCover(targetSet, subSets):
if not subSets:
return None
probeKey = max(subSets, key=lambda x: len(targetSet.intersection(subSets[x])))
probeSet = subSets[probeKey]
#result = set()
resKeys = set()
while len(targetSet) > 0:
#result.update(probeSet)
targetSet = targetSet.difference(probeSet)
@alexanderlz
alexanderlz / col_desc.sql
Last active May 22, 2024 13:26
Get column description in postgresql/redshift
SELECT description FROM pg_catalog.pg_description WHERE objsubid =
(
SELECT ordinal_position FROM information_schema.columns WHERE table_name='YOUR_TABLE_NAME' AND column_name='YOUR_COLUMN_NAME'
)
and objoid =
(
SELECT oid FROM pg_class WHERE relname = 'YOUR_TABLE_NAME' AND relnamespace =
(
SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'public'
)
@alexanderlz
alexanderlz / lines.py
Created October 24, 2013 09:28
Get lines from file by separator other than '\n' - under the restriction that it should be the beginning of line for example - in file: [1]abcderf [2] this is some split line with newlines in it [3] the third one the method should yield only 3 lines if the "line" marker is '['
def fetchLinesByMarker(ffile, lineMarker):
buff = []
for ln in ffile:
if ln.startswith(lineMarker):
if buff:
yield(''.join(buff))
del(buff[:])
buff.append(ln)
yield(''.join(buff))
@alexanderlz
alexanderlz / find_bad_file_by_block_hdfs.sh
Created February 20, 2013 17:28
In case you need to find a filename by block_id in hdfs
sudo -u hdfs hadoop fsck / -files -blocks | grep BAD_BLOCK_ID -B 5
@alexanderlz
alexanderlz / CryptoHash.java
Created January 7, 2013 14:33
Crypto hash UDF for apache hive. Allows users to hash values using hive QL. Can be used to obfuscate data using MD5 or sha-1
package com.hiveextensions.udf;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import org.apache.hadoop.hive.ql.exec.UDF;
public final class CryptoHash extends UDF {
public String evaluate(final String s, final String algorithm) {
@alexanderlz
alexanderlz / log_diff.sh
Created July 9, 2012 14:51
oneliner to find which logs weren't updated in HDFS today
diff <(hadoop fs -ls /user/mapred/ | cut -f4 -d'/' | sort -u) <(hadoop fs -ls /user/mapred/*/*_data/$(date +%Y%m%d)* | cut -f4 -d'/' | sort -u) | grep '<' | cut -f2 -d'<'
@alexanderlz
alexanderlz / hdfs_list_running_jobs.sh
Created May 20, 2012 12:22
hadoop cli - oneliner to list running jobs with duration and slots usage
hadoop job -list | grep job_ | awk 'BEGIN{FS="\t";OFS=","};{print $1,strftime("%H:%M:%S", (systime()-int($3/1000)),1),"\""$4"\"","\""$6"\""}'