Skip to content

Instantly share code, notes, and snippets.

View tathagata's full-sized avatar
🐢
debugging

Tathagata tathagata

🐢
debugging
View GitHub Profile
@tathagata
tathagata / modify-encoding.py
Created June 24, 2013 18:46
Modify python site
#!/usr/bin/python2.7 -S
import sys
sys.setdefaultencoding("utf-8")
import site
@tathagata
tathagata / default-system-encoding.py
Created June 24, 2013 18:45
Getting default encoding for python
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'UTF-8'
@tathagata
tathagata / codecs-read.py
Last active December 18, 2015 22:09
Read file with Windows-1252 encoding.
corpus_words = set(map(lambda s: s.strip(),\
codecs.open(file, encoding='Windows‑1252').readlines()))
for i in sorted(corpus_words):
print i.encode("Windows‑1252")
@tathagata
tathagata / detect_encoding.sh
Last active December 18, 2015 21:59
Using the file command [http://en.wikipedia.org/wiki/File_(command)] to detect encoding of a file
file -bi uniq_words_in_corpus.txt
#output: text/plain; charset=unknown-8bit
@tathagata
tathagata / chardet_test.py
Last active December 18, 2015 21:49
Bare minimum guesswork with Chardet
#file to parse: https://dl.dropbox.com/u/18146922/uniq_words_in_corpus.txt
def getEncoding(infile):
import chardet
rawdata = open(infile, "r").read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print charenc
#output: ISO-8859-2.
@tathagata
tathagata / identifierSplitting.py
Created April 19, 2013 02:56
Shows how to split identifiers in a directory full for java files.
def identifierSplitByFolder(folderA,folderB):
""" usage: identifierSplitByFolder(folderWithJavaFiles,folderWithJavaFilesIdentifierSplit ) """
import re, string, os
for root, directory, files in os.walk(folderA):
for file in files:
absfnA = os.path.join(folderA,file)
absfnB = os.path.join(folderB,file)
words=open(absfnA).read().replace("\r\n"," ").split(" ")
@tathagata
tathagata / syndict.py
Last active December 15, 2015 20:41
For Stackoverflow
def createSynsetDict():
import pymysql
conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='', db='multiwordnet')
cur = conn.cursor()
syndict={}
fp = open("C:\Users\Tathagata\projects\NewTracelabData\EX3\Albergate\AlbergateIdentifierJDKMethods201304040238SplitTransUniqWordsCopy.txt")
content = fp.read()
words = content.decode("utf-8").lower().split()
@tathagata
tathagata / last_updated_files.sh
Created February 27, 2013 05:31
Get the last updated files in a directory
find $1 -type f -print0 | xargs -0 stat --format '%Y :%y %n' | sort -nr | cut -d: -f2- | head
@tathagata
tathagata / random_word
Created February 27, 2013 05:29
random word in bash
sed `perl -e "print int rand(99999)"`"q;d" /usr/share/dict/words
@tathagata
tathagata / dircompare.py
Created February 1, 2013 22:32
Quickly compare contents of folders. Ideal for replicating experiments
import os
folder_A=r'''path/to/folder/A'''
folder_B=r'''path/to/folder/B'''
for root_A, dirnames_A, filenames_A in os.walk(folder_A):
for root_B, dirnames_B, filenames_B in os.walk(folder_B):
print set(filenames_A) == set(filenames_B)
print set(filenames_A) - set(filenames_B)