Brendan O'Connor brendano

## md5sort.py
#!/usr/bin/env python
""" sorts lines (or tab-sep records) by md5.  (e.g. for train/test splits).
optionally prepends with the md5 id too.
brendan o'connor - anyall.org - gist.github.com/brendano """

import hashlib,sys,optparse
p = optparse.OptionParser()
p.add_option('-k',  type='int', default=False)
p.add_option('-p', action='store_true')
opts,args=p.parse_args()

## xlsx2tsv.py
#!/usr/bin/env python
"""
xlsx2tsv  filename.xlsx  [sheet number or name]

Parse a .xlsx (Excel OOXML, which is not OpenOffice) into tab-separated values.
If it has multiple sheets, need to give a sheet number or name.
Outputs honest-to-goodness tsv, no quoting or embedded \\n\\r\\t.

One reason I wrote this is because Mac Excel 2008 export to csv or tsv messes
up encodings, converting everything to something that's not utf8 (macroman

## autolog.py
# Written by Brendan O'Connor, brenocon@gmail.com, www.anyall.org
#  * Originally written Aug. 2005
#  * Posted to gist.github.com/16173 on Oct. 2008

#   Copyright (c) 2003-2006 Open Source Applications Foundation
#
#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at
#

## gist:39760
# Load the MNIST digit recognition dataset into R
# http://yann.lecun.com/exdb/mnist/
# assume you have all 4 files and gunzip'd them
# creates train$n, train$x, train$y  and test$n, test$x, test$y
# e.g. train$x is a 60000 x 784 matrix, each row is one digit (28x28)
# call:  show_digit(train$x[5,])   to see a digit.
# brendan o'connor - gist.github.com/39760 - anyall.org

load_mnist <- function() {
  load_image_file <- function(filename) {

## emoji.py
# -*- encoding: utf-8 -*-
# actually that encoding line is NOT important codewise. only for doc purposes.
"""
Detect emoji or other emoji-like things in Python.
The regular expressions here can be used to either identify emoji or to remove it.
The comments are written from the perspective of removing it.
The regexes get some stuff besides emoji.

by Brendan O'Connor (http://brenocon.com) 2016-10-20
originally written as part of https://arxiv.org/abs/1608.08868

## analysis.txt
How much text versus metadata is in a tweet?
Brendan O'Connor (brenocon.com), 2011-06-13
http://twitter.com/brendan642/status/80473880111742976

What's it mean to compare the amount of text versus metadata?
Let's start with raw size of the data that comes over the wire from Twitter.

## Get tweets out of a sample stream archive.
## (e.g. curl http://stream.twitter.com/1/statuses/sample.json)
% cat tweets.2011-05-19 | grep -P '"text":' | head -100000 > 100k_tweets

## morpha.py
"""
Wrapper around morpha from
http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html

Vaguely follows edu.stanford.nlp.Morphology except we implement with a pipe.
hacky.  Would be nice to use cython/swig/ctypes to directly embed morpha.yy.c
as a python extension.

TODO compare linguistic quality to lemmatizer in python's "pattern" package

## log_logistic.py.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                brendano
                / log_logistic.py.md
            
            
              Last active
              March 8, 2020 14:12
            
              
                numerically stable implementation of the log-logistic function
              
          
    Binary case

This is just the middle section of Bob Carpenter's note for evaluating log-loss via the binary logistic functoin
https://lingpipe-blog.com/2012/02/16/howprevent-overflow-underflow-logistic-regression/
The logp function calculates the negative cross-entropy:
    dotproduct( [y, 1-y],  [logP(y=1), logP(y=0)] )
where the input s is the beta'x log-odds scalar value.  The trick is to make this numerically stable for any choice of s and y.

  
## .Rhistory
log(c(1.05,.6))
log(c(1.5,.6))
ifelse(runif(1000)>.5, 1.5, .6)
x=ifelse(runif(1000)>.5, 1.5, .6)
mean(x)
prod(x)
x=ifelse(runif(10)>.5, 1.5, .6)
x
y=replicate(100000,{x=ifelse(runif(10)>.5, 1.5, .6); prod(x)})
summary(y)

## make_views.py
#!/usr/bin/env python

# From your Zotero database and file storage,
# creates a simple HTML table, and directory full of symlinks,
# for quick-and-dirty web or Dropbox viewing.

# Installation: place in your Zotero folder
#   e.g. ~/Documents/zotero/
# And run it
#   e.g. python ~/Documents/zotero/make_views.py
	#!/usr/bin/env python
	""" sorts lines (or tab-sep records) by md5. (e.g. for train/test splits).
	optionally prepends with the md5 id too.
	brendan o'connor - anyall.org - gist.github.com/brendano """

	import hashlib,sys,optparse
	p = optparse.OptionParser()
	p.add_option('-k', type='int', default=False)
	p.add_option('-p', action='store_true')
	opts,args=p.parse_args()
	#!/usr/bin/env python
	"""
	xlsx2tsv filename.xlsx [sheet number or name]

	Parse a .xlsx (Excel OOXML, which is not OpenOffice) into tab-separated values.
	If it has multiple sheets, need to give a sheet number or name.
	Outputs honest-to-goodness tsv, no quoting or embedded \\n\\r\\t.

	One reason I wrote this is because Mac Excel 2008 export to csv or tsv messes
	up encodings, converting everything to something that's not utf8 (macroman
	# Written by Brendan O'Connor, brenocon@gmail.com, www.anyall.org
	# * Originally written Aug. 2005
	# * Posted to gist.github.com/16173 on Oct. 2008

	# Copyright (c) 2003-2006 Open Source Applications Foundation
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# Load the MNIST digit recognition dataset into R
	# http://yann.lecun.com/exdb/mnist/
	# assume you have all 4 files and gunzip'd them
	# creates train$n, train$x, train$y and test$n, test$x, test$y
	# e.g. train$x is a 60000 x 784 matrix, each row is one digit (28x28)
	# call: show_digit(train$x[5,]) to see a digit.
	# brendan o'connor - gist.github.com/39760 - anyall.org

	load_mnist <- function() {
	load_image_file <- function(filename) {
	# -- encoding: utf-8 --
	# actually that encoding line is NOT important codewise. only for doc purposes.
	"""
	Detect emoji or other emoji-like things in Python.
	The regular expressions here can be used to either identify emoji or to remove it.
	The comments are written from the perspective of removing it.
	The regexes get some stuff besides emoji.

	by Brendan O'Connor (http://brenocon.com) 2016-10-20
	originally written as part of https://arxiv.org/abs/1608.08868
	How much text versus metadata is in a tweet?
	Brendan O'Connor (brenocon.com), 2011-06-13
	http://twitter.com/brendan642/status/80473880111742976

	What's it mean to compare the amount of text versus metadata?
	Let's start with raw size of the data that comes over the wire from Twitter.

	## Get tweets out of a sample stream archive.
	## (e.g. curl http://stream.twitter.com/1/statuses/sample.json)
	% cat tweets.2011-05-19 \| grep -P '"text":' \| head -100000 > 100k_tweets
	"""
	Wrapper around morpha from
	http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html

	Vaguely follows edu.stanford.nlp.Morphology except we implement with a pipe.
	hacky. Would be nice to use cython/swig/ctypes to directly embed morpha.yy.c
	as a python extension.

	TODO compare linguistic quality to lemmatizer in python's "pattern" package
	log(c(1.05,.6))
	log(c(1.5,.6))
	ifelse(runif(1000)>.5, 1.5, .6)
	x=ifelse(runif(1000)>.5, 1.5, .6)
	mean(x)
	prod(x)
	x=ifelse(runif(10)>.5, 1.5, .6)
	x
	y=replicate(100000,{x=ifelse(runif(10)>.5, 1.5, .6); prod(x)})
	summary(y)
	#!/usr/bin/env python

	# From your Zotero database and file storage,
	# creates a simple HTML table, and directory full of symlinks,
	# for quick-and-dirty web or Dropbox viewing.

	# Installation: place in your Zotero folder
	# e.g. ~/Documents/zotero/
	# And run it
	# e.g. python ~/Documents/zotero/make_views.py