Skip to content

Instantly share code, notes, and snippets.

@benmarwick
benmarwick / HTML2DTM.r
Created February 22, 2013 08:13
Take a folder of HTML files and convert them to a document term matrix for text mining. Includes removal of non-ASCII characters and iterative removal of stopwords
# get data
setwd("C:/Downloads/html") # this folder has only the HTML files
html <- list.files()
# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
@drjwbaker
drjwbaker / pastec-tutorial.md
Last active August 31, 2016 13:22
Getting Pastec up and running, 8 August 2016

Getting Pastec up and running

Pastec is an open source index and search engine for image recognition. This is how I got it working with lots of help from the hard work of Ryan Baumann, Shawn Graham and Matthew Lincoln.

Installation

Either install Ubuntu 14.04.5 as an operating system, or get a virtual machine from osboxes. Fire up with VirtualBox. Ensure VM is connected to the network (Settings>Network).

Install Pastec by following the documentation. Be sure to download and unzip visualWordsORB.dat into the build subdirectory of Pastec.

@duhaime
duhaime / classify_images.py
Last active July 13, 2018 12:15
Image to Vec
from __future__ import absolute_import, division, print_function
"""
This is a modification of the classify_images.py
script in Tensorflow. The original script produces
string labels for input images (e.g. you input a picture
of a cat and the script returns the string "cat"); this
modification reads in a directory of images and
generates a vector representation of the image using
@rccordell
rccordell / PoetryBot.rmd
Last active February 5, 2019 01:50 — forked from bmschmidt/words.R
---
title: "Programming Literary Bots"
author: "Ryan Cordell"
date: "3/12/2017"
output: html_document
---
## Acknowledgements
This version of my twitterbot assignment was adapted from [an original written in Python](https://www.dropbox.com/s/r1py3zazde2turk/Trendingmore.py?dl=0), which itself adapted code written by Mark Sample. That orginal bot tweeted (I've since stopped it) at [Quoth the Ravbot](https://twitter.com/Quoth__the). The current version owes much to advice and code borrowed from two colleagues at Northeastern University: Jonathan Fitzgerald and Benjamin Schmidt.
#!/usr/bin/python
## Split audio files into chunks
## Daniel Pett 1/5/2020
__author__ = 'portableant'
## Tested on Python 2.7.16 - yes I know I need to upgrade.
import argparse
import os
import speech_recognition as sr
@rccordell
rccordell / renderSite.R
Last active September 8, 2020 00:42
This script builds on Aleszu Bajak's excellent [tutorial on building a course website using R Markdown and Github pages](http://www.storybench.org/convert-google-doc-rmarkdown-publish-github-pages/). It automates the rendering of HTML files from RMD and automatically generates the page menu for the site, eliminating much duplicative work.
# This script builds on Aleszu Bajak's excellent
# [tutorial on building a course website using R Markdown and Github pages](http://www.storybench.org/convert-google-doc-rmarkdown-publish-github-pages/).
# I was excited about the concept but wanted to automate a few of the production steps: namely generating the HTML files
# for the site from the RMD pages (which Aleszu describes doing one-by-one) and generating the site navigation menu,
# which Aleszu handcodes in the `_site.yml` file. This script should automate both processes, though it may have some quirks
# unique to my setup that you'd want to tweak to fit your own. It's likely more loquacious than necessary as well, so feel free
# to condense as you can. Ideally, each time you make updates to your RMD files you can run this script to generate updated HTML
# pages and a new `_site.yml`. Then commit changes to Github and you're up and running!
# Once you've got everything configured for your own site below, you should be able to run `source('rend
@benmarwick
benmarwick / R2MALLET.r
Last active April 12, 2021 10:27
R code to operate MALLET entirely from within R. Set variables, send commands to Windows' command console and get MALLET's result back into R for further analysis.
# Set working directory
dir <- "C:\\" # adjust to suit
setwd(dir)
# configure variables and filenames for MALLET
## here using MALLET's built-in example data and
## variables from http://programminghistorian.org/lessons/topic-modeling-and-mallet
# folder containing txt files for MALLET to work on
importdir <- "C:\\mallet-2.0.7\\sample-data\\web\\en"
@cdiener
cdiener / asciinator.py
Created April 13, 2014 03:11
asciinator.py now with documentation
# This line imports the modules we will need. The first is the sys module used
# to read the command line arguments. Second the Python Imaging Library to read
# the image and third numpy, a linear algebra/vector/matrix module.
import sys; from PIL import Image; import numpy as np
# This is a list of characters from low to high "blackness" in order to map the
# intensities of the image to ascii characters
chars = np.asarray(list(' .,:;irsXA253hMHGS#9B&@'))
# Check whether all necessary command line arguments were given, if not exit and show a
@ihercowitz
ihercowitz / image_resize.py
Created October 23, 2010 20:19
Python Script to resize all the images, on a given directory, to a 1024x768 JPEG format.
#!/usr/bin/env python
import Image
import os, sys
def resizeImage(infile, dir, output_dir="", size=(1024,768)):
outfile = os.path.splitext(infile)[0]+"_resized"
extension = os.path.splitext(infile)[1]
if extension.lower()!= ".jpg":
@benmarwick
benmarwick / tweet-edits-to-archaeology-articles.R
Last active April 3, 2023 16:35
Using R with wikipedia for various things
# get recent changes from wikipedia
library(rvest)
n_changes <- 5000
recent_changes_url <- paste0("https://en.wikipedia.org/w/index.php?title=Special:RecentChanges&limit=", n_changes , "&days=1")
# connect to website
html <- read_html(recent_changes_url)