Skip to content

Instantly share code, notes, and snippets.

View sebbacon's full-sized avatar

Seb Bacon sebbacon

View GitHub Profile
@sebbacon
sebbacon / gist:178a8e3b5625f79919ad
Created June 20, 2014 12:58
bot with writeable data dir
require 'json'
require 'turbotlib'
Turbotlib.log("Starting scrape...")
state_file = open("#{Turbotlib.data_dir}/state.txt", "w")
(1...20).each do |n|
data = {
number: n,
message: "Hello #{n}",
@sebbacon
sebbacon / setup
Last active August 29, 2015 14:04
Demonstrating different behaviour between GET and POST for search queries. Hard to reproduce on a fresh install, as relates to size of index. Can reproduce on our production server, which is 10Gi with 13,286,926 items
# Set up the index and type and mappings
curl -XPOST "http://localhost:9200/bork"
curl -XPOST "http://localhost:9200/bork/user/_mapping" -d '
{
"user":{
"properties":{
"data_type":{
"index":"not_analyzed",
"type":"string"
@sebbacon
sebbacon / gist:27f081763d19d86e93ca
Created December 12, 2014 09:45
Getting median dates for all jurisdictions via ElasticSearch
require 'elasticsearch'
client = Elasticsearch::Client.new(:host => "elasticsearch-lb")
results = client.search(
:index => 'openc_production',
:type => 'company',
:body => {
"query"=> {
"filtered"=> {
"query"=> {
"match_all"=> {}
@sebbacon
sebbacon / gist:4bef431144a569e868d5
Created January 30, 2015 08:16
ES query to find active companies
ElasticsearchClient::INDEX = "openc_production"
query = {
:type => "company",
:body => {
:filter => {
:bool => {
:should => [
{
:missing => {
@sebbacon
sebbacon / gist:0d80c2037c6ff1a77449
Last active August 29, 2015 14:14
bot to parse an XLS file
# -*- coding: utf-8 -*-
require 'json'
require 'turbotlib'
require 'mechanize'
Turbotlib.log("Starting run...") # optional debug logging
url = 'http://www.cnv.gov.ar/Infofinan/BLOB_Zip.asp?cod_doc=257519&error_page=Error.asp'
set -e
curl -XDELETE http://localhost:9200/data/test
curl -XPUT http://localhost:9200/data/_mapping/test -d '
{
"test" : {
"properties" : {
"License Status" : {"type": "string", "index" : "not_analyzed" },
"License Type" : {"type": "string", "index" : "not_analyzed" },
"License Expiration Date" : { "type" : "date", "format": "mm/dd/YYYY" }
}
# test with
#
# $ bundle exec hutch --require test.rb --verbose --mq-host rabbit1 --mq-api-port 55672
#
# Then publish a JSON message to the "testy" queue
class Test
include Hutch::Consumer
consume 'test.ack'
queue_name 'testy'
Hutch::Config.set(:mq_host, 'rabbit1')
Hutch::Config.set(:mq_api_host, 'rabbit1')
Hutch::Config.set(:mq_api_port, 55672)
Hutch.connect
10000.times do
Hutch.publish("angler.record", subject:'test')
end
@sebbacon
sebbacon / probablecompanies.sh
Created March 23, 2015 11:53
work out probably companies from raw data
for file in $(find . -name transformer.out); do
echo " $file";
jq -r 'select(.licence_holder.entity_type == "company" or (select(.licence_holder.entity_type == "unknown")|select(.licence_holder.entity_properties.name | test("\\b(LLC|LLLC|LLLP|LLC|INC|limited liability|incorporated|corporation|CORP|limited|LTD|PLLC|corp|llc|salon|sales|design|therapy|training|^Cnger|annual|institute|wine|beer|alcho|shop|[0-9#.]+|autos?|stores?|center|street|avenue|road|restaurant|east|west|north|south|central|club|pharmacy|carpet|electric|course|studio)\\b"; "i"))))|.permissions[].activity_name' $file >> /tmp/$(echo $file|cut -d "/" -f2).company.permissions;
done
@sebbacon
sebbacon / gist:63d854d593f141336d1e
Created March 23, 2015 16:39
Load business licence transformers into elasticsearch
#!/bin/bash
curl -XDELETE 'http://elasticsearch-lb:9200/business_licences/'
curl -XPOST 'http://elasticsearch-lb:9200/business_licences/' -d '
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_keyword":{