Skip to content

Instantly share code, notes, and snippets.

View acerb's full-sized avatar

Chris. acerb

  • Brussels, BE
View GitHub Profile
@tadejsv
tadejsv / Instructions.md
Last active July 11, 2021 00:47
Jina 2.0 example

This script indexes ~800 poem verses from the huggingface poem_sentiment dataset, and uses a transformer model to index them, and performs a KNN search using FAISS module.

Before running, install all the requirements with these 3 commands:

conda create -n jina-2.0 -c conda-forge -c huggingface faiss-cpu datasets
conda activate jina-2.0
pip install jina sentence-transformers --pre
@semaperepelitsa
semaperepelitsa / bench_set_types.sql
Last active May 19, 2021 13:48
Postgres: array vs hstore vs jsonb (set operations)
create extension if not exists hstore;
create table if not exists test_array as
select id, '{"A","B","C"}'::varchar[] || id::varchar as codes from generate_series(1, 100000) id;
create table if not exists test_hstore as
select id, '"A"=>t,"B"=>t,"C"=>t'::hstore || hstore(id::varchar, 't') as codes from generate_series(1, 100000) id;
create table if not exists test_jsonb as
select id, '{"A":true,"B":true,"C":true}'::jsonb || jsonb_build_object(id::varchar, true) as codes from generate_series(1, 100000) id;
\timing on
@shathor
shathor / FuzzySubstringSearch.java
Last active May 20, 2023 03:32
Performs a fuzzy substring search to find and return a portion of a string that matches with another, considering a max. allowed distance between the two. It is a slightly modified Levenshtein distance. E.g. Looking for 'abcd' in 'xyzabydxyz' and a maximum distance of 1 will return 'abyd'. Useful when trying to fuzzy search in a sentence.
package ch.sgwerder.gist;
import org.junit.jupiter.api.Assertions; // only for testing, see below
import org.junit.jupiter.api.Test; // only for testing, see below
import java.util.stream.IntStream;
/**
* Copyright (C) 2017 Simon Gwerder.
* <p/>
@rvanbruggen
rvanbruggen / 1-importing_from_google_sheet.cql
Last active August 8, 2017 18:38
Importing and querying the web of Belgian Public companies and their ceo's/chairmen
//Importing from the Google Spreadsheet
//import the Person nodes
load csv with headers from
"https://docs.google.com/spreadsheets/d/1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc/export?format=csv&id=1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc&gid=0" as persons
create (n:Node:Person)
set n = persons;
//import the Company nodes
load csv with headers from
"https://docs.google.com/spreadsheets/d/1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc/export?format=csv&id=1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc&gid=2040965723" as companies
@padajo
padajo / service-checklist.md
Last active April 25, 2023 13:34 — forked from acolyer/service-checklist.md
Internet Scale Services Checklist

Internet Scale Services Checklist

A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services."

An update by Paul Johnston (paul@roundaboutlabs.com), for a Serverless Architecture scenario. This assumes something akin to AWS Lambda + API Gateway + DynamoDB (c. 2016) Function as a Service (FaaS) solution as the basis for deployment rather than a cloud-based virtual server approach which the original paper was based upon. The FaaS solution implies each function is separately scalable and the database is inherently partitioned (assuming designed/built well).

If you agree/disagree, please fork and share with me on twitter @pauldjohnston.

Vagrant.configure("2") do |config|
config.vm.box = "precise64"
config.vm.box_url = "http://files.vagrantup.com/precise64.box"
config.vm.network :private_network, ip: "192.168.33.101"
config.vm.synced_folder "./", "/vagrant", id: "vagrant-root"
end
cd ~
sudo apt-get update
sudo apt-get install openjdk-7-jre-headless -y
# Download the compiled elasticsearch rather than the source.
wget http://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.20.2.tar.gz -O elasticsearch.tar.gz
tar -xf elasticsearch.tar.gz
rm elasticsearch.tar.gz
sudo mv elasticsearch-* elasticsearch
sudo mv elasticsearch /usr/local/share
@clintongormley
clintongormley / gist:3888120
Created October 14, 2012 09:44
Upgrading a running elasticsearch cluster

Yesterday I upgraded our running elasticsearch cluster on a site which serves a few million search requests a day, with zero downtime. I've been asked to describe the process, hence this blogpost.

To make it more complicated, the cluster was running elasticsearch version 0.17.8 (released 6 Oct 2011) and I upgraded it to the latest 0.19.10. There have been 21 releases between those two versions, with a lot of functional changes, so I needed to be ready to roll back if necessary.

Our setup:

  • elasticsearch

We run elasticsearch on two biggish boxes: 16 cores plus 32GB of RAM. All indices have 1 replica, so all data is stored on both boxes (about 45GB of data). The primary data for our main indices is also stored in our database. We have a few other indices whose data is stored only in elasticsearch, but are updated once daily only. Finally, we store our sessions in elasticsearch, but active sessions are cached in memcached.

@karussell
karussell / elasticsearch-import-data
Last active October 30, 2023 16:14
ElasticSearch from SQL DB
Why is there no such DataImportHandler thing in ElasticSearch? Uhm, well ... but because:
1. You should really consider your own scripts
(be it jvm based, perl, ruby, php, nodejs/javascript)
to feed ElasticSearch via bulk indexing:
http://www.elasticsearch.org/guide/reference/java-api/bulk.html
2. There are two projects doing it already:
* http://code.google.com/p/sql-to-nosql-importer/
* https://github.com/Aconex/scrutineer (keeps DB in synch with ES or solr!)
@nherment
nherment / backup.sh
Created February 29, 2012 10:42
Backup and restore an Elastic search index (shamelessly copied from http://tech.superhappykittymeow.com/?p=296)
#!/bin/bash
# herein we backup our indexes! this script should run at like 6pm or something, after logstash
# rotates to a new ES index and theres no new data coming in to the old one. we grab metadatas,
# compress the data files, create a restore script, and push it all up to S3.
TODAY=`date +"%Y.%m.%d"`
INDEXNAME="logstash-$TODAY" # this had better match the index name in ES
INDEXDIR="/usr/local/elasticsearch/data/logstash/nodes/0/indices/"
BACKUPCMD="/usr/local/backupTools/s3cmd --config=/usr/local/backupTools/s3cfg put"
BACKUPDIR="/mnt/es-backups/"
YEARMONTH=`date +"%Y-%m"`