Chris. acerb

## Instructions.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              6 stars
            
          
                tadejsv
                / Instructions.md
            
            
              Last active
              July 11, 2021 00:47
            
              
                Jina 2.0 example
              
          
    This script indexes ~800 poem verses from the huggingface poem_sentiment dataset, and uses a transformer model to index them,
and performs a KNN search using FAISS module.
Before running, install all the requirements with these 3 commands:
conda create -n jina-2.0 -c conda-forge -c huggingface faiss-cpu datasets
conda activate jina-2.0
pip install jina sentence-transformers --pre

  
## bench_set_types.sql
create extension if not exists hstore;

create table if not exists test_array  as
  select id, '{"A","B","C"}'::varchar[] || id::varchar as codes from generate_series(1, 100000) id;
create table if not exists test_hstore as
  select id, '"A"=>t,"B"=>t,"C"=>t'::hstore || hstore(id::varchar, 't') as codes from generate_series(1, 100000) id;
create table if not exists test_jsonb  as
  select id, '{"A":true,"B":true,"C":true}'::jsonb || jsonb_build_object(id::varchar, true) as codes from generate_series(1, 100000) id;

\timing on

## FuzzySubstringSearch.java
package ch.sgwerder.gist;

import org.junit.jupiter.api.Assertions; // only for testing, see below
import org.junit.jupiter.api.Test; // only for testing, see below

import java.util.stream.IntStream;

/**
 * Copyright (C) 2017 Simon Gwerder.
 * <p/>

## 1-importing_from_google_sheet.cql
//Importing from the Google Spreadsheet
//import the Person nodes
load csv with headers from
"https://docs.google.com/spreadsheets/d/1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc/export?format=csv&id=1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc&gid=0" as persons
create (n:Node:Person)
set n = persons;

//import the Company nodes
load csv with headers from
"https://docs.google.com/spreadsheets/d/1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc/export?format=csv&id=1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc&gid=2040965723" as companies

## service-checklist.md

      
              1 file
            
          
              3 forks
            
          
              0 comments
            
          
              21 stars
            
          
                padajo
                / service-checklist.md
            
            
              Last active
              April 25, 2023 13:34
                — forked from acolyer/service-checklist.md
            
              
                Internet Scale Services Checklist
              
          
    Internet Scale Services Checklist

A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services."

http://mvdirona.com/jrh/talksandpapers/jamesrh_lisa.pdf

An update by Paul Johnston (paul@roundaboutlabs.com), for a Serverless Architecture scenario. This assumes something akin to AWS Lambda + API Gateway + DynamoDB (c. 2016) Function as a Service (FaaS) solution as the basis for deployment rather than a cloud-based virtual server approach which the original paper was based upon. The FaaS solution implies each function is separately scalable and the database is inherently partitioned (assuming designed/built well).
If you agree/disagree, please fork and share with me on twitter @pauldjohnston.

  
## Vagrantfile
Vagrant.configure("2") do |config|

  config.vm.box = "precise64"
  config.vm.box_url = "http://files.vagrantup.com/precise64.box"

  config.vm.network :private_network, ip: "192.168.33.101"

  config.vm.synced_folder "./", "/vagrant", id: "vagrant-root"

end

## es.sh
cd ~
sudo apt-get update
sudo apt-get install openjdk-7-jre-headless -y

# Download the compiled elasticsearch rather than the source.
wget http://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.20.2.tar.gz -O elasticsearch.tar.gz
tar -xf elasticsearch.tar.gz
rm elasticsearch.tar.gz
sudo mv elasticsearch-* elasticsearch
sudo mv elasticsearch /usr/local/share

## gist:3888120

      
              1 file
            
          
              9 forks
            
          
              0 comments
            
          
              36 stars
            
          
                clintongormley
                / gist:3888120
            
            
              Created
              October 14, 2012 09:44
            
              
                Upgrading a running elasticsearch cluster
              
          
    Yesterday I upgraded our running elasticsearch cluster on a site which serves a few million search requests a day, with zero downtime. I've been asked to describe the process, hence this blogpost.
To make it more complicated, the cluster was running elasticsearch version 0.17.8 (released 6 Oct 2011) and I upgraded it to the latest 0.19.10. There have been 21 releases between those two versions, with a lot of functional changes, so I needed to be ready to roll back if necessary.
Our setup:


elasticsearch

We run elasticsearch on two biggish boxes: 16 cores plus 32GB of RAM. All indices have 1 replica, so all data is stored on both boxes (about 45GB of data). The primary data for our main indices is also stored in our database. We have a few other indices whose data is stored only in elasticsearch, but are updated once daily only. Finally, we store our sessions in elasticsearch, but active sessions are cached in memcached.

  
## elasticsearch-import-data
Why is there no such DataImportHandler thing in ElasticSearch? Uhm, well ... but because:

1. You should really consider your own scripts
(be it jvm based, perl, ruby, php, nodejs/javascript)
to feed ElasticSearch via bulk indexing:
http://www.elasticsearch.org/guide/reference/java-api/bulk.html

2. There are two projects doing it already:
 * http://code.google.com/p/sql-to-nosql-importer/
 * https://github.com/Aconex/scrutineer (keeps DB in synch with ES or solr!)

## backup.sh
#!/bin/bash
# herein we backup our indexes! this script should run at like 6pm or something, after logstash
# rotates to a new ES index and theres no new data coming in to the old one. we grab metadatas,
# compress the data files, create a restore script, and push it all up to S3.
TODAY=`date +"%Y.%m.%d"`
INDEXNAME="logstash-$TODAY" # this had better match the index name in ES
INDEXDIR="/usr/local/elasticsearch/data/logstash/nodes/0/indices/"
BACKUPCMD="/usr/local/backupTools/s3cmd --config=/usr/local/backupTools/s3cfg put"
BACKUPDIR="/mnt/es-backups/"
YEARMONTH=`date +"%Y-%m"`
	create extension if not exists hstore;

	create table if not exists test_array as
	select id, '{"A","B","C"}'::varchar[] \|\| id::varchar as codes from generate_series(1, 100000) id;
	create table if not exists test_hstore as
	select id, '"A"=>t,"B"=>t,"C"=>t'::hstore \|\| hstore(id::varchar, 't') as codes from generate_series(1, 100000) id;
	create table if not exists test_jsonb as
	select id, '{"A":true,"B":true,"C":true}'::jsonb \|\| jsonb_build_object(id::varchar, true) as codes from generate_series(1, 100000) id;

	\timing on
	package ch.sgwerder.gist;

	import org.junit.jupiter.api.Assertions; // only for testing, see below
	import org.junit.jupiter.api.Test; // only for testing, see below

	import java.util.stream.IntStream;

	/**
	* Copyright (C) 2017 Simon Gwerder.
	* <p/>
	//Importing from the Google Spreadsheet
	//import the Person nodes
	load csv with headers from
	"https://docs.google.com/spreadsheets/d/1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc/export?format=csv&id=1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc&gid=0" as persons
	create (n:Node:Person)
	set n = persons;

	//import the Company nodes
	load csv with headers from
	"https://docs.google.com/spreadsheets/d/1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc/export?format=csv&id=1_X628w_2Lx8ZAIPQQUAGhoDTuf31MRxY821E5D3u2Nc&gid=2040965723" as companies
	Vagrant.configure("2") do \|config\|

	config.vm.box = "precise64"
	config.vm.box_url = "http://files.vagrantup.com/precise64.box"

	config.vm.network :private_network, ip: "192.168.33.101"

	config.vm.synced_folder "./", "/vagrant", id: "vagrant-root"

	end
	cd ~
	sudo apt-get update
	sudo apt-get install openjdk-7-jre-headless -y

	# Download the compiled elasticsearch rather than the source.
	wget http://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.20.2.tar.gz -O elasticsearch.tar.gz
	tar -xf elasticsearch.tar.gz
	rm elasticsearch.tar.gz
	sudo mv elasticsearch-* elasticsearch
	sudo mv elasticsearch /usr/local/share
	Why is there no such DataImportHandler thing in ElasticSearch? Uhm, well ... but because:

	1. You should really consider your own scripts
	(be it jvm based, perl, ruby, php, nodejs/javascript)
	to feed ElasticSearch via bulk indexing:
	http://www.elasticsearch.org/guide/reference/java-api/bulk.html

	2. There are two projects doing it already:
	* http://code.google.com/p/sql-to-nosql-importer/
	* https://github.com/Aconex/scrutineer (keeps DB in synch with ES or solr!)
	#!/bin/bash
	# herein we backup our indexes! this script should run at like 6pm or something, after logstash
	# rotates to a new ES index and theres no new data coming in to the old one. we grab metadatas,
	# compress the data files, create a restore script, and push it all up to S3.
	TODAY=`date +"%Y.%m.%d"`
	INDEXNAME="logstash-$TODAY" # this had better match the index name in ES
	INDEXDIR="/usr/local/elasticsearch/data/logstash/nodes/0/indices/"
	BACKUPCMD="/usr/local/backupTools/s3cmd --config=/usr/local/backupTools/s3cfg put"
	BACKUPDIR="/mnt/es-backups/"
	YEARMONTH=`date +"%Y-%m"`