Skip to content

Instantly share code, notes, and snippets.

@semyont
semyont / 0_reuse_code.js
Created December 30, 2015 13:40
Here are some things you can do with Gists in GistBox.
// Use Gists to store code you would like to remember later on
console.log(window); // log the "window" object to the console
@semyont
semyont / regex.py
Created August 3, 2016 10:48
Regex for extracting log data
from pyspark.sql.functions import split, regexp_extract
split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_df.show(truncate=False)
@semyont
semyont / csv_pandas_stream_elastic_upsert.py
Last active July 20, 2023 16:13
large timeseries csv streaming upsert bulk elasticseach #index #pandas #bigdata #csv #upsert #elasticsearch #progressbar #example #bulk #stream #dataops #dataengineer #timeseries
import logging
import hashlib
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from tqdm import tqdm
class Storage:
@semyont
semyont / gevent_concurrency_redis.py
Created April 12, 2017 20:53
gevent based concurrency for redis-py
import logging
logging.basicConfig(
format='%(asctime)s,%(msecs)05.1f (%(funcName)s) %(message)s',
datefmt='%H:%M:%S')
log = logging.getLogger()
log.setLevel(logging.INFO)
import threading
import os
import time
@semyont
semyont / useful_pandas_snippets.py
Created April 12, 2017 20:55 — forked from bsweger/useful_pandas_snippets.md
Useful Pandas Snippets
# List unique values in a DataFrame column
pd.unique(df.column_name.ravel())
# Convert Series datatype to numeric, getting rid of any non-numeric values
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
# Grab DataFrame rows where column has certain values
valuelist = ['value1', 'value2', 'value3']
df = df[df.column.isin(valuelist)]
@semyont
semyont / wordpress-mysql-docker-compose.yml
Last active April 12, 2017 21:44
Wordpress MySQL Docker Compose
version: '2'
services:
db:
image: mysql:5.7
volumes:
- db_data:/var/lib/mysql
restart: always
environment:
MYSQL_ROOT_PASSWORD: wordpress
# Convert wide format csv to long format csv
# Time Temp1 Temp2 Temp3 Temp4 Temp5
# 00 21 32 33 21 23
# 10 34 23 12 08 23
# 20 12 54 33 54 55
with open("in.csv") as f,open("out.csv","w") as out:
headers = next(f).split()[1:] # keep headers/Time Temp1 Temp2 Temp3 Temp4 Temp5
for row in f:
# GET /_search
{
"query": {
"bool": {
"must": [
{ "match": { "doc.title": "Search" }},
{ "match": { "doc.content": "Elasticsearch" }}
],
"filter": [
{ "term": { "doc.status": "published" }},
@semyont
semyont / elasticsearch_term_nested_aggregation.json
Last active May 16, 2017 06:43
elasticsearch collect mode for nested aggregation when top hit size is bigger then fields. then inner aggregations returns unneeded fields to the upper aggregation layer, combining filtering/ match with this will reduce variance in fields
# use un-analyzed fields
{
"aggs" : {
"domain" : {
"terms" : {
"field" : "doc.domain.keyword",
"size" : 4,
"collect_mode" : "breadth_first"
},
@semyont
semyont / tornado_gevent_async.py
Created June 4, 2017 08:21
tornado blocking task gevent workers async example
# Do this as early as possible in your application:
from gevent import monkey; monkey.patch_all()
from tornado.web import RequestHandler, asynchronous
import gevent
class MyHandler(RequestHandler):
@asynchronous
def get(self, *args, **kwargs):
def async_task():