Skip to content

Instantly share code, notes, and snippets.

View hakanilter's full-sized avatar

Hakan İlter hakanilter

View GitHub Profile
@hakanilter
hakanilter / spark_weird_csv.scala
Last active May 24, 2019 12:18
Creating DataFrame from weird CSV files
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
def wcsv_to_df(
fileName: String,
tableName: String,
columns: Array[String],
fieldTerminator: String,
@hakanilter
hakanilter / tensorflow_embeddings.py
Created April 5, 2019 09:32
Tensorflow Universal Sentence Encoder
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os
import pandas as pd
from scipy import spatial
from operator import itemgetter
#module_url = "https://tfhub.dev/google/universal-sentence-encoder/2"
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
@hakanilter
hakanilter / poor_mans_text_clustering.py
Last active April 5, 2019 09:30
Poor Man's text clustering using cosine similarity
from scipy import spatial
distances = spatial.distance.squareform(spatial.distance.pdist(message_embeddings, 'cosine'))
def progress(i):
print('\r{} {}'.format('-\|/'[i % 4], i), end='')
def cluster(items, distances, similarity_threshold=0.11):
print('Clustering threshold:', similarity_threshold)
clusters = list()
@hakanilter
hakanilter / wikipedia_category_to_es.py
Last active March 7, 2019 17:23
Saving Wikipedia Categories in ElasticSearch using PySpark
# Download required library
#cd /opt/conda/lib/python3.6/site-packages/pyspark-2.4.0-py3.6.egg/pyspark/jars/
#wget http://central.maven.org/maven2/org/elasticsearch/elasticsearch-spark-20_2.11/6.6.1/elasticsearch-spark-20_2.11-6.6.1.jar
#ls -l *elastic*
# Initialize Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
@hakanilter
hakanilter / wikipedia_sql_to_parquet.py
Created March 6, 2019 08:52
Convert Wikipedia Category SQL File to Parquet Files
from pyspark.sql import SparkSession
# init spark
spark = SparkSession.builder \
.master("local[*]") \
.appName("anaconda") \
.config("spark.sql.warehouse.dir", "file:///tmp/spark-warehouse") \
.enableHiveSupport() \
.getOrCreate()
@hakanilter
hakanilter / mongodb-setup.sh
Created February 27, 2019 10:22
Amazon Linux Single Node Simple MongoDB Setup
# Update packages
sudo yum update -y
# Mount EBS volume
sudo mkfs -t xfs /dev/xvdb
sudo mkdir /data
sudo mount /dev/xvdb /data
# Install MongoDB
echo '
@hakanilter
hakanilter / ecs-run-and-wait.sh
Last active December 7, 2023 16:50
AWS ECS run task and wait for the result
# Requies JSON as the output format and "jq" commandline tool
# If task runs successfuly, exits 0
run_result=$(aws ecs run-task \
--cluster ${CLUSTER} \
--task-definition ${TASK_DEFINITION} \
--launch-type EC2 \
--overrides "${OVERRIDES}")
echo ${run_result}
container_arn=$(echo $run_result | jq -r '.tasks[0].taskArn')
aws ecs wait tasks-stopped \
@hakanilter
hakanilter / json_hive_definition.py
Last active November 28, 2018 16:41
Fastest way to get Hive definition for a given Json file
def json_hive_def(path):
spark.read.json(path).createOrReplaceTempView("temp_view")
spark.sql("CREATE TABLE temp_table AS SELECT * FROM temp_view LIMIT 0")
script = spark.sql("SHOW CREATE TABLE temp_table").take(1)[0].createtab_stmt.replace('\n', '')
spark.sql("DROP TABLE temp_table")
return script
@hakanilter
hakanilter / awslogs-setup.sh
Last active November 22, 2018 10:44
Installing AWS Cloudwatch Agent in Debian
sudo su
apt-get install -y libyaml-dev python-dev python3-dev python3-pip
pip3 install awscli-cwlogs
if [ ! -d /var/awslogs/bin ] ; then
mkdir -p /var/awslogs/bin
ln -s /usr/local/bin/aws /var/awslogs/bin/aws
fi
mkdir /opt/awslogs
cd /opt/awslogs
curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py -O
@hakanilter
hakanilter / athena.sql
Created November 8, 2018 13:04
Athena create select query with location
CREATE TABLE sampledb.test_empty_array_parquet
WITH (
format = 'PARQUET',
external_location = 's3://somewhere'
)
AS SELECT *
FROM sampledb.test_empty_array