Skip to content

Instantly share code, notes, and snippets.

View TrainTravel's full-sized avatar

Train Chen TrainTravel

View GitHub Profile
@TrainTravel
TrainTravel / BaseQueryValidator.java
Created January 3, 2025 17:38 — forked from piyusht007/BaseQueryValidator.java
SQL parsing in java using Apache-Calcite SQL parser.
public class BaseQueryValidator {
private static List<String> extractTableAliases(SqlNode node) {
final List<String> tables = new ArrayList<>();
// If order by comes in the query.
if (node.getKind().equals(SqlKind.ORDER_BY)) {
// Retrieve exact select.
node = ((SqlSelect) ((SqlOrderBy) node).query).getFrom();
} else {
node = ((SqlSelect) node).getFrom();
from pyspark.sql.functions import window
# configure reading from the stream
kinesis_df = spark.readStream.format("kinesis")
.option("streamName", KINESIS_STREAM_NAME)
.option("region", AWS_REGION)
.option("roleArn", KINESIS_ACCESS_ROLE_ARN
.option("initialPosition", "latest")
.load()
@TrainTravel
TrainTravel / collab_filter.py
Created May 15, 2024 10:00 — forked from ychennay/collab_filter.py
Collaborative Filter Example in Spark
import pandas as pd
from pyspark.mllib.recommendation import ALS, Rating
from pyspark.sql import SparkSession, SQLContext
from sklearn.metrics.pairwise import cosine_similarity
if __name__ == "__main__": # run this by typing "python collaborative_filter.py"
app_name = "collab_filter_example"
# create a Spark context
spark = SparkSession.builder.master("local").appName(app_name).getOrCreate()
@TrainTravel
TrainTravel / _file_formats.md
Created January 30, 2024 08:05 — forked from kzzzr/_file_formats.md
File formats comparison: CSV, JSON, Parquet, ORC

File formats comparison: CSV, JSON, Parquet, ORC

Key results

Whenever you need to store your data on S3 / Data Lake / External table choose file format wisely:

  • Parquet / ORC are the best options due to efficient data layout, compression, indexing capabilities
  • Columnar formats allow for column projection and partition pruning (reading only relevant data!)
  • Binary formats enable schema evolution which is very applicable for constantly changing business environment
@TrainTravel
TrainTravel / jq-cheetsheet.md
Created May 13, 2022 09:15 — forked from olih/jq-cheetsheet.md
jq Cheet Sheet

Processing JSON using jq

jq is useful to slice, filter, map and transform structured json data.

Installing jq

On Mac OS

brew install jq

@TrainTravel
TrainTravel / gist:05a234fe87db95355df746d73386e9e5
Last active October 22, 2021 01:14
PostgreSQL Statistics & Views
-- find out all the available catalog views
SELECT
n.nspname as "Schema",
c.relname as "Name",
CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' END as "Type",
r.rolname as "Owner"
FROM pg_catalog.pg_class c
JOIN pg_catalog.pg_roles r ON r.oid = c.relowner
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('v','')
@TrainTravel
TrainTravel / gist:5cd9df0f5ab05235f21af2ee620b72e3
Last active October 21, 2021 03:09
analyzing_shared_buffers_content_script.sql
-- createdb test_buffers;
-- pgbench -i -s 50 test_buffers;
-- psql test_buffers;
-- \dt+;
-- select * from pgbench_branches;
-- pgbench -S -c 8 -t 25000 test_buffers
@TrainTravel
TrainTravel / postgres_config.md
Created August 13, 2021 08:09 — forked from rgreenjr/postgres_config.md
PostgreSQL Configuration Optimization

PostgreSQL Configuration Optimization

Memory

Only four values really matter:

  • shared-buffers: below 2GB: set it to 20% of full memory; below 32GB: 25% of your full memory.
@TrainTravel
TrainTravel / postgres_queries_and_commands.sql
Created August 13, 2021 06:27 — forked from rgreenjr/postgres_queries_and_commands.sql
Useful PostgreSQL Queries and Commands
-- show running queries (pre 9.2)
SELECT procpid, age(clock_timestamp(), query_start), usename, current_query
FROM pg_stat_activity
WHERE current_query != '<IDLE>' AND current_query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
-- show running queries (9.2)
SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
$ ./ammonium
Loading...
Welcome to the Ammonite Repl 0.8.0
(Scala 2.11.8 Java 1.8.0_112)
@ import $ivy.`org.jupyter-scala::scio:0.4.0-RC1`
import $ivy.$
@ {
import jupyter.scio._