Skip to content

Instantly share code, notes, and snippets.

View r39132's full-sized avatar

Sid Anand r39132

View GitHub Profile
@r39132
r39132 / JsonToParquetConverter.java
Created May 23, 2023 01:16
ChatGPT Json-to-Parquet-Converter
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericRecordBuilder;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.EncoderFactory;
@r39132
r39132 / gist:3ef24b3ffc4ebf7f0509fc635165f3af
Created April 6, 2018 06:34
Running "java -jar ../rat/apache-rat/target/apache-rat-0.13-SNAPSHOT.jar -E ./.rat-excludes -d . > rat_report.txt" on Apache Airflow
Ignored 2 lines in your exclusion files as comments or empty lines.
*****************************************************
Summary
-------
Generated at: 2018-04-05T23:32:08-07:00
Notes: 5
Binaries: 36
Archives: 0
PayPal currently supports 4+ generations of software stacks in production and runs 2k+ distinct microservices, which together provide customers with the fast and seamless user experience they expect. To maintain high quality while promoting happy, productive developers in such an environment, self-service tools with a high grade of automation under the hood are paramount. In this talk, I will tell the story of how PayPal moved from a "PayPal on a box"-test environment, to VM-based environments, and finally to a delivery pipeline leveraging our container platform. I will describe how our pipeline delivers containers to fly-away test environments for automated integration testing and how that paradigm shift impacted our engineering teams and their workflows. We will share our insights and learnings on what worked really well for us as well as how some of our learnings can be applied at other companies.
from airflow import DAG, utils
from airflow.operators.dummy_operator import DummyOperator
from datetime import date, datetime, time, timedelta
today = datetime.today()
# Round to align with the schedule interval
START_DATE = today.replace(minute=0, second=0, microsecond=0)
DAG_NAME = 'clear_task_bug_dag_1.0'
"""
### Example HTTP operator and sensor
"""
from airflow import DAG
from airflow.operators.http_operator import SimpleHttpOperator
from airflow.operators.sensors import HttpSensor
from datetime import datetime, timedelta
import json
seven_days_ago = datetime.combine(datetime.today() - timedelta(7),
@r39132
r39132 / gist:30cc62c74b3ba23039a622c31016766f
Created February 11, 2017 01:41
ElasticSearch 2.3 --> 5.1 Migration : new IP fields do not support ipv6
I recently migrated from AWS ES 2.3 to 5.1.
I followed the instructions on [http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-version-migration.html]
TLDR, i snapshotted my 2.3 ES cluster to S3 and then restored to the new 5.1 cluster from S3.
However, I ran into a problem. I added an *ip* field to my indexes, which included indexes brought over from 2.3 as well as new indexes created on 5.1. Here's an sample mapping:
curl -XPUT "localhost:80/cars/_mapping/transactions" -d'
{
sid-as-mbp:ep siddharth$ terraform plan --target=aws_kinesis_stream.scored_output
var.im_ami
Enter a value: 1
Refreshing Terraform state prior to plan...
aws_s3_bucket.agari_stage_ep_scored_output_firehose_bucket: Refreshing state... (ID: agari-stage-ep-scored-output-firehose)
aws_iam_role.firehose_role: Refreshing state... (ID: collector_ingest_firehose_role)
aws_kinesis_firehose_delivery_stream.scored_output_firehose: Refreshing state... (ID: arn:aws:firehose:us-west-2:118435376172:deliverystream/agari-stage-ep-scored-output-firehose)
aws_kinesis_stream.scored_output: Refreshing state... (ID: arn:aws:kinesis:us-west-2:118435376172:stream/agari-stage-ep-scored-output)
now = datetime.now()
now_to_the_hour = now.replace(hour=now.time().hour, minute=0, second=0, microsecond=0)
START_DATE = now_to_the_hour + timedelta(hours=-3)
DAG_NAME = 'ep_telemetry_v2'
ORG_IDS = get_active_org_ids_string()
default_args = {
'owner': 'sanand',
'depends_on_past': True,
'pool': 'ep_data_pipeline',
from datetime import datetime
from airflow.models import DAG
from airflow.operators import BashOperator, ShortCircuitOperator
import logging
def skip_to_current_job(ds, **kwargs):
now = datetime.now()
left_window = kwargs['dag'].following_schedule(kwargs['execution_date'])
right_window = kwargs['dag'].following_schedule(left_window)
check process airflow-webserver with pidfile /home/deploy/airflow/pids/airflow-webserver.pid
group airflow
start program "/bin/sh -c '( HISTTIMEFORMAT="%d/%m/%y %T " TMP=/data/tmp AIRFLOW_HOME=/home/deploy/airflow PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin airflow webserver -p 8080 2>&1 & echo $! > /home/deploy/airflow/pids/airflow-webserver.pid ) | logger -p local7.info'"
as uid deploy and gid deploy
stop program "/bin/sh -c 'PATH=/bin:/sbin:/usr/bin:/usr/sbin pkill -TERM -P `cat /home/deploy/airflow/pids/airflow-webserver.pid` && rm -f /home/deploy/airflow/pids/airflow-webserver.pid'"
as uid deploy and gid deploy
~