Skip to content

Instantly share code, notes, and snippets.

View JoaoCarabetta's full-sized avatar
🏊
data swimming

João Carabetta JoaoCarabetta

🏊
data swimming
View GitHub Profile
@JoaoCarabetta
JoaoCarabetta / EstadosBrasil.txt
Last active March 3, 2017 18:31
Lista Estados Brasil
"AC","AL","AP","AM","BA","CE","DF","ES","GO","MA","MT","MS","MG","PA","PB","PR","PE","PI","RJ","RN","RS","RO","RR","SC","SP","SE","TO"
@JoaoCarabetta
JoaoCarabetta / headersTSE.csv
Last active March 28, 2017 19:11
Dados Eleitoais TSE - Headers para csv do LEIAME.pdf.
We can make this file beautiful and searchable if this error is corrected: It looks like row 6 should actually have 14 columns, instead of 2. in line 5.
PERFIL_ELEITORADO,CONSULTA_CAND_2010,CONSULTA_CAND_2012,CONSULTA_CAND_2014,BEM_CANDIDATO,CONSULTA_LEGENDAS ,CONSULTA_VAGAS ,VOTACAO_CANDIDATO_MUN_ZONA_2012,VOTACAO_CANDIDATO_MUN_ZONA_2014,VOTACAO_PARTIDO_MUN_ZONA_2012,VOTACAO_PARTIDO_MUN_ZONA_2014,VOTO_SECAO ,DETALHE_VOTACAO_MUN_ZONA_2012,DETALHE_VOTACAO_MUN_ZONA_2014
PERIODO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO,DATA_GERACAO
UF,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO,HORA_GERACAO
MUNICIPIO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO,ANO_ELEICAO
COD_MUNICIPIO_TSE,NUM_TURNO ,NUM_TURNO ,NUM_TURNO ,DESCRICAO_ELEICAO,NUM_TURNO,DESCRICAO_ELEICAO,NUM_TURNO,NUM_TURNO,NUM_TURNO,NUM_TURNO,NUM_TURNO,NUM_TURNO,NUM_TURNO
NR_ZONA,DESCRICAO_ELEI
@JoaoCarabetta
JoaoCarabetta / split_list_to_row.py
Last active April 11, 2018 16:09
Split list values to rows on pandas enforcing output type
def split_data_frame_list(df,
target_column,
output_type=float):
'''
Accepts a column with multiple types and splits list variables to several rows.
df: dataframe to split
target_column: the column containing the values to split
output_type: type of all outputs
def suffix(alist):
if not len(alist):
return [[]]
else:
return [alist] + suffix(alist[1:])
def preffix(alist):
if not len(alist):
@JoaoCarabetta
JoaoCarabetta / create_waze_partitioned_table_athena.sql
Created January 15, 2019 16:41
Creates an Athena partitioned table for Waze data
DROP TABLE IF EXISTS main;
CREATE EXTERNAL TABLE main (
endTimeMillis BIGINT,
startTimeMillis BIGINT,
endTime STRING,
startTime STRING,
jams array<struct<
uuid: STRING,
pubMillis: BIGINT,
CREATE TABLE waze.polygons_geo
WITH (
external_location = 's3://...',
format = 'Parquet') AS
WITH dataset AS (
SELECT
polygons
FROM waze.polygons)
SELECT
pol.polygon,
@JoaoCarabetta
JoaoCarabetta / linestring_to_geojson.sql
Created February 4, 2019 18:48
Waze linestring to geojson in Athena
SELECT
'{"type":"LineString", "coordinates":' ||
'[' || array_join(transform(line, loc -> '[' || CAST(loc.x AS VARCHAR) || ',' || CAST(loc.y AS VARCHAR) || ']'), ',') || ']}'
FROM test.test
@JoaoCarabetta
JoaoCarabetta / update-function-code-aws-lambda.sh
Last active February 25, 2019 14:00
Update python function to AWS lambda
# update-function-code-aws-lambda.sh <function path> <lambda-function-name>
# make sure that the aws-cli is configured in the same region and with a user with permission
rm temp.zip
echo ------ Zipping --------
cp $1 lambda_function.py
zip temp.zip lambda_function.py
rm lambda_function.py
echo ------ Updating Lambda ------
aws lambda update-function-code --function-name $2 --zip-file "fileb://temp.zip"
@JoaoCarabetta
JoaoCarabetta / create_table_athena_dump.sql
Created March 4, 2019 15:02
Create table from Athena CSV dump
-- Delete *.csv.metadata
-- aws s3 rm s3://... --recursive --exclude '*.csv'
CREATE EXTERNAL TABLE `osm_pems_ids`(
`osm_id` bigint COMMENT '',
`sensor_id` string COMMENT '',
`neigh_level` bigint COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
@JoaoCarabetta
JoaoCarabetta / bulk_delete_dynamo_tables.py
Created March 21, 2019 16:53
Bulk delete AWS dynamodb tables
import boto3
table_filter = 'temp-cap'
dynamo = boto3.client('dynamodb')
tables = dynamo.list_tables()['TableNames']
tables = [t for t in tables if table_filter in t]
for t in tables:
try:
dynamo.delete_table(TableName=t)
except:
continue