p5k6/reinvent_2017_notes.md

## reinvent_2017_notes.md

      
    Raw
  

              reinvent_2017_notes.md
            
          
    re:invent


API gateway - no longer (exclusively) at the edge, can set up links within vpcs - see here

Monday


ABD202 - Best Practices for Building Serverless Big Data Applications


slide deck
presentation
extract to target s3 bucket, archived and queryable, then converted to parquet with glue and landed in another bucket - their data lake, and redshift (two side by side)
"target" bucket usually not queried directly though, query the parquet files in the lake bucket
use lambda for many things here

file from external partner lands in s3, sns topic triggers lambda and then new file is processed/loaded
api gateway is hit, which triggers lambda, which then loads data onto kinesis, into s3, into dynamo, etc
push for step functions for remembering stateful apps


ABD315 - Building Serverless ETL Pipelines with AWS Glue


slide deck
presentation
Merck presentation
How they build out their warehouse
dynamic frames

designed for semi-structured data
much more perfomrant than standard json
much more performant than vanilla spark
mostly in the case of many small files - think streaming firehoses
variety of transforms available


has job bookmarks, remembers where it left off
can persist state of transforms, sinks, sources, etc
pause option built in - can start from last state, but not advance

used for debugging/testing


scala support is coming
very quick to add new data sources
reference arch - use glue to crawl for sources

rds, local dbs, streams, s3 buckets
this fills the glue catalog
can transform data with outputs to s3, redshift, etc


Merck stuff was more - how the integrated this at the front end to integrate with their proprietary ent software backend

ABD304 - Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum


slide deck
presentation
parquet - ingest by using spectrum, and loading into redshift with "insert into table x select y from z"

parquet direct load coming Q1 2018


upsert - use our solution 😛. (I.e. we're doing things right)
use "backup no" for some loads/etls (worth checking our transforms here
keep 20% free space or 3x your largest table (rule of thumb)
spectrum scales based on number of slices (i.e. linearly with num of nodes/cpu units)
materialize frequently filtered columns from dimension tables to fact tables
likewise, materialize frequently calc'd values to tables

Tuesday


ABD217 - From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time


slide deck
presentation
Flex did an interesting thing where they listen on their "schema.rb" equivalent, and automatically migrated these changes over to redshift
idea of hot/warm/cold data -

hot is in kinesis, elastic search, etc
warm - redshift
cold - s3


Presentation showed how their arch evolved over time, leading through 5 iterations

iteration 5 showed how they listed to source schema and updated the RS/elasticsearch schema in near real time using protocol buffers


SRV332 - Building Serverless Real-Time Data Processing


workshop presentation
wildrydes workshop
takeaways

pretty easy to set up kinesis streams, start querying in sql
both in athena with the batched files from firehose
and directly on the kinesis stream itself


ABD205 - Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success


slides
presentation
built out a ML algo to predict which students were likely to be in trouble by week 2 of semester

83% accurate
demographics are NOT included in their model, as they cannot change those...
(i have some questions about this, but didn't have a chance to ask them)
they were able to reach out to those students, and help prevent dropouts
estimate they were able to prevent 3100 dropouts
random note - in 7 cases, students had their power shut off, and IvyTech was able to get the students some help


Some redshift tidbits, most of which are known to our team

IOT328 - Building an AWS IoT-Enabled Drink Dispenser


slides
Pretty fun - built out a little device that dispenses drinks
built out a little website where we could send credits to classmates ($0.25ea); when you have $1 you can dispense a drink
got to program the microcontroller directly with mongoose os
pretty cool to see how this could tie into the aws ecosystem, and how you can manage these devices at scale
kinda reminded me of working with particle (née sparkfun) devices

Wednesday


Keynote - Andy Jassy


full keynote
note - the keynote dropped out around 930-10am PST for about 20 min (was in an overflow room)

k8s


ecs for k8s (EKS)
auto deploy across mulitple AZ
HA
can auto upgrade, but can control when you want

AWS Fargate


no need to manage servers
no clusters to manage, manages infra for you
auto scale across multiple AZ
"hands off the wheel"

Aurora


multi master - preview
zero downtime
multi-region coming 2018
multi-az now
aurora serverless...

on demand, serverless
no provisioning of db instances
auto scales for you
shuts down when not in use


DynamoDb global tables


multi-master, multi-region, fully managed

Graph Db


Managed neo4j?

not sure, but managed for you; support for open protocols (sparkl etc)


Neptune

data lakes


really just a - hey this stuff is cool. use it
S3 select

only pull data you need?
much better perf
filter data within objects


Glacier select

run queries directly on data in glacier
make glacier part of data lake


ML


Sagemaker
Databricks alternative it appears
can run tensorflow on sagemaker

Kinesis Video


ingest video/audio, other time encoded data

transcribe


transcribe long form audio into text

Sessions

ABD302 - Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana


presentation
slides
brief overview of what elasticsearch is, how aws helps
standard - within vpc and what not for the service
replica shards are how you get parallelism with ES
lucene - maps shards in memory; benefits from addl memory
how to use lamdba to deliver data to ES
some recs for sizing, what instances to choose, etc
probably worth looking over the slides if nothing else
had to leave for the ML session however....didn't get to see much with analytics use

MCL333 - Building Deep Learning Applications with TensorFlow on AWS


I'll be honest, I struggled to keep up with this one
very deep on Neural Network deep learning specifics
slide deck
Ran through a "handwritten digits" learning algo in Jupyter on a G2 GPU instance (erm - i think that's right)
dropout - helps prevent overfitting

force it to ignore a certain amount; force layer to learn better
0.5 usually a good number


deep copy/deep paste

copy paste formula from paper 😛


transfer learning

"pretrained" model
use new input layer (cat faces), but keep weights the same
output is completely diff
edge detection already there on layer 1
layter 2 - probably okay, basic face shape
layer 3 does need to change


LSTM - RNN (recurrent neural networks)

multiple digit numbers - makse sense to use RNN, as its a sequence


SRV319 - How Nextdoor Built a Scalable, Serverless Data Pipeline for Billions of Events per Day


presentation
small ops team (basically 3 people)
how do they scale to that many events with such a small team?

Kinesis streams basically
need as many managed services as possible for such a samll ops team
rule of thumb - if managed service is available from aws, use it


Legacy system on Apache Flume for logs
flume imported to s3 (and then to redshift) and ES
Standardized a lot of boilerplate in lambda functions with open source project called bender

config driven


Kinesis for data ingestion, lambda for execution, kinesis firehose for data aggregation, S3 for storage

Thursday


Keynote - Werner Vogels


much less intense than Wednesday

Cloud9


Looks pretty cool for dev collaboration
IDE, can debug various aws services from it
can be used like google docs (multiple devs working in same IDE, see what lines they're on, etc)

9s rule of thumb


3 9s with 2 AZs and manual failover
4 9s with 3 AZs in one region and auto failover
5 9s with 3 AZs in each region, and 2 regions

AWS Services designed for Availability


pages 41-42

Chaos engineering - Nora Jones


argues for additional "chaos testing" phase in addition to unit/integration testing
start with graceful restarts, then targeted chaos, then failure injection
her job - to automate chaos at netflix
Principles of Chaos Engineering

API Gateway w/VPC integration


for lambda
has concurrency controls
no longer (exclusively) at the edge, can set up links within vpcs - see here

New serverless app repo announced

Sessions

ABD328 - Zombie Annihilation Using AWS Big Data: Turning a Data Swamp into a Data Lake


seesion lab link
workshop
Chance to work with Glue as an ETL tool
very slick, easy to use - simple to get going, add new sources, add simple conversions
surprised by lag for some operations (spark jobs basically)

Looks like it has to spin up an individual EMR cluster for each job, so can take a while if one's not hot


Used quicksight a bit

not bad. can do some decent graphs (including some geo ones) with basic sql


Added some real-time data via kinesis

had sentiment analysis in source data (are zombies happy? no?)


Probably the most useful workshop for me personally

MCL212 - AWS DeepLens workshop: Building Computer Vision Applications


Probably didn't need to do this one really, but it was pretty cool
Chance to play with Deeplens
Full blown computer (atom based) with camera and aws greengrass on device
does some things locally, uploads others to the cloud
built a hot dog detector (yes like in Silicon Valley)
Def a lot of potential with this
Doesn't require super high level knowledge of ML (more helps, but average dev can probably work pretty well with it)

Friday


SRV330 Serverless DevOps to the Rescue


GH link
Had some trouble getting this running, but picked up a few things
another link
showed how to run SAM Local (run/test lambda functions locally)
mostly showed aws services for CI/CD
X-ray was pretty cool to see though, been wondering how to work with this. Looks like good potential with lambda functions