Skip to content

Instantly share code, notes, and snippets.

@p5k6
Last active October 7, 2019 00:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save p5k6/662ab0baaae654585ad9c78f2654dee8 to your computer and use it in GitHub Desktop.
Save p5k6/662ab0baaae654585ad9c78f2654dee8 to your computer and use it in GitHub Desktop.
Notes from 2017 reinvent

re:invent

  • API gateway - no longer (exclusively) at the edge, can set up links within vpcs - see here

Monday


ABD202 - Best Practices for Building Serverless Big Data Applications

  • slide deck
  • presentation
  • extract to target s3 bucket, archived and queryable, then converted to parquet with glue and landed in another bucket - their data lake, and redshift (two side by side)
  • "target" bucket usually not queried directly though, query the parquet files in the lake bucket
  • use lambda for many things here
    • file from external partner lands in s3, sns topic triggers lambda and then new file is processed/loaded
    • api gateway is hit, which triggers lambda, which then loads data onto kinesis, into s3, into dynamo, etc
    • push for step functions for remembering stateful apps

ABD315 - Building Serverless ETL Pipelines with AWS Glue

  • slide deck
  • presentation
  • Merck presentation
  • How they build out their warehouse
  • dynamic frames
    • designed for semi-structured data
    • much more perfomrant than standard json
    • much more performant than vanilla spark
    • mostly in the case of many small files - think streaming firehoses
    • variety of transforms available
  • has job bookmarks, remembers where it left off
  • can persist state of transforms, sinks, sources, etc
  • pause option built in - can start from last state, but not advance
    • used for debugging/testing
  • scala support is coming
  • very quick to add new data sources
  • reference arch - use glue to crawl for sources
    • rds, local dbs, streams, s3 buckets
    • this fills the glue catalog
    • can transform data with outputs to s3, redshift, etc
  • Merck stuff was more - how the integrated this at the front end to integrate with their proprietary ent software backend

ABD304 - Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum

  • slide deck
  • presentation
  • parquet - ingest by using spectrum, and loading into redshift with "insert into table x select y from z"
    • parquet direct load coming Q1 2018
  • upsert - use our solution 😛. (I.e. we're doing things right)
  • use "backup no" for some loads/etls (worth checking our transforms here
  • keep 20% free space or 3x your largest table (rule of thumb)
  • spectrum scales based on number of slices (i.e. linearly with num of nodes/cpu units)
  • materialize frequently filtered columns from dimension tables to fact tables
  • likewise, materialize frequently calc'd values to tables

Tuesday


ABD217 - From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time

  • slide deck
  • presentation
  • Flex did an interesting thing where they listen on their "schema.rb" equivalent, and automatically migrated these changes over to redshift
  • idea of hot/warm/cold data -
    • hot is in kinesis, elastic search, etc
    • warm - redshift
    • cold - s3
  • Presentation showed how their arch evolved over time, leading through 5 iterations
    • iteration 5 showed how they listed to source schema and updated the RS/elasticsearch schema in near real time using protocol buffers

SRV332 - Building Serverless Real-Time Data Processing

  • workshop presentation
  • wildrydes workshop
  • takeaways
    • pretty easy to set up kinesis streams, start querying in sql
    • both in athena with the batched files from firehose
    • and directly on the kinesis stream itself

ABD205 - Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success

  • slides
  • presentation
  • built out a ML algo to predict which students were likely to be in trouble by week 2 of semester
    • 83% accurate
    • demographics are NOT included in their model, as they cannot change those...
    • (i have some questions about this, but didn't have a chance to ask them)
    • they were able to reach out to those students, and help prevent dropouts
    • estimate they were able to prevent 3100 dropouts
    • random note - in 7 cases, students had their power shut off, and IvyTech was able to get the students some help
  • Some redshift tidbits, most of which are known to our team

IOT328 - Building an AWS IoT-Enabled Drink Dispenser

  • slides
  • Pretty fun - built out a little device that dispenses drinks
  • built out a little website where we could send credits to classmates ($0.25ea); when you have $1 you can dispense a drink
  • got to program the microcontroller directly with mongoose os
  • pretty cool to see how this could tie into the aws ecosystem, and how you can manage these devices at scale
  • kinda reminded me of working with particle (née sparkfun) devices

Wednesday


Keynote - Andy Jassy

  • full keynote
  • note - the keynote dropped out around 930-10am PST for about 20 min (was in an overflow room)

k8s

  • ecs for k8s (EKS)
  • auto deploy across mulitple AZ
  • HA
  • can auto upgrade, but can control when you want

AWS Fargate

  • no need to manage servers
  • no clusters to manage, manages infra for you
  • auto scale across multiple AZ
  • "hands off the wheel"

Aurora

  • multi master - preview
  • zero downtime
  • multi-region coming 2018
  • multi-az now
  • aurora serverless...
    • on demand, serverless
    • no provisioning of db instances
    • auto scales for you
    • shuts down when not in use

DynamoDb global tables

  • multi-master, multi-region, fully managed

Graph Db

  • Managed neo4j?
    • not sure, but managed for you; support for open protocols (sparkl etc)
  • Neptune

data lakes

  • really just a - hey this stuff is cool. use it
  • S3 select
    • only pull data you need?
    • much better perf
    • filter data within objects
  • Glacier select
    • run queries directly on data in glacier
    • make glacier part of data lake

ML

  • Sagemaker
  • Databricks alternative it appears
  • can run tensorflow on sagemaker

Kinesis Video

  • ingest video/audio, other time encoded data

transcribe

  • transcribe long form audio into text

Sessions

ABD302 - Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana

  • presentation
  • slides
  • brief overview of what elasticsearch is, how aws helps
  • standard - within vpc and what not for the service
  • replica shards are how you get parallelism with ES
  • lucene - maps shards in memory; benefits from addl memory
  • how to use lamdba to deliver data to ES
  • some recs for sizing, what instances to choose, etc
  • probably worth looking over the slides if nothing else
  • had to leave for the ML session however....didn't get to see much with analytics use

MCL333 - Building Deep Learning Applications with TensorFlow on AWS

  • I'll be honest, I struggled to keep up with this one
  • very deep on Neural Network deep learning specifics
  • slide deck
  • Ran through a "handwritten digits" learning algo in Jupyter on a G2 GPU instance (erm - i think that's right)
  • dropout - helps prevent overfitting
    • force it to ignore a certain amount; force layer to learn better
    • 0.5 usually a good number
  • deep copy/deep paste
    • copy paste formula from paper 😛
  • transfer learning
    • "pretrained" model
    • use new input layer (cat faces), but keep weights the same
    • output is completely diff
    • edge detection already there on layer 1
    • layter 2 - probably okay, basic face shape
    • layer 3 does need to change
  • LSTM - RNN (recurrent neural networks)
    • multiple digit numbers - makse sense to use RNN, as its a sequence

SRV319 - How Nextdoor Built a Scalable, Serverless Data Pipeline for Billions of Events per Day

  • presentation
  • small ops team (basically 3 people)
  • how do they scale to that many events with such a small team?
    • Kinesis streams basically
    • need as many managed services as possible for such a samll ops team
    • rule of thumb - if managed service is available from aws, use it
  • Legacy system on Apache Flume for logs
  • flume imported to s3 (and then to redshift) and ES
  • Standardized a lot of boilerplate in lambda functions with open source project called bender
    • config driven
  • Kinesis for data ingestion, lambda for execution, kinesis firehose for data aggregation, S3 for storage

Thursday


Keynote - Werner Vogels

  • much less intense than Wednesday

Cloud9

  • Looks pretty cool for dev collaboration
  • IDE, can debug various aws services from it
  • can be used like google docs (multiple devs working in same IDE, see what lines they're on, etc)

9s rule of thumb

  • 3 9s with 2 AZs and manual failover
  • 4 9s with 3 AZs in one region and auto failover
  • 5 9s with 3 AZs in each region, and 2 regions

AWS Services designed for Availability

Chaos engineering - Nora Jones

  • argues for additional "chaos testing" phase in addition to unit/integration testing
  • start with graceful restarts, then targeted chaos, then failure injection
  • her job - to automate chaos at netflix
  • Principles of Chaos Engineering

API Gateway w/VPC integration

  • for lambda
  • has concurrency controls
  • no longer (exclusively) at the edge, can set up links within vpcs - see here

New serverless app repo announced

Sessions

ABD328 - Zombie Annihilation Using AWS Big Data: Turning a Data Swamp into a Data Lake

  • seesion lab link
  • workshop
  • Chance to work with Glue as an ETL tool
  • very slick, easy to use - simple to get going, add new sources, add simple conversions
  • surprised by lag for some operations (spark jobs basically)
    • Looks like it has to spin up an individual EMR cluster for each job, so can take a while if one's not hot
  • Used quicksight a bit
    • not bad. can do some decent graphs (including some geo ones) with basic sql
  • Added some real-time data via kinesis
    • had sentiment analysis in source data (are zombies happy? no?)
  • Probably the most useful workshop for me personally

MCL212 - AWS DeepLens workshop: Building Computer Vision Applications

  • Probably didn't need to do this one really, but it was pretty cool
  • Chance to play with Deeplens
  • Full blown computer (atom based) with camera and aws greengrass on device
  • does some things locally, uploads others to the cloud
  • built a hot dog detector (yes like in Silicon Valley)
  • Def a lot of potential with this
  • Doesn't require super high level knowledge of ML (more helps, but average dev can probably work pretty well with it)

Friday


SRV330 Serverless DevOps to the Rescue

  • GH link
  • Had some trouble getting this running, but picked up a few things
  • another link
  • showed how to run SAM Local (run/test lambda functions locally)
  • mostly showed aws services for CI/CD
  • X-ray was pretty cool to see though, been wondering how to work with this. Looks like good potential with lambda functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment