- API gateway - no longer (exclusively) at the edge, can set up links within vpcs - see here
- slide deck
- presentation
- extract to target s3 bucket, archived and queryable, then converted to parquet with glue and landed in another bucket - their data lake, and redshift (two side by side)
- "target" bucket usually not queried directly though, query the parquet files in the lake bucket
- use lambda for many things here
- file from external partner lands in s3, sns topic triggers lambda and then new file is processed/loaded
- api gateway is hit, which triggers lambda, which then loads data onto kinesis, into s3, into dynamo, etc
- push for step functions for remembering stateful apps
- slide deck
- presentation
- Merck presentation
- How they build out their warehouse
- dynamic frames
- designed for semi-structured data
- much more perfomrant than standard json
- much more performant than vanilla spark
- mostly in the case of many small files - think streaming firehoses
- variety of transforms available
- has job bookmarks, remembers where it left off
- can persist state of transforms, sinks, sources, etc
- pause option built in - can start from last state, but not advance
- used for debugging/testing
- scala support is coming
- very quick to add new data sources
- reference arch - use glue to crawl for sources
- rds, local dbs, streams, s3 buckets
- this fills the glue catalog
- can transform data with outputs to s3, redshift, etc
- Merck stuff was more - how the integrated this at the front end to integrate with their proprietary ent software backend
- slide deck
- presentation
- parquet - ingest by using spectrum, and loading into redshift with "insert into table x select y from z"
- parquet direct load coming Q1 2018
- upsert - use our solution 😛. (I.e. we're doing things right)
- use "backup no" for some loads/etls (worth checking our transforms here
- keep 20% free space or 3x your largest table (rule of thumb)
- spectrum scales based on number of slices (i.e. linearly with num of nodes/cpu units)
- materialize frequently filtered columns from dimension tables to fact tables
- likewise, materialize frequently calc'd values to tables
ABD217 - From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time
- slide deck
- presentation
- Flex did an interesting thing where they listen on their "schema.rb" equivalent, and automatically migrated these changes over to redshift
- idea of hot/warm/cold data -
- hot is in kinesis, elastic search, etc
- warm - redshift
- cold - s3
- Presentation showed how their arch evolved over time, leading through 5 iterations
- iteration 5 showed how they listed to source schema and updated the RS/elasticsearch schema in near real time using protocol buffers
- workshop presentation
- wildrydes workshop
- takeaways
- pretty easy to set up kinesis streams, start querying in sql
- both in athena with the batched files from firehose
- and directly on the kinesis stream itself
- slides
- presentation
- built out a ML algo to predict which students were likely to be in trouble by week 2 of semester
- 83% accurate
- demographics are NOT included in their model, as they cannot change those...
- (i have some questions about this, but didn't have a chance to ask them)
- they were able to reach out to those students, and help prevent dropouts
- estimate they were able to prevent 3100 dropouts
- random note - in 7 cases, students had their power shut off, and IvyTech was able to get the students some help
- Some redshift tidbits, most of which are known to our team
- slides
- Pretty fun - built out a little device that dispenses drinks
- built out a little website where we could send credits to classmates ($0.25ea); when you have $1 you can dispense a drink
- got to program the microcontroller directly with mongoose os
- pretty cool to see how this could tie into the aws ecosystem, and how you can manage these devices at scale
- kinda reminded me of working with particle (née sparkfun) devices
- full keynote
- note - the keynote dropped out around 930-10am PST for about 20 min (was in an overflow room)
- ecs for k8s (EKS)
- auto deploy across mulitple AZ
- HA
- can auto upgrade, but can control when you want
- no need to manage servers
- no clusters to manage, manages infra for you
- auto scale across multiple AZ
- "hands off the wheel"
- multi master - preview
- zero downtime
- multi-region coming 2018
- multi-az now
- aurora serverless...
- on demand, serverless
- no provisioning of db instances
- auto scales for you
- shuts down when not in use
- multi-master, multi-region, fully managed
- Managed neo4j?
- not sure, but managed for you; support for open protocols (sparkl etc)
- Neptune
- really just a - hey this stuff is cool. use it
- S3 select
- only pull data you need?
- much better perf
- filter data within objects
- Glacier select
- run queries directly on data in glacier
- make glacier part of data lake
- Sagemaker
- Databricks alternative it appears
- can run tensorflow on sagemaker
- ingest video/audio, other time encoded data
- transcribe long form audio into text
- presentation
- slides
- brief overview of what elasticsearch is, how aws helps
- standard - within vpc and what not for the service
- replica shards are how you get parallelism with ES
- lucene - maps shards in memory; benefits from addl memory
- how to use lamdba to deliver data to ES
- some recs for sizing, what instances to choose, etc
- probably worth looking over the slides if nothing else
- had to leave for the ML session however....didn't get to see much with analytics use
- I'll be honest, I struggled to keep up with this one
- very deep on Neural Network deep learning specifics
- slide deck
- Ran through a "handwritten digits" learning algo in Jupyter on a G2 GPU instance (erm - i think that's right)
- dropout - helps prevent overfitting
- force it to ignore a certain amount; force layer to learn better
- 0.5 usually a good number
- deep copy/deep paste
- copy paste formula from paper 😛
- transfer learning
- "pretrained" model
- use new input layer (cat faces), but keep weights the same
- output is completely diff
- edge detection already there on layer 1
- layter 2 - probably okay, basic face shape
- layer 3 does need to change
- LSTM - RNN (recurrent neural networks)
- multiple digit numbers - makse sense to use RNN, as its a sequence
- presentation
- small ops team (basically 3 people)
- how do they scale to that many events with such a small team?
- Kinesis streams basically
- need as many managed services as possible for such a samll ops team
- rule of thumb - if managed service is available from aws, use it
- Legacy system on Apache Flume for logs
- flume imported to s3 (and then to redshift) and ES
- Standardized a lot of boilerplate in lambda functions with open source project called bender
- config driven
- Kinesis for data ingestion, lambda for execution, kinesis firehose for data aggregation, S3 for storage
- much less intense than Wednesday
- Looks pretty cool for dev collaboration
- IDE, can debug various aws services from it
- can be used like google docs (multiple devs working in same IDE, see what lines they're on, etc)
- 3 9s with 2 AZs and manual failover
- 4 9s with 3 AZs in one region and auto failover
- 5 9s with 3 AZs in each region, and 2 regions
- argues for additional "chaos testing" phase in addition to unit/integration testing
- start with graceful restarts, then targeted chaos, then failure injection
- her job - to automate chaos at netflix
- Principles of Chaos Engineering
- for lambda
- has concurrency controls
- no longer (exclusively) at the edge, can set up links within vpcs - see here
- seesion lab link
- workshop
- Chance to work with Glue as an ETL tool
- very slick, easy to use - simple to get going, add new sources, add simple conversions
- surprised by lag for some operations (spark jobs basically)
- Looks like it has to spin up an individual EMR cluster for each job, so can take a while if one's not hot
- Used quicksight a bit
- not bad. can do some decent graphs (including some geo ones) with basic sql
- Added some real-time data via kinesis
- had sentiment analysis in source data (are zombies happy? no?)
- Probably the most useful workshop for me personally
- Probably didn't need to do this one really, but it was pretty cool
- Chance to play with Deeplens
- Full blown computer (atom based) with camera and aws greengrass on device
- does some things locally, uploads others to the cloud
- built a hot dog detector (yes like in Silicon Valley)
- Def a lot of potential with this
- Doesn't require super high level knowledge of ML (more helps, but average dev can probably work pretty well with it)
- GH link
- Had some trouble getting this running, but picked up a few things
- another link
- showed how to run SAM Local (run/test lambda functions locally)
- mostly showed aws services for CI/CD
- X-ray was pretty cool to see though, been wondering how to work with this. Looks like good potential with lambda functions