- Give developers access to the services they build.
- Have developers "take the phone" (support their services)
- what tools do they need to do that?
- Version the CF templates with the code that uses them
- Split CF stacks based on stateful and stateless resources
- Eliminate duplicated template code/resources
- Can use tools like Troposphere (Python) take it to a higher level
- They build 2 AMIs: 1 with the OS+RPMs+configuration and 1 with OS+RPMs (no config)
- Allows them to re-bake the latter image for a different environment without changing the application bits
- They update the image ID specified in their auto-scaling groups to deploy a new build
- ASG update policy dictates the resulting behavior
- They use different AWS accounts to isolate resource limitations (i.e. dev allocations can't impact production).
- Ensure timeouts are in sync across the system
- Need to know the sequence of calls (Dapper/Salp)
- They use Zuul for traffic shaping and routing
- Need to consider the combinatorial complexity of the RPCs as well as availability. If you want 4 nines of availability from a service, each of its upstream services needs >4 nines (S1 x S2 x S3)
-
Can have multiple sets of credentials in
~/.aws/credentials:[default] ... [production] ... -
Use ProfileCredentialsProvider and specify the profile name (i.e. 'production') when calling the constructor
-
BP: in production, use the InstanceProfileCredentialsProvider and use an IAM role assigned to the EC2 instance
-
Can use a provider credentials chain; if you specify the right chain you won't need to change code in production (i.e. ProfileCredentialsProvider first, and then InstanceProfileCredentialsProvider second)
-
Can enable client-side metrics, which appear to be reported to CloudWatch
-
Look for AwsSdkMetrics class methods or enable it via JMX or a system property
-
Take a look at the new "resource objects" support that's currently in developer preview on GitHub. It will help reduce the amount of boilerplate code.
- Placement group: lower round-trip time (RTT)
- Enhanced networking (SR-IOV): c4, c3, r3, i2 types support it
- Use i2 types for MongoDB
- Specify the 'cluster' strategy for a placement group, it will place instances physically close to one another
- Only certain instance types are allowed in placement groups (basically same set as enhanced networking)
- Placement groups are local to an availability zone
- BP: only add instances to a placement group when it's initially created and add all members at one time. Will fail quite often if you attempt to add instances long after the PG created (as physical space around the existing members may be limited/non-existent).
- PGs not suitable for horizontally scalable tiers because of this
- BP: homogeneous instance types
- To check if SR-IOV is enabled run
ethtooland check the driver type- vif: no, ixgbevf: yes
- Can use
ec2-describe-instance-attributewithsriovNetSupportas the attribute to see if EC2 thinks it's supported (reference). - You can modify the attribute once if you manually add SR-IOV support to a supported instance type
- Cannot go back once you convert/enable an image!
- Always do the instance half (i.e. driver install/setup) first, lest you lose network access to the guest after enabling
- If you register a custom AMI with SR-IOV support all instances created from it will automatically have it enabled
PFC304: Effective Interprocess Communications in the Cloud: The Pros and Cons of Microservices Architectures
- The tipping point: organizational growth (multiple teams) + diverse functionality + bottleneck in monolithic stack
- Need structure when adopting microservices, lest chaos ensue
- Polyglot is okay, but ensure there are standards for how things work/are operated
- S3 uses gossip to do discovery
- Gossip protocols are not consistent: members will each have a different view based on the gossip they've heard
- An alternative is to use a metadata store/consensus (e.g. ZooKeeper)
- BP: build an API for the metadata store so that the internal structure can evolve w/o breaking clients
- Failure detection: there's no way to determine if someone is dead or just silent
- You can detect liveness
- Don't let components go silent; have them report heartbeats at a minimum
- Use leases instead of locks
- They implemented a workflow model that is very much like Quartz: stateless actions that query the metadata store
- Idempotent actions and workflows are the key
- Favorite interview questions:
- How to achieve consensus
- Biggest challenge of distributed systems: partial failures
- Three types of distributed systems: those with SPoFs, those with Paxos at the bottom, and broken distributed systems
- First attempted Paxos as a library, limited adoption
- Then implemented Paxos as a service, which boosted adoption
- 3rd attempt: Paxos as a primitive: transaction journal (since folks want order + consistency typically)
- Can write logs for multiple accounts or regions to a single S3 bucket
- CloudTrail Processing Library in Java does all the CT work for you, you just write the business logic to react to events
- Organize CF stacks by layers and environment
- e.g.: identity, base networking, shared, backend, frontend
- Use input/output parameters to express dependencies
- Use nested stacks for reusability
- CF now lets you strongly type parameters (!!!)
- Can signal completion back to CF if using user data (!!!)
- Can use CloudWatch logs to stream logs out of an instance
- Flow updates through CF only
- Sounds like there's a WIP preview feature (!!!)
- Can use CloudFormer to dump a CF template snippet for a resource to compare drift
- Use ASGs to do rolling updates
- Can extend CF with stack events (!!!)
- Have CF send notifications to custom extensions using SNS
- Custom resources must understand create, update, rollback, delete events
- Custom extension signals CF when done
- Use 'noecho' option to not log sensitive info to CF (!!!)
- AWS Cost Explorer can slice things by tags