Skip to content

Instantly share code, notes, and snippets.

@randylayman
Last active April 14, 2016 17:57
Show Gist options
  • Save randylayman/bd4f9a7ec11875da691f04ec4548595d to your computer and use it in GitHub Desktop.
Save randylayman/bd4f9a7ec11875da691f04ec4548595d to your computer and use it in GitHub Desktop.
OReilly Software Architecture 2016 Notes

#OReilly Software Architecture 2016 April 11-13, 2016 ###Venue Hilton Midtown Manhattan, NY

##Day 1 (April 11) ####Microservices Tutorial - Cassandra Shum (ThoughtWorks), Maria Gomez (ThoughtWorks) The instructors provided a VirtualBox machine that contained scripts we ran to start and stop virtual machines. Converting monoliths to microservices meant uncommenting one invocation and invoking another. The session was lots of do this, do that with not a lot of discussion.

The session used the Thoughtworks Go CD tool for managing the build pipelines. The interface seems pretty straight forwards but the speed of picking up changes from the GitHub pushes wasn't great. Might be from running in smallish virtual machine.

Did an exercise about bounded contexts to design the model for an airway, including systems for payments and rewards programs. The point of the exercise was to not model a user because the business didn't talk about a user/customer/account. Instead model the rewards, the payments, and the commonality into an application (because that's what a user did to create the rewards program).

The session included a deeper dive into Consumer-Driven Contracts (CDC). The concept is that consumers of an API provide test cases to the producer repository. These test cases assert the behavior that particular consumer is expecting remains working. They showed an open source project Pact (https://github.com/realestate-com-au/pact) that can capture the test cases on the client side and provide an artifact to provide to the server repository and then be played again there to verify the contract is fulfilled.

###The Zen of Architecture - Juval Löwy (IDesign Inc.) The speaker for this session had a very large ego and a very strong opinion of his topic. I think for all the bluster and banter there is a kernel of truth in what he said. When designing a system, be aware of where volatility (likelihood of change) is and make sure there is a box around that to allow things to be changed out with no impacts to the larger system. This was a bit of a tease for selling his training classes and the last ten minutes was a sales pitch for why you should come to the session.

What follows are my notes during the session, largely copying text from slides.

  • Avoid functional decomposition

    • leads to duplicating behaviors across service boundaries
    • discourages reuse
    • couples services to order and current use cases
    • clients stich services and bloat the clients
    • simple client might invoke chain. But each step in the chain is bigger than the child because of message passing
  • Domain Decomposition

    • functional decomposition in disguise
  • Functional design is not testable

    • Can't do regression testing because of unusual conditions (???)

Chubby is opposite of chatty (fewer large messages versus more small messages)

###The IDesign Method

  • Decompose based on volatility

    • Identify areas of potential change (can be functional but not domain functional)
      • Encapsulate in services
      • Milestones based on integration, not features
    • Implement behavior as interaction between services or subsystems
  • Universal principle of good design -encapsulate change to insulate, do not resonate with change + functional decomposition maximizes impact of change

  • features aren't in services. Features are the integrations/interactions between the services

  • volatility is often not self-evident

  • Dunning-Kruger effect

  • Axes of volatility

    • At the same customer over time
    • At the same time across different customers
    • axes should be independent
      • same axes is probably functional decomposition
  • Decomposition and business

    • avoid encapsulating changes to the nature of the business
      • hint - very rare indeed
      • hint - each done poorly
      • the speculating design trap
  • Decomposition and Longevity

    • Estimate when change happens with simple heuristic: ability of organization/customer/market to instigate or absorb a change is constant due to nature of the business
  • Layered approach

    • typical layers
      • Client (Presentation)
        • User or another system
        • advocate single point of entry to system
      • Business
        • managers encapsulate sequence in use cases and workflow
        • each manager is collection of related use cases
        • engines encapsulate business rules and activities
        • manage may use zero or more engines
        • engines may be shared between managers
        • engine == activity (how to do something)
      • resource access
        • encapsulate resource access
        • may call resource stored in other services
        • can be shared across engines and managers
        • expose lowest possible business context around resources -business verbs vs CRUD/IO
      • resources
        • physical resources
      • utilities
        • common infrastructure to all services
  • A cohesive usage of manager, engine, and resources is a service to the world

  • Strive the minimal set of interacting services that satisfy the use case

  • Features are always and everywhere aspects of integration not implementation

  • Observation - volatility decreases top down, reusability increases top-down

    • Managers are almost expendable
    • Deviation may indicate functional decomposition (or unripe decomposition)
  • Observation (about interaction diagrams)

    • functional diagrams yields forks or stair cases
    • volatility looks like a stick figure (maybe a bit like a tree)
  • http://www.idesign.net/Download/OSA

##Day 2 (April 12) ###Keynote #1 blah blah Microservices blah blah - Jonas Bonér (Lightbend)

  • synchronous is strong coupling (in time) Attributes of microservices
  • isolation
    • most important trait
    • bulkheads failures (prevents cascading failures)
    • provide ability to have resilience
  • autonomous. Work on their own
  • single responsibility pattern. Do one thing and do it well
  • exclusive state, including persistence
  • ability to move services
    • addressable via stable and static address
    • provides location transparency

Microservices are building blocks of systems

Systems need to exploit reality

  • reality is not consistent
  • information has latency

Minimize coupling and communication

What about transactions?

  • within services are fine with strong consistency
  • Between services implement Saga Pattern
    • For each transaction also create reversing transaction within microservice. Unspecified how to deal with coordination

###Keynote #2 Evolution of Evolutionary Architecture - Rebecca Parsons (ThoughtWorks) Evolutionary architecture is ability to evolve over time Emergent design, design patterns, and CI are the safety net that makes it possible Continuous delivery enabled creative thinking

  • lower risks for production deployments
  • deploy when functionality dictates, removes risk from the consideration

Principles of Evolutionary Architecture

  • Evolvablity same level as reliability, scalability, etc
  • Evolve in technical, data, security, and architecture

###Keynote #3 Conversational Commerce - Stewart Nickolas (IBM) Demo of system from IBM using voice on Apple TV, Amazon Echo, Siri to work with Watson to show how it could help a large consumer purchase, such as remodeling a kitchen.

###Keynote #4 What I learned about architecture while running marathons -Ted Malaska (Cloudera) Sates of mediation allows problems to be solved (TV, running, etc)

  • if stuck > 30 minutes, go for a run
  • most challenges solved while running

Efficiency of motion

  • SQL isn't necessarily right
  • horsepower isn't best answer, efficiency
  • less with more

###Session #1 Don't build Death Star of Security - David Strauss (Pantheon)

  • Defense in depth

  • Don't overly obsess about the edge

  • For this session assuming 1st level is bypassed

  • Authentication Security

    • Common 2nd level attack
    • Capture of username/password in flight
  • Password hashing

    • Best: Password + Salt (Random, stored with database) + Pepper (maybe static, separate from storage, possibly in source)
    • Use HMAC SHA512 if speed is an issue
    • Use Password stretching like PBKDF2 (100 rounds) to make it take more time. Also scrypt/bcrypt
  • Use different complexity based on length

    • Based on recommendation from Stanford(?)
    • 8-11 characters, require Aa1$
    • 12-15 characters, require Aa1
    • 16-19 characters, require Aa
    • 20+ characters, require a
  • Recommend multifactor authentication. Almost table stakes

  • Federated Authentication, such as OpenID Connect

  • Authentication before application

    • Makes it impossible to exploit application vulnerabilities without credentials
    • Wrap app in something like Apache with SAML
  • Confused Deputy Problem

    • Internal system too trusting
  • Challenge: Ambient Authority and Confused Deputies

    • Virtue of who you are gets you authority
    • Pattern: Capability-Based Security
    • Pattern: Mandatory Access Control, like selinux
    • Anti-pattern: Mandatory Access Control as an afterthought. Complex rules makes it difficult to debug
    • Better: Boundaries first, container style. Docker on CentOS works well with selinux
  • Staying Hands Off Sensitive Data

    • Pattern: Delegated Handling of Sensitive Data (think payment gateways)
    • Pattern: Black Hole API (write only API to receive sensitive data so receiver doesn't persist it)
  • Managing App Data (mobile, laptops, etc)

    • Pattern: Key management
    • Pattern: Anonymouize Data
    • Pattern: Physical security and device encryption
    • Pattern: Smart cards and hardware tokens (i.e. U2F)
  • Preventing a breach from spreading

    • Pattern: System isolation (firewall or more)
    • Challenge: Shared secrets (i.e. passwords)
    • Anti-pattern: Security through obscurity
    • Pattern: Public Key Infrastructure.
      • Encrypt all connections, no trust
      • Cloud flare has OSS CA written in Go for issuing certs for new hosts
      • MySQL can support PKI/certs for logins. Can we use this for QA access to staging database?
      • JWT with asymmetric mode. Doesn't require private secret on verifier
  • Mitigate TLS overhead with persistent connections and session caching

###Let's not rewrite it all - Michelle Brush (Cerner Corporation) Purpose, not modernization, drives change

Types of Rewrites:

  1. Code
  2. Design
  3. API/Frameworks
  4. Architecture
  5. User Interface
  6. Language Change gets more difficult as you go down the list. Often times change bleeds through the layers.

Rule 1: Minimize the scope. Define and stick to a small set of objectives.

Rule 2: Create a technical vision. Make everyone understand it. Enforce it (i.e. build checks for disallowed components/libraries)

Rule 3: Reduce Complexity. Look for ways to reduce scope/effort

Rule 4: Work in thin slices.

  • Driver for which slice is what is most likely to change.
  • Plan to build adapters and abstractions
  • Accept you will have throwaway work
  • Risk -- won't finish migration.
    • Reason - people don't understand/not bought into reason for change

Rule 5: Build your tests first

  • High level integration/system tests
  • Try to not mess with interfaces
  • Don't fix every bug -> changes interface. Document and fix afterwards (or fix in base and rebuild interface assumptions/tests)

Rule 6: Embrace Operations Early. Document operational risks as you go.

Rule 7: Invest in learning.

  • Crate materials as you go
  • Document guiding principles
  • Create learning materials regularly

Rule 8: Make doing the right thing easy

  • Make doing th wrong thing hard

Consistent hashing, shuffle sharding, and copysets: Practical tools for controlling failure - Wes Chow (Chartbeat)

Controlling (limiting) failure through sharding (applies to load balancing)

Pattern: Hashing mod

Pattern: Hashing nodes

  • Hash node name, plot to line
  • Hash keys, key stored in next highest nod

Pattern: Consistent Hashing

  • For every noe, generate M virtual nodes. Repeat Hashing Nodes approach
  • Adding nodes results in about 1/N data getting shifted to new node

Pattern: Node Sets/Rendezvous Hashing

  • For all servers, hash server+key. Server that produced the lowest hash stores the key

Consistent Hash versus Rendezvous Hash

Consistent has lots of work at startup to generate the virtual nodes (speaker uses 160, couldn't explain why when asked). Lookup is O(ln(n)) Rendezvous has no work at startup. Lookup is O(n)

Poison Pills

  • One request that does bad things (expensive query, buggy handler, unhandled input)
  • Results in crash. If retries, can take out pods
  • Hash requests into sets of nodes. The more sets = smaller blast radius

Shuffle Sharding

  • For each customer, assign to a set of servers. Change servers for each set. Can get pretty low blast radius with medium numbers

###It Probably Works - Tyler McMullen (Fastly) Probabilistic Algorithms

  • Element of random
  • Used to do things ore efficiently than otherwise possible
  • Not guessing
  • Provably bound error rates

Join-Shortest-Queue

  • Request latency is log-normal distribution for almost all systems
  • Balls and Bins Problem
  • As load goes up, random load balancing gets worse
  • Join-Shortest-Queue is Better
  • Distributed Random is the same as single-server Random
  • Distributed Shorted Queue is difficult
    • Naive is almost same as Random
    • Oscillations from nodes making same decision
  • Better choice is "Power of 2"
    • Pick 2 random servers, use lowest of 2
    • Gives exponential improvement in variance

Count Distinct

  • ex. count unique # of IPs across fleet
  • naive solution doesn't scale well
  • distributed is hard for data transmission
  • HyperLogLog generates estimate
    • Extension of LogLog
    • Hash input to bit string. We expect the max # of leading bits set to be approximately log2(unique items)
    • Improve accuracy by using more buckets and taking mean
    • Union of HyperLogLog is max of every bucket for distributed systems
    • Error rate is ~ 2%

Reliable Broadcast

  • Every message gets to all clients
  • ex. CDN purge across network
  • Not the same as Atomic Broadcast
  • Gossip Protocols

Good Systems are Boring

###App Security and Microservices - Sam Newman (ThoughtWorks) App Security is like hand washing. It doesn't require a specialist for most parts.

Microservices increases attach surface but reduces scope of failure.

Prevention    ==>    Detection
  |                       /\
  \/                      |
Recovery      <==     Response

Try to examine realistic and holistic thread models. Good example is Attach Trees by Bruce Schneier.

  • Use HTTPS within the network
    • LetsEncrypt makes this easy with automation
    • Lemur might help with client certs for servers. From Netflix
  • Confused Deputy problem
    • Solution: Nested SAML assertions - painful
    • OpenID Connect. OpenAM supports it
  • Encrypt at Rest
  • Patch every week
    • Patch verification is hard
    • Same tools for docker
    • Single OS can help
    • Weekly probably good enough Detection
  • Service like Qualys. CVEs have fingerprints to look for in logs
  • Aggregate logs and store
  • ModSecurity
  • Polygot = more stuff to track. Move things to break? Yes, but solvable with great automation

Response

  • How you speak to your users
  • Denials can make it worse
  • Good example is Tylenol in physical stores in 70's. Clear, open, quick. See Wikipedia article
  • Be empathic
  • Game day exercises should include comms

Recovery

  • Side bar - use time limited API Keys with AWS
  • Backups
    • Burn it all down and rebuild it
    • Harder with microservices

#Day 2 (April 13) ###Keynote #1 Evolving toward microservices: How HomeDepot.com made the transition - Christopher Grant (Home Depot)

  • Evolutionary Architecture
    • Have a vision
    • Understand your (business) objectives
    • Choose what matters
    • Implement in stages
  • Decompose and Defer Boundaries
    • Understand products and domains
    • Look for the hard things
    • Review data models
  • Architect for change
    • Prepare for change
    • Be proactive, not reactive
    • Delay until the last responsible moment
    • Expect but minimize future work
  • Implement Safety
    • Automate early and often
    • Consider in place and greenfield
    • Utilize feature switches and traffic throttles
    • Encourage separation and independence

###Keynote #2 Going cloud native: It takes a platform - Chip Childers (Cloud Foundry Foundation) Why does cloud native matter?

  • Disruption, platform economics

Simple Patterns Highly Automated Scaled with Ease

Industrializing the craft. Doing artisanal at scale.

Focus on Takt Time (yes, Takt)

  • Desired time between units of production output
  • Time between two features reaching production

Emergent engineering principles

  • Learned the hard way
  • Starting to understand
  • 12 Factors Website from Heroku provides good information
    • Declarative formats for automation
    • Clean contract with OS
    • Suitable for deployments on modern cloud platforms
    • Minimize divergence dev -> test -> prod
  • You're going to need a platform
    • Simple platform is a ticketing system (UI, more sophisticated with APIs)
    • Platforms make promises (software optional)
    • Constraints are the contracts that allow platforms keep promises
    • The right constraints free us ot be creative where it matters

Here is my source code Run it on the cloud for me I do not care how CloudFoundry Haiku

###Keynote #3 From static to future-proof: Enterprise architectures in the age of the customer - Thomas Cozzolino (Salesforce) To future proof, 3 demands

  • Metadata (really data driven)
  • Composibility (UI components, integration/workflow)
  • A bigger universe (multi-disciplinary/lense for diversity)

###Keynote #4 Let's make the pain visible - Janelle Klein (New Iron)

  • Stages of escalating project risk
    • Deferring problems
    • Painful releases
    • Thrashing
    • Project meltdown
  • What's wrong with the current strategy?
    • Book: Good Strategy Bad Strategy
    • We don't have a strategy
  • Obstacle #1: Managers and developers are speaking different languages
  • Obstacle #2: We spend tons of time working on improvements that don't actually make improvements
  • Root: Lack of Visibility
  • Need to optimize for Idea Flow, not Code
  • The hard part is identifying the problem
  • Idea Flow is a universal definition for effective practice
  • OpenMastery
    • Take Responsibility
    • Lean how to get there
    • Then teach the industry to succeed

Idea Flow is a book

###Session #1: The architect's clue bucket - Ruth Malan (Bredemeyer Consulting) Note: This was a really bad session. The speaker threw up bunches of slides with quotes that were read to the attendees and didn't provide much of a narrative to connect the quotes together.

  • Architects have structural oversight
  • Makes tradeoffs
  • Thinks about why
  • Technical Debt ties systems to their past

Architect SCARS - Grady Booch

Separation of Concerns crisp resilient Astractions balanced distribution of Responsibilities strive to Simplify

How to kill a system -- Strangler Application by Martin Fowler

Twitter account @papers_we_love

Rule of Three - Jeremy Weinberg

Book Empathic Technical Leadership by Alex Harms

###Session #2 Apache Kafka and the stream data platform - Jay Kreps (Confluent) Went through the history while at LinkedIn

Philosophy - everything is an event Tried ActiveMQ and RabbitMQ but had problems with throughput, persistence, partitioning, ordering Kafka Connectors do the hard work of clients (config management, distributed connections, ...)

###Session #3 Microservices: What's missing. . . - Adrian Cockcroft (Battery Ventures) Java World - Sprint Cloud. Netflix and Pivotal are now part of that ecosystem Hal O and SoundCloud working on tools in the Go world

Talk covers How to avoid fragile microservices (bold is each section)

Failure Injection Testing

  • Simian Army. Trust with Verification
    • Chaos Monkey verifies stateless servers
    • Chaos Gorilla verifies server failover properly (run once a quarter as a Game Day exercise)
    • Chaos Kong verifies routing around region failures (run once a month)
    • Evident.io has a stronger security money if you want to pay for it

Version Aware Routing

  • Immediately and safely introduce a new version
  • Clients routed to right server
  • Change 1 thing at a time, client OR server
  • Eventually remove old versions (from code base)
  • Version # ==> Interface.Feature.BugFix
  • Deployment for types of changes:
    • Bug fix. Canary test and remove old version
    • Feature. Canary.
      • If backwards compatible, remove old versions
      • If no run side-by-side then upgrade clients
    • Interface. Run side-by-side then upgrade clients.

Protocols

  • Measure serialization, transmission, and deserialization costs
  • Public APIs use JSON
  • Private (internal) APIs consider Thrift, ProtoBuf, gRPC, Avrio, SBE (SBE is the fastest by far but has limitations)

Interfaces

  • Build reference implementation for client
  • Use client in stress test harnesses
  • Each service should have distinct object model to reduce coupling
  • Version dependency interfaces with strong version pinning

Timeouts and Retries

  • Connection versus Request timeouts
    • Question: How to handle ultimate failure?
    • Use persistent connections
    • Connection timeout should be slightly larger than network latency
    • Request timeout should be based on logic of request
    • Use cascading time budget
      • Edge has large timeout
      • Each service deeper is smaller
      • With small systems, set statically
      • With large/dynamic systems, set dynamically and pass in headers. Reduce at each step. Fail when can't fulfill within time
    • Instrument retries
    • Never retry on same connection

Manage Inconsistency

  • Manage in the app because its the only place it can be completely handled

Denormalized Data Models

  • Many databases => Inconsistencies exist
  • Antipattern: Shared schema between services
  • Build custom cross-data source check/repair process

Cloud Native Monitoring

  • High rate of change, per minute doesn't cut it
  • Ephemeral Configurations
  • Want data per second
  • SaaS solutions do well (DataDog on his list)

Managing Scale

  • Flow - AppDynamics, New Relic
  • Doesn't scale to 100s of services
  • GetGuestimate.com
    • Monte Carol simulator/modeler for response time
  • Latency doesn't follow normal deviation

Look at go-kit/kit/metrics for metric tracking in Go

###Session 4: Designing a reactive data platform: Challenges, patterns, and antipatterns - Alex Silva (Pluralsight)

Responsive   ==>    Elastic
  /\                  ||
  ||                  \/
Message Driven =>  Resilient
  • Responsiveness
    • Goal is constant time response under varying load
  • Elasticity
    • Scalability on demand
    • Required for reactive
    • Needs:
      • Asynchronous
      • Share nothing (single responsibility pattern)
      • Divide and Conquer
      • Location Transparency
  • Resiliency
    • The ability to return to its original shape
    • In the face of problems, still works @ normal performance
  • Message Driven
    • Not events
    • Messages go to some one/thing
    • Events go to a bucket/topic. Facts/past.
  • Their platform
    • Akka, Kafka, Spark
  • Pattern: Decentralize processing of key messages
  • Pattern: Design incremental communication protocol
  • Pattern: Hide an elastic pool of resources behind its owner (see Akka router)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment