Rodrigue Schaefer: From monolith to microservices about some of the challenges of Zalando's transition from monolith to microservices (microXchg 2016)
-
2008 started with a POC with magento => starts fine but does not scale very well
-
2010 couldn't handle the raising load and traffic with magento
-
=> so in 3 months they build their own system, based on Java, Spring, Postgres DB (a monolithic application)
Their main focus was to build a system as efficient, stable and fast as possible
- => business logic in db (store procedures)
- => all access via store procedures (no direct access to db)
The system eventually become hard to handle (maintainability): adding a new feature was getting harder and harder
- huge number of devs working on the same codebase
- lot of dependencies between teams => a lot of coordination needed between teams for developing and releasing a feature (=> "release train" model)
- extra amount of coordination => slow productivity
-
as code size increases
- => bug density increases and
- => system complexity increases
-
higher complexity ===> rigid processes were adopted (everything tightly controlled to reduce variance as much as possible) ==> this kills innovation (everything on the same tech stack, etc)
- Old platform with rigid processes
- => hiring problems, retention issues: difficult with finding people wanting to work on this old platform
- => not attractive to young talents
- => slow onboarding and fear to change anything
Then came their radical organizational change!
Zalando wanted:
- Autonomous teams to deliver amazing products efficiently at scale
- Give a team independence (no time lost with fear of breaking someone else' part)
Based on 3 principles, glued together by Trust:
- Autonomy: the team can act on its own, define its delivery process, its technological stack, have an idea, develop it, deploying and operate it
- Purpose: strategical alignment to be all on the same page
- Alignment is made with OKR: the company define its objective for the year, then each department take these objectives and come up with their own, then the teams do the same
- Mastery: give support to engineers to get better at what they want to be good at. We want excellent engineers, so we need to help them develop and grow.
- => Positive psychology is helping here. Old psychology: helping sick people. Positive psychology: make normal people more happy (see Six factor Model of Psychological Well being
Conway's law applied in reverse: we changed the organization and now suddenly the old technological landscape did not fit to this change.
Organization side: "a purpose-driven organization composed of autonomous teams which deliver clearly defined products"
this, mapped to the technological side, means:
Technological side: "a service-oriented architecture composed of loosely coupled elements that have bounded contexts" (A.Cockcroft definition of microservices)
Organization side + Technological side => Radical Agility
- Rapid provisioning
- Basic Monitoring
- Rapid application deployment
See also MicroservicePrerequisites by M.Fowler
To adopt a microservice architecture, you need to be very good at operation, because when you migrate a monolith to a microservice architecture you push the complexity down at the infrastructure level.
AWS + docker + app-monitoring + Stups.io (open-source platform developed by Zalando). PS Stups is now on-hold in Zalando.
-
Expect failure: expect other systems to fail, so:
- build resilient systems
- avoid domino effects (using tools like Hystrix)
- something the most engineers doesn't know or are not used to
- we help all teams to have this mindset (even one team without this mindset could "ruin" all the system)
-
End to end responsibility
- cross functional teams responsible of everything: dev, test, operations
- teams should think as a small startup: nobody cares about your staff, you have to care about your idea
-
Software as a service
- teams have to see their products as "software as a service", and see other teams are their customers.
- a great mindset change for a lot of dev people
-
API first
- try to keep aligned all teams on how to design and share API
- there's an "API gild" which reviews APIs made by teams and help creating coherent APIs across all the organization org
With ~70 teams => how do you make sure everything fits well together?!
In order to handle this, Zalando has:
- Rules of Play - defines a vision of the architecture we want to have: loosely coupled services, resilience, REST as the main style to design API, ... => written down into a booklet and given to everybody
- Peer Reviews: get feedback and opinions by others
- Tech Radars: looking at new technologies, technologies we don't want to see, experimental stuff, etc. Share this knowledge publicly
- Shared Concept of Core Business Entities (aka "prototype architecture"): we took some of the best engineers to work together to create a blueprint of how the new microservices platform could look like (they created a prototype of the core domain functionality in a dozen of microservices, message queues,...). This is no ivory tower architecture, just an idea presented to the teams to take what they found useful and go from there.
Zalando is a public company: auditors should be happy :D
What Zalando does:
- 4-eyes principles
- Audit trails
- Identity and access management (a big project, they developed an internal tool to handle this, based on oauth)
- Data protection agreement
Many examples of microservices architecture are basically back-end platforms. What about the presentation?
- How do you split the single page of a frontend application in parts?
- How do you enable independent teams to work on different parts of the page?
Zalando has 12 teams on the fashion shop: each team owns a different "fragment" (a component or widget on the page, e.g. the recommendation box). Each fragment is a markup served by a specific webapp operated by a team. Each webapp sits on top of a set of collaborating microservices. The "Layout Service" put all this "puzzle pieces" together, using a template (based on the URL of the page) and a context (e.g. the current user).
A Router sits in front of the layout service, used to proxy the requests coming from the users to the new and to the old (monolithic) shop. This way Zalando can still use "legacy functionalities" served by the monolithic application, and migrate smoothly and incrementally to a new microservice architecture. They started migrating the checkout. With this router they can do A/B testing or canary releases. The routes can be changed dynamically via API, at runtime!
Open-sourced many of these tools (see https://opensource.zalando.com/):
- Skipper: the router
- Inn Keeper: the routes storage
- Tailor: the layout service
- Quilt: the templates storage
- able to inject new features (at runtime!)
- faster feedback loop: able to try out new ideas without waiting for the next "release train" to come...
- tech agnostic => can try out new technologies
- autonomous teams with full control, they can define their processes tailored to their needs
- true lean and agile processes
- continuous delivery (not all teams are doing this, each one can decide)
- teams are now independent each other
- smaller codebases => faster onboarding of new members
- up-to-date technologies attract young talents
- easier to spin-off new teams
It's a long long journey, which will take years!
- Q: How do you manage the balance between team independence and teams overlapping on the same things, maybe solving the same problems (and maybe with different technologies)?
- A: It happens sometimes. To avoid it, in each team there's a "delivery lead" role, sort of a coach for the team; he/she also has a bigger picture of what the company is trying to achieve, can connect teams, and check that communication and alignment are in place. But it happens!
- Q: What issues with the IAM solution?
- A: The current challenge for our IAM is scaling to handle all the requests. We have lots of requests from the customers:
- => which means lots of calls on the layout service
- => which means lots of calls to endpoints
- => which means lots of calls to microservices
- => which means each service has to authenticate via IAM
- scaling is really hard
In the backend teams are really autonomous in choosing whatever they like, because each service is independent. The problem is in the frontend, where each team could start using new fancy js frameworks:
- this ends up in the user's browser
- slows down the page load time
- may create conflicts and incompatibilities between adopted frameworks
=> we came up with a list of allowed frameworks (together with all the engineers) => we use the tech radar to share knowledge and create alignment between teams
Checkout: they keep both versions deployed and run in parallel. Used the old version as fallback in case of issues with the new version.
Convert vs rewrite of functionality as microservices: they chose to rewrite because converting the functionality would have been too expensive
Pushing on open-source => to let zalando be seen as a tech company
Size of teams? They follow Amazon's "two pizzas rule" => size between 2 and 12 people, most teams are around 6 people
Common shared libraries between services? We don't use shared library. the only exception is when a team publishes the lib as open source
Who handle support? 1st-level support team to handle an incident => but they're moving to teams being responsible for their own systems
How to handle performance testing on the platform and security related stuff?
Business-assurance unit, which coach teams on how to do performance testing. So there is a horizontal unit doing this.
When the team is getting too big?
If your team is getting too big, chances are that
- your service is too big
- you have too many services
it's a good idea to split the team or split up the service
Team owning old functionality is the same responsible for its migration?
Same team responsible for the old monolithic functionality is responsible for rebuilding the functionality as microservices.
Mobile apps?
The mobile apps currently still use the API exposed by the monolithic platform