Skip to content

Instantly share code, notes, and snippets.

@riyadparvez
Created June 23, 2014 23:52
Show Gist options
  • Save riyadparvez/83ae7379fd357ae39d93 to your computer and use it in GitHub Desktop.
Save riyadparvez/83ae7379fd357ae39d93 to your computer and use it in GitHub Desktop.
Microreboot notes

Microreboot

Microreboots

  • restart fine-grained components “with a clean slate”
  • only take a fraction of the time needed for full system reboot
  • Separate data recovery and application recovery

Goals of confining the reboot

  • Reduce the amount of time it takes for the system to return to service
  • Minimize the failure's disruption to the system and its users
  • Preserve as much in-memory application data as possible.

Designing Microrebootable System

  • Fine-grain componenets
  • State segregation
  • Decoupling
  • Retryable requests
  • Leases

Precondition of mRBootable

  • Component isolation -
  • Fine grained workload -
  • Resource management -

Research Questions

  • Are μRBs effective in recovering from failures?
  • Are μRBs any better than JVM restarts?
  • Are μRBs useful in clusters?
  • Do μRB-friendly architectures incur a performance overhead?

Fault Detections

  • if a client encounters a network-level error (e.g., cannot connect to server) or an HTTP 4xx or 5xx error, then it flags the response as faulty. If no such errors occur, the received HTML is searched for keywords indicative of failure (e.g., “exception,” “failed,” “error”).
  • The second fault detector submits in parallel each request to the application instance we are injecting faults into, as well as to a separate, known-good instance on another machine. It then compares the result of the former to the “truth” provided by the latter, flagging any differences as failures. This detector is the only one able to identify complex failures, such as the surreptitious corruption of the dollar amount in a bid.

Recovery Manager

  • Recovery manager (RM) that performs simple failure diagnosis and recovers by: microrebooting EJBs, the WAR, or all of eBid; restarting the JVM that runs JBoss (and thus eBid as well); or rebooting the operating system.
  • The recovery manager maintains for each component in the system a score, which gets incremented every time the component is in the path originating at a failed URL. RM decides what and when to (micro)reboot based on hand-tuned thresholds.
  • Uses a simple recursive recovery policy based on the principle of trying the cheapest recovery first. If this does not help, RM reboots progressively larger subsets of components. Thus, RM first microreboots EJBs, then eBid’s WAR, then the entire eBid application, then the JVM running the JBoss application server, and finally reboots the OS; if none of these actions cure the failure symptoms, RM notifies a human administrator.

Performance

Throughput varies less than 2% between the various configurations, which is within the margin of error. Latency, however, increases by 70-90% when using SSM, because moving state between JBoss and a remote session state store requires the session object to be marshalled, sent over the network, then unmarshalled; this consumes more CPU than if the object were kept inside the JVM.

Limitations

  • Shared state - If updates aren't atomic then micorrebooting components can leave the state in inconsistent. Full JVM restart reboots all the componenets so nobody will see inconsistent state. Not only does a JVM restart refresh all components, but it also discards the volatile shared state, regardless of whether it is inconsistent or not; μRBs allow that state to persist. In a crash-only system, state that survives the recovery of components resides in a state store that assumes responsibility for data consistency. In order to accomplish this, dedicated state repositories need APIs that are sufficiently high-level to allow the repository to repair the objects it manages, or at the very least to detect corruption. Otherwise, faults and inconsistencies perpetuate; this is why application-generic checkpoint-based recovery in Unix was found not to work well.

  • External resources - If an application allocates external resources without through application server then microrebooting can leak resource as the application server isn't aware of the resource.

  • Full reboot - When full JVM restart is required, poor failure diagnosis may result in one or more ineffectual component level mRBs.

  • Resource management - For Java the resources are handles by JVM GC, but for other non-GC langauges resource management can difficult in mRBootable system. Even in JVM, there are issues related to PermGen.

  • eBid has only one recovery group closure containing only 5 EJBs. Real life apllications tend to have more than one such recovery closure group and also recovery closure group will have more members.

  • Microreboot can hide real bugs.

  • Safety property of microreboot can further limit its applicability.

Design of mRebootable System

  • Isolation - Dependencies between components need to be minimized, because a dense dependency graph increases the size of recovery groups, making μRBs take longer and be more disruptive.
  • Fine grained workload - Microreboots thrive on workloads consisting of fine-grain, independent requests; if a system is faced with long running operations, then individual components could be periodically microcheckpointed to keep the cost of μRBs low, keeping in mind the associated risk of persistent faults.
  • Resources - Efficient support for microreboots requires a nearly constant-time resource reclamation mechanism, to allow microreboots to synchronously clean up resources.

Misc.

  • Crash-only software refers to computer programs that handle failures by simply restarting, without attempting any sophisticated recovery.
  • During failover, those requests that do not requir session state, such as searching or browsing, will be successfully served by the good nodes; requests that require session state will fail.
  • In reporting the results, we differentiate between resuscitation, or restoring the system to a point from which it can resume the serving of requests for all users, without necessarily having fixed the resulting database corruption, and recovery – bringing the system to a state where it functions with a 100% correct database. Financial institutions often aim for resuscitation, applying compensating transactions at the end of the business day to repair database inconsistencies. A ≈ sign in the rightmost column indicates that additional manual database repair actions were required to achieve correct recovery after resuscitation.
  • Visually, the impact of a failure and recovery event can be estimated by the area of the corresponding dip in good Taw , with larger dips indicating higher service disruption. The area of a T aw dip is determined by its width (i.e., time to recover) and depth (i.e., the throughput of requests turned away during recovery).
  • Another way to mitigate the coarseness of node-level failover is to use component-level failover; having reduced the cost of a reboot by making it finer-grain, micro- failover seems a natural solution. Load balancers would have to be augmented with the ability to fail over only those requests that would touch the component(s) known to be recovering. There is no use in failing over any other requests. Microfailover accompanied by microreboot can reduce recovery-induced failures even further. Microfailover, however, requires the load balancer to have a thorough understanding of application dependencies, which might make it impractical for real Internet services.
  • If recovery is sufficiently non-intrusive, then we can use low-level retry mechanisms to hide failure and recovery from callers – if it is brief, they won’t notice. Fortunately, the HTTP/1.1 specification offers return code 503 for indicating that a Web server is temporarily unable to handle a request (typically due to overload or maintenance). This code is accompanied by a Retry-After header containing the time after which the Web client can retry.
  • Server-side rejuvenation service that periodically checks the amount of memory available in the JVM; if it drops below M_alarm bytes, then the recovery service microreboots components in a rolling fashion until available memory exceeds a threshold Msufficient; if all EJBs are microrebooted and M_sufficient has not been reached, the whole JVM is restarted. Production systems could monitor a number of additional system parameters, such as number of file descriptors, CPU utilization, lock graphs for identifying deadlocks, etc.
  • The rejuvenation service does not have any knowledge of which components need to be microrebooted in order to reclaim memory. Thus, it builds a list of all components; as components are microrebooted, the service remembers how much memory was released by each one’s μRB. The list is kept sorted in descending order by released memory and, the next time memory runs low, the rejuvenation service microrejuvenates components expected to release most memory, re-sorting the list as needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment