busbey/HADOOP-11656_proposal.md

## HADOOP-11656_proposal.md

      
    Raw
  

              HADOOP-11656_proposal.md
            
          
    Problem Statement

Currently, Hadoop exposes downstream clients to a variety of third party libraries. As our code base grows and matures we increase the set of libraries we rely on. At the same time, as our user base grows we increase the likelihood that some downstream project will run into a conflict while attempting to use a different version of some library we depend on. While there are hot-button third party libraries that drive most of the development and support issues (e.g. Guava, Apache Commons, and Jackson), a coherent general practice will ensure that we avoid future complications. Simply attempting to coordinate library versions among Hadoop and various downstream projects is untenable, because each project has its own release schedule and often attempts to support multiple versions of other ecosystem projects. Furthermore, our current approach of taking a conservative approach to dependency updates leads to reliance on stale versions of everything. Those stale versions include libraries with known security and usability issues. When we do change a dependency it causes sufficient pain to downstream users that upgrading even minor versions of Hadoop is seen as costly. While solving the general problem of exposing our internal dependencies to downstream clients will reduce what can break because of a Hadoop upgrade, an ideal solution should also have an opt-in path so that it can be included in branch-2 without further disrupting folks.
Use Cases

Motivating use cases:

Users building applications that talk to FileSystem based storage systems (e.g. HDFS) should be able to use arbitrary libraries to implement their application. Where the storage system happens to use the same library internally, the runtime version of that library selected by the application should not cause the client-side FileSystem implementation to change behavior.
Users building applications that are managed by YARN should be able to use arbitrary libraries to implement their application. Where YARN happens to use the same library internally, the runtime version of the library selected by the application should not change the behavior of the framework. Where other applications are also being managed by YARN, the library choices of one application should not impact any other.
Users writing analytic jobs in MapReduce should be able to use arbitrary libraries to implement their business logic. Using these libraries at runtime should not require special handling beyond ensuring the list of dependencies is provided to job submission. Where the framework running the job uses the same library internally, user supplied jobs using new or older versions of the same library should not cause the framework to change behavior.
Contributors updating a Hadoop component to rely on a different version of a third party dependency should be able to do so without impacting downstream users. The same should be true for adding a new third party dependency or ceasing to rely on a third party dependency.

Technical Approach

This approach is broken up into two parts along wether the code is expected to run under the control of downstream users ("client side") or under the control of Hadoop ("framework side"). The former should cover both applications that talk directly to HDFS as well as the driver that initially submits a YARN application.
Client Side

Providing downstream clients with access to resources in a Hadoop cluster without exposing them to our internal dependencies can be done entirely in packaging through Maven. This allows us to continue development in-place across the Hadoop projects while relying on the extant InterfaceAudience annotations to drive what is exposed to downstream clients.
Currently Hadoop offers a module named "hadoop-client". This module is an aggregator of the different Hadoop modules needed by downstream clients to HDFS, creators of a YARN application, or a MapReduce job. It also attempts to exclude any of the dependencies of those Hadoop modules that aren't needed at runtime for downstream clients. For this solution, we'll leave the pom dependencies and associated artifacts from this module as-is, ensuring that downstream clients who rely on it aren't broken by this change set.
Current project module structure (subset) for hadoop-client:
hadoop (top level project)
  |
  \---- hadoop-client

In order to keep all the downstream client changes together we'll add a new module to the top level project, tentatively named "hadoop-client-reactor", and move the current hadoop-client module to be a sub-module. In addition, we'll add three new modules as explained below.
Proposed project structure:
hadoop (top level project)
 |
 \---- hadoop-client-reactor
          |
          |--- hadoop-client
          |--- hadoop-client-api
          |--- hadoop-client-runtime
          \--- hadoop-client-invariants

The hadoop-client-api module will have hadoop-client as a dependency. It will then use the Maven Shade Plugin with a custom filter to create an artifact that contains just those classes needed to compile against InterfaceAudience.Public classes. The artifact will have no transitive dependencies. This essentially allows downstream clients to ensure they are building only against the public API by adding hadoop-client-api as a dependency at compile scope. It's expected that over the course of implementation, the community may need to expand which classes are InterfaceAudience.Public or that exceptions to this enforcement may be needed.
RISK NOTE Classes labeled InterfaceAudience.Public might inadvertently leak references to third-party libraries. Doing so will substantially complicate isolating things in the client-api. As a part of implementation we'll quantify the exposure and then the community can decide if it's better to expose a relocated version for those who opt-in, avoid exposing such classes altogether in branch-2, fixing them with compatible changes in branch-2, or fixing them via a breaking change in Hadoop 3. The trade-offs for any of those choices will depend on the specifics of the classes involved and so aren't addressed directly here.
The hadoop-client-runtime module will also have hadoop-client as a dependency. It will use the Maven Shade Plugin with a custom Filter to ensure none of the classes found in hadoop-client-api are included. It will also use the Maven Shade Plugin to include the remainder of the Hadoop and third party dependencies needed at runtime. This allows hadoop-client-runtime as a dependency at runtime scope. Existing Maven Shade Plugin class and resource relocation capabilities should be sufficient to allow relocating everything under the package org.apache.hadoop.shaded. It is expected that during implementation, several resource files related to the Java Services API will need to be processed to ensure relocated classes still load. The artifact will have no transitive dependencies.
The hadoop-client-invariants module will be configured to ensure the automation for hadoop-client-api and hadoop-client-runtime continues to function as the various Hadoop components change over time. It will use the Maven Enforcer Plugin to ensure that there are no transitive dependencies when either of the new client modules are used. It will also use two custom Maven Enforcer rules to ensure that a) hadoop-client-api contains only classes marked InterfaceAudience.Public (with exceptions as needed) and b) that both modules only contain classes in packages under org.apache.hadoop. The module essentially provides a build-time sanity check that any changes to our downstream client surface have been considered.
A possible optimization for the hadoop-client-runtime artifact is to use the Maven Shade Plugin's "minimize jar" feature to only include those parts of transitive dependencies that are actually referenced in project files. Because the solution proposed here is done only with new Maven build steps rather than code reorganizing, using this feature will require providing a custom class for the Maven Shade Plugin so that it can properly identify "project files." As an approximation of savings (using the trunk branch), a proof-of-concept shading that puts all of org.apache.hadoop into hadoop-client-api and all third party dependencies into hadoop-client-runtime results in a hadoop-client-api artifact that is 18MB and a hadoop-client-runtime artifact that is 21MB. The disadvantage of this optimization is that any downstream clients who want to harmonize versions with the Hadoop project would not be able to reuse the relocated versions in the runtime artifact, unless they happened to only use a subset of classes covered by Hadoop's use.
As a possible alternative approach to the hadoop-client-runtime we could instead make a relocated version of each individual dependency needed rather than a single monolith (e.g. we'd make a hadoop-commons-math, hadoop-guava, hadoop-jackson, etc). The advantage of such an approach is that downstream clients who wanted to harmonize versions with the Hadoop project could more easily rely on just those third-party dependencies they need rather than all of the ones offered by Hadoop. Additionally, adding future third party dependencies that get used client-side would require the work of creating an appropriate relocated dependency, which should prevent inadvertent dependency bloat. The obvious downside of this approach is that such work would always be required.
As previously mentioned, hadoop-client's published artifacts (including the pom) should remain unchanged. Thus it should continue to have the same behavior for downstream clients of branch-2. If desired, we could encourage client migration in the 3.0 branch by moving the logic for which modules and dependencies are needed into the hadoop-client-api and hadoop-client-runtime modules (or renaming the hadoop-client module to e.g. hadoop-client-impl).
Framework Side

The crux of the framework side approach is to move from maintaining our own webapp-style classloader to relying instead on embedding an OSGi container implementation. While the following covers YARN applications, the same approach should be usable within MapReduce to provide isolation for MapReduce jobs.
Currently, applications loaded into YARN are optionally placed into a child classloader that falls back to the same class loader used for the framework. This additional classloader allows the application to rely on alternative library versions without impacting the proper running of the YARN framework. Unfortunately, it doesn't provide much upgrade help for applications that rely on the classes found in the fallback case.
For this solution, we'll remove the classloader code and instead on YARN container creation we'll spin up an embedded OSGi container (preferably using the Apache Felix project) and export into the container the public API for YARN. The submitted user code will be isolated within this container, ensuring that it is unaffected by changes to the dependencies used in YARNs implementation.
RISK NOTE Downstream user-provided code must not be required to be an OSGi bundle. The goal of this approach is to isolate hosted applications with a minimal transition cost. While exposing the use of OSGi might provide operational benefit (e.g. allowing service discovery or shared bundle-based dependencies), that transition will be very disruptive and should be handled by follow-on work to allow it as an opt-in. I am still working through what is needed to automatically wrap user provided jars into the YARN-hosted OSGi-container. It may be necessary to rely on a different OSGi implementation or additional third-party libraries. Even if this is the case, the remaining implementation details involving parts of the Apache Felix project would remain the same.
To maintain backwards compatibility, we'll add a module that makes use the Maven Shade Plugin and the Apache Felix Maven Bundle Plugin to create an OSGi bundle that includes all of the dependencies exposed by the Hadoop 2.6.0 classpath. By default in branch-2, this bundle will be exported to the embedded OSGi container, causing applications to see the same classpath for future branch-2 releases. On the 3.0 branch, the default will be to not include this bundle in the container.
Additionally, we can add a similar module that creates a bundle for the dependencies that were present for each of the previous GA branch-2 releases (e.g. 2.2, 2.3, etc). We can then add a command line flag for job launching that allows users to state a preference (e.g. yarn job --hadoop2.2-dependencies). This should make migrating applications from earlier versions easier, whether to a new branch-2 release or to Hadoop 3.