Skip to content

Instantly share code, notes, and snippets.

@rmannibucau
Last active March 23, 2018 17:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rmannibucau/ab7543c23b6f57af921d98639fbcd436 to your computer and use it in GitHub Desktop.
Save rmannibucau/ab7543c23b6f57af921d98639fbcd436 to your computer and use it in GitHub Desktop.
  • No clear or consistent lifecycle defined on all potential atomic elements (sources, sink, (s)dofn but not transform which is an aggregate)

→ minimum being a) when the element got instantiated ("new instance created") - @Setup, b) when the element is being destroyed - @Teardown

→ SDF covers most of it but sources don’t have it which leads to weird patterns in sources (like creating way too much connections compared to the actual need + not being able to release the instance properly leading to keep recomputing again and again the same data even if the runner uses the same instance on 2 consecutive lines, the instance lifecycle being handled you should be able to trakc if you already computed something (estimated size) or not)

  • sdk-core (most shared module) classpath is built based on libraries having conflict easily

→ jackson, snappy, joda, avro (and its legacy stack)

→ concretely here we should extract what is user API in sdk-core-api (I don’t care much of the name) and load a provider which implements primitives as needed. This refactoring will also allow to move the unrelated feature of a sdk like some IO or transform which should be in their own module to respect a clear separation of concerns.

  • Serialization of the atomic elements is java serialization which is hard to replace by a custom logic, it does probably requires a Coder or something covering the same scope.

→ immediate use case is to allow to isolate the primitives from the beam sdk classloader and therefore add the way to find the classloader to deserialize the instance properly with the right classloader

  • DAG API mixes some runtime and definition concerns and is built on top os static utilities and internals making it hard to instrument or use

→ a static DAG based on vertices, edges and 100% about definition and not runtime would be beneficial here, this is close to the portable API but at java level must stay java friendly with an empty stack (no protobuf or any serialization detail dependency)

  • runner dont really have an API to implement - it is all based on replacements - which means you can’t really use the runner to build feature on top of the primitive and decorate it to change some behavior

→ a trivial example is the metrics implementation, it should mainly be a matter of rewritting a bit the DAG through a runner decoration (to not do it on user code side) to simple add the metrics in the execution plan (~DAG on runner side)

→ another implication is that you need to modify all runners to support the portable API (languages) whereas this portable API should just be an implementation on top of runner (and not inside runners) which means the implementation of the portable API would be done only once (per language if we have multiple languages for runners) and wouldn’t pollute a "pure" stack (like java ATM)

  • Configuration should be doable globally as today with pipeline options but also per transform by design with a natural way to reference any element of the DAG

  • Any part of beam should be contextually right which means there shouldn’t be any global cache which prevents to use beam in a EE/OSGi/Custom environment

  • Coder should support versionning

  • General style (not blocking but bothering each time you open beam)

→ @AutoValue: no real gain compared to a Pojo and concurrent "practise" compared to the pipeline config which is PipelineOptions based, both should be merged (probably on a pojo based solution which doesn’t prevent to use fluent setters for instance)

→ stop using abstract classes everywhere: being annotation based or worse case interface based will avoid to hide all that boilerplate in abstract classes which finally is shown to the user when he develops and creates a lot of noise when you just do a pipeline or IO

--→ side note here: sdk-api would be in scope compile and sdk-core-impl which would bring the current "abstract" part would be scope runtime (or test for an IO)

  • the generic record representation of beam (SQL, etc) shouldn’t force beam to host an implementation and should just allow to manipulate data with a minimum of structure - which doesn’t prevent to add in records some contextual meta if needed like kafka io does

→ I proposed JsonObject since it doesn’t impose any API, avro would have been an option if it wouldnt have jackson 1 and other noisy libraries we want to avoid as being enforced

  • beam should embrace reactive programming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment