Skip to content

Instantly share code, notes, and snippets.

@SemanticBeeng
Last active February 8, 2020 09:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save SemanticBeeng/2edad7a5e2cd7cd6af5cde824b4e0da0 to your computer and use it in GitHub Desktop.
Save SemanticBeeng/2edad7a5e2cd7cd6af5cde824b4e0da0 to your computer and use it in GitHub Desktop.
# Cross language/framework/platform data fabric
## Requirements / Goals
1. #DataSchema abstract over data types from simple tabular ("data frame") to multi-dimension tensors/arrays, graph, etc (see HDF5)
2. #DataSchema specifiable throygh by a functioanal / declarative language (like Kotlingrad + Petastorm/UniSchema)
3. #DataSchema with bindings to languages (Scala, Python) and frameworks (Parquet, ApachHudi, Tensorflow, ApacheSpark, PyTorch)
4. #DataSchema to define both in-memory #DataFabric and schema for data at rest (Parquet, ApacheHudi, PetaStorm, etc)
5. Runtime derived from the "shared runtime" paradigm of #ApacheArrow (no conversions, zero-copy, JVM off-heap)
6. Runtime treats IO/persistence as a separate effect (abstracted away from algo/application logic)
## Use cases
1. Define data sets under management in a #DataLake / #FeatureStore in an unified way (not just in some Python or SQL code)
2. Do not mandate remote calls or persistence just because we need to combine two frameworks / technologies (no PySpark sockets, for example)
3. Compose algorithms / ML models expressed as (much as possible as) pure functions with #ModelSignature-s a'la Tensorflow (https://www.tensorflow.org/tfx/serving/signature_defs)
4. Unify #ProgrammingModel with a #FunctionalPgromming / #DSL mindset and (run) away from the "data pipeline" mentality (a'la Emma language http://emma-language.org/)
Resources
1. https://twitter.com/semanticbeeng/status/1119581463278772224
2. https://twitter.com/semanticbeeng/status/1117415216969584640
3. https://twitter.com/semanticbeeng/status/1146141244042686465
4. https://twitter.com/semanticbeeng/status/1145334581903728640
5. https://twitter.com/semanticbeeng/status/1144675483960913920
6. https://twitter.com/semanticbeeng/status/1144657475460878336
7. https://twitter.com/semanticbeeng/status/1144557723926847488
8. https://twitter.com/semanticbeeng/status/1142400720324431873 - Petastorm
9. https://twitter.com/semanticbeeng/status/1139814984521699328
10. https://twitter.com/semanticbeeng/status/1139794053199990785 **
11. https://twitter.com/semanticbeeng/status/1139789288856571904 - Apache Arrow **
12. https://twitter.com/semanticbeeng/status/1147069429542531072
13. https://twitter.com/semanticbeeng/status/1131887704529100800
14. https://twitter.com/semanticbeeng/status/1130389796038352896
15. https://twitter.com/semanticbeeng/status/1128170662269468672
-
17. https://twitter.com/semanticbeeng/status/1144944281234411520
18. https://twitter.com/semanticbeeng/status/1147174912232251393
19. https://twitter.com/semanticbeeng/status/1139794053199990785
20. https://twitter.com/semanticbeeng/status/1139501979384913920
21. https://twitter.com/semanticbeeng/status/1145334581903728640
22. https://github.com/higherkindness/skeuomorph/issues/91#issuecomment-495475543 - skeuomorph
23. https://twitter.com/semanticbeeng/status/1131583712796266498
24. StructTensor https://github.com/tensorflow/community/blob/master/rfcs/20190910-struct-tensor.md
https://twitter.com/semanticbeeng/status/1192708092326219776 **
25. RelayIR https://twitter.com/semanticbeeng/status/1193572920699867137
16. Preto types from UDFs: https://prestodb.io/docs/current/develop/functions.html
17. AvroTF https://engineering.linkedin.com/blog/2019/04/avro2tf--an-open-source-feature-transformation-engine-for-tensor
18. https://gist.github.com/SemanticBeeng/b3102567b1a566fe0b2eb99edae9409c - structured numpy arrays **
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment