idan/flat.md

## flat.md

      
    Raw
  

              flat.md
            
          
    Wandering the halls of the Internet today, it's hard to miss a certain kind of discourse on data. Big data, bigger data, biggest data. A million rows aren't cool. You know what's cool? A billion rows. Distributed data systems that slip the surly bonds of any one machine. Techniques for sampling and transforming data while it moves. Strategies for contending with a deluge of events from chatty devices. Directing those data tributaries into undifferentiated data lakes so that we may pose different queries onto the data someday. It has become almost impossible to talk about this stuff without abusing metaphors way past their safe design limits.
As a developer, all those bits occupying the proverbial lake/warehouse/refinery are as immediately useful as a grape seed is to a winery. Locality of data isn't some abstract concept when you're trying to build things on top of that data — it's the leading term of developer experiences. If I have the data, I can load it, and get to work. If I don't have the data, then it doesn't matter if the data is cleaned, filtered, and sorted. If I don't have the data, then either I change my application logic to work with pieces of it over the wire, or I figure out how to bring a working set to my local environment.
As a result, there's an entire industry of tools that get data into the the right place, in the right format, at the right time. Data architectures vary wildly, so these solutions have a wide range of ambition (what they attempt to do) and complexity (what mess they attempt to conceal.) This is in contrast to the application/compute space, where pithy, prescriptive manifestos like 12factor offer a great lens through which to think: "Do it this way, and thou shalt scale." There is no equivalent for data because there's so many different approaches; there's only so many theses that can be nailed to the doors of a church before you run out of room to describe more best practices.
It's easy to get dazzled by the complexity and diversity of data tooling, but we're not the first profession to invent hyper-specialized implements. Surgeons have trays full of weird tweezer-like things whose shapes differ but whose ultimate purpose is the same: grabbing hold of stuff that might be difficult to grab with the little sticks on the ends of our arms. Surgery on data also requires specialized tools that offer unusual capabilities or guarantee certain behaviors, but most medicine is not surgery. We do ourselves a disservice if we accept complexity into every situation involving data. Even surgeons pick up the standard scalpel for a lot of their work.
Flat Data is an exploration from GitHub's Office of the CTO which aims to make make everyday data work easier by bringing the data you need right to your repo. It runs on GitHub actions, so there's no infrastructure to provision and monitor. Each Flat action has exactly two phases: fetching, and an optional postprocessing step. That's it! No DAGs, no orchestrator, no dependencies, no fancy mental model to ingest and grasp.
In fact, it's this simple mental model that Flat encourages which we find ourselves valuing. It is liberating to think about our code and assume that the data it needs will just be there. There's no deeper technological conceit, no theorem discovered, and no massive codebase released. Flat isn't trying to replace the zoo of data tools out there, and it isn't going to be the right tool for every team and situation. We find it to be surprisingly flexible: when it does not solve problems outright, getting data into a repository is a great first step that other tools can then build upon.
Got feedback? Don't be shy! @githubOCTO or octo-devex@github.com