Skip to content

Instantly share code, notes, and snippets.

@robertzk
Created May 31, 2015 15:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save robertzk/e07ebc1fa0ee436e5d2d to your computer and use it in GitHub Desktop.
Save robertzk/e07ebc1fa0ee436e5d2d to your computer and use it in GitHub Desktop.
Initial syberia FAQ
Can I use CSV files? What about streaming production data? So I have this SAS file...
You can use any kind of data ingest your heart desires. The built-in import stage comes with support for many common formats (link), but it is easy to add more (link).
If you wish to use your live production data, write a package and add an import adapter (link).
What is a mungebit and why are you making up words?
A mungebit is the correct mathematical abstraction for wrangling a data set in a way that you won't have to bug a software or data engineer to make it "production ready" or live in "the data pipeline." It means you can turn the 90% of time data scientists spend on data wrangling into 10%.
Michael Spivak, the mathematician, wrote in his book on Advanced Calculus "there is a reason why the definitions are hard and the theorems are easy." Data wrangling isn't hard or annoying because of some inherent property of data wrangling, but because the correct abstractions have not yet been discovered. Mungebits are an attempt at such an abstraction.
Read up more on mungebits here.
How do I deploy my Syberia model?
You have a model object that can generate predictions on arbitrary raw non-wrangled production data. Ask your data engineer to embed it into your web server or some other magic. (Rserve?)
At Avant, we use Syberia for many of our model deployments, but the tools are not quite ready for the public. Contact Rob K directly if you want help deploying your model. Or wait for Syberia 2.0.
Why do you hate PMML?
No one likes XML, and besides, it boxes you in and tells you: you can perform this kind of data preparation and run this kind of model and nothing else. We prefer freedom.
A tundraContainer is a pretty simple alternative: just keep all that as a native R object so you can perform any data wrangling and any kind of statistical model. It works for most data sets and models, and if it doesn't just downsample intelligently or wait for Syberia 2.0.
Read up more on tundraContainers here.
How big can my data sets be?
As big as you want. If you think your problem is the size of your data set, you are probably taking the wrong approach.
The current release, Syberia 1.0, supports in-memory data manipulation and model training, but the abstractions it encourages are language-agnostic and easily generalize to tools like Spark or Hadoop (although we like Haskell). We are developing these capabilities internally at Avant, but you are welcome to help with Syberia 2.0 if you can't wait.
Will Syberia allow me to do reproducible research?
Yes (link to model card readme)
Is there anything I can't do with Syberia?
No. As you begin to dig in more, you will eventually realize that unlike more "rigid" frameworks like Rails, Angular, or Django, Syberia is a *meta*-framework that allows you to define your own conventions as you realize you need more "stuff." The only real abstraction is manipulation of the file system (director), but even that can be replaced away if you need something more intelligent or powerful. And if you can't do it in R, just write it in C and provide hooks via an R package and a Syberia engine.
Help! Something broke...
Check out the troubleshooting page, StackOverflow, IRC, file a Github issue, or email Rob K, in that order.
@sparuchuri
Copy link

How about a question related to "How do I contribute" or "Who built this"? After all, this is at least some % recruiting :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment