Skip to content

Instantly share code, notes, and snippets.

@davexunit
Created December 29, 2015 15:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save davexunit/3fd37e70d44c196eb02d to your computer and use it in GitHub Desktop.
Save davexunit/3fd37e70d44c196eb02d to your computer and use it in GitHub Desktop.

In order to get into a trustworthy state, the [big data] toolchain needs to:

  • Consolidate. There are too many tools for every job. There are even too many tools to manage your too many tools, and frontends for your frontends.

  • Lose weight. Every project depends on way too many other projects, each of which only contributes a tiny fragment for a very specific use case. Get rid of most dependencies!

  • Modularize. If you can't get rid of a dependency, but it is still only of interest to a small group of users, make it an optional extension module that the user only has to install if he needs this particular functionality.

  • Buildable. Make sure that everybody can build everything from scratch, without having to rely on Maven or Ivy or SBT downloading something automagically in the background. Test your builds offline, with a clean build directory, and document them! Everything must be rebuildable by any sysadmin in a reproducible way, so he can ensure a bug fix is really applied.

  • Distribute. Do not rely on binary downloads from your CDN as sole distribution channel. Instead, encourage and support alternate means of distribution, such as the proper integration in existing and trusted Linux distributions.

  • Maintain compatibility. successful big data projects will not be fire-and-forget. Eventually, they will need to go into production and then it will be necessary to run them over years. It will be necessary to migrate them to newer, larger clusters. And you must not lose all the data while doing so.

  • Sign. Code needs to be signed, end-of-story.

  • Authenticate. All downloads need to come with a way of checking the downloaded files agree with what you uploaded.

  • Integrate. The key feature that makes Linux systems so very good at servers is the all-round integrated software management. When you tell the system to update - and you have different update channels available, such as a more conservative "stable/LTS" channel, a channel that gets you the latest version after basic QA, and a channel that gives you the latest versions shortly after their upload to help with QA. It covers almost all software on your system, so it does not matter whether the security fix is in your kernel, web server, library, auxillary service, extension module, scripting language etc. - it will pull this fix and update you in no time.

Source: http://www.vitavonni.de/blog/201504/2015042601-big-data-toolchains-are-a-security-risk.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment