Using Big Data approaches in Science
Understanding complicated systems requires the ability to freely explore and analyze them. With smaller datasets this can be done in basic tools on single machine. As these datasets grow in both complexity and size, single computers and these tools quickly become insufficent for handling these problems. Furthermore, it becomes a real challenge for a single individual to maintain an overview and understand fully the data being examined. The term Big Data has been applied to the paradigm shift in approaching these scale of problems, as well as the slew of new, many open-source, tools for attacking them. Of particular interest, is the developments around the Apache Spark project providing not only a robust framework for scalable computation, but a blossoming ecosystem of graph, machine learning, and streaming tools.
For us in imaging science, this means viewing, filtering, and transforming terabyte-sized datasets in real-time. With such vast amounts of data, it is neither possible nor pracitical to examine each image manually. Furthermore for all of the virtues of the human visual system in removing noise and identifying patterns and structures, it is a biased, poorly-scalable instrument and very ill-suited for comparing many images and extracting quantitative information. While thousands of specialized image processing tools exist, they are typically difficult to scale up to cluster and cloud environments and offer very little (if any support) for machine learning, streaming, and graph analysis.
Rather than the traditional approach of focusing on a specialized tool, we chose the framework first. As Apache Spark provided tight Amazon EC2 Cloud Integration, fault-tolerance, scalability, and record-breaking performance it was a clear choice for our analysis. Furthermore adopting a cloud-based approach has allowed us to dynamically scale the computing power as it was needed and keep all the data stored on a single filesystem rather than dealing with a multitude of disks, backups, and data management issues. From a cost perspective, as shown by Novartis, an analysis which would have required a $44 million data center, can be done for $5000 using cloud resources. While other articles have covered machine learning in greater depth, for us the ability to apply the latest generation of algorithms on our terabyte datasets quickly with hardly any additional coding, has meant we can replace and improve many challenging manual tasks like segmentation and labeling with SVM and deep-learning approaches.
The next era of software developments is very promising and particularly in the Spark environment there are a few which are of particular interest to us. Approximate computing is being continuously develpped and shows manyfold speedups for common operations. This means we can get approximate results very quickly which typically suffices for exploration and quick hypothesis testing. Streaming and real-time processing while presently too slow for many tasks could soon be at a point where much of the analysis is performed immediately and feedback is available during experiments to improve experimental design and data collection.