Democratizing Big Data with Azure HDInsight
by Saptak Sen
Azure HDInsight, is an enterprise grade cloud platform for industry's leading open source big data technologies.
The best way to explain big data is to look at how customers are leveraging big data to be more productive on Azure HDInsight.
AccuWeather is a global technology firm which is leveraging Microsoft cloud to build predictive analytics as part of the solutions. With the power of Microsoft cloud and Azure HDInsight, AccuWeather has been able to scale to billions of requests a day and to scale petabytes of data in size.
The best way to look at big data is how data changes in terms of volume, velocity and variety. AccuWeather has been able to collate data from different sources, store them in HDInsight, process the data, apply machine learning models, and predict the outcome of weather patterns.
Together with this combination, AccuWeather has been able to work with the Union-Pacific Railway, where they were able to predict a tornado that was gonna come in another 30 minutes. And because of this prediction, the railway stopped about eight trains, and thus saving many lives and millions of dollars.
Once an organization starts going down the route of big data and advanced analytics, many new scenarios come up.
AccuWeather has partnered with Starbucks, for example, to improve supply chain. They've been able to look at seasonal variations in temperatures and help Starbucks optimize their pipeline of ice and water cups. So big data involves a lot of scenarios.
It involves real time and batch analytics. It involves machine learning. It involves advanced analytics. The best place to do all this open source analytics is on HDInsight.
This journey was not easy for AccuWeather though. Big data is hard and it's very hard to obtain skills. It's very hard to customize an environment.
Enterprise grade Big Data Analytics
Hadoop and Spark are big open source projects. However, how do you install them? How do you optimize them? How do you make sure that they're running reliably? How do you get value out such a complicated system? And then how do you integrate Hadoop and Spark with existing IT investments?
So let's say you've set up your authentication using Active Directory, then how do you use the same authentication with your Hadoop deployments?
Azure HDInsight is a Cloud Spark and Hadoop service for the Enterprise. It's a reliable service with an enterprise-grade SLA of three nines. This is an SLA that involves the entire cluster, which includes all your nodes. HDInsight is Enterprise-ready, so it has features such as, Enterprise security and monitoring. So we can leverage the benefits of Azure Active Directory for authentication.
You can fine grade range of policies. So as an enterprise IT, you can easily integrate HDInsight with your solution.
It's also the best productivity platform for developers and data scientists. We have a rich ecosystem of tools around Visual Studio, IntelliJ, Eclipse.
We also have rich notebooks integration with Jupyter and Zeppelin that data scientists and data developers can use to be more productive.
Economy of scale
HDInsight is cost effective with cloud. That means that when you start a HDInsight cluster, we separate out the compute and the storage components of Hadoop. So that means your data can be in Azure Blob Store, or Azure Data Lake Store, which are designed for scale. And then you can spin up delete clusters without losing your data. Over time, as more enterprises bet on the cloud, this dynamic nature of deploying Hadoop in the cloud lowers the total cost of ownership by 60%. Because now you don't have to have the clusters running all the time.
Also, as part of our open source analytics effort, HDInsight integrates with industry's leading ISV applications.
There are a lot of these ISV applications that build on top of Hadoop to do a variety of scenarios, from all the way to injection, to storing, processing to analyzing, visualizing. And they make it very easy to get insights from data faster.
It is very easy for administrators to manage an HDInsight cluster. You don't have to know Hadoop. It's a fully managed offering where Microsoft manages the uptime of the cluster, and you, as an organization, can spend your time on your investments on your IP itself.
Microsoft HDInsight has been rated it the top right corner by Forrester in the Big Data Hadoop Cloud wave. Microsoft and Hortonworks has invested heavily in making Hadoop easier and reliable in the cloud.
Last month at the DataWorks Summit, Microsoft was pleased to announce general availability of Azure HDInsight 3.6 backed by enterprise grade SLA. HDInsight 3.6 brings updates to various open source components in Apache Hadoop & Spark eco-system to the cloud, allowing customers to deploy them easily and run them reliably on an enterprise grade platform.
What’s new in Azure HDInsight 3.6
Azure HDInsight 3.6 is a major update to the core Apache Hadoop & Spark platform as well as with various open source components. HDInsight 3.6 has the latest Hortonworks Data Platform (HDP) 2.6 platform, a collaborative effort between Microsoft and Hortonworks to bring HDP to market cloud-first.
HDInsight 3.6 GA also builds upon the public preview of 3.6 which included Apache Spark 2.1.
Apache Spark 2.1 is now generally available, backed by existing SLA. Microsoft is introducing capabilities to support real-time streaming solutions with Spark integration to Azure Event Hubs and leveraging the structured streaming connector in Kafka for HDInsight. This will allow customers to use Spark to analyze millions of real-time events ingested into these Azure services, thus enabling IoT and other real-time scenarios. HDInsight 3.6 will only have the latest version of Apache Spark such as 2.1 and above. There is no support for older versions such as 2.0.2 or below. Learn more on how to get started with Spark on HDInsight.
Apache Hive 2.1 enables ~2X faster ETL with robust SQL standard ACID merge support and many more improvements. This release also includes an updated preview of Interactive Hive using LLAP (Long Lived and Process) which enables 25x faster queries. With the support of the new version of Hive, customers can expect sub-second performance, thus enabling enterprise data warehouse scenarios without the need for data movement. Learn more on how to get started with Interactive Hive on HDInsight.
This release also includes new Hive views (Hive view 2.0) which provides an easy to use graphical user interface for developers to get started with Hadoop. Developers can use this to easily upload data to HDInsight, define tables, write queries and get insights from data faster using Hive views 2.0. Following screenshot shows new Hive views 2.0 interface.
Microsoft is expanding their interactive data analysis by including Apache Zeppelin notebook apart from Jupyter. Zeppelin notebook is pre-installed when you use HDInsight 3.6, and you can easily launch it from the portal. Following screenshot shows Zeppelin notebook interface.
Getting started with Azure HDInsight 3.6
It is very simple to get started with Apache HDInsight 3.6 – simply go to the Microsoft Azure portal and create an Azure HDInsight service.
Once you’ve selected HDInsight, you can pick the specific version and workload based on your desired scenario. Azure HDInsight supports a wide range of scenarios and workloads such as Hive, Spark, Interactive Hive (Preview), HBase, Kafka (Preview), Storm, and R Server as options you can select from. Learn more on creating clusters in HDInsight.
Once you’ve complete the wizard, the appropriate cluster will be created. Apart from the Azure portal, you can also automate creation of the HDInsight service using the Command Line Interface (CLI). Learn more on how to create cluster using CLI.