Skip to content

Instantly share code, notes, and snippets.

@wagnerjgoncalves
Created April 27, 2016 23:36
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 11 You must be signed in to fork a gist
  • Save wagnerjgoncalves/35a51f7a8e9f87db929c6d789d1d97ed to your computer and use it in GitHub Desktop.
Save wagnerjgoncalves/35a51f7a8e9f87db929c6d789d1d97ed to your computer and use it in GitHub Desktop.
Introduction Big Data

What's in Big Data Applications and Systems?

Introduction

So we will start by introducing you to where big data comes from and what kinds of things you can do with it.

We'll also provide an overview of some of the key characteristics of big data and a short summary of the data science process to get value out of big data.

Finally, we'll summarize the components of Hadoop for big data, and provide some hands on activities to make yourself familiar with some of these components.

By the end of this course you will be able to

  • Describe the Big Data landscape including examples of real world big data problems and approaches.

  • Identify the high level components in the data science lifecycle and associated data flow.

  • Explain the V’s of Big Data and why each impacts the collection, monitoring, storage, analysis and reporting, including their impact in the presence of multiple V’s.

  • Identify big data problems and be able to recast problems as data science questions.

  • Summarize the features and significance of the HDFS file system and the MapReduce programming model and how they relate to working with Big Data.

What makes big data valuable

  • Personalized marketing

  • Recomendations Engines

  • Sentiment Analysis (Product review)

    • Natural language processing
  • Mobile Advertising

    • Customer Profile + Recent purchases

    • Geolocation - spacial big data

  • Consumer Growth to Guide Product Growth

    • Collective Consumer Behavoir
  • Biomedical Applications

    • Personalized Medicine

      • Personalized Cancer Treatment
  • Big Data-Driven Cities

    • Smart City

What application area interests you?

Saving lives with Big Data

wildfire

Prediction and Response

  • People
  • Sensors
  • Organizations

Using Big Data to Help Patients

Where Does Big Data Come From?

  • Machines, People and Organizations
  • Sensor
  • Social media data (twitters, photos)

Machine-Generated Data: It's Everywhere and There's a Lot!

Smart devices

  • Connect to other devices/networks
  • Collect and analyze data autonomously
  • Provide environmental contexts

Machine-Generated Data: Advantages

Big Data Generated By People: The Unstructured Challenge

  • Social Medias (Facebook, Instagram, Youtube)
  • Blog
  • Mobile SMS
  • Email

Not defined data model

Big Data Generated By People: How Is It Being Used?

Hadoop is designed to support the processing of large data sets in a distributed computing environment.

Spark and Storm are open source framework that handle such real time data generated at a fast rate.

ETL: Extract Transform Load

NoSQL Data Storage in the Cloud

Organization-Generated Data: Structured but often siloed

Organization-Generated Data: Benefits Come From Combining With Other Data Types

Characteristics Of Big Data

Is usualy caracterized using a number of V's.

The three most importants: Volume, Velocity and Variety.

Volume

Refers to vast amounts of data that is generated every second.

Volume == Size

Email, photo, Videos

There are a number of challenges related to the massive volumes of Big Data

Storage -> Distribution -> Processing

Data Acquisition

Retrieval

The challenges include cost, scalability and performance related to their storage, acess and processing.

Variety

Refers to the ever increasing different forms that data can come in such as text, images and geospatial data.

Variety == Complexity

Variety is a form of scalability.

Today data are more heterogeneous:

Structural Variety: formats and models

Media Variety: medium in which data get delivered

Semantic Variety: how to interpret and operate on data

Availability Variation: real-time? intermittent?

Velocity

Refers to the speed that which data is being generated.

Velocity == Speed
  • Speed of creating data
  • Speed of storing data
  • Speed of analyzing data

Batch Processing (Incomplete)

Collect Data -> Clean Data -> Feed in chunks -> Wait -> Act 

Real-Time Processing (Fast)

Instantly capture stream data -> Feed real time to machines -> Precess real time -> Act
  • Faster decision

Veracity

Refers to biases, noises and abnormality in data.

Veracity == Quality

Validity and Volatity

  • Accuracy of data
  • Reliability of data source
  • Context within analysis

Unstructed data form the internet is imprecise and uncertain.

Example of Google Flu Trends: Uncertain, Provenance

Valence

Refers to the connectendness of big data in forms of graphs.

Valence == Connectendness

Measure of connectivity

  • Data Connectivity: Two data items are connected when they are releated to each other
  • Valence: Fraction of data items that are connected out of the total number of possible connections

Challenges:

  • More complex data exploration algorithms
  • Modeling and prediction valence changes
  • Group event detection
  • Emergent behavior analysis

Value

Heart of Big Data challenge

Starting to generate value from Big Data

Data source:

  • Machine - User activity logs
  • People - Twitter conversations
  • Organization - User demografic info/Game stats

A “Small” Definition of Big Data

The term ‘big data’ seems to be popping up everywhere these days. And there seems to be as many uses of this term as there are contexts in which you find it: ‘big data’ is often used to refer to any dataset that is difficult to manage using traditional database systems; it is also used as a catch-all term for any collection of data that is too large to process on a single server; yet others use the term to simply mean “a lot of data”; sometimes it turns out it doesn’t even have to be large. So what exactly is big data?

A precise specification of ‘big’ is elusive. What is considered big for one organization may be small for another. What is large-scale today will likely seem small-scale in the near future; petabyte is the new terabyte. Thus, size alone cannot specify big data. The complexity of the data is an important factor that must also be considered.

Most now agree with the characterization of big data using the 3 V’s coined by Doug Laney of Gartner:

· Volume: This refers to the vast amounts of data that is generated every second/minute/hour/day in our digitized world.

· Velocity: This refers to the speed at which data is being generated and the pace at which data moves from one point to the next.

· Variety: This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.

A fourth V is now also sometimes added:

· Veracity: This refers to the quality of the data, which can vary greatly.

There are many other V's that gets added to these depending on the context. For our specialization, we will add:

· Valence: This refers to how big data can bond with each other, forming connections between otherwise disparate datasets.

The above V’s are the dimensions that characterize big data, and also embody its challenges: We have huge amounts of data, in different formats and varying quality, that must be processed quickly.

It is important to note that the goal of processing big data is to gain insight to support decision-making. It is not sufficient to just be able to capture and store the data. The point of collecting and processing volumes of complex data is to understand trends, uncover hidden patterns, detect anomalies, etc. so that you have a better understanding of the problem being analyzed and can make more informed, data-driven decisions. In fact, many consider value as the sixth V of big data:

· Value: Processing big data must bring about value from insights gained.

To address the challenges of big data, innovative technologies are needed. Parallel, distributed computing paradigms, scalable machine learning algorithms, and real-time querying are key to analysis of big data. Distributed file systems, computing clusters, cloud computing, and data stores supporting data variety and agility are also necessary to provide the infrastructure for processing of big data. Workflows provide an intuitive, reusable, scalable and reproducible way to process big data to gain verifiable value from it in and enable application of same methods to different datasets.

With all the data generated from social media, smart sensors, satellites, surveillance cameras, the Internet, and countless other devices, big data is all around us. The endeavor to make sense out of that data brings about exciting opportunities indeed!

Quiz

  1. Amazon has been collecting review data for a particular product. They have realized that almost 90% of the reviews were mostly a 5/5 rating. However, of the 90%, they realized that 50% of them were customers who did not have proof of purchase or customers who did not post serious reviews about the product. Of the following, which is true about the review data collected in this situation?

    Low Veracity

  2. As mentioned in the slides, what are the challenges to data with high valence?

    Complex Data Exploration Algorithms

  3. Which of the follow is NOT one of the 6 V's in big data?

    Vision

  4. What is the veracity of big data?

    The abnormality or uncertainties of data.

  5. What are the challenges of data with high variety?

    Hard to integrate

  6. Which of the following is the best way to describe why it is crucial to process data in real-time?

    Prevents missed opportunities.

  7. What are the challenges with big data that has high volume?

    Cost, Scalability, and Performance

Getting Value out of Big Data

Explain why data science is the key to getting value out of Big Data.

List the right set of skills for a data scientist to fit your organization.

Big Data - Insight - Action

Insight -> Data Product

Big Data + Analysis + Questions -> Insights

Data Science is not static = Recommendation Systems

Historical Data + Near real-time data -> Prediction

Data Science is Team Work

Computer Science + Mathematics + Business Expertise
TEchnical Skills + Businnes Skills + Soft Skills
  • Have passion for data
  • Relate problems to analytics
  • Care about engineering solutions
  • Exhibit curiosity
  • Communicate with teammates

Building a Big Data Strategy

Strategy: Aim - Policy - Plan - Action

BigData strategy starts with big objectives

What data to collect?

Business Objectives

  • Commitment
  • Sponsorship
  • Communication

Build diverse team

  • Diverse team
  • Deliver as a team

Training

Share data

Is key to any big data initiative

Define big data policies

Cultivates analytics-driven culture

Communication Goals -> Build teams -> Share data -> Adapt for new situations -> Integrate analytics

How does big data science happen?: Five Components of Data Science

  • People
  • Purpose
  • Process
  • Platforms
  • Programability

Big Data Engineering Computational Big Data Science

Acquire -> Prepare -> Analyze -> Report -> Act

Rate of spread and direction

Build | Explore | Report - Act Scale

Hadoop Plataform

Process: Build metrics for accountability

  • Cost
  • Timeline
  • Planning of deliverables
  • Expectations
  • Purpose

We define data science as a multidisciplinary craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability.

Asking the Right Questions

  • Define the Problem
  • Assess the situation
    • Risk
    • Benefits
    • Contigencies
    • Regulations
    • Resources
    • Requirements
  • Define goals
    • Objectives
    • Criteria
  • Formulate the Questions

Steps in the Data Science Process

Step 1: Acquiring Data

Is to determine what data is available.

Script Languages

  • Tradicional Databases
  • WebService
  • Txt Files
  • NoSQL Storage

Step 2: Prepare

Step 2-A: Exploring Data

Goal: Understand your data.

Look for things with correlation.

Visualize your data:

  • Histogram
  • Line graphs
  • Heat maps
  • Scatter plotts

Step 2-B: Pre-processing Data

Clean + Transform

  • Inconsistent values
  • Duplicate records
  • Missing values
  • Invalid data
  • Outliers
  1. Addressing Data Quality Issues

    • Remove data with missing values
    • Merge duplicated records
    • Generate best estimate for invalid values
    • Remove oultiers
  2. Getting Data in Shape

    • Data manipulation/Data preprocessing/Data wranglig
  3. Feature Selection

    • Remove feature
    • Combine features
    • Add feature
  4. Dimensionality Reduction

Step 3: Analyzing Data

Build a model from your data

Input Data -> Analysis Technique -> Model -> Model Output

Categories of analysing data:

  • Classification: Predict category
  • Regression: Predict numeric value
  • Clustering: Organize similar items into groups
  • Graph analysis: Use graph structures to find connections between entities
  • Association analysis: Find rules to capture associations between items

Modeling:

  • Select techinique
  • Build model
  • Validate model

Step 4: Communicating Results

What to present? How to present?

D3.js Leaflet.js

Step 5: Turning Insights into Action

What is distributed file system? (DFS)

  • Data Partitioning
  • Data Replication

DFS provide:

  • Data Scalability
  • Fault Tolerance
  • High Concurrency

Scalable Computing over the Internet

Single Compute Node Parallel Computer

Commodity Cluster

  • Affordable

  • Less-specialized

  • Distributed Computing

  • Data-paralleslim

  • Fault-tolerance

  • Redudant data storage + Data parallel job restart

Programming Models for Big Data

Data parallel scalabilty - commodity clusters

Programming model = abstractions

runtime libraries + programming langagues

  1. Support Big Data Operations
    • Split volumes of data
    • Acces Data Fast
    • Distribute Computations to Nodes
  2. Handle Fault Tolerance
    • Replicate data partitions
    • Recover files when needed
  3. Enable Adding More Racks
  4. Optimezed for specific data types

Mapa Reduce -> A programmming model for Bid Data -> Many implementatios

Foundations For Big Data Quiz

  1. Which of the following is the best description of why it is important to learn about the foundations in big data? Foundations stand the test of time.
  2. What is the benefit of a commodity cluster? Enables fault tolerance.
  3. What is a way to enable fault tolerance? Redundant Data Storage
  4. What is NOT a benefit specific to a distributed file system? Large Storage
  5. Which of the following is NOT a general requirement for a programming language in order to support big data models? Utilize Map Reduction Methods

Getting Started with Hadoop

  1. Enable Scalability
  2. Handle Falut Tolerance
  3. Optimized for a Variety Data Types
  4. Facilitate a Shared Environment
  5. Provide value

Hadoop: Why, Where and Who?

What's in the ecosystem?

Why is it beneficial?

Where is it used?

Who uses it?

How do these tools work?

The Hadoop Ecosystem: Welcome to the zoo!

Layer Diagram

  • Distributed file system as foundation
  • Flexible scheduling and resource management (YARN)
  • Simplified programming model
    • Map -> apply()
    • Recuce -> summarize()
  • Higher-level programming models
    • Pig = dataflow scripting
    • Hive = SQL-like queries
  • Specialized models for graph processing
    • Giraph = process large sclae graphs
  • Real-time and in-memory processing
    • Storm, Spark and Flink
  • Zookeeper for management

The Hadoop Distributed File System: A Storage System for Big Data

HDFS = foudation for Hadoop ecosystem

  • Scalability
  • Reliability

Store massively large data sets

Replication for fault tolerance

Two key components of HDFS

  1. NameNode for metadata (Usually one per cluster)
  2. DataNode for block storage (Usually one por machine)

YARN: A Resource Manager for Hadoop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment