wagnerjgoncalves/week_1.md

## week_1.md

      
    Raw
  

              week_1.md
            
          
    What's in Big Data Applications and Systems?

Introduction

So we will start by introducing you to where big data comes from and what kinds
of things you can do with it.
We'll also provide an overview of some of the key characteristics of big data
and a short summary of the data science process to get value out of big data.
Finally, we'll summarize the components of Hadoop for big data, and provide
some hands on activities to make yourself familiar with some of these components.
By the end of this course you will be able to


Describe the Big Data landscape including examples of real world big data problems and approaches.


Identify the high level components in the data science lifecycle and associated data flow.


Explain the V’s of Big Data and why each impacts the collection, monitoring, storage, analysis and reporting, including their impact in the presence of multiple V’s.


Identify big data problems and be able to recast problems as data science questions.


Summarize the features and significance of the HDFS file system and the MapReduce programming model and how they relate to working with Big Data.


What makes big data valuable


Personalized marketing


Recomendations Engines


Sentiment Analysis (Product review)

Natural language processing


Mobile Advertising


Customer Profile + Recent purchases


Geolocation - spacial big data


Consumer Growth to Guide Product Growth

Collective Consumer Behavoir


Biomedical Applications


Personalized Medicine

Personalized Cancer Treatment


Big Data-Driven Cities

Smart City


What application area interests you?

Saving lives with Big Data

wildfire
Prediction and Response

People
Sensors
Organizations

Using Big Data to Help Patients

Where Does Big Data Come From?


Machines, People and Organizations
Sensor
Social media data (twitters, photos)

Machine-Generated Data: It's Everywhere and There's a Lot!

Smart devices


Connect to other devices/networks
Collect and analyze data autonomously
Provide environmental contexts

Machine-Generated Data: Advantages

Big Data Generated By People: The Unstructured Challenge


Social Medias (Facebook, Instagram, Youtube)
Blog
Mobile SMS
Email

Not defined data model
Big Data Generated By People: How Is It Being Used?

Hadoop is designed to support the processing of large data sets in a
distributed computing environment.
Spark and Storm are open source framework that handle such real time data generated
at a fast rate.
ETL: Extract Transform Load
NoSQL Data Storage in the Cloud
Organization-Generated Data: Structured but often siloed

Organization-Generated Data: Benefits Come From Combining With Other Data Types


## week_2.md

      
    Raw
  

              week_2.md
            
          
    Characteristics Of Big Data

Is usualy caracterized using a number of V's.
The three most importants: Volume, Velocity and Variety.
Volume

Refers to vast amounts of data that is generated every second.
Volume == Size

Email, photo, Videos
There are a number of challenges related to the massive volumes of Big Data
Storage -> Distribution -> Processing
Data Acquisition
Retrieval
The challenges include cost, scalability and performance related to their storage, acess and processing.
Variety

Refers to the ever increasing different forms that data can come in such as text, images and geospatial data.
Variety == Complexity

Variety is a form of scalability.
Today data are more heterogeneous:
Structural Variety: formats and models
Media Variety: medium in which data get delivered
Semantic Variety: how to interpret and operate on data
Availability Variation: real-time? intermittent?
Velocity

Refers to the speed that which data is being generated.
Velocity == Speed


Speed of creating data
Speed of storing data
Speed of analyzing data

Batch Processing (Incomplete)
Collect Data -> Clean Data -> Feed in chunks -> Wait -> Act 

Real-Time Processing (Fast)
Instantly capture stream data -> Feed real time to machines -> Precess real time -> Act


Faster decision

Veracity

Refers to biases, noises and abnormality in data.
Veracity == Quality

Validity and Volatity

Accuracy of data
Reliability of data source
Context within analysis

Unstructed data form the internet is imprecise and uncertain.
Example of Google Flu Trends: Uncertain, Provenance
Valence

Refers to the connectendness of big data in forms of graphs.
Valence == Connectendness

Measure of connectivity

Data Connectivity: Two data items are connected when they are releated to each other
Valence: Fraction of data items that are connected out of the total number of  possible connections

Challenges:

More complex data exploration algorithms
Modeling and prediction valence changes
Group event detection
Emergent behavior analysis

Value

Heart of Big Data challenge
Starting to generate value from Big Data
Data source:

Machine - User activity logs
People - Twitter conversations
Organization - User demografic info/Game stats

A “Small” Definition of Big Data

The term ‘big data’ seems to be popping up everywhere these days. And there seems to be as many uses of this term as there are contexts in which you find it: ‘big data’ is often used to refer to any dataset that is difficult to manage using traditional database systems; it is also used as a catch-all term for any collection of data that is too large to process on a single server; yet others use the term to simply mean “a lot of data”; sometimes it turns out it doesn’t even have to be large. So what exactly is big data?
A precise specification of ‘big’ is elusive. What is considered big for one organization may be small for another. What is large-scale today will likely seem small-scale in the near future; petabyte is the new terabyte. Thus, size alone cannot specify big data. The complexity of the data is an important factor that must also be considered.
Most now agree with the characterization of big data using the 3 V’s coined by Doug Laney of Gartner:
· Volume: This refers to the vast amounts of data that is generated every second/minute/hour/day in our digitized world.
· Velocity: This refers to the speed at which data is being generated and the pace at which data moves from one point to the next.
· Variety: This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.
A fourth V is now also sometimes added:
· Veracity: This refers to the quality of the data, which can vary greatly.
There are many other V's that gets added to these depending on the context. For our specialization, we will add:
· Valence: This refers to how big data can bond with each other, forming connections between otherwise disparate datasets.
The above V’s are the dimensions that characterize big data, and also embody its challenges: We have huge amounts of data, in different formats and varying quality, that must be processed quickly.
It is important to note that the goal of processing big data is to gain insight to support decision-making. It is not sufficient to just be able to capture and store the data. The point of collecting and processing volumes of complex data is to understand trends, uncover hidden patterns, detect anomalies, etc. so that you have a better understanding of the problem being analyzed and can make more informed, data-driven decisions. In fact, many consider value as the sixth V of big data:
· Value: Processing big data must bring about value from insights gained.
To address the challenges of big data, innovative technologies are needed. Parallel, distributed computing paradigms, scalable machine learning algorithms, and real-time querying are key to analysis of big data. Distributed file systems, computing clusters, cloud computing, and data stores supporting data variety and agility are also necessary to provide the infrastructure for processing of big data. Workflows provide an intuitive, reusable, scalable and reproducible way to process big data to gain verifiable value from it in and enable application of same methods to different datasets.
With all the data generated from social media, smart sensors, satellites, surveillance cameras, the Internet, and countless other devices, big data is all around us. The endeavor to make sense out of that data brings about exciting opportunities indeed!
Quiz


Amazon has been collecting review data for a particular product. They have realized that almost 90% of the reviews were mostly a 5/5 rating. However, of the 90%, they realized that 50% of them were customers who did not have proof of purchase or customers who did not post serious reviews about the product. Of the following, which is true about the review data collected in this situation?
Low Veracity


As mentioned in the slides, what are the challenges to data with high valence?
Complex Data Exploration Algorithms


Which of the follow is NOT one of the 6 V's in big data?
Vision


What is the veracity of big data?
The abnormality or uncertainties of data.


What are the challenges of data with high variety?
Hard to integrate


Which of the following is the best way to describe why it is crucial to process data in real-time?
Prevents missed opportunities.


What are the challenges with big data that has high volume?
Cost, Scalability, and Performance


Getting Value out of Big Data

Explain why data science is the key to getting value out of Big Data.
List the right set of skills for a data scientist to fit your organization.
Big Data - Insight - Action

Insight -> Data Product
Big Data + Analysis + Questions -> Insights
Data Science is not static = Recommendation Systems
Historical Data + Near real-time data -> Prediction

Data Science is Team Work
Computer Science + Mathematics + Business Expertise
TEchnical Skills + Businnes Skills + Soft Skills


Have passion for data
Relate problems to analytics
Care about engineering solutions
Exhibit curiosity
Communicate with teammates

Building a Big Data Strategy

Strategy: Aim - Policy - Plan - Action
BigData strategy starts with big objectives
What data to collect?
Business Objectives

Commitment
Sponsorship
Communication

Build diverse team

Diverse team
Deliver as a team

Training
Share data
Is key to any big data initiative
Define big data policies
Cultivates analytics-driven culture
Communication Goals -> Build teams -> Share data -> Adapt for new situations -> Integrate analytics

How does big data science happen?: Five Components of Data Science


People
Purpose
Process
Platforms
Programability

Big Data Engineering    Computational Big Data Science
Acquire -> Prepare -> Analyze -> Report -> Act
Rate of spread and direction
Build
|
Explore
|     Report - Act
Scale
Hadoop Plataform
Process: Build metrics for accountability

Cost
Timeline
Planning of deliverables
Expectations
Purpose

We define data science as a multidisciplinary craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability.
Asking the Right Questions


Define the Problem
Assess the situation

Risk
Benefits
Contigencies
Regulations
Resources
Requirements


Define goals

Objectives
Criteria


Formulate the Questions

Steps in the Data Science Process

Step 1: Acquiring Data

Is to determine what data is available.
Script Languages

Tradicional Databases
WebService
Txt Files
NoSQL Storage

Step 2: Prepare

Step 2-A: Exploring Data

Goal: Understand your data.
Look for things with correlation.
Visualize your data:

Histogram
Line graphs
Heat maps
Scatter plotts

Step 2-B: Pre-processing Data

Clean + Transform

Inconsistent values
Duplicate records
Missing values
Invalid data
Outliers


Addressing Data Quality Issues

Remove data with missing values
Merge duplicated records
Generate best estimate for invalid values
Remove oultiers


Getting Data in Shape

Data manipulation/Data preprocessing/Data wranglig


Feature Selection

Remove feature
Combine features
Add feature


Dimensionality Reduction


Step 3: Analyzing Data

Build a model from your data
Input Data -> Analysis Technique -> Model -> Model Output
Categories of analysing data:

Classification: Predict category
Regression: Predict numeric value
Clustering: Organize similar items into groups
Graph analysis: Use graph structures to find connections between entities
Association analysis: Find rules to capture associations between items

Modeling:

Select techinique
Build model
Validate model

Step 4: Communicating Results

What to present?
How to present?
D3.js
Leaflet.js
Step 5: Turning Insights into Action


## week_3.md

      
    Raw
  

              week_3.md
            
          
    What is distributed file system? (DFS)


Data Partitioning
Data Replication

DFS provide:

Data Scalability
Fault Tolerance
High Concurrency

Scalable Computing over the Internet

Single Compute Node
Parallel Computer
Commodity Cluster


Affordable


Less-specialized


Distributed Computing


Data-paralleslim


Fault-tolerance


Redudant data storage + Data parallel job restart


Programming Models for Big Data

Data parallel scalabilty - commodity clusters
Programming model = abstractions
runtime libraries + programming langagues

Support Big Data Operations

Split volumes of data
Acces Data Fast
Distribute Computations to Nodes


Handle Fault Tolerance

Replicate data partitions
Recover files when needed


Enable Adding More Racks
Optimezed for specific data types

Mapa Reduce -> A programmming model for Bid Data -> Many implementatios
Foundations For Big Data Quiz


Which of the following is the best description of why it is important to learn about the foundations in big data?
Foundations stand the test of time.
What is the benefit of a commodity cluster?
Enables fault tolerance.
What is a way to enable fault tolerance?
Redundant Data Storage
What is NOT a benefit specific to a distributed file system?
Large Storage
Which of the following is NOT a general requirement for a programming language in order to support big data models?
Utilize Map Reduction Methods

Getting Started with Hadoop


Enable Scalability
Handle Falut Tolerance
Optimized for a Variety Data Types
Facilitate a Shared Environment
Provide value

Hadoop: Why, Where and Who?

What's in the ecosystem?

Why is it beneficial?

Where is it used?

Who uses it?

How do these tools work?

The Hadoop Ecosystem: Welcome to the zoo!

Layer Diagram

Distributed file system as foundation
Flexible scheduling and resource management (YARN)
Simplified programming model

Map -> apply()
Recuce -> summarize()


Higher-level programming models

Pig = dataflow scripting
Hive = SQL-like queries


Specialized models for graph processing

Giraph = process large sclae graphs


Real-time and in-memory processing

Storm, Spark and Flink


Zookeeper for management

The Hadoop Distributed File System: A Storage System for Big Data

HDFS = foudation for Hadoop ecosystem

Scalability
Reliability

Store massively large data sets
Replication for fault tolerance
Two key components of HDFS

NameNode for metadata (Usually one per cluster)
DataNode for block storage (Usually one por machine)

YARN: A Resource Manager for Hadoop