You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So we will start by introducing you to where big data comes from and what kinds
of things you can do with it.
We'll also provide an overview of some of the key characteristics of big data
and a short summary of the data science process to get value out of big data.
Finally, we'll summarize the components of Hadoop for big data, and provide
some hands on activities to make yourself familiar with some of these components.
By the end of this course you will be able to
Describe the Big Data landscape including examples of real world big data problems and approaches.
Identify the high level components in the data science lifecycle and associated data flow.
Explain the V’s of Big Data and why each impacts the collection, monitoring, storage, analysis and reporting, including their impact in the presence of multiple V’s.
Identify big data problems and be able to recast problems as data science questions.
Summarize the features and significance of the HDFS file system and the MapReduce programming model and how they relate to working with Big Data.
What makes big data valuable
Personalized marketing
Recomendations Engines
Sentiment Analysis (Product review)
Natural language processing
Mobile Advertising
Customer Profile + Recent purchases
Geolocation - spacial big data
Consumer Growth to Guide Product Growth
Collective Consumer Behavoir
Biomedical Applications
Personalized Medicine
Personalized Cancer Treatment
Big Data-Driven Cities
Smart City
What application area interests you?
Saving lives with Big Data
wildfire
Prediction and Response
People
Sensors
Organizations
Using Big Data to Help Patients
Where Does Big Data Come From?
Machines, People and Organizations
Sensor
Social media data (twitters, photos)
Machine-Generated Data: It's Everywhere and There's a Lot!
Smart devices
Connect to other devices/networks
Collect and analyze data autonomously
Provide environmental contexts
Machine-Generated Data: Advantages
Big Data Generated By People: The Unstructured Challenge
Social Medias (Facebook, Instagram, Youtube)
Blog
Mobile SMS
Email
Not defined data model
Big Data Generated By People: How Is It Being Used?
Hadoop is designed to support the processing of large data sets in a
distributed computing environment.
Spark and Storm are open source framework that handle such real time data generated
at a fast rate.
ETL: Extract Transform Load
NoSQL Data Storage in the Cloud
Organization-Generated Data: Structured but often siloed
Organization-Generated Data: Benefits Come From Combining With Other Data Types
The three most importants: Volume, Velocity and Variety.
Volume
Refers to vast amounts of data that is generated every second.
Volume == Size
Email, photo, Videos
There are a number of challenges related to the massive volumes of Big Data
Storage -> Distribution -> Processing
Data Acquisition
Retrieval
The challenges include cost, scalability and performance related to their storage, acess and processing.
Variety
Refers to the ever increasing different forms that data can come in such as text, images and geospatial data.
Variety == Complexity
Variety is a form of scalability.
Today data are more heterogeneous:
Structural Variety: formats and models
Media Variety: medium in which data get delivered
Semantic Variety: how to interpret and operate on data
Availability Variation: real-time? intermittent?
Velocity
Refers to the speed that which data is being generated.
Velocity == Speed
Speed of creating data
Speed of storing data
Speed of analyzing data
Batch Processing (Incomplete)
Collect Data -> Clean Data -> Feed in chunks -> Wait -> Act
Real-Time Processing (Fast)
Instantly capture stream data -> Feed real time to machines -> Precess real time -> Act
Faster decision
Veracity
Refers to biases, noises and abnormality in data.
Veracity == Quality
Validity and Volatity
Accuracy of data
Reliability of data source
Context within analysis
Unstructed data form the internet is imprecise and uncertain.
Example of Google Flu Trends: Uncertain, Provenance
Valence
Refers to the connectendness of big data in forms of graphs.
Valence == Connectendness
Measure of connectivity
Data Connectivity: Two data items are connected when they are releated to each other
Valence: Fraction of data items that are connected out of the total number of possible connections
Challenges:
More complex data exploration algorithms
Modeling and prediction valence changes
Group event detection
Emergent behavior analysis
Value
Heart of Big Data challenge
Starting to generate value from Big Data
Data source:
Machine - User activity logs
People - Twitter conversations
Organization - User demografic info/Game stats
A “Small” Definition of Big Data
The term ‘big data’ seems to be popping up everywhere these days. And there seems to be as many uses of this term as there are contexts in which you find it: ‘big data’ is often used to refer to any dataset that is difficult to manage using traditional database systems; it is also used as a catch-all term for any collection of data that is too large to process on a single server; yet others use the term to simply mean “a lot of data”; sometimes it turns out it doesn’t even have to be large. So what exactly is big data?
A precise specification of ‘big’ is elusive. What is considered big for one organization may be small for another. What is large-scale today will likely seem small-scale in the near future; petabyte is the new terabyte. Thus, size alone cannot specify big data. The complexity of the data is an important factor that must also be considered.
Most now agree with the characterization of big data using the 3 V’s coined by Doug Laney of Gartner:
· Volume: This refers to the vast amounts of data that is generated every second/minute/hour/day in our digitized world.
· Velocity: This refers to the speed at which data is being generated and the pace at which data moves from one point to the next.
· Variety: This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.
A fourth V is now also sometimes added:
· Veracity: This refers to the quality of the data, which can vary greatly.
There are many other V's that gets added to these depending on the context. For our specialization, we will add:
· Valence: This refers to how big data can bond with each other, forming connections between otherwise disparate datasets.
The above V’s are the dimensions that characterize big data, and also embody its challenges: We have huge amounts of data, in different formats and varying quality, that must be processed quickly.
It is important to note that the goal of processing big data is to gain insight to support decision-making. It is not sufficient to just be able to capture and store the data. The point of collecting and processing volumes of complex data is to understand trends, uncover hidden patterns, detect anomalies, etc. so that you have a better understanding of the problem being analyzed and can make more informed, data-driven decisions. In fact, many consider value as the sixth V of big data:
· Value: Processing big data must bring about value from insights gained.
To address the challenges of big data, innovative technologies are needed. Parallel, distributed computing paradigms, scalable machine learning algorithms, and real-time querying are key to analysis of big data. Distributed file systems, computing clusters, cloud computing, and data stores supporting data variety and agility are also necessary to provide the infrastructure for processing of big data. Workflows provide an intuitive, reusable, scalable and reproducible way to process big data to gain verifiable value from it in and enable application of same methods to different datasets.
With all the data generated from social media, smart sensors, satellites, surveillance cameras, the Internet, and countless other devices, big data is all around us. The endeavor to make sense out of that data brings about exciting opportunities indeed!
Quiz
Amazon has been collecting review data for a particular product. They have realized that almost 90% of the reviews were mostly a 5/5 rating. However, of the 90%, they realized that 50% of them were customers who did not have proof of purchase or customers who did not post serious reviews about the product. Of the following, which is true about the review data collected in this situation?
Low Veracity
As mentioned in the slides, what are the challenges to data with high valence?
Complex Data Exploration Algorithms
Which of the follow is NOT one of the 6 V's in big data?
Vision
What is the veracity of big data?
The abnormality or uncertainties of data.
What are the challenges of data with high variety?
Hard to integrate
Which of the following is the best way to describe why it is crucial to process data in real-time?
Prevents missed opportunities.
What are the challenges with big data that has high volume?
Cost, Scalability, and Performance
Getting Value out of Big Data
Explain why data science is the key to getting value out of Big Data.
List the right set of skills for a data scientist to fit your organization.
Big Data - Insight - Action
Insight -> Data Product
Big Data + Analysis + Questions -> Insights
Data Science is not static = Recommendation Systems
Historical Data + Near real-time data -> Prediction
Communication Goals -> Build teams -> Share data -> Adapt for new situations -> Integrate analytics
How does big data science happen?: Five Components of Data Science
People
Purpose
Process
Platforms
Programability
Big Data Engineering Computational Big Data Science
Acquire -> Prepare -> Analyze -> Report -> Act
Rate of spread and direction
Build
|
Explore
| Report - Act
Scale
Hadoop Plataform
Process: Build metrics for accountability
Cost
Timeline
Planning of deliverables
Expectations
Purpose
We define data science as a multidisciplinary craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability.
Asking the Right Questions
Define the Problem
Assess the situation
Risk
Benefits
Contigencies
Regulations
Resources
Requirements
Define goals
Objectives
Criteria
Formulate the Questions
Steps in the Data Science Process
Step 1: Acquiring Data
Is to determine what data is available.
Script Languages
Tradicional Databases
WebService
Txt Files
NoSQL Storage
Step 2: Prepare
Step 2-A: Exploring Data
Goal: Understand your data.
Look for things with correlation.
Visualize your data:
Histogram
Line graphs
Heat maps
Scatter plotts
Step 2-B: Pre-processing Data
Clean + Transform
Inconsistent values
Duplicate records
Missing values
Invalid data
Outliers
Addressing Data Quality Issues
Remove data with missing values
Merge duplicated records
Generate best estimate for invalid values
Remove oultiers
Getting Data in Shape
Data manipulation/Data preprocessing/Data wranglig
Feature Selection
Remove feature
Combine features
Add feature
Dimensionality Reduction
Step 3: Analyzing Data
Build a model from your data
Input Data -> Analysis Technique -> Model -> Model Output
Categories of analysing data:
Classification: Predict category
Regression: Predict numeric value
Clustering: Organize similar items into groups
Graph analysis: Use graph structures to find connections between entities
Association analysis: Find rules to capture associations between items