Created
April 4, 2024 06:26
-
-
Save MazamGanendra/bc626d352921e740038426bfb2d411db to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "ce6ad075", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# MAZAM GANENDRA\n", | |
"# 21181192\n", | |
"\n", | |
"# SLIDE 8A ( Data and data-something )\n", | |
"\n", | |
"# -- DATA AND INFROMATION --\n", | |
"\n", | |
"# DATA\n", | |
"# •The term datais defined as a collection of individual facts or statistics (singular form: datum).\n", | |
"# •Data can come in the form of text, observations, figures, images, numbers, graphs, or symbols.\n", | |
"# •Data is a raw form of knowledge and, on its own, doesn’t carry any significance or purpose.\n", | |
"# •Data can be simple—and may even seem useless until it is analyzed, organized, and interpreted.\n", | |
"\n", | |
"# INFORMATION\n", | |
"# •The term informationis defined as knowledge gained through study, communication, research, or \n", | |
"# instruction.\n", | |
"# •Essentially, information is the result of analyzing and interpreting pieces of data.\n", | |
"# •Whereas data is the individual figures, numbers, or graphs, information is the perception of those \n", | |
"# pieces of knowledge.\n", | |
"\n", | |
"# Key differences between them\n", | |
"# •Data is a collection of facts, while information puts those facts into context.\n", | |
"# •While data is raw and unorganized, information is organized.\n", | |
"# •Data points are individualand sometimes unrelated. Infor-mation maps out that data to provide a \n", | |
"# big-picture viewof how it all fits together.\n", | |
"# •Data, on its own, is meaningless. When it’s analyzedand inter-preted, it becomes meaningfulinformation.\n", | |
"# •Data does not dependon information; however, information dependson data.•Data typically comes in the \n", | |
"# form of graphs, numbers, figures, or statistics,while information is typically presented through words, \n", | |
"# language, thoughts, and ideas.\n", | |
"# •Data isn’t sufficientfor decision-making, but you can make decisionsbased on information." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "a64919e7", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# -- Dataset and database --\n", | |
"\n", | |
"# Data, dataset, database\n", | |
"# •The dataare observations or measurements (unprocessed or processed) represented as text, numbers, \n", | |
"# or multimedia.\n", | |
"# •A datasetis a structured collection of data generally asso-ciated with a unique body of work.\n", | |
"# •A databaseis an organized collection of data stored as multi-ple datasets, where those datasets are \n", | |
"# generally stored and accessed electronically from a computer system that allows the data to be easily \n", | |
"# accessed, manipulated, and updated.\n", | |
"# •A datasetis a structured collection of data organized and stored together for analysis or processing, \n", | |
"# that can include many different types of data, from numerical values to text, images or audio recordings.\n", | |
"# •The datawithin a dataset can typically be accessed indivi-dually, in combination or managed as a whole\n", | |
"# entity.\n", | |
"# •A database(relational, document, or key-valuetype) is an organized collection of data stored as \n", | |
"# multiple datasets.\n", | |
"\n", | |
"# Some use cases for the 6 popular schemas\n", | |
"# •Flat model: Best model is for small, simple applications.\n", | |
"# •Hierarchical model: For nested data, like XML or JSON.•Network model: Useful for mapping and spatial \n", | |
"# data, also for depicting workflows.\n", | |
"# •Relational model: Best reflects Object-Oriented Programming applications.\n", | |
"# •Star model: For analyzing large, one-dimensional datasets.\n", | |
"# •Snowflake model: For analyzing large and complex datasets.\n", | |
"\n", | |
"# Dataset and database\n", | |
"# •A datasetis a collection of related data often in a table or spreadsheet format, used primarily for \n", | |
"# analysis.\n", | |
"# •Whereas databaseis a structured system for storing, managing, and retrieving data, often used in \n", | |
"# applications and software systems." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "8f7a6014", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# -- Data warehouse and data lake --\n", | |
"\n", | |
"# Data warehouse\n", | |
"# •A data warehouseis a type of data management system that is designed to enable and support business \n", | |
"# intelligence (BI) activities, especially analytics.\n", | |
"# •Data warehouses are solely intended to perform queries and analysis and often contain large amounts \n", | |
"# of historical data. \n", | |
"# •The data within a data warehouse is usually derived from a wide range of sources such as application \n", | |
"# log files and transaction applications.\n", | |
"# •A data warehouse centralizes and consolidates large amounts of data from multiple sources.\n", | |
"# •Its analytical capabilities allow organizations to derive valu-able business insights from their \n", | |
"# data to improve decision-making.\n", | |
"\n", | |
"# Data lake\n", | |
"# •A data lake is a storage repository that holds a vast amount of raw data in its native format until \n", | |
"# it is needed for analytics applications.\n", | |
"# •While a traditional data warehouse stores data in hierarchical dimensions and tables, a data lake \n", | |
"# uses a flat architecture to store data, primarily in files or object storage.\n", | |
"# •That gives users more flexibility on data management, storage and usage.\n", | |
"\n", | |
"# Meta data in data lake\n", | |
"# •Metadata describes the data stored in the data lake, providing details such as its source, its \n", | |
"# structure, its meaning, its relationships with other data, and its usage.\n", | |
"# •This makes it easier for users to discover relevant data in the vast amounts of data stored in the \n", | |
"# data lake.\n", | |
"\n", | |
"# Data swamps\n", | |
"# •One of the biggest challenges is preventing a data lake from turning into a data swamp.\n", | |
"# •If it isn't set up and managed properly, the data lake can beco-me a messy dumping ground for data.\n", | |
"# •Users may not find what they need, and data managers may lose track of data that's stored in the data \n", | |
"# lake, even as more pours in.\n", | |
"\n", | |
"# Technology overload\n", | |
"# •The wide variety of technologies that can be used in data la-kes also complicates deployments.\n", | |
"# •First, organizations must find the right combination of techno-logies to meet their particular data \n", | |
"# management and analytics needs.\n", | |
"# •Then they need to install them, although the growing use of the cloud has made that step easier.\n", | |
"\n", | |
"# Unexpected costs\n", | |
"# •While the upfront technology costs may not be excessive, that can change if organizations don't \n", | |
"# carefully manage data lake environments.\n", | |
"# •For example, companies may get surprise bills for cloud-based data lakes if they're used more than \n", | |
"# expected.\n", | |
"# •The need to scale up data lakes to meet workload demands also increases costs.\n", | |
"\n", | |
"# Data governance\n", | |
"# •One of the purposes of a data lake is to store raw data as-is for various analytics uses.\n", | |
"# •But without effective governance of data lakes, organizations may be hit with data quality,\n", | |
"# consistency and reliability issues.\n", | |
"# •Those problems can hamper analytics applications and produce flawed results that lead to bad business \n", | |
"# decisions." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "77a7421c", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# -- Data lakehouse --\n", | |
"\n", | |
"# Data lakehouse #1\n", | |
"# •A data lakehouse, as the name suggests, is a new data archi-tecture that merges a data warehouse and\n", | |
"# a data lake into a single whole, with the purpose of addressing each one’s limitations.\n", | |
"# •In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its \n", | |
"# raw formats just like data lakes.\n", | |
"# •At the same time, it brings structure to data and empowers data management features similar to those \n", | |
"# in data warehouses by implementing the metadata layer on top of the store.\n", | |
"# •This enables different teams to use a single system to access all of the enterprise data for a range \n", | |
"# of projects, including data science, machine learning, and business intelligence.\n", | |
"\n", | |
"# Data lakehouse #2\n", | |
"# •It uses an open-source data lake table format, which allows it to work with the features of a data \n", | |
"# warehouse, such as stan-dardized data structures and data management capabilities.\n", | |
"# •The lakehouse architecture also makes it easier to use analy-tics, data science, and machine learning.\n", | |
"# •This is because all of the data is stored in a single place, which makes it easier to access and \n", | |
"# analyze at scale across the entire organization.\n", | |
"\n", | |
"# Problems faced by data lake architecture\n", | |
"# •Inconsistent Data Quality without Schema Enforcement: Data lakes are a great way to store large amounts\n", | |
"# of data from dif-ferent sources. Because they’re so big and unstructured, it can be hard to keep track\n", | |
"# of the quality (to correct) of the data.\n", | |
"# •Handling data today —combining batch and streaming data: Today, data needs to be fast. Data lakes need\n", | |
"# to be able to handle both batch (historical) data as well as streaming (live) data, especially with the\n", | |
"# ever-growing volume of data gene-rated and collected.\n", | |
"# •Overhead for time and money: Managing data warehouse and data lake architectures can be a technical \n", | |
"# challenge. Data warehouses are powerful, but they’re expensive to set up and maintain. Data lakes are\n", | |
"# more cost-effective, but they do not inherently come structure your data for fast querying speeds.\n", | |
"# Organizations need to figure out which data is most critical for their day-to-day analysis and keep \n", | |
"# that in the data warehouse. Other less urgent data can stay in the data lake.\n", | |
"\n", | |
"# Solution: Delta lake\n", | |
"# •It is one of the table formats that enable data lakehouses.\n", | |
"# •It’s an open-source data management and governance layer that sits on top of a data lake.\n", | |
"# •Delta Lake gives data lakes the structure of a data warehouse, while still letting them be used \n", | |
"# for the broad range of use ca-ses that data lakes are typically used for" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "8ae2a55c", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# -- DataFrame --\n", | |
"\n", | |
"# DataFrame\n", | |
"# •A DataFrameis a data structure that organizes data into a 2-dimensional table of rows and columns, \n", | |
"# much like a spread-sheet.\n", | |
"# •DataFrames are one of the most common data structures used in modern data analytics because they are \n", | |
"# a flexibleand intuitive way of storing and working with data.\n", | |
"# •Every DataFrame contains a blueprint, known as a schema, that defines the name and data type of each \n", | |
"# column.\n", | |
"\n", | |
"# Python Pandas DataFrame\n", | |
"# •In Python Pandas, a dataframeis a data structure constructed with rows and columns, similar to a \n", | |
"# database or Excel spreadsheet.\n", | |
"# •It consists of a dictionary of lists in which the list each have their own identifiers or keys, \n", | |
"# such as “last name” or “food group.“" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "1b032249", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# -- Dataset --\n", | |
"\n", | |
"# Definition\n", | |
"# •A dataset, or data set, is a collection of data related to a parti-cular topic, theme, or industry.\n", | |
"# •Datasets include different types of information, such as num-bers, text, images, videos, and audio, \n", | |
"# and can be stored in various formats, such as CSV, JSON, or SQL.\n", | |
"# •So, a dataset typically involves structured data for a specific purpose and is related to the same \n", | |
"# subject.\n", | |
"\n", | |
"# Dataset vs database\n", | |
"# •While a dataset is a collection of data, often in a tabular form such as a CSV or Excel file, focused \n", | |
"# on a specific topic or analy-sis, a database is a structured set of data held in a computer, typically\n", | |
"# a server, that provides more complex functionality for data storage, management, and retrieval.\n", | |
"# •Databases are designed to handle large volumes of data and support concurrent access by multiple users,\n", | |
"# with robust querying capabilities through languages like SQL.\n", | |
"# •Databases maintain data integrityand are essential for appli-cations that require regular data \n", | |
"# updatesand transactions, such as customer relationship management systems or online retail sites.\n", | |
"# •On the other hand, datasets are typically static, used for ana-lysis, and do not facilitate \n", | |
"# real-timedata manipulation or complex transaction processing.\n", | |
"\n", | |
"# Types of datasets\n", | |
"# •Based on the data type\n", | |
"# •Based on data structure\n", | |
"# •In statistics\n", | |
"# •Machine learning\n", | |
"\n", | |
"# Types of datasets (based on the data type)\n", | |
"# •Numerical datasets: Contain numbers and are used for quan-titative analysis.\n", | |
"# •Text datasets: Contain posts, text messages, and documents.\n", | |
"# •Multimedia datasets: Contain images, videos, and audio files.\n", | |
"# •Time-series datasets: Contain data collected over time to ana-lyze trends and patterns.\n", | |
"# •Spatial dataset: Contain geographically referenced informa-tion, such as GPS data.\n", | |
"# •Structured datasets: Organized in specific structures to make it easier to query and analyze data.\n", | |
"# •Unstructured datasets: Don’t have a well-defined schema. They can include a variety of types of data.\n", | |
"# •Hybrid datasets: Include both structured and unstructured data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "e7307962", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# -- Not so related: Apache Spark --\n", | |
"\n", | |
"# Confusing terms: DataFrame and dataset\n", | |
"# •DataFrame and dataset can be misleading to types of APIs provided by Apache Spark.\n", | |
"# •This part is intended to show the API types.\n", | |
"\n", | |
"# Apache Spark\n", | |
"# •Apache Spark is an open-source, distributed processing system used for big data workloads.\n", | |
"# •It utilizes in-memory caching and optimized query execution for fast queries against data of any size.\n", | |
"# •Simply put, Spark is a fast and general engine for large-scale data processing.\n", | |
"\n", | |
"# Features\n", | |
"# •Fast processing–The most important feature of Apache Spark that has made the big data world choose \n", | |
"# this technology over others is its speed.\n", | |
"# •Flexibility–Apache Spark supports multiple languages and allows the developers to write applications\n", | |
"# in Java, Scala, R, or Python.\n", | |
"# •In-memory computing–Spark stores the data in the RAM of servers which allows quick access and in turn\n", | |
"# accelerates the speed of analytics.\n", | |
"# •Real-time processing–Spark is able to process real-time streaming data. Unlike MapReduce which \n", | |
"# processes only stored data, Spark is able to process real-time data and is, therefore, able to produce\n", | |
"# instant outcomes.\n", | |
"# •Better analytics–In contrast to MapReduce that includes Map and Reduce functions, Apache Spark consists\n", | |
"# of a rich set of SQL queries, machine learning algorithms, complex analytics, etc, where with all these \n", | |
"# functionalities, analytics can be performed in a better fashion with the help of Spark.\n", | |
"\n", | |
"# APIs\n", | |
"# •Apache Spark provides three different APIs for working with big data: RDD, Dataset, DataFrame.\n", | |
"# •The Spark platform provides functions to change between the three data formats quickly.\n", | |
"# •Each API has advantages as well as cases when it is most beneficial to use them" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "2a07f9b4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# -- Data source --\n", | |
"\n", | |
"# Data source\n", | |
"# •The data sourcesare the places where you can obtain data for analysis.\n", | |
"# •They come in various forms, such as data sets, APIs, software, and providers.\n", | |
"# •A dataset's quality and reliability heavily depends on the source from which it is obtained.\n", | |
"# •Understanding data sources is essential for data analysis.\n", | |
"\n", | |
"# Data source examples\n", | |
"# •Public health data is used to monitor the spread of diseases and predict future threats.\n", | |
"# •Most businesses use Google Analytics to track website traffic and user behavior.\n", | |
"# •LinkedIn provides data on user behavior, job market trends, and professional connections.\n", | |
"\n", | |
"# Types and formats\n", | |
"# •Data sources can be categorized based on the structure of the data they provide.\n", | |
"# •There are three primary types of data sources: structured, unstructured, and semi-structured.\n", | |
"\n", | |
"# Structured data\n", | |
"# •Structured data refers to data with a specific structure, typi-cally organized in a table format, \n", | |
"# where relational databases are a common source of structured data, since they contain tables consisting\n", | |
"# of columns and rows.\n", | |
"# •SQL is a programming language used to manage and mani-pulate structured data.\n", | |
"# •Structured data is widely used in finance, healthcare, and retail industries.\n", | |
"\n", | |
"# Unstructured data\n", | |
"# •Unstructured data refers to data that doesn't have a specific structure, making it more challenging to \n", | |
"# analyze.\n", | |
"# •Examples of unstructured data include text, images, and video, where some examples are government \n", | |
"# databases, news articles, and social media.\n", | |
"# •Machine learning is often used to analyze unstructured data since ML can use algorithms to identify \n", | |
"# patterns and relation-ships.\n", | |
"\n", | |
"# Semi-structured data formats\n", | |
"# •Semi-structured data is a combination of structured and unstructured data.\n", | |
"# •It has some structure but is also flexible, allowing for changes as needed.\n", | |
"# •Examples of some popular semi-structured data are XML, JSON, CSV." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "ce182bff", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# SLIDE 8B ( Source of data )\n", | |
"\n", | |
"# -- Datasource --\n", | |
"\n", | |
"#Data source #1\n", | |
"# •A data source is the physical or digital location where the data comes from in various forms.\n", | |
"# •The data source can be both the place where the data was originally created and the place where it was\n", | |
"# added, where the last is for data digitizing.\n", | |
"# •Data sources can be digital (for the most part) or paper-based.\n", | |
"# •The idea is to enable users to access and exploit the data from this source.\n", | |
"# •The data source can take different forms, such as a database, a flat file, an inventory table, web \n", | |
"# scraping, streaming data, physical archives, etc.\n", | |
"# •With the development of Big Data and new technologies, these different formats are constantly evolving,\n", | |
"# making data sources ever more complex.\n", | |
"# •The challenge for organisations is to simplify them as much as possible.\n", | |
"\n", | |
"#Data source #2\n", | |
"# •A data source is simply the source of the data.\n", | |
"# •It can be a file, a particular database on a DBMS, or even a live data feed.\n", | |
"# •The data might be located on the same computer as the program, or on another computer somewhere on \n", | |
"# a network.\n", | |
"\n", | |
"# Others\n", | |
"# •Sensors: raw data for physical data, e.g. temperature, humidity, light intensity, ect.\n", | |
"# •Simulation:randomdataleadingtoameaningafteranalysis,e.g.MonteCarlosimulation." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "9273f4bb", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# -- Datatype --\n", | |
"\n", | |
"# Types of data\n", | |
"# •There are two typesof data: Qualitativeand Quantitativedata.\n", | |
"# •They are further clas-sified into four cate-gories: Nominal data,Ordinal data, Discrete data, \n", | |
"# Continuous data.\n", | |
"\n", | |
"# Qualitative or Categorical Data\n", | |
"# •Qualitative or Categorical Data is data that can’t be measured or counted in the form of numbers.\n", | |
"# •These types of data are sorted by category, not by number.\n", | |
"# •These data consist of audio, images, symbols, or text.\n", | |
"# •The gender of a person, i.e., male, female, or others, is qualitative data.\n", | |
"# •Qualitative data tells aboutthe perception of people.\n", | |
"\n", | |
"# Nominal data\n", | |
"# •Nominal Data is used to label variables without any order or quantitative value.\n", | |
"# •The color of hair can be considered nominal data, as one color can’t be compared with another color.\n", | |
"\n", | |
"# Ordinal data\n", | |
"# •Ordinal data have natural ordering where a number is present in some kind of order by their position \n", | |
"# on the scale.\n", | |
"# •These data are used for observation like customer satisfaction, happiness, etc., but we can’t do any \n", | |
"# arithmetical tasks on them.\n", | |
"# •Ordinal data is qualitative data forwhich their values have somekind of relative position.\n", | |
"\n", | |
"# Quantitative data\n", | |
"# •Quantitative data can be expressed in numerical values, making it countable and including statistical \n", | |
"# data analysis. \n", | |
"# •These data can be represented on a wide variety of graphs and charts, such as bar graphs, histograms, \n", | |
"# scatter plots, boxplots, pie charts, line graphs, etc.\n", | |
"\n", | |
"# Discrete data\n", | |
"# •The discrete data contain the values that fall under integers or whole numbers.\n", | |
"# •The total number of students in a class is an example of discrete data.\n", | |
"# •These data can’t bebroken into decimalor fraction values.\n", | |
"\n", | |
"# Continuous data\n", | |
"# •Continuous data are in the form of fractional numbers.•It can be the version of an android phone, \n", | |
"# the height of a person, the length of an object, etc.\n", | |
"# •Continuous data represents information that can be divided into smaller levels.\n", | |
"# •The continuous variable can takeany value within a range.\n", | |
"\n", | |
"# Notes\n", | |
"# •Different types of data are used in research, analysis, statistical analysis, data visualization, and \n", | |
"# data science.\n", | |
"# •Working on data is crucial because we need to figure out what kind of data it is and how to use it to\n", | |
"# get valuable output out of it.\n", | |
"# •Working with data requires good data science skills and a deep understanding of different types of data \n", | |
"# and how to work with them." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.11.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment