Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save MazamGanendra/bc626d352921e740038426bfb2d411db to your computer and use it in GitHub Desktop.
Save MazamGanendra/bc626d352921e740038426bfb2d411db to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "ce6ad075",
"metadata": {},
"outputs": [],
"source": [
"# MAZAM GANENDRA\n",
"# 21181192\n",
"\n",
"# SLIDE 8A ( Data and data-something )\n",
"\n",
"# -- DATA AND INFROMATION --\n",
"\n",
"# DATA\n",
"# •The term datais defined as a collection of individual facts or statistics (singular form: datum).\n",
"# •Data can come in the form of text, observations, figures, images, numbers, graphs, or symbols.\n",
"# •Data is a raw form of knowledge and, on its own, doesn’t carry any significance or purpose.\n",
"# •Data can be simple—and may even seem useless until it is analyzed, organized, and interpreted.\n",
"\n",
"# INFORMATION\n",
"# •The term informationis defined as knowledge gained through study, communication, research, or \n",
"# instruction.\n",
"# •Essentially, information is the result of analyzing and interpreting pieces of data.\n",
"# •Whereas data is the individual figures, numbers, or graphs, information is the perception of those \n",
"# pieces of knowledge.\n",
"\n",
"# Key differences between them\n",
"# •Data is a collection of facts, while information puts those facts into context.\n",
"# •While data is raw and unorganized, information is organized.\n",
"# •Data points are individualand sometimes unrelated. Infor-mation maps out that data to provide a \n",
"# big-picture viewof how it all fits together.\n",
"# •Data, on its own, is meaningless. When it’s analyzedand inter-preted, it becomes meaningfulinformation.\n",
"# •Data does not dependon information; however, information dependson data.•Data typically comes in the \n",
"# form of graphs, numbers, figures, or statistics,while information is typically presented through words, \n",
"# language, thoughts, and ideas.\n",
"# •Data isn’t sufficientfor decision-making, but you can make decisionsbased on information."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a64919e7",
"metadata": {},
"outputs": [],
"source": [
"# -- Dataset and database --\n",
"\n",
"# Data, dataset, database\n",
"# •The dataare observations or measurements (unprocessed or processed) represented as text, numbers, \n",
"# or multimedia.\n",
"# •A datasetis a structured collection of data generally asso-ciated with a unique body of work.\n",
"# •A databaseis an organized collection of data stored as multi-ple datasets, where those datasets are \n",
"# generally stored and accessed electronically from a computer system that allows the data to be easily \n",
"# accessed, manipulated, and updated.\n",
"# •A datasetis a structured collection of data organized and stored together for analysis or processing, \n",
"# that can include many different types of data, from numerical values to text, images or audio recordings.\n",
"# •The datawithin a dataset can typically be accessed indivi-dually, in combination or managed as a whole\n",
"# entity.\n",
"# •A database(relational, document, or key-valuetype) is an organized collection of data stored as \n",
"# multiple datasets.\n",
"\n",
"# Some use cases for the 6 popular schemas\n",
"# •Flat model: Best model is for small, simple applications.\n",
"# •Hierarchical model: For nested data, like XML or JSON.•Network model: Useful for mapping and spatial \n",
"# data, also for depicting workflows.\n",
"# •Relational model: Best reflects Object-Oriented Programming applications.\n",
"# •Star model: For analyzing large, one-dimensional datasets.\n",
"# •Snowflake model: For analyzing large and complex datasets.\n",
"\n",
"# Dataset and database\n",
"# •A datasetis a collection of related data often in a table or spreadsheet format, used primarily for \n",
"# analysis.\n",
"# •Whereas databaseis a structured system for storing, managing, and retrieving data, often used in \n",
"# applications and software systems."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8f7a6014",
"metadata": {},
"outputs": [],
"source": [
"# -- Data warehouse and data lake --\n",
"\n",
"# Data warehouse\n",
"# •A data warehouseis a type of data management system that is designed to enable and support business \n",
"# intelligence (BI) activities, especially analytics.\n",
"# •Data warehouses are solely intended to perform queries and analysis and often contain large amounts \n",
"# of historical data. \n",
"# •The data within a data warehouse is usually derived from a wide range of sources such as application \n",
"# log files and transaction applications.\n",
"# •A data warehouse centralizes and consolidates large amounts of data from multiple sources.\n",
"# •Its analytical capabilities allow organizations to derive valu-able business insights from their \n",
"# data to improve decision-making.\n",
"\n",
"# Data lake\n",
"# •A data lake is a storage repository that holds a vast amount of raw data in its native format until \n",
"# it is needed for analytics applications.\n",
"# •While a traditional data warehouse stores data in hierarchical dimensions and tables, a data lake \n",
"# uses a flat architecture to store data, primarily in files or object storage.\n",
"# •That gives users more flexibility on data management, storage and usage.\n",
"\n",
"# Meta data in data lake\n",
"# •Metadata describes the data stored in the data lake, providing details such as its source, its \n",
"# structure, its meaning, its relationships with other data, and its usage.\n",
"# •This makes it easier for users to discover relevant data in the vast amounts of data stored in the \n",
"# data lake.\n",
"\n",
"# Data swamps\n",
"# •One of the biggest challenges is preventing a data lake from turning into a data swamp.\n",
"# •If it isn't set up and managed properly, the data lake can beco-me a messy dumping ground for data.\n",
"# •Users may not find what they need, and data managers may lose track of data that's stored in the data \n",
"# lake, even as more pours in.\n",
"\n",
"# Technology overload\n",
"# •The wide variety of technologies that can be used in data la-kes also complicates deployments.\n",
"# •First, organizations must find the right combination of techno-logies to meet their particular data \n",
"# management and analytics needs.\n",
"# •Then they need to install them, although the growing use of the cloud has made that step easier.\n",
"\n",
"# Unexpected costs\n",
"# •While the upfront technology costs may not be excessive, that can change if organizations don't \n",
"# carefully manage data lake environments.\n",
"# •For example, companies may get surprise bills for cloud-based data lakes if they're used more than \n",
"# expected.\n",
"# •The need to scale up data lakes to meet workload demands also increases costs.\n",
"\n",
"# Data governance\n",
"# •One of the purposes of a data lake is to store raw data as-is for various analytics uses.\n",
"# •But without effective governance of data lakes, organizations may be hit with data quality,\n",
"# consistency and reliability issues.\n",
"# •Those problems can hamper analytics applications and produce flawed results that lead to bad business \n",
"# decisions."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "77a7421c",
"metadata": {},
"outputs": [],
"source": [
"# -- Data lakehouse --\n",
"\n",
"# Data lakehouse #1\n",
"# •A data lakehouse, as the name suggests, is a new data archi-tecture that merges a data warehouse and\n",
"# a data lake into a single whole, with the purpose of addressing each one’s limitations.\n",
"# •In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its \n",
"# raw formats just like data lakes.\n",
"# •At the same time, it brings structure to data and empowers data management features similar to those \n",
"# in data warehouses by implementing the metadata layer on top of the store.\n",
"# •This enables different teams to use a single system to access all of the enterprise data for a range \n",
"# of projects, including data science, machine learning, and business intelligence.\n",
"\n",
"# Data lakehouse #2\n",
"# •It uses an open-source data lake table format, which allows it to work with the features of a data \n",
"# warehouse, such as stan-dardized data structures and data management capabilities.\n",
"# •The lakehouse architecture also makes it easier to use analy-tics, data science, and machine learning.\n",
"# •This is because all of the data is stored in a single place, which makes it easier to access and \n",
"# analyze at scale across the entire organization.\n",
"\n",
"# Problems faced by data lake architecture\n",
"# •Inconsistent Data Quality without Schema Enforcement: Data lakes are a great way to store large amounts\n",
"# of data from dif-ferent sources. Because they’re so big and unstructured, it can be hard to keep track\n",
"# of the quality (to correct) of the data.\n",
"# •Handling data today —combining batch and streaming data: Today, data needs to be fast. Data lakes need\n",
"# to be able to handle both batch (historical) data as well as streaming (live) data, especially with the\n",
"# ever-growing volume of data gene-rated and collected.\n",
"# •Overhead for time and money: Managing data warehouse and data lake architectures can be a technical \n",
"# challenge. Data warehouses are powerful, but they’re expensive to set up and maintain. Data lakes are\n",
"# more cost-effective, but they do not inherently come structure your data for fast querying speeds.\n",
"# Organizations need to figure out which data is most critical for their day-to-day analysis and keep \n",
"# that in the data warehouse. Other less urgent data can stay in the data lake.\n",
"\n",
"# Solution: Delta lake\n",
"# •It is one of the table formats that enable data lakehouses.\n",
"# •It’s an open-source data management and governance layer that sits on top of a data lake.\n",
"# •Delta Lake gives data lakes the structure of a data warehouse, while still letting them be used \n",
"# for the broad range of use ca-ses that data lakes are typically used for"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8ae2a55c",
"metadata": {},
"outputs": [],
"source": [
"# -- DataFrame --\n",
"\n",
"# DataFrame\n",
"# •A DataFrameis a data structure that organizes data into a 2-dimensional table of rows and columns, \n",
"# much like a spread-sheet.\n",
"# •DataFrames are one of the most common data structures used in modern data analytics because they are \n",
"# a flexibleand intuitive way of storing and working with data.\n",
"# •Every DataFrame contains a blueprint, known as a schema, that defines the name and data type of each \n",
"# column.\n",
"\n",
"# Python Pandas DataFrame\n",
"# •In Python Pandas, a dataframeis a data structure constructed with rows and columns, similar to a \n",
"# database or Excel spreadsheet.\n",
"# •It consists of a dictionary of lists in which the list each have their own identifiers or keys, \n",
"# such as “last name” or “food group.“"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "1b032249",
"metadata": {},
"outputs": [],
"source": [
"# -- Dataset --\n",
"\n",
"# Definition\n",
"# •A dataset, or data set, is a collection of data related to a parti-cular topic, theme, or industry.\n",
"# •Datasets include different types of information, such as num-bers, text, images, videos, and audio, \n",
"# and can be stored in various formats, such as CSV, JSON, or SQL.\n",
"# •So, a dataset typically involves structured data for a specific purpose and is related to the same \n",
"# subject.\n",
"\n",
"# Dataset vs database\n",
"# •While a dataset is a collection of data, often in a tabular form such as a CSV or Excel file, focused \n",
"# on a specific topic or analy-sis, a database is a structured set of data held in a computer, typically\n",
"# a server, that provides more complex functionality for data storage, management, and retrieval.\n",
"# •Databases are designed to handle large volumes of data and support concurrent access by multiple users,\n",
"# with robust querying capabilities through languages like SQL.\n",
"# •Databases maintain data integrityand are essential for appli-cations that require regular data \n",
"# updatesand transactions, such as customer relationship management systems or online retail sites.\n",
"# •On the other hand, datasets are typically static, used for ana-lysis, and do not facilitate \n",
"# real-timedata manipulation or complex transaction processing.\n",
"\n",
"# Types of datasets\n",
"# •Based on the data type\n",
"# •Based on data structure\n",
"# •In statistics\n",
"# •Machine learning\n",
"\n",
"# Types of datasets (based on the data type)\n",
"# •Numerical datasets: Contain numbers and are used for quan-titative analysis.\n",
"# •Text datasets: Contain posts, text messages, and documents.\n",
"# •Multimedia datasets: Contain images, videos, and audio files.\n",
"# •Time-series datasets: Contain data collected over time to ana-lyze trends and patterns.\n",
"# •Spatial dataset: Contain geographically referenced informa-tion, such as GPS data.\n",
"# •Structured datasets: Organized in specific structures to make it easier to query and analyze data.\n",
"# •Unstructured datasets: Don’t have a well-defined schema. They can include a variety of types of data.\n",
"# •Hybrid datasets: Include both structured and unstructured data"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "e7307962",
"metadata": {},
"outputs": [],
"source": [
"# -- Not so related: Apache Spark --\n",
"\n",
"# Confusing terms: DataFrame and dataset\n",
"# •DataFrame and dataset can be misleading to types of APIs provided by Apache Spark.\n",
"# •This part is intended to show the API types.\n",
"\n",
"# Apache Spark\n",
"# •Apache Spark is an open-source, distributed processing system used for big data workloads.\n",
"# •It utilizes in-memory caching and optimized query execution for fast queries against data of any size.\n",
"# •Simply put, Spark is a fast and general engine for large-scale data processing.\n",
"\n",
"# Features\n",
"# •Fast processing–The most important feature of Apache Spark that has made the big data world choose \n",
"# this technology over others is its speed.\n",
"# •Flexibility–Apache Spark supports multiple languages and allows the developers to write applications\n",
"# in Java, Scala, R, or Python.\n",
"# •In-memory computing–Spark stores the data in the RAM of servers which allows quick access and in turn\n",
"# accelerates the speed of analytics.\n",
"# •Real-time processing–Spark is able to process real-time streaming data. Unlike MapReduce which \n",
"# processes only stored data, Spark is able to process real-time data and is, therefore, able to produce\n",
"# instant outcomes.\n",
"# •Better analytics–In contrast to MapReduce that includes Map and Reduce functions, Apache Spark consists\n",
"# of a rich set of SQL queries, machine learning algorithms, complex analytics, etc, where with all these \n",
"# functionalities, analytics can be performed in a better fashion with the help of Spark.\n",
"\n",
"# APIs\n",
"# •Apache Spark provides three different APIs for working with big data: RDD, Dataset, DataFrame.\n",
"# •The Spark platform provides functions to change between the three data formats quickly.\n",
"# •Each API has advantages as well as cases when it is most beneficial to use them"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "2a07f9b4",
"metadata": {},
"outputs": [],
"source": [
"# -- Data source --\n",
"\n",
"# Data source\n",
"# •The data sourcesare the places where you can obtain data for analysis.\n",
"# •They come in various forms, such as data sets, APIs, software, and providers.\n",
"# •A dataset's quality and reliability heavily depends on the source from which it is obtained.\n",
"# •Understanding data sources is essential for data analysis.\n",
"\n",
"# Data source examples\n",
"# •Public health data is used to monitor the spread of diseases and predict future threats.\n",
"# •Most businesses use Google Analytics to track website traffic and user behavior.\n",
"# •LinkedIn provides data on user behavior, job market trends, and professional connections.\n",
"\n",
"# Types and formats\n",
"# •Data sources can be categorized based on the structure of the data they provide.\n",
"# •There are three primary types of data sources: structured, unstructured, and semi-structured.\n",
"\n",
"# Structured data\n",
"# •Structured data refers to data with a specific structure, typi-cally organized in a table format, \n",
"# where relational databases are a common source of structured data, since they contain tables consisting\n",
"# of columns and rows.\n",
"# •SQL is a programming language used to manage and mani-pulate structured data.\n",
"# •Structured data is widely used in finance, healthcare, and retail industries.\n",
"\n",
"# Unstructured data\n",
"# •Unstructured data refers to data that doesn't have a specific structure, making it more challenging to \n",
"# analyze.\n",
"# •Examples of unstructured data include text, images, and video, where some examples are government \n",
"# databases, news articles, and social media.\n",
"# •Machine learning is often used to analyze unstructured data since ML can use algorithms to identify \n",
"# patterns and relation-ships.\n",
"\n",
"# Semi-structured data formats\n",
"# •Semi-structured data is a combination of structured and unstructured data.\n",
"# •It has some structure but is also flexible, allowing for changes as needed.\n",
"# •Examples of some popular semi-structured data are XML, JSON, CSV."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "ce182bff",
"metadata": {},
"outputs": [],
"source": [
"# SLIDE 8B ( Source of data )\n",
"\n",
"# -- Datasource --\n",
"\n",
"#Data source #1\n",
"# •A data source is the physical or digital location where the data comes from in various forms.\n",
"# •The data source can be both the place where the data was originally created and the place where it was\n",
"# added, where the last is for data digitizing.\n",
"# •Data sources can be digital (for the most part) or paper-based.\n",
"# •The idea is to enable users to access and exploit the data from this source.\n",
"# •The data source can take different forms, such as a database, a flat file, an inventory table, web \n",
"# scraping, streaming data, physical archives, etc.\n",
"# •With the development of Big Data and new technologies, these different formats are constantly evolving,\n",
"# making data sources ever more complex.\n",
"# •The challenge for organisations is to simplify them as much as possible.\n",
"\n",
"#Data source #2\n",
"# •A data source is simply the source of the data.\n",
"# •It can be a file, a particular database on a DBMS, or even a live data feed.\n",
"# •The data might be located on the same computer as the program, or on another computer somewhere on \n",
"# a network.\n",
"\n",
"# Others\n",
"# •Sensors: raw data for physical data, e.g. temperature, humidity, light intensity, ect.\n",
"# •Simulation:randomdataleadingtoameaningafteranalysis,e.g.MonteCarlosimulation."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "9273f4bb",
"metadata": {},
"outputs": [],
"source": [
"# -- Datatype --\n",
"\n",
"# Types of data\n",
"# •There are two typesof data: Qualitativeand Quantitativedata.\n",
"# •They are further clas-sified into four cate-gories: Nominal data,Ordinal data, Discrete data, \n",
"# Continuous data.\n",
"\n",
"# Qualitative or Categorical Data\n",
"# •Qualitative or Categorical Data is data that can’t be measured or counted in the form of numbers.\n",
"# •These types of data are sorted by category, not by number.\n",
"# •These data consist of audio, images, symbols, or text.\n",
"# •The gender of a person, i.e., male, female, or others, is qualitative data.\n",
"# •Qualitative data tells aboutthe perception of people.\n",
"\n",
"# Nominal data\n",
"# •Nominal Data is used to label variables without any order or quantitative value.\n",
"# •The color of hair can be considered nominal data, as one color can’t be compared with another color.\n",
"\n",
"# Ordinal data\n",
"# •Ordinal data have natural ordering where a number is present in some kind of order by their position \n",
"# on the scale.\n",
"# •These data are used for observation like customer satisfaction, happiness, etc., but we can’t do any \n",
"# arithmetical tasks on them.\n",
"# •Ordinal data is qualitative data forwhich their values have somekind of relative position.\n",
"\n",
"# Quantitative data\n",
"# •Quantitative data can be expressed in numerical values, making it countable and including statistical \n",
"# data analysis. \n",
"# •These data can be represented on a wide variety of graphs and charts, such as bar graphs, histograms, \n",
"# scatter plots, boxplots, pie charts, line graphs, etc.\n",
"\n",
"# Discrete data\n",
"# •The discrete data contain the values that fall under integers or whole numbers.\n",
"# •The total number of students in a class is an example of discrete data.\n",
"# •These data can’t bebroken into decimalor fraction values.\n",
"\n",
"# Continuous data\n",
"# •Continuous data are in the form of fractional numbers.•It can be the version of an android phone, \n",
"# the height of a person, the length of an object, etc.\n",
"# •Continuous data represents information that can be divided into smaller levels.\n",
"# •The continuous variable can takeany value within a range.\n",
"\n",
"# Notes\n",
"# •Different types of data are used in research, analysis, statistical analysis, data visualization, and \n",
"# data science.\n",
"# •Working on data is crucial because we need to figure out what kind of data it is and how to use it to\n",
"# get valuable output out of it.\n",
"# •Working with data requires good data science skills and a deep understanding of different types of data \n",
"# and how to work with them."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment