MazamGanendra/Resume Session 8 Mazam Ganendra.ipynb

## Resume Session 8 Mazam Ganendra.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ce6ad075",
   "metadata": {},
   "outputs": [],
   "source": [
    "# MAZAM GANENDRA\n",
    "# 21181192\n",
    "\n",
    "# SLIDE 8A ( Data and data-something )\n",
    "\n",
    "#  -- DATA AND INFROMATION --\n",
    "\n",
    "# DATA\n",
    "# •The term datais defined as a collection of individual facts or statistics (singular form: datum).\n",
    "# •Data can come in the form of text, observations, figures, images, numbers, graphs, or symbols.\n",
    "# •Data is a raw form of knowledge and, on its own, doesn’t carry any significance or purpose.\n",
    "# •Data can be simple—and may even seem useless until it is analyzed, organized, and interpreted.\n",
    "\n",
    "# INFORMATION\n",
    "# •The term informationis defined as knowledge gained through study, communication, research, or \n",
    "#  instruction.\n",
    "# •Essentially, information is the result of analyzing and interpreting pieces of data.\n",
    "# •Whereas data is the individual figures, numbers, or graphs, information is the perception of those \n",
    "#  pieces of knowledge.\n",
    "\n",
    "# Key differences between them\n",
    "# •Data is a collection of facts, while information puts those facts into context.\n",
    "# •While data is raw and unorganized, information is organized.\n",
    "# •Data points are individualand sometimes unrelated. Infor-mation maps out that data to provide a \n",
    "#  big-picture viewof how it all fits together.\n",
    "# •Data, on its own, is meaningless. When it’s analyzedand inter-preted, it becomes meaningfulinformation.\n",
    "# •Data does not dependon information; however, information dependson data.•Data typically comes in the \n",
    "#  form of graphs, numbers, figures, or statistics,while information is typically presented through words, \n",
    "#  language, thoughts, and ideas.\n",
    "# •Data isn’t sufficientfor decision-making, but you can make decisionsbased on information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a64919e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# -- Dataset and database --\n",
    "\n",
    "# Data, dataset, database\n",
    "# •The dataare observations or measurements (unprocessed or processed) represented as text, numbers, \n",
    "#  or multimedia.\n",
    "# •A datasetis a structured collection of data generally asso-ciated with a unique body of work.\n",
    "# •A databaseis an organized collection of data stored as multi-ple datasets, where those datasets are \n",
    "#  generally stored and accessed electronically from a computer system that allows the data to be easily \n",
    "#  accessed, manipulated, and updated.\n",
    "# •A datasetis a structured collection of data organized and stored together for analysis or processing, \n",
    "#  that can include many different types of data, from numerical values to text, images or audio recordings.\n",
    "# •The datawithin a dataset can typically be accessed indivi-dually, in combination or managed as a whole\n",
    "#  entity.\n",
    "# •A database(relational, document, or key-valuetype) is an organized collection of data stored as \n",
    "#  multiple datasets.\n",
    "\n",
    "# Some use cases for the 6 popular schemas\n",
    "# •Flat model: Best model is for small, simple applications.\n",
    "# •Hierarchical model: For nested data, like XML or JSON.•Network model: Useful for mapping and spatial \n",
    "#  data, also for depicting workflows.\n",
    "# •Relational model: Best reflects Object-Oriented Programming applications.\n",
    "# •Star model: For analyzing large, one-dimensional datasets.\n",
    "# •Snowflake model: For analyzing large and complex datasets.\n",
    "\n",
    "# Dataset and database\n",
    "# •A datasetis a collection of related data often in a table or spreadsheet format, used primarily for \n",
    "#  analysis.\n",
    "# •Whereas databaseis a structured system for storing, managing, and retrieving data, often used in \n",
    "#  applications and software systems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8f7a6014",
   "metadata": {},
   "outputs": [],
   "source": [
    "# -- Data warehouse and data lake --\n",
    "\n",
    "# Data warehouse\n",
    "# •A data warehouseis a type of data management system that is designed to enable and support business \n",
    "#  intelligence (BI) activities, especially analytics.\n",
    "# •Data warehouses are solely intended to perform queries and analysis and often contain large amounts \n",
    "#  of historical data. \n",
    "# •The data within a data warehouse is usually derived from a wide range of sources such as application \n",
    "#  log files and transaction applications.\n",
    "# •A data warehouse centralizes and consolidates large amounts of data from multiple sources.\n",
    "# •Its analytical capabilities allow organizations to derive valu-able business insights from their \n",
    "# data to improve decision-making.\n",
    "\n",
    "# Data lake\n",
    "# •A data lake is a storage repository that holds a vast amount of raw data in its native format until \n",
    "#  it is needed for analytics applications.\n",
    "# •While a traditional data warehouse stores data in hierarchical dimensions and tables, a data lake \n",
    "#  uses a flat architecture to store data, primarily in files or object storage.\n",
    "# •That gives users more flexibility on data management, storage and usage.\n",
    "\n",
    "# Meta data in data lake\n",
    "# •Metadata describes the data stored in the data lake, providing details such as its source, its \n",
    "#  structure, its meaning, its relationships with other data, and its usage.\n",
    "# •This makes it easier for users to discover relevant data in the vast amounts of data stored in the \n",
    "#  data lake.\n",
    "\n",
    "# Data swamps\n",
    "# •One of the biggest challenges is preventing a data lake from turning into a data swamp.\n",
    "# •If it isn't set up and managed properly, the data lake can beco-me a messy dumping ground for data.\n",
    "# •Users may not find what they need, and data managers may lose track of data that's stored in the data \n",
    "# lake, even as more pours in.\n",
    "\n",
    "# Technology overload\n",
    "# •The wide variety of technologies that can be used in data la-kes also complicates deployments.\n",
    "# •First, organizations must find the right combination of techno-logies to meet their particular data \n",
    "#  management and analytics needs.\n",
    "# •Then they need to install them, although the growing use of the cloud has made that step easier.\n",
    "\n",
    "# Unexpected costs\n",
    "# •While the upfront technology costs may not be excessive, that can change if organizations don't \n",
    "#  carefully manage data lake environments.\n",
    "# •For example, companies may get surprise bills for cloud-based data lakes if they're used more than \n",
    "#  expected.\n",
    "# •The need to scale up data lakes to meet workload demands also increases costs.\n",
    "\n",
    "# Data governance\n",
    "# •One of the purposes of a data lake is to store raw data as-is for various analytics uses.\n",
    "# •But without effective governance of data lakes, organizations may be hit with data quality,\n",
    "#  consistency and reliability issues.\n",
    "# •Those problems can hamper analytics applications and produce flawed results that lead to bad business \n",
    "#  decisions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "77a7421c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# -- Data lakehouse --\n",
    "\n",
    "# Data lakehouse #1\n",
    "# •A data lakehouse, as the name suggests, is a new data archi-tecture that merges a data warehouse and\n",
    "#  a data lake into a single whole, with the purpose of addressing each one’s limitations.\n",
    "# •In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its \n",
    "#  raw formats just like data lakes.\n",
    "# •At the same time, it brings structure to data and empowers data management features similar to those \n",
    "#  in data warehouses by implementing the metadata layer on top of the store.\n",
    "# •This enables different teams to use a single system to access all of the enterprise data for a range \n",
    "#  of projects, including data science, machine learning, and business intelligence.\n",
    "\n",
    "# Data lakehouse #2\n",
    "# •It uses an open-source data lake table format, which allows it to work with the features of a data \n",
    "#  warehouse, such as stan-dardized data structures and data management capabilities.\n",
    "# •The lakehouse architecture also makes it easier to use analy-tics, data science, and machine learning.\n",
    "# •This is because all of the data is stored in a single place, which makes it easier to access and \n",
    "#  analyze at scale across the entire organization.\n",
    "\n",
    "# Problems faced by data lake architecture\n",
    "# •Inconsistent Data Quality without Schema Enforcement: Data lakes are a great way to store large amounts\n",
    "#  of data from dif-ferent sources. Because they’re so big and unstructured, it can be hard to keep track\n",
    "#  of the quality (to correct) of the data.\n",
    "# •Handling data today —combining batch and streaming data: Today, data needs to be fast. Data lakes need\n",
    "#  to be able to handle both batch (historical) data as well as streaming (live) data, especially with the\n",
    "#  ever-growing volume of data gene-rated and collected.\n",
    "# •Overhead for time and money: Managing data warehouse and data lake architectures can be a technical \n",
    "#  challenge. Data warehouses are powerful, but they’re expensive to set up and maintain. Data lakes are\n",
    "#  more cost-effective, but they do not inherently come structure your data for fast querying speeds.\n",
    "#  Organizations need to figure out which data is most critical for their day-to-day analysis and keep \n",
    "#  that in the data warehouse. Other less urgent data can stay in the data lake.\n",
    "\n",
    "# Solution: Delta lake\n",
    "# •It is one of the table formats that enable data lakehouses.\n",
    "# •It’s an open-source data management and governance layer that sits on top of a data lake.\n",
    "# •Delta Lake gives data lakes the structure of a data warehouse, while still letting them be used \n",
    "#  for the broad range of use ca-ses that data lakes are typically used for"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8ae2a55c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# -- DataFrame --\n",
    "\n",
    "# DataFrame\n",
    "# •A DataFrameis a data structure that organizes data into a 2-dimensional table of rows and columns, \n",
    "#  much like a spread-sheet.\n",
    "# •DataFrames are one of the most common data structures used in modern data analytics because they are \n",
    "#  a flexibleand intuitive way of storing and working with data.\n",
    "# •Every DataFrame contains a blueprint, known as a schema, that defines the name and data type of each \n",
    "#  column.\n",
    "\n",
    "# Python Pandas DataFrame\n",
    "# •In Python Pandas, a dataframeis a data structure constructed with rows and columns, similar to a \n",
    "#  database or Excel spreadsheet.\n",
    "# •It consists of a dictionary of lists in which the list each have their own identifiers or keys, \n",
    "#  such as “last name” or “food group.“"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "1b032249",
   "metadata": {},
   "outputs": [],
   "source": [
    "# -- Dataset --\n",
    "\n",
    "# Definition\n",
    "# •A dataset, or data set, is a collection of data related to a parti-cular topic, theme, or industry.\n",
    "# •Datasets include different types of information, such as num-bers, text, images, videos, and audio, \n",
    "#  and can be stored in various formats, such as CSV, JSON, or SQL.\n",
    "# •So, a dataset typically involves structured data for a specific purpose and is related to the same \n",
    "#  subject.\n",
    "\n",
    "# Dataset vs database\n",
    "# •While a dataset is a collection of data, often in a tabular form such as a CSV or Excel file, focused \n",
    "#  on a specific topic or analy-sis, a database is a structured set of data held in a computer, typically\n",
    "#  a server, that provides more complex functionality for data storage, management, and retrieval.\n",
    "# •Databases are designed to handle large volumes of data and support concurrent access by multiple users,\n",
    "#  with robust querying capabilities through languages like SQL.\n",
    "# •Databases maintain data integrityand are essential for appli-cations that require regular data \n",
    "#  updatesand transactions, such as customer relationship management systems or online retail sites.\n",
    "# •On the other hand, datasets are typically static, used for ana-lysis, and do not facilitate \n",
    "#  real-timedata manipulation or complex transaction processing.\n",
    "\n",
    "# Types of datasets\n",
    "# •Based on the data type\n",
    "# •Based on data structure\n",
    "# •In statistics\n",
    "# •Machine learning\n",
    "\n",
    "# Types of datasets (based on the data type)\n",
    "# •Numerical datasets: Contain numbers and are used for quan-titative analysis.\n",
    "# •Text datasets: Contain posts, text messages, and documents.\n",
    "# •Multimedia datasets: Contain images, videos, and audio files.\n",
    "# •Time-series datasets: Contain data collected over time to ana-lyze trends and patterns.\n",
    "# •Spatial dataset: Contain geographically referenced informa-tion, such as GPS data.\n",
    "# •Structured datasets: Organized in specific structures to make it easier to query and analyze data.\n",
    "# •Unstructured datasets: Don’t have a well-defined schema. They can include a variety of types of data.\n",
    "# •Hybrid datasets: Include both structured and unstructured data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e7307962",
   "metadata": {},
   "outputs": [],
   "source": [
    "# -- Not so related: Apache Spark --\n",
    "\n",
    "# Confusing terms: DataFrame and dataset\n",
    "# •DataFrame and dataset can be misleading to types of APIs provided by Apache Spark.\n",
    "# •This part is intended to show the API types.\n",
    "\n",
    "# Apache Spark\n",
    "# •Apache Spark is an open-source, distributed processing system used for big data workloads.\n",
    "# •It utilizes in-memory caching and optimized query execution for fast queries against data of any size.\n",
    "# •Simply put, Spark is a fast and general engine for large-scale data processing.\n",
    "\n",
    "# Features\n",
    "# •Fast processing–The most important feature of Apache Spark that has made the big data world choose \n",
    "#  this technology over others is its speed.\n",
    "# •Flexibility–Apache Spark supports multiple languages and allows the developers to write applications\n",
    "#  in Java, Scala, R, or Python.\n",
    "# •In-memory computing–Spark stores the data in the RAM of servers which allows quick access and in turn\n",
    "#  accelerates the speed of analytics.\n",
    "# •Real-time processing–Spark is able to process real-time streaming data. Unlike MapReduce which \n",
    "#  processes only stored data, Spark is able to process real-time data and is, therefore, able to produce\n",
    "#  instant outcomes.\n",
    "# •Better analytics–In contrast to MapReduce that includes Map and Reduce functions, Apache Spark consists\n",
    "#  of a rich set of SQL queries, machine learning algorithms, complex analytics, etc, where with all these \n",
    "#  functionalities, analytics can be performed in a better fashion with the help of Spark.\n",
    "\n",
    "# APIs\n",
    "# •Apache Spark provides three different APIs for working with big data: RDD, Dataset, DataFrame.\n",
    "# •The Spark platform provides functions to change between the three data formats quickly.\n",
    "# •Each API has advantages as well as cases when it is most beneficial to use them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "2a07f9b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# -- Data source --\n",
    "\n",
    "# Data source\n",
    "# •The data sourcesare the places where you can obtain data for analysis.\n",
    "# •They come in various forms, such as data sets, APIs, software, and providers.\n",
    "# •A dataset's quality and reliability heavily depends on the source from which it is obtained.\n",
    "# •Understanding data sources is essential for data analysis.\n",
    "\n",
    "# Data source examples\n",
    "# •Public health data is used to monitor the spread of diseases and predict future threats.\n",
    "# •Most businesses use Google Analytics to track website traffic and user behavior.\n",
    "# •LinkedIn provides data on user behavior, job market trends, and professional connections.\n",
    "\n",
    "# Types and formats\n",
    "# •Data sources can be categorized based on the structure of the data they provide.\n",
    "# •There are three primary types of data sources: structured, unstructured, and semi-structured.\n",
    "\n",
    "# Structured data\n",
    "# •Structured data refers to data with a specific structure, typi-cally organized in a table format, \n",
    "#  where relational databases are a common source of structured data, since they contain tables consisting\n",
    "#  of columns and rows.\n",
    "# •SQL is a programming language used to manage and mani-pulate structured data.\n",
    "# •Structured data is widely used in finance, healthcare, and retail industries.\n",
    "\n",
    "# Unstructured data\n",
    "# •Unstructured data refers to data that doesn't have a specific structure, making it more challenging to \n",
    "#  analyze.\n",
    "# •Examples of unstructured data include text, images, and video, where some examples are government \n",
    "#  databases, news articles, and social media.\n",
    "# •Machine learning is often used to analyze unstructured data since ML can use algorithms to identify \n",
    "#  patterns and relation-ships.\n",
    "\n",
    "# Semi-structured data formats\n",
    "# •Semi-structured data is a combination of structured and unstructured data.\n",
    "# •It has some structure but is also flexible, allowing for changes as needed.\n",
    "# •Examples of some popular semi-structured data are XML, JSON, CSV."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ce182bff",
   "metadata": {},
   "outputs": [],
   "source": [
    "# SLIDE 8B ( Source of data )\n",
    "\n",
    "# -- Datasource --\n",
    "\n",
    "#Data source #1\n",
    "# •A data source is the physical or digital location where the data comes from in various forms.\n",
    "# •The data source can be both the place where the data was originally created and the place where it was\n",
    "#  added, where the last is for data digitizing.\n",
    "# •Data sources can be digital (for the most part) or paper-based.\n",
    "# •The idea is to enable users to access and exploit the data from this source.\n",
    "# •The data source can take different forms, such as a database, a flat file, an inventory table, web \n",
    "#  scraping, streaming data, physical archives, etc.\n",
    "# •With the development of Big Data and new technologies, these different formats are constantly evolving,\n",
    "#  making data sources ever more complex.\n",
    "# •The challenge for organisations is to simplify them as much as possible.\n",
    "\n",
    "#Data source #2\n",
    "# •A data source is simply the source of the data.\n",
    "# •It can be a file, a particular database on a DBMS, or even a live data feed.\n",
    "# •The data might be located on the same computer as the program, or on another computer somewhere on \n",
    "#  a network.\n",
    "\n",
    "# Others\n",
    "# •Sensors: raw data for physical data, e.g. temperature, humidity, light intensity, ect.\n",
    "# •Simulation:randomdataleadingtoameaningafteranalysis,e.g.MonteCarlosimulation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "9273f4bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# -- Datatype --\n",
    "\n",
    "# Types of data\n",
    "# •There are two typesof data: Qualitativeand Quantitativedata.\n",
    "# •They are further clas-sified into four cate-gories: Nominal data,Ordinal data, Discrete data, \n",
    "#  Continuous data.\n",
    "\n",
    "# Qualitative or Categorical Data\n",
    "# •Qualitative or Categorical Data is data that can’t be measured or counted in the form of numbers.\n",
    "# •These types of data are sorted by category, not by number.\n",
    "# •These data consist of audio, images, symbols, or text.\n",
    "# •The gender of a person, i.e., male, female, or others, is qualitative data.\n",
    "# •Qualitative data tells aboutthe perception of people.\n",
    "\n",
    "# Nominal data\n",
    "# •Nominal Data is used to label variables without any order or quantitative value.\n",
    "# •The color of hair can be considered nominal data, as one color can’t be compared with another color.\n",
    "\n",
    "# Ordinal data\n",
    "# •Ordinal data have natural ordering where a number is present in some kind of order by their position \n",
    "#  on the scale.\n",
    "# •These data are used for observation like customer satisfaction, happiness, etc., but we can’t do any \n",
    "#  arithmetical tasks on them.\n",
    "# •Ordinal data is qualitative data forwhich their values have somekind of relative position.\n",
    "\n",
    "# Quantitative data\n",
    "# •Quantitative data can be expressed in numerical values, making it countable and including statistical \n",
    "#  data analysis. \n",
    "# •These data can be represented on a wide variety of graphs and charts, such as bar graphs, histograms, \n",
    "#  scatter plots, boxplots, pie charts, line graphs, etc.\n",
    "\n",
    "# Discrete data\n",
    "# •The discrete data contain the values that fall under integers or whole numbers.\n",
    "# •The total number of students in a class is an example of discrete data.\n",
    "# •These data can’t bebroken into decimalor fraction values.\n",
    "\n",
    "# Continuous data\n",
    "# •Continuous data are in the form of fractional numbers.•It can be the version of an android phone, \n",
    "#  the height of a person, the length of an object, etc.\n",
    "# •Continuous data represents information that can be divided into smaller levels.\n",
    "# •The continuous variable can takeany value within a range.\n",
    "\n",
    "# Notes\n",
    "# •Different types of data are used in research, analysis, statistical analysis, data visualization, and \n",
    "#  data science.\n",
    "# •Working on data is crucial because we need to figure out what kind of data it is and how to use it to\n",
    "#  get valuable output out of it.\n",
    "# •Working with data requires good data science skills and a deep understanding of different types of data \n",
    "#  and how to work with them."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}