ayarayenima/Issues8.ipynb

## Issues8.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "161f1a89",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data\n",
    "# The term data is defined as a collection of individual facts or statistics (singular form: datum). can come in the form of text, observations, figures, images, numbers, graphs, or symbols."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9c20d908",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Information\n",
    "# The term information is defined as knowledge gained through study, communication, research, or instruction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9b2b4f90",
   "metadata": {},
   "outputs": [],
   "source": [
    "# The Diferrence between data and information :\n",
    "# Data is a collection of facts, while information puts those facts into context.\n",
    "# Data points are individual and sometimes unrelated. Information maps out that data to provide a big-picture view of how it all fits together\n",
    "# Data typically comes in the form of graphs, numbers, figures,or statistics,while information is typically presented through words, language, thoughts, and ideas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6a9137a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dataset is a structured collection of data generally associated with a unique body of work.\n",
    "# Database is an organized collection of data stored as multiple datasets, where those datasets are generally stored and accessed electronically from a computer system that allows the data to be easily accessed, manipulated, and updated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f0b7e22d",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Some use cases for the 6 popular schemas :\n",
    "#• Flat model: Best model is for small, simple applications.\n",
    "#• Hierarchical model: For nested data, like XML or JSON.\n",
    "#• Network model: Useful for mapping and spatial data, also for depicting workflows.\n",
    "#• Relational model: Best reflects Object-Oriented Programming applications.\n",
    "#• Star model: For analyzing large, one-dimensional datasets.\n",
    "#• Snowflake model: For analyzing large and complex datasets.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "dde198d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics.\n",
    "# Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications.\n",
    "# Metadata describes the data stored in the data lake, providing details such as its source, its structure, its meaning, its relationships with other data, and its usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f70f1c07",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data swamps :\n",
    "#• One of the biggest challenges is preventing a data lake from turning into a data swamp.\n",
    "#• If it isn't set up and managed properly, the data lake can beco\u0002me a messy dumping ground for data.\n",
    "#• Users may not find what they need, and data managers may lose track of data that's stored in the data lake, even as more pours in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "42134a60",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data governance :\n",
    "#• One of the purposes of a data lake is to store raw data as-is for various analytics uses.\n",
    "#• But without effective governance of data lakes, organizations may be hit with data quality, consistency and reliability issues.\n",
    "#• Those problems can hamper analytics applications and produce flawed results that lead to bad business decisions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2ece68de",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data lakehouse : \n",
    "# • A data lakehouse, as the name suggests, is a new data archi\u0002tecture that merges a data warehouse and a data lake into a single whole, with the purpose of addressing each one’s limitations.\n",
    "# • In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its raw formats just like data lakes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3145a1c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# DataFrame :\n",
    "# • A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spread\u0002sheet.\n",
    "# • DataFrames are one of the most common data structures used in modern data analytics because they are a flexibleand intuitive way of storing and working with data.\n",
    "# • Every DataFrame contains a blueprint, known as a schema,that defines the name and data type of each column\n",
    "# In Python Pandas, a dataframe is a data structure constructed with rows and columns, similar to a database or Excel spreadsheet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8b7f4d1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Types of datasets (based on the data type) :\n",
    "#• Numerical datasets: Contain numbers and are used for quantitative analysis.\n",
    "#• Text datasets: Contain posts, text messages, and documents.\n",
    "#• Multimedia datasets: Contain images, videos, and audio files.\n",
    "#• Time-series datasets: Contain data collected over time to analyze trends and patterns.\n",
    "#• Spatial dataset: Contain geographically referenced information, such as GPS data.\n",
    "\n",
    "\n",
    "# Types of datasets (based on the data structure) :\n",
    "#• Structured datasets: Organized in specific structures to make it easier to query and analyze data.\n",
    "#• Unstructured datasets: Don’t have a well-defined schema.They can include a variety of types of data.\n",
    "#• Hybrid datasets: Include both structured and unstructured data.\n",
    "\n",
    "# Types of datasets (in statistics) : \n",
    "#• Numerical datasets: Involve only numbers.\n",
    "#• Bivariate datasets: Involve two data variables.\n",
    "#• Multivariate datasets: Involve three or more data variables.\n",
    "#• Categorical datasets: Consist of categorical variables that can take only a limited set of values.\n",
    "#• Correlation datasets: Contain data variables that relate to each other\n",
    "\n",
    "# Types of datasets (Machine learning)\n",
    "#• Datasets for training ML: Used to train the model.\n",
    "#• Datasets for validation: Used to reduce overfitting and make the model more accurate.\n",
    "#• Dataset for testing: Used for testing the final output of the model to confirm its accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "5512beb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data Source :\n",
    "#• A data source is the physical or digital location where the data comes from in various forms.\n",
    "#• The data source can be both the place where the data was originally created and the place where it was added, where the last is for data digitizing.\n",
    "#• Data sources can be digital (for the most part) or paper-based.\n",
    "#• The idea is to enable users to access and exploit the data from this source.\n",
    "#• The data source can take different forms, such as a database, a flat file, an inventory table, web scraping, streaming data, physical archives, etc.\n",
    "#• With the development of Big Data and new technologies, these different formats are constantly evolving, making data sources ever more complex.\n",
    "#• The challenge for organisations is to simplify them as much as possible.\n",
    "#• A data source is simply the source of the data.\n",
    "#• It can be a file, a particular database on a DBMS, or even a live data feed.\n",
    "#• The data might be located on the same computer as the program, or on another computer somewhere on a network.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84f443d0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "161f1a89",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Data\n",
	"# The term data is defined as a collection of individual facts or statistics (singular form: datum). can come in the form of text, observations, figures, images, numbers, graphs, or symbols."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "9c20d908",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Information\n",
	"# The term information is defined as knowledge gained through study, communication, research, or instruction."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"id": "9b2b4f90",
	"metadata": {},
	"outputs": [],
	"source": [
	"# The Diferrence between data and information :\n",
	"# Data is a collection of facts, while information puts those facts into context.\n",
	"# Data points are individual and sometimes unrelated. Information maps out that data to provide a big-picture view of how it all fits together\n",
	"# Data typically comes in the form of graphs, numbers, figures,or statistics,while information is typically presented through words, language, thoughts, and ideas."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"id": "6a9137a4",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Dataset is a structured collection of data generally associated with a unique body of work.\n",
	"# Database is an organized collection of data stored as multiple datasets, where those datasets are generally stored and accessed electronically from a computer system that allows the data to be easily accessed, manipulated, and updated."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"id": "f0b7e22d",
	"metadata": {},
	"outputs": [],
	"source": [
	"#Some use cases for the 6 popular schemas :\n",
	"#• Flat model: Best model is for small, simple applications.\n",
	"#• Hierarchical model: For nested data, like XML or JSON.\n",
	"#• Network model: Useful for mapping and spatial data, also for depicting workflows.\n",
	"#• Relational model: Best reflects Object-Oriented Programming applications.\n",
	"#• Star model: For analyzing large, one-dimensional datasets.\n",
	"#• Snowflake model: For analyzing large and complex datasets.\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"id": "dde198d4",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics.\n",
	"# Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications.\n",
	"# Metadata describes the data stored in the data lake, providing details such as its source, its structure, its meaning, its relationships with other data, and its usage."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"id": "f70f1c07",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Data swamps :\n",
	"#• One of the biggest challenges is preventing a data lake from turning into a data swamp.\n",
	"#• If it isn't set up and managed properly, the data lake can beco\u0002me a messy dumping ground for data.\n",
	"#• Users may not find what they need, and data managers may lose track of data that's stored in the data lake, even as more pours in."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"id": "42134a60",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Data governance :\n",
	"#• One of the purposes of a data lake is to store raw data as-is for various analytics uses.\n",
	"#• But without effective governance of data lakes, organizations may be hit with data quality, consistency and reliability issues.\n",
	"#• Those problems can hamper analytics applications and produce flawed results that lead to bad business decisions.\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"id": "2ece68de",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Data lakehouse : \n",
	"# • A data lakehouse, as the name suggests, is a new data archi\u0002tecture that merges a data warehouse and a data lake into a single whole, with the purpose of addressing each one’s limitations.\n",
	"# • In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its raw formats just like data lakes."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"id": "3145a1c0",
	"metadata": {},
	"outputs": [],
	"source": [
	"# DataFrame :\n",
	"# • A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spread\u0002sheet.\n",
	"# • DataFrames are one of the most common data structures used in modern data analytics because they are a flexibleand intuitive way of storing and working with data.\n",
	"# • Every DataFrame contains a blueprint, known as a schema,that defines the name and data type of each column\n",
	"# In Python Pandas, a dataframe is a data structure constructed with rows and columns, similar to a database or Excel spreadsheet."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"id": "8b7f4d1e",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Types of datasets (based on the data type) :\n",
	"#• Numerical datasets: Contain numbers and are used for quantitative analysis.\n",
	"#• Text datasets: Contain posts, text messages, and documents.\n",
	"#• Multimedia datasets: Contain images, videos, and audio files.\n",
	"#• Time-series datasets: Contain data collected over time to analyze trends and patterns.\n",
	"#• Spatial dataset: Contain geographically referenced information, such as GPS data.\n",
	"\n",
	"\n",
	"# Types of datasets (based on the data structure) :\n",
	"#• Structured datasets: Organized in specific structures to make it easier to query and analyze data.\n",
	"#• Unstructured datasets: Don’t have a well-defined schema.They can include a variety of types of data.\n",
	"#• Hybrid datasets: Include both structured and unstructured data.\n",
	"\n",
	"# Types of datasets (in statistics) : \n",
	"#• Numerical datasets: Involve only numbers.\n",
	"#• Bivariate datasets: Involve two data variables.\n",
	"#• Multivariate datasets: Involve three or more data variables.\n",
	"#• Categorical datasets: Consist of categorical variables that can take only a limited set of values.\n",
	"#• Correlation datasets: Contain data variables that relate to each other\n",
	"\n",
	"# Types of datasets (Machine learning)\n",
	"#• Datasets for training ML: Used to train the model.\n",
	"#• Datasets for validation: Used to reduce overfitting and make the model more accurate.\n",
	"#• Dataset for testing: Used for testing the final output of the model to confirm its accuracy."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"id": "5512beb3",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Data Source :\n",
	"#• A data source is the physical or digital location where the data comes from in various forms.\n",
	"#• The data source can be both the place where the data was originally created and the place where it was added, where the last is for data digitizing.\n",
	"#• Data sources can be digital (for the most part) or paper-based.\n",
	"#• The idea is to enable users to access and exploit the data from this source.\n",
	"#• The data source can take different forms, such as a database, a flat file, an inventory table, web scraping, streaming data, physical archives, etc.\n",
	"#• With the development of Big Data and new technologies, these different formats are constantly evolving, making data sources ever more complex.\n",
	"#• The challenge for organisations is to simplify them as much as possible.\n",
	"#• A data source is simply the source of the data.\n",
	"#• It can be a file, a particular database on a DBMS, or even a live data feed.\n",
	"#• The data might be located on the same computer as the program, or on another computer somewhere on a network.\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "84f443d0",
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.11.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}