Skip to content

Instantly share code, notes, and snippets.

@ayarayenima
Created May 7, 2024 13:10
Show Gist options
  • Save ayarayenima/b04ec11cf27fa807d15a1f448cb73a85 to your computer and use it in GitHub Desktop.
Save ayarayenima/b04ec11cf27fa807d15a1f448cb73a85 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "161f1a89",
"metadata": {},
"outputs": [],
"source": [
"# Data\n",
"# The term data is defined as a collection of individual facts or statistics (singular form: datum). can come in the form of text, observations, figures, images, numbers, graphs, or symbols."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "9c20d908",
"metadata": {},
"outputs": [],
"source": [
"# Information\n",
"# The term information is defined as knowledge gained through study, communication, research, or instruction."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "9b2b4f90",
"metadata": {},
"outputs": [],
"source": [
"# The Diferrence between data and information :\n",
"# Data is a collection of facts, while information puts those facts into context.\n",
"# Data points are individual and sometimes unrelated. Information maps out that data to provide a big-picture view of how it all fits together\n",
"# Data typically comes in the form of graphs, numbers, figures,or statistics,while information is typically presented through words, language, thoughts, and ideas."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6a9137a4",
"metadata": {},
"outputs": [],
"source": [
"# Dataset is a structured collection of data generally associated with a unique body of work.\n",
"# Database is an organized collection of data stored as multiple datasets, where those datasets are generally stored and accessed electronically from a computer system that allows the data to be easily accessed, manipulated, and updated."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f0b7e22d",
"metadata": {},
"outputs": [],
"source": [
"#Some use cases for the 6 popular schemas :\n",
"#• Flat model: Best model is for small, simple applications.\n",
"#• Hierarchical model: For nested data, like XML or JSON.\n",
"#• Network model: Useful for mapping and spatial data, also for depicting workflows.\n",
"#• Relational model: Best reflects Object-Oriented Programming applications.\n",
"#• Star model: For analyzing large, one-dimensional datasets.\n",
"#• Snowflake model: For analyzing large and complex datasets.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "dde198d4",
"metadata": {},
"outputs": [],
"source": [
"# Data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics.\n",
"# Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications.\n",
"# Metadata describes the data stored in the data lake, providing details such as its source, its structure, its meaning, its relationships with other data, and its usage."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "f70f1c07",
"metadata": {},
"outputs": [],
"source": [
"# Data swamps :\n",
"#• One of the biggest challenges is preventing a data lake from turning into a data swamp.\n",
"#• If it isn't set up and managed properly, the data lake can beco\u0002me a messy dumping ground for data.\n",
"#• Users may not find what they need, and data managers may lose track of data that's stored in the data lake, even as more pours in."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "42134a60",
"metadata": {},
"outputs": [],
"source": [
"# Data governance :\n",
"#• One of the purposes of a data lake is to store raw data as-is for various analytics uses.\n",
"#• But without effective governance of data lakes, organizations may be hit with data quality, consistency and reliability issues.\n",
"#• Those problems can hamper analytics applications and produce flawed results that lead to bad business decisions.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "2ece68de",
"metadata": {},
"outputs": [],
"source": [
"# Data lakehouse : \n",
"# • A data lakehouse, as the name suggests, is a new data archi\u0002tecture that merges a data warehouse and a data lake into a single whole, with the purpose of addressing each one’s limitations.\n",
"# • In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its raw formats just like data lakes."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "3145a1c0",
"metadata": {},
"outputs": [],
"source": [
"# DataFrame :\n",
"# • A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spread\u0002sheet.\n",
"# • DataFrames are one of the most common data structures used in modern data analytics because they are a flexibleand intuitive way of storing and working with data.\n",
"# • Every DataFrame contains a blueprint, known as a schema,that defines the name and data type of each column\n",
"# In Python Pandas, a dataframe is a data structure constructed with rows and columns, similar to a database or Excel spreadsheet."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "8b7f4d1e",
"metadata": {},
"outputs": [],
"source": [
"# Types of datasets (based on the data type) :\n",
"#• Numerical datasets: Contain numbers and are used for quantitative analysis.\n",
"#• Text datasets: Contain posts, text messages, and documents.\n",
"#• Multimedia datasets: Contain images, videos, and audio files.\n",
"#• Time-series datasets: Contain data collected over time to analyze trends and patterns.\n",
"#• Spatial dataset: Contain geographically referenced information, such as GPS data.\n",
"\n",
"\n",
"# Types of datasets (based on the data structure) :\n",
"#• Structured datasets: Organized in specific structures to make it easier to query and analyze data.\n",
"#• Unstructured datasets: Don’t have a well-defined schema.They can include a variety of types of data.\n",
"#• Hybrid datasets: Include both structured and unstructured data.\n",
"\n",
"# Types of datasets (in statistics) : \n",
"#• Numerical datasets: Involve only numbers.\n",
"#• Bivariate datasets: Involve two data variables.\n",
"#• Multivariate datasets: Involve three or more data variables.\n",
"#• Categorical datasets: Consist of categorical variables that can take only a limited set of values.\n",
"#• Correlation datasets: Contain data variables that relate to each other\n",
"\n",
"# Types of datasets (Machine learning)\n",
"#• Datasets for training ML: Used to train the model.\n",
"#• Datasets for validation: Used to reduce overfitting and make the model more accurate.\n",
"#• Dataset for testing: Used for testing the final output of the model to confirm its accuracy."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "5512beb3",
"metadata": {},
"outputs": [],
"source": [
"# Data Source :\n",
"#• A data source is the physical or digital location where the data comes from in various forms.\n",
"#• The data source can be both the place where the data was originally created and the place where it was added, where the last is for data digitizing.\n",
"#• Data sources can be digital (for the most part) or paper-based.\n",
"#• The idea is to enable users to access and exploit the data from this source.\n",
"#• The data source can take different forms, such as a database, a flat file, an inventory table, web scraping, streaming data, physical archives, etc.\n",
"#• With the development of Big Data and new technologies, these different formats are constantly evolving, making data sources ever more complex.\n",
"#• The challenge for organisations is to simplify them as much as possible.\n",
"#• A data source is simply the source of the data.\n",
"#• It can be a file, a particular database on a DBMS, or even a live data feed.\n",
"#• The data might be located on the same computer as the program, or on another computer somewhere on a network.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84f443d0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment