Created
April 4, 2024 05:53
-
-
Save CyberianRonin/7f867a3fc5d7d0da0f06354b00dbc685 to your computer and use it in GitHub Desktop.
Data And Data Source
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "161f1a89", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Data\n", | |
"# The term data is defined as a collection of individual facts or statistics (singular form: datum). can come in the form of text, observations, figures, images, numbers, graphs, or symbols." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "9c20d908", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Information\n", | |
"# The term information is defined as knowledge gained through study, communication, research, or instruction." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "9b2b4f90", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# The Diferrence between data and information :\n", | |
"# Data is a collection of facts, while information puts those facts into context.\n", | |
"# Data points are individual and sometimes unrelated. Information maps out that data to provide a big-picture view of how it all fits together\n", | |
"# Data typically comes in the form of graphs, numbers, figures,or statistics,while information is typically presented through words, language, thoughts, and ideas." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "6a9137a4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Dataset is a structured collection of data generally associated with a unique body of work.\n", | |
"# Database is an organized collection of data stored as multiple datasets, where those datasets are generally stored and accessed electronically from a computer system that allows the data to be easily accessed, manipulated, and updated." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "f0b7e22d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Some use cases for the 6 popular schemas :\n", | |
"#• Flat model: Best model is for small, simple applications.\n", | |
"#• Hierarchical model: For nested data, like XML or JSON.\n", | |
"#• Network model: Useful for mapping and spatial data, also for depicting workflows.\n", | |
"#• Relational model: Best reflects Object-Oriented Programming applications.\n", | |
"#• Star model: For analyzing large, one-dimensional datasets.\n", | |
"#• Snowflake model: For analyzing large and complex datasets.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "dde198d4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics.\n", | |
"# Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications.\n", | |
"# Metadata describes the data stored in the data lake, providing details such as its source, its structure, its meaning, its relationships with other data, and its usage." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "f70f1c07", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Data swamps :\n", | |
"#• One of the biggest challenges is preventing a data lake from turning into a data swamp.\n", | |
"#• If it isn't set up and managed properly, the data lake can beco\u0002me a messy dumping ground for data.\n", | |
"#• Users may not find what they need, and data managers may lose track of data that's stored in the data lake, even as more pours in." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "42134a60", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Data governance :\n", | |
"#• One of the purposes of a data lake is to store raw data as-is for various analytics uses.\n", | |
"#• But without effective governance of data lakes, organizations may be hit with data quality, consistency and reliability issues.\n", | |
"#• Those problems can hamper analytics applications and produce flawed results that lead to bad business decisions.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "2ece68de", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Data lakehouse : \n", | |
"# • A data lakehouse, as the name suggests, is a new data archi\u0002tecture that merges a data warehouse and a data lake into a single whole, with the purpose of addressing each one’s limitations.\n", | |
"# • In a nutshell, the lakehouse system leverages low-cost storage to keep large volumes of data in its raw formats just like data lakes." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "3145a1c0", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# DataFrame :\n", | |
"# • A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spread\u0002sheet.\n", | |
"# • DataFrames are one of the most common data structures used in modern data analytics because they are a flexibleand intuitive way of storing and working with data.\n", | |
"# • Every DataFrame contains a blueprint, known as a schema,that defines the name and data type of each column\n", | |
"# In Python Pandas, a dataframe is a data structure constructed with rows and columns, similar to a database or Excel spreadsheet." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"id": "8b7f4d1e", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Types of datasets (based on the data type) :\n", | |
"#• Numerical datasets: Contain numbers and are used for quantitative analysis.\n", | |
"#• Text datasets: Contain posts, text messages, and documents.\n", | |
"#• Multimedia datasets: Contain images, videos, and audio files.\n", | |
"#• Time-series datasets: Contain data collected over time to analyze trends and patterns.\n", | |
"#• Spatial dataset: Contain geographically referenced information, such as GPS data.\n", | |
"\n", | |
"\n", | |
"# Types of datasets (based on the data structure) :\n", | |
"#• Structured datasets: Organized in specific structures to make it easier to query and analyze data.\n", | |
"#• Unstructured datasets: Don’t have a well-defined schema.They can include a variety of types of data.\n", | |
"#• Hybrid datasets: Include both structured and unstructured data.\n", | |
"\n", | |
"# Types of datasets (in statistics) : \n", | |
"#• Numerical datasets: Involve only numbers.\n", | |
"#• Bivariate datasets: Involve two data variables.\n", | |
"#• Multivariate datasets: Involve three or more data variables.\n", | |
"#• Categorical datasets: Consist of categorical variables that can take only a limited set of values.\n", | |
"#• Correlation datasets: Contain data variables that relate to each other\n", | |
"\n", | |
"# Types of datasets (Machine learning)\n", | |
"#• Datasets for training ML: Used to train the model.\n", | |
"#• Datasets for validation: Used to reduce overfitting and make the model more accurate.\n", | |
"#• Dataset for testing: Used for testing the final output of the model to confirm its accuracy." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"id": "5512beb3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Data Source :\n", | |
"#• A data source is the physical or digital location where the data comes from in various forms.\n", | |
"#• The data source can be both the place where the data was originally created and the place where it was added, where the last is for data digitizing.\n", | |
"#• Data sources can be digital (for the most part) or paper-based.\n", | |
"#• The idea is to enable users to access and exploit the data from this source.\n", | |
"#• The data source can take different forms, such as a database, a flat file, an inventory table, web scraping, streaming data, physical archives, etc.\n", | |
"#• With the development of Big Data and new technologies, these different formats are constantly evolving, making data sources ever more complex.\n", | |
"#• The challenge for organisations is to simplify them as much as possible.\n", | |
"#• A data source is simply the source of the data.\n", | |
"#• It can be a file, a particular database on a DBMS, or even a live data feed.\n", | |
"#• The data might be located on the same computer as the program, or on another computer somewhere on a network.\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "84f443d0", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.11.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment