Skip to content

Instantly share code, notes, and snippets.

@Antara000
Created April 4, 2024 06:51
Show Gist options
  • Save Antara000/58bb8da93f18fb4495e817ae1b523b8c to your computer and use it in GitHub Desktop.
Save Antara000/58bb8da93f18fb4495e817ae1b523b8c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "77a28a6c",
"metadata": {},
"source": [
"***What is a Data Source?***\n",
"\n",
"Imagine a data source as a big bucket where your information comes from. This bucket can be physical, like a box of files, or digital, like a computer database.\n",
"\n",
"1. There are many different types of data sources, including:\n",
"\n",
"+ Databases: Organized collections of data\n",
"+ Flat files: Simple spreadsheets or text files\n",
"+ Web scraping: Extracting data from websites\n",
"+ Sensors: Devices that collect physical measurements (temperature, humidity)\n",
"+ Simulations: Creating data through computer models\n",
"\n",
"***Understanding Data Types***\n",
"\n",
"1. Data comes in two main flavors:\n",
"\n",
"+ Qualitative (descriptive): Describes things and can't be easily measured with numbers (favorite color, gender)\n",
"+ Quantitative (numerical): Represented by numbers and can be used for calculations (weight, temperature)\n",
"\n",
"2. There are different subtypes within each category to further classify your data:\n",
"\n",
"+ Nominal: Labels with no order (hair color, blood type)\n",
"+ Ordinal: Ranked data with positions (customer satisfaction, letter grades)\n",
"+ Discrete: Whole numbers you can count (number of students, days in a week)\n",
"+ Continuous: Measurable data with decimals (height, weight, temperature)\n",
"\n",
"***Why Data Sources and Types Matter?***\n",
"\n",
"Knowing your data source and type is crucial for using it effectively. For example, you can't average qualitative data like favorite colors, but you can calculate the average weight of students (quantitative data).\n",
"\n",
"1. By understanding your data, you can:\n",
"\n",
"+ Analyze it properly\n",
"+ Draw meaningful conclusions\n",
"+ Make better decisions\n"
]
},
{
"cell_type": "markdown",
"id": "3c82aa0e",
"metadata": {},
"source": [
"===\n",
"\n",
"Data Source #1\n",
"\n",
"- A data source is the physical or digital location where the data comes from in various forms.\n",
"- The data source can be both the place where the data was originally created and the place where it was added, wherethe last is for data digitizing.\n",
"- Data sources can be digital (for the most part) or paper-based.\n",
"- The idea is to enable users to access and exploit the data fromthis source.\n",
"- The data source can take different forms, such as a database, a flat file, an inventory table, web scraping, streaming data, physical archives, etc.\n",
"- With the development of Big Data and new technologies, these different formats are constantly evolving, making data sources ever more complex.\n",
"- The challenge for organisations is to simplify them as much as possible.\n",
"\n",
"Data Source #2\n",
"\n",
"- A data source is simply the source of the data.\n",
"- It can be a file, a particular database on a DBMS, or even a live data feed.\n",
"- The data might be located on the same computer as the program, or on another computer somewhere on a network.\n",
"\n",
"Others:\n",
"- Sensors: raw data for physical data, e.g. temperature, humidity, light intensity, ect.\n",
"- Simulation: random data leading to a meaning after analysis, e.g. Monte Carlo simulation.\n",
"\n",
"Data Type\n",
"\n",
"Type of Data:\n",
"- There are two types of data: Qualitative and Quantitative data.\n",
"- They are further classified into four categories: Nominal data, Ordinal data, Discrete data, Continuous data.\n",
"\n",
"Qualitatvie or Categorical Data :\n",
"- Qualitative or Categorical Data is data that can’t be measured or counted in the form of numbers.\n",
"- These types of data are sorted by category, not by number.\n",
"- These data consist of audio, images, symbols, or text.\n",
"- The gender of a person, i.e., male, female, or others, is qualitative data.\n",
"- Qualitative data tells about the perception of people.\n",
"EX :\n",
"* what language do you speak\n",
"* Favorite Holiday destination\n",
"* Opinion on something (agree, disagree, or neutral)\n",
"* Colors\n",
"\n",
"\n",
"Nominal Data\n",
"- Nominal Data is used to label variables without any order or quantitative value.\n",
"- The color of hair can be considered nominal data, as one color can’t be compared with another color.\n",
"EX :\n",
"* Colour of hari (Blonde, red, brown, black, etc)\n",
"* Martial Status (Single, widowed, Married)\n",
"* Nationality (Indian, German, American)\n",
"* Gender (Male, Female, Others)\n",
"* Eye Color (black,Brown, Etc)\n",
"\n",
"Ordinal Data\n",
"- Ordinal data have natural ordering where a number is present in some kind of order by their position on the scale.\n",
"- These data are used for observation like customer satisfaction, happiness, etc., but we can’t do any arithmetical tasks on them.\n",
"- Ordinal data is qualitative data for which their values have some kind of relative position.\n",
"EX :\n",
"* Letter Grades in the exam (A, B, C, D, etc.)\n",
"* Ranking of people ia a competition (First, Second, Third, Etc)\n",
"* Economic Status (High, Medium, and Low)\n",
"* Educaation Level (higher, Secondary, Primary)\n",
"\n",
"Discrete Data\n",
"- Countable and finite\n",
"- Whole numbers or integers\n",
"- Represented mainly by bar graphs\n",
"- Values cannot be divided into subdivisions into smaller pieces\n",
"- Have spaces between the values\n",
"EX :\n",
"* Total students in a class\n",
"* number of days in a week\n",
"* size of a shoe\n",
"\n",
"Continuous Data\n",
"- Measurable\n",
"- Fractions or decimals\n",
"- Represented in the form of a histogram\n",
"- Values can be divided into subdivisions into smaller pieces\n",
"- In the form of a continuous sequence\n",
"EX :\n",
"* Temperature of room\n",
"* the weight of a person\n",
"* length of an object\n",
"\n",
"Notes\n",
"- Different types of data are used in research, analysis, statistical analysis, data visualization, and data science.\n",
"- Working on data is crucial because we need to figure out what kind of data it is and how to use it to get valuable output out of it.\n",
"- Working with data requires good data science skills and a deep understanding of different types of data and how to work with them."
]
},
{
"cell_type": "markdown",
"id": "4d6390bb",
"metadata": {},
"source": [
"===\n",
"***Database***\n",
"\n",
"A database is an organized collection of data stored as multiple datasets, typically electronically accessed from a computer system that allows the data to be easily accessed, manipulated, and updated. Databases can be relational, document-based, or key-value types.\n",
"\n",
"1. There are six popular schemas commonly used in database design:\n",
"\n",
"+ Flat Model: Best suited for small and simple applications where data is stored in a single table with all necessary information.\n",
"+ Hierarchical Model: Suitable for nested data such as XML or JSON, organized in a tree structure with a parent entity having multiple child entities.\n",
"+ Network Model: Useful for mapping and spatial data, as well as depicting workflow. Data is stored in a complex structure with complex relationships between entities.\n",
"+ Relational Model: Reflects Object-Oriented Programming applications well. Data is stored in related tables with foreign and primary keys.\n",
"+ Star Model: Used for analyzing large datasets with a one-dimensional nature, consisting of a fact table in the center connected to dimension tables.\n",
"+ Snowflake Model: Used for analyzing large and complex datasets. Similar to the star model, but dimension tables are divided into smaller tables to reduce redundancy and enhance normalization.\n",
"\n",
"***DataWarehouse***\n",
"\n",
"A Data Warehouse is a central repository for structured data processed from various sources for business analysis, while a Data Lake is a data storage allowing for storage of raw and structured data as well as unstructured data in large volumes. \n",
"Data Warehouses are suitable for business analysis requiring structured and integrated data, while Data Lakes are suitable for storing raw data in large volumes without prior data modeling.\n",
"\n",
"***Dataframe***\n",
"\n",
"A DataFrame is a two-dimensional tabular data structure used in programming and data analysis. Similar to a database table or spreadsheet, data is stored in rows and columns. DataFrames are commonly used in programming languages such as Python (pandas), R, and Spark for easy data manipulation and analysis, allowing users to perform various operations such as filtering, sorting, grouping, and joining data efficiently.\n",
"\n",
"1. There are several types of datasets based on the type of data they contain:\n",
"\n",
"+ Numerical Datasets: Contain numerical values for quantitative analysis.\n",
"+ Text Datasets: Contain text such as messages, documents, and other text content.\n",
"+ Multimedia Datasets: Contain images, videos, and audio for multimedia applications.\n",
"+ Time-Series Datasets: Data collected sequentially for trend and pattern analysis.\n",
"+ Spatial Datasets: Contain geographic information such as GPS data for spatial analysis.\n",
"\n",
"2. There are three main types of datasets based on data structure:\n",
"\n",
"+ Structured Datasets: Data organized in a specific structure for easy querying and analysis.\n",
"+ Unstructured Datasets: Data without well-defined schemas, encompassing various data types.\n",
"+ Hybrid Datasets: Combination of structured and unstructured data in one dataset.\n",
"\n",
"3. In statistics, there are several commonly used dataset types:\n",
"\n",
"+ Numerical Datasets: Numeric datasets consisting solely of numerical values.\n",
"+ Bivariate Datasets: Bivariate datasets involve two data variables.\n",
"+ Multivariate Datasets: Multivariate datasets involve three or more data variables.\n",
"+ Categorical Datasets: Categorical datasets consist of categorical variables with limited values.\n",
"+ Correlation Datasets: Correlation datasets contain interrelated data variables.\n",
"\n",
"4. In machine learning, there are three commonly used dataset types:\n",
"\n",
"+ Datasets for Training ML: Used to train machine learning models.\n",
"+ Datasets for Validation: Used to reduce overfitting and improve model accuracy.\n",
"+ Datasets for Testing: Used to test the final output of the model and ensure its accuracy.\n",
"\n",
"***Apache***\n",
"\n",
"Apache Spark is an open-source distributed processing system used for big data workloads. Apache Spark utilizes in-memory caching techniques and optimized query execution for fast querying of data at any scale. Simply put, Spark is a fast and general engine for processing data at large scale.\n",
"\n",
"1. Some key features of Apache Spark include:\n",
"\n",
"+ Fast Processing: Known for its speed in data processing, making it a choice for many organizations for big data workloads.\n",
"+ Flexibility: Apache Spark supports multiple programming languages such as Java, Scala, R, and Python, giving developers flexibility in writing applications.\n",
"+ In-Memory Processing: Spark stores data in server RAM, allowing for fast access and speeding up analysis.\n",
"\n",
"***DataSource***\n",
"\n",
"A data source refers to the places where data can be obtained for analysis. Data sources can come in various forms such as datasets, APIs, software, and data providers. The quality and reliability of a dataset depend greatly on the source from which the data is obtained. Understanding data sources is crucial in data analysis as it can affect the final outcomes of the data analysis process.\n",
"\n",
"1. Examples of data sources include:\n",
"\n",
"+ Public Health Data: Used to monitor disease spread and predict future threats.\n",
"+ Google Analytics: Used by most businesses to track website traffic and user behavior.\n",
"+ LinkedIn: Provides data on user behavior, job market trends, and professional connections.\n",
"\n",
"2. Data sources can be categorized based on the data structure they provide. There are three main types of data sources:\n",
"\n",
"+ Structured Data: Refers to data with a specific structure, often organized in tabular format, as found in relational databases.\n",
"+ Unstructured Data: Data that lacks well-organized structure, such as free-form text, images, or videos.\n",
"+ Semi-Structured Data: Data that has loosely defined structure, such as data in JSON or XML format.\n",
"\n",
"\n",
"There are two main types of data: Qualitative and Quantitative. Qualitative data is data that cannot be measured or counted in numerical form. It is sorted by category rather than by number and includes audio, images, symbols, or text. Quantitative data can be expressed in numerical values, making it countable and included in statistical data analysis. It can be represented in various graphs and charts, such as bar graphs, histograms, scatter plots, and others. Additionally, this data can be further categorized into Nominal, Ordinal, Discrete, and Continuous data. Nominal data is used to label variables without order or quantitative value, while Ordinal data has a natural order with values falling into some sort of order based on their position on a scale. Discrete data consists of values that are integers or whole numbers, while Continuous data is in fractional numbers and can be divided into smaller levels."
]
},
{
"cell_type": "markdown",
"id": "18ca1eb2",
"metadata": {},
"source": [
"<h2>Question</h2>"
]
},
{
"cell_type": "markdown",
"id": "52c22148",
"metadata": {},
"source": [
"1. What are ETL (extract, transform, load) between structured data and data warehouse? Explain in brief.\n",
"2. Why extract and load (EL) are separated from transform (T)?\n",
"- raw data → EL → data lake\n",
"- data lake → T → end use\n",
"3. What are batch and streaming data? What are the differences between these data?\n",
"- There is also ELT. What is the reason to use ELT instead of ETL?\n",
"- Explain about sensors as data sources."
]
},
{
"cell_type": "markdown",
"id": "932c0896",
"metadata": {},
"source": [
"<h2>Answer</h2>"
]
},
{
"cell_type": "markdown",
"id": "805d499a",
"metadata": {},
"source": [
"+ ETL (Extract, Transform, Load) is a process used to transfer data from external sources into a data warehouse. Extraction involves retrieving raw data from the source, transformation involves manipulating and processing data to fit requirements, while loading involves storing processed data into the data warehouse.\n",
"\n",
"+ Extract and Load (EL) are separated from Transform (T) to divide the data retrieval process from the data processing and transformation process. By separating these, you can manage dependencies between data retrieval and data transformation more efficiently.\n",
"\n",
"+ Batch data is processed in groups, taken from the source and processed at regular intervals. Streaming data, on the other hand, is continuously processed as it arrives, without having to wait for entire batches of data to be available. The main difference between them lies in how they are processed: batch processing periodically, while streaming is real-time.\n",
"\n",
"+ ELT (Extract, Load, Transform) is an approach where data is extracted from the source, loaded into the data warehouse, and then processed and transformed within the data warehouse itself. Reasons for using ELT instead of ETL may include the modern data warehouse's ability to handle and process large volumes of data, as well as the flexibility to perform complex data transformations directly within the data warehouse.\n",
"\n",
"+ Sensors are electronic devices used to detect and measure changes in the physical environment or behavior. As a data source, sensors can generate diverse data such as temperature, humidity, pressure, or motion, which can then be utilized for various applications such as environmental monitoring, security surveillance, or machine performance analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e24d6cfd",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment