Skip to content

Instantly share code, notes, and snippets.

@afrozhie
Created April 4, 2024 06:28
Show Gist options
  • Save afrozhie/975c83a4c4157241775128cf3739af41 to your computer and use it in GitHub Desktop.
Save afrozhie/975c83a4c4157241775128cf3739af41 to your computer and use it in GitHub Desktop.
ipynb for issues 8 session 8
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "1dd63331-758b-4ac7-acc2-b2eb2cdeab84",
"metadata": {},
"source": [
"#### Resume Session 8\n",
"#### Firman Andrian\n",
"#### 21181094"
]
},
{
"cell_type": "markdown",
"id": "b7230ed9-632a-4299-b491-06b50330b9ec",
"metadata": {},
"source": [
"## Data Source\n",
"\n",
"A data source is the physical or digital location where the data comes from in various forms, can be both the place where the data was originally created and the place where it was added, where the last is for data digitizing. The data source can take different forms, such as a database, a flat file, an inventory table, web scraping, streaming data, \n",
"physical archives, etc. With the development of Big Data and new technologies, these different formats are constantly evolving, making data sources ever more complex."
]
},
{
"cell_type": "markdown",
"id": "95aa57d3-a38a-49fb-a677-e6385031867f",
"metadata": {},
"source": [
"## Data Type\n",
"\n",
"There are two types of data: Qualitative (Nominal data, Ordinal data)\n",
" and Quantitative (Discrete data, Continuous data)\n",
" \n",
"### Qualitative or Categorical Data\n",
" • is data that can’t be measured or counted in the form of numbers.\n",
" • These types of data are sorted by category, not by number.\n",
" • These data consist of audio, images, symbols, or text.\n",
" • The gender of a person, i.e., male, female, or others, is qualitative data.\n",
" • Qualitative data tells about the perception of people.\n",
"#### Nominal data\n",
"Example : Hair Colour (Blonde, Brown, Black), Gender ( Male, Female)\n",
" • Nominal Data is used to label variables without any order or \n",
"quantitative value.\n",
" • The color of hair can be considered nominal data, as one color can’t be compared with another color.\n",
"#### Ordinal data \n",
"Example : A, B, C, D, First, Second, Third, High, Medium, Low.\n",
" • Ordinal data have natural ordering where a number is present in some kind of order by their position on the scale.\n",
" • These data are used for observation like customer satisfaction, happiness, etc., but we can’t do any arithmetical tasks on them.\n",
" • Ordinal data is qualitative data for which their values have some kind of relative position\n",
"\n",
"### Quantitative data\n",
" • Quantitative data can be expressed in numerical values, making it countable and including statistical data analysis. \n",
" • These data can be represented on a wide variety of graphs and charts, such as bar graphs, histograms, scatter plots, boxplots, pie charts, line graphs, etc.\n",
"#### Discrete data\n",
"Example : Jumlah, Biaya, jumlah peserta\n",
" • The discrete data contain the values that fall under integers or whole numbers.\n",
" • The total number of students in a class is an example of discrete data.\n",
" • These data can’t be broken into decimal or fraction values.\n",
"#### Continuous data\n",
"Example : Tinggi seseorang, Kecepatan Mobil, Frekuensi wi-fi\n",
" • Continuous data are in the form of fractional numbers.\n",
" • It can be the version of an android phone, the height of a person, the length of an object, etc.\n",
" • Continuous data represents information that can be divided into smaller levels.\n",
" • The continuous variable can take any value within a range."
]
},
{
"cell_type": "markdown",
"id": "c357072a-b313-4077-a531-45e7862f1a3a",
"metadata": {},
"source": [
"## Data and data-something\n",
"### Data and information\n",
"#### Data\n",
"- The term data is defined as a collection of individual facts or statistics (singular form: datum).\n",
"- Data can come in the form of text, observations, figures, images, numbers, graphs, or symbols.\n",
"#### Information\n",
"- The term information is defined as knowledge gained through study, communication, research, or instruction.\n",
"- Essentially, information is the result of analyzing and interpreting pieces of data.\n",
"### Dataset and database\n",
"- The data are observations or measurements (unprocessed or \n",
"processed) represented as text, numbers, or multimedia.\n",
"- A dataset is a structured collection of data generally asso\n",
"ciated with a unique body of work.\n",
"- A database is an organized collection of data stored as multi\n",
"ple datasets, where those datasets are generally stored and \n",
"accessed electronically from a computer system that allows \n",
"the data to be easily accessed, manipulated, and updated.\n",
"#### Some use cases for the 6 popular schemas\n",
"- Flat model: Best model is for small, simple applications.\n",
"- Hierarchical model: For nested data, like XML or JSON.\n",
"- Network model: Useful for mapping and spatial data, also for \n",
"depicting workflows.\n",
"- Relational model: Best reflects Object-Oriented Programming \n",
"applications.\n",
"- Star model: For analyzing large, one-dimensional datasets.\n",
"- Snowflake model: For analyzing large and complex datasets.\n",
"### Data warehouse and data lake\n",
"#### Data warehouse\n",
"- A data warehouse is a type of data management system that \n",
"is designed to enable and support business intelligence (BI) \n",
"activities, especially analytics.\n",
"- Data warehouses are solely intended to perform queries and \n",
"analysis and often contain large amounts of historical data.\n",
"#### Data lake\n",
"- A data lake is a storage repository that holds a vast amount of \n",
"raw data in its native format until it is needed for analytics \n",
"applications.\n",
"- While a traditional data warehouse stores data in hierarchical \n",
"dimensions and tables, a data lake uses a flat architecture to \n",
"store data, primarily in files or object storage.\n",
"- That gives users more flexibility on data management, storage \n",
"and usage.\n",
"##### Challenges on Data Lake\n",
"- Data swamps, In a data swamp it is very \n",
"hard to find the data what \n",
"we need.\n",
"- Technology overload, Combination of available \n",
"technologies might \n",
"complicates deployments.\n",
"- Unexpected costs, Upfront technologies might \n",
"cost more than expected or \n",
"planned.\n",
"- Data governance, Storing raw data as-is still \n",
"requires effective governan\n",
"ce.\n",
"### Data lakehouse\n",
"- A data lakehouse, as the name suggests, is a new data archi\n",
"tecture that merges a data warehouse and a data lake into a \n",
"single whole, with the purpose of addressing each one’s \n",
"limitations.\n",
"- In a nutshell, the lakehouse system leverages low-cost storage \n",
"to keep large volumes of data in its raw formats just like data \n",
"lakes\n",
"##### Problems faced by data lake architecture\n",
"- Inconsistent Data Quality without Schema Enforcement\n",
"- Handling data today — combining batch and streaming data\n",
"- Overhead for time and money\n",
"### DataFrame\n",
"- A DataFrame is a data structure that organizes data into a 2\n",
"dimensional table of rows and columns, much like a spread\n",
"sheet.\n",
"- DataFrames are one of the most common data structures \n",
"used in modern data analytics because they are a flexible\n",
" and intuitive way of storing and working with data.\n",
"- Every DataFrame contains a blueprint, known as a schema, \n",
"that defines the name and data type of each column.\n",
"### Dataset\n",
"- A dataset, or data set, is a collection of data related to a parti\n",
"cular topic, theme, or industry.\n",
"- Datasets include different types of information, such as num\n",
"bers, text, images, videos, and audio, and can be stored in \n",
"various formats, such as CSV, JSON, or SQL.\n",
"- So, a dataset typically involves structured data for a specific \n",
"purpose and is related to the same subject\n",
"#### Types of datasets (based on the data type)\n",
"- Numerical datasets: Contain numbers and are used for quan\n",
"titative analysis.\n",
"- Text datasets: Contain posts, text messages, and documents.\n",
"- Multimedia datasets: Contain images, videos, and audio files.\n",
"- Time-series datasets: Contain data collected over time to ana\n",
"lyze trends and patterns.\n",
"- Spatial dataset: Contain geographically referenced informa\n",
"tion, such as GPS data.\n",
"#### Types of datasets (based on the data structure)\n",
"- Structured datasets: Organized in specific structures to make \n",
"it easier to query and analyze data.\n",
"- Unstructured datasets: Don’t have a well-defined schema. \n",
"They can include a variety of types of data.\n",
"- Hybrid datasets: Include both structured and unstructured \n",
"data\n",
"#### Types of datasets (in statistics)\n",
"- Numerical datasets: Involve only numbers.\n",
"- Bivariate datasets: Involve two data variables.\n",
"- Multivariate datasets: Involve three or more data variables.\n",
"- Categorical datasets: Consist of categorical variables that can \n",
"take only a limited set of values.\n",
"- Correlation datasets: Contain data variables that relate to \n",
"each other\n",
"#### Types of datasets (Machine learning)\n",
"- Datasets for training ML: Used to train the model.\n",
"- Datasets for validation: Used to reduce overfitting and make \n",
"the model more accurate.\n",
"- Dataset for testing: Used for testing the final output of the \n",
"model to confirm its accuracy\n",
"### Apache Spark\n",
"- Apache Spark is an open-source, distributed processing \n",
"system used for big data workloads.\n",
"- It utilizes in-memory caching and optimized query execution \n",
"for fast queries against data of any size.\n",
"- Feature : Fast Processing, Flexibility, In-memory Computing, Real time processing, Better analytics"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16f89a86-0218-4ea2-b11b-682fd4cf08a0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment