Skip to content

Instantly share code, notes, and snippets.

@yunoooo111
Created April 4, 2024 06:19
Show Gist options
  • Save yunoooo111/1110af9124e51c460914d9683c134f84 to your computer and use it in GitHub Desktop.
Save yunoooo111/1110af9124e51c460914d9683c134f84 to your computer and use it in GitHub Desktop.
Tugas 8a dan 8b
Display the source blob
Display the rendered blob
Raw
DATA:
collection of raw and unorgines facts. may or may not be informative. may or may not be processed. Database
Metadata:
Data about data. always informative. always processed. data dictionary.
@@ Data To Information @@
-DATA:
•The term data is defined as a collection of individual facts or statistics (singular form: datum).
•Data can come in the form of text, observations, figures, images, numbers, graphs, or symbols.
•Data is a raw form of knowledge and, on its own, doesn’t carry any significance or purpose.
•Data can be simple—and may even seem useless until it is analyzed, organized, and interpreted
-INFORMATION:
•The term informationis defined as knowledge gained through study, communication, research, or instruction.
•Essentially, information is the result of analyzing and interpreting pieces of data.
•Whereas data is the individual figures, numbers, or graphs, information is the perception of those pieces of knowledge.
-Differences between them:
•Data is a collection of facts, while information puts those facts into context.
•While data is raw and unorganized, information is organized.
•Data points are individualand sometimes unrelated. Infor-mation maps out that data to provide a big-picture viewof how it all fits together.
•Data, on its own, is meaningless. When it’s analyzedand inter-preted, it becomes meaningfulinformation.
@@ DATASET and DATABASE @@
Dataset, data, database:
•A dataset is a structured collection of data organized and stored together for analysis or processing, that can include many different types of data, from numerical values to text, images or audio recordings.
•The datawithin a dataset can typically be accessed individually, in combination or managed as a whole entity.
•A database(relational, document, or key-valuetype) is an organized collection of data stored as multiple datasets
*SQL Database*
• Relational
• Analytical(OLAP)
*No SQL Database*
• Column-Family
• Graph
• Document
• KeyValue
-6 Database Schema Design:
a. Flat File Model : Best model is for small, simple applications.
b. Hierarchical Model : For nested data, like XML or JSON.
c. Network Model : Useful for mapping and spatial data, also for depicting workflows.
d. Relational Model :Best reflects Object-Oriented Programming applications.
e. Star Schema : For analyzing large, one-dimensional datasets.
f. Snowflake Schema : For analyzing large and complex datasets.
@@ QUESTION @@
1. What are ETL (extract, transform, load) between structured data and data warehouse? Explain in brief.
answer:
*Extract, transform, and load (ETL) is the process of combining data from multiple sources into a large, central repository called a data warehouse. ETL uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning (ML). You can address specific business intelligence needs through data analytics (such as predicting the outcome of business decisions, generating reports and dashboards, reducing operational inefficiency, and more).*
2. Why extract and load (EL) are separated from transform (T)?-raw data →EL →data lake-data lake →T →end use
answer:
There are several reasons why Extract, Load (EL) and Transform (T) are separated in the ETL (Extract, Transform, Load) process, even though they deal with data movement:
Modularity: Separating EL and T allows for independent development and maintenance. Changes to how data is extracted or loaded (e.g., a new data source) won't affect the transformation logic (e.g., data cleaning, formatting).
Reusability: The extracted data in the data lake can be used for various purposes beyond the initial transformation. Separating EL allows the same raw data to be transformed for different analytical needs.
Scalability: EL processes often deal with high-volume data ingestion. Separating it from transformation allows for independent scaling. You can optimize data extraction for speed while transformation focuses on accuracy.
Data Integrity: The data lake serves as a staging area for raw data. Keeping the raw data untouched in the lake ensures a reliable source for future transformations, even if the transformation logic changes.
Security: Separation allows for implementing different security controls for each stage. You might have stricter access to the transformation logic compared to the raw data in the lake.
Overall, separating EL and T promotes a more robust, flexible, and secure data pipeline.
3. What are batch and streaming data? What are the differences between these data?
answer:
Batch and streaming data are two ways data is handled based on how it arrives and is processed. Here's a breakdown of the key differences:
--Batch Data:
•Processing: Processed in large chunks, or batches, at specific intervals. Imagine waiting for a full basket of laundry before washing it all at once.
•Data size: Typically large and finite (known amount beforehand).
•Latency: Lower priority on real-time results. Processing can take time depending on the batch size.
•Use cases: Well-suited for historical analysis, reports (monthly sales figures), and non-real-time tasks like payroll processing.
--Streaming Data:
•Processing: Analyzed on-the-fly, in real-time or near real-time, as soon as it arrives. Think of watching a live stream where data unfolds continuously.
•Data size: Often arrives continuously and may be infinite in volume upfront.
•Latency: Critical for real-time insights and actions. Needs to be processed quickly with minimal delay.
•Use cases: Ideal for fraud detection, sensor data analysis (stock market fluctuations), and applications requiring immediate response.
*Here's an analogy: Batch data is like a daily newspaper, delivered all at once with a summary of past events. Streaming data is like a live news feed, constantly updating with the latest happenings.
4. There is also ELT. What is the reason to use ELT instead of ETL?
answer
ELT, standing for Extract, Load, Transform, differs from ETL (Extract, Transform, Load) in the order of operations for data processing. Here's why you might choose ELT over ETL:
--Advantages of ELT:
•Speed: ELT can be faster, especially for large datasets. Since data is loaded directly into the data warehouse before transformation, there's no separate transformation step that can become a bottleneck.
•Flexibility: The raw data in the data warehouse allows for more flexible exploration and analysis. Analysts can define transformations on the fly without needing to modify the ETL pipeline.
•Scalability: Modern data warehouses are often designed to handle complex transformations. ELT can leverage this processing power to scale efficiently with growing data volumes.
•Unstructured data: ELT is better suited for handling unstructured data formats (like images, sensor data) that might require transformations within the data warehouse itself.
When to consider ELT:
You have a large data warehouse with strong processing capabilities.
Real-time or near real-time analysis isn't a critical requirement.
You need flexibility to explore and analyze data in various ways.
Your data includes unstructured formats.
However, ELT also has drawbacks:
•Data quality: Since transformations happen after loading, identifying and cleaning bad data in the warehouse can be more complex.
•Complexity: Debugging transformation logic within the warehouse can be more challenging compared to a separate ETL process.
•Security: Depending on the implementation, ensuring data security during transformations within the warehouse might require additional considerations.
Choosing between ETL and ELT depends on your specific needs. If data quality, security, and clear lineage are top priorities, ETL might be preferable. If speed, flexibility, and handling large or unstructured data are more important, ELT could be a better fit.
5. Explain about sensors as data sources.
Sensors are fantastic data sources for a wide variety of applications because they continuously measure and collect the physical world around us. They act like electronic eyes and ears, capturing real-time or near real-time data on various physical properties. Here's a breakdown of how sensors function as data sources:
Types of Sensors:
Sensors come in all shapes and sizes, measuring a vast array of phenomena. Some common types include:
•Temperature sensors: Measure heat or coolness (thermostats, weather stations)
•Pressure sensors: Track air, water, or gas pressure (car tires, industrial equipment)
•Image sensors: Capture visual data (cameras)
•Motion sensors: Detect movement (security systems, fitness trackers)
•Acoustic sensors: Pick up sound waves (microphones)
--Data Collected by Sensors:
The specific data a sensor collects depends on its type. Here are some examples:
•A temperature sensor might record temperature readings in degrees Celsius at specific intervals.
•A pressure sensor might track pressure fluctuations over time in units like pounds per square inch (psi).
•An image sensor might capture visual data as digital images at a certain resolution.
--Applications of Sensor Data:
Sensor data has a wide range of applications across various industries. Here are a few examples:
•Environmental monitoring: Track air quality, temperature, and humidity for pollution control or climate studies.
•Manufacturing: Monitor machine performance, identify potential equipment failures for predictive maintenance.
•Agriculture: Measure soil moisture, temperature, and light levels to optimize crop yields.
•Smart homes: Control thermostats, lighting, and security systems based on sensor data.
•Wearable technology: Track fitness metrics like heart rate, steps taken, and sleep patterns.
--Advantages of Sensor Data:
•Real-time or near real-time: Sensors provide continuous updates, allowing for immediate analysis and response.
•Granular data: Sensors can capture highly detailed measurements, providing a rich picture of the monitored environment.
•Remote monitoring: Sensors can be deployed in difficult-to-reach locations, enabling data collection from anywhere.
Extracting Data from Sensors:
The process of extracting data from sensors typically involves:
•Sensor hardware: The physical sensor that detects the physical quantity of interest.
•Analog-to-digital converter (ADC): Converts the sensor's analog signal (e.g., voltage) into a digital format for processing.
•Data transmission: The digital data is transmitted via cables, wireless networks, or other methods to a data collection system.
Overall, sensors are powerful tools for gathering valuable data from the physical world. This data plays a critical role in various applications, driving advancements in science, industry, and everyday life.
**SOURCE OF DATA*
*type,limitation,validation,etc*
--Data source #1--
•A data source is the physical or digital location where the data comes from in various forms.
•The data source can be both the place where the data was originally created and the place where it was added, where the last is for data digitizing.
•Data sources can be digital (for the most part) or paper-based.
•The idea is to enable users to access and exploit the data from this source.
•The data source can take different forms, such as a database, a flat file, an inventory table, web scraping, streaming data, physical archives, etc.
•With the development of Big Data and new technologies, these different formats are constantly evolving, making data sources ever more complex.
•The challenge for organisations is to simplify them as much as possible.
--Data source #2--
•A data source is simply the source of the data.
•It can be a file, a particular database on a DBMS, or even a live data feed.
•The data might be located on the same computer as the program, or on another computer somewhere on a network
--Others--
•Sensors: raw data for physical data, e.g. temperature, humidity, light intensity, ect.
•Simulation:randomdataleadingtoameaningafteranalysis,e.g.MonteCarlosimulation.
**--Types of data--**
•There are two typesof data: Qualitativeand Quantitativedata.
•They are further clas-sified into four cate-gories: Nominal data,Ordinal data, Discrete data, Continuous data.
*--Qualitative or Categorical Data--*
•Qualitative or Categorical Data is data that can’t be measured or counted in the form of numbers.
•These types of data are sorted by category, not by number.•These data consist of audio, images, symbols, or text.
•The gender of a person, i.e., male, female, or others, is qualitative data.
•Qualitative data tells aboutthe perception of people.
*--Nominal data--*
•Nominal Data is used to label variables without any order or quantitative value.
•The color of hair can be considered nominal data, as one color can’t be compared with another color.
*--Ordinal data--*
•Ordinal data have natural ordering where a number is present in some kind of order by their position on the scale.
•These data are used for observation like customer satisfaction, happiness, etc., but we can’t do any arithmetical tasks on them.
•Ordinal data is qualitative data for which their values have somekind of relative position
*--Quantitative data--*
•Quantitative data can be expressed in numerical values, making it countable and including statistical data analysis.
•These data can be represented on a wide variety of graphs and charts, such as bar graphs, histograms, scatter plots, boxplots, pie charts, line graphs, etc.
*--Discrete data--*
•The discrete data contain the values that fall under integers or whole numbers.
•The total number of students in a class is an example of discrete data.
•These data can’t bebroken into decimalor fraction values.
*--Continuous data--*
•Continuous data are in the form of fractional numbers.
•It can be the version of an android phone, the height of a person, the length of an object, etc.
•Continuous data represents information that can be divided into smaller levels.
•The continuous variable can takeany value within a range.
**NOTES**
•Different types of data are used in research, analysis, statistical analysis, data visualization, and data science.
•Working on data is crucial because we need to figure out what kind of data it is and how to use it to get valuable output out of it.
•Working with data requires good data science skills and a deep understanding of different types of data and how to work with them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment