show dbs
Here is a list of some terms associated with Hadoop. You'll learn more about these terms and how they relate to Spark in the rest of the lesson.
- Hadoop - an ecosystem of tools for big data storage and data analysis. Hadoop is an older system than Spark but is still used by many companies. The major difference between Spark and Hadoop is how they use memory. Hadoop writes intermediate results to disk whereas Spark tries to keep data in memory whenever possible. This makes Spark faster for many use cases.
- Hadoop MapReduce - a system for processing and analyzing large data sets in parallel.
- Hadoop YARN - a resource manager that schedules jobs across a cluster. The manager keeps track of what computer resources are available and then assigns those resources to specific tasks.
- Hadoop Distributed File System (HDFS) - a big data storage system that splits data into chunks and stores the chunks across a cluster of computers.
As Hadoop matured, other tools were developed t
import requests | |
from bs4 import BeautifulSoup | |
from csv import writer | |
response = requests.get('http://codedemos.com/sampleblog/') | |
soup = BeautifulSoup(response.text, 'html.parser') | |
posts = soup.find_all(class_='post-preview') |
Text Document -> Text pre-processing -> Text parsing & Exploratory Data Analysis -> Text Representation & Feature Engineering -> Modeling and/or Pattern Mining -> Evaluation & Deployment
- Machine Translation
- Speech Recognition
- Sentiment Analysis
- pyspark.sql module
- pyspark.streaming module
- pyspark.ml package
- pyspark.mllib package
- pyspark.sql.SparkSession: Main entry point for DataFrame and SQL functionality.
- pyspark.sql.DataFrame: A distributed collection of data grouped into named columns.
- pyspark.sql.Column: A column expression in a DataFrame.
• anecdotal evidence: Evidence, often personal, that is collected casually rather than by a well-designed study.
• population: A group we are interested in studying. “Population” often refers to a group of people, but the term is used for other subjects, too.
• cross-sectional study: A study that collects data about a population at a particular point in time.
• cycle: In a repeated cross-sectional study, each repetition of the study is called a cycle.