- According to Apache, Spark is "a fast and general engine for large-scale data processing"
- Since it runs on a cluster, it is very scalable
- It has a built in cluster manager, but it can also run on top of a Hadoop cluster, which would then use YARN
- According to Apache, Spark can "run programs up to 100x faster than Hadoop
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pathlib import Path | |
from sxl import Workbook | |
from datetime import datetime | |
from datetime import timedelta | |
# Get today's date, end date, and initilize dictionaries | |
today = datetime.today().replace(hour=0, minute=0, second=0, microsecond=0) | |
end_date = today + timedelta(days=3) | |
sap_index = {} | |
data_dict = {} |
What is Hadoop? "Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware" - Hortonworks
- Distributed storage: stores data across many hard drives & has backup copies