Skip to content

Instantly share code, notes, and snippets.

@andrew-curthoys
andrew-curthoys / udemy_hadoop_training_course_notes.md
Last active January 23, 2023 01:57
Udemy Hadoop Training Course Notes

Udemy - "The Ultimate Hands-On Hadoop - Tame Your Big Data!" Course Notes

What is Hadoop? "Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware" - Hortonworks

Features

  • Distributed storage: stores data across many hard drives & has backup copies
@andrew-curthoys
andrew-curthoys / udemy_apache_spark_training_course_notes.md
Last active January 29, 2020 15:37
Udemy Apache Spark Course Notes

Udemy - "Taming Big Data with Apache Spark 3 and Python - Hands On!" Course Notes

Introduction to Spark

  • According to Apache, Spark is "a fast and general engine for large-scale data processing"
  • Since it runs on a cluster, it is very scalable
  • It has a built in cluster manager, but it can also run on top of a Hadoop cluster, which would then use YARN
  • According to Apache, Spark can "run programs up to 100x faster than Hadoop
from pathlib import Path
from sxl import Workbook
from datetime import datetime
from datetime import timedelta
# Get today's date, end date, and initilize dictionaries
today = datetime.today().replace(hour=0, minute=0, second=0, microsecond=0)
end_date = today + timedelta(days=3)
sap_index = {}
data_dict = {}