andrew-curthoys

## Excel parser
from pathlib import Path
from sxl import Workbook
from datetime import datetime
from datetime import timedelta

# Get today's date, end date, and initilize dictionaries
today = datetime.today().replace(hour=0, minute=0, second=0, microsecond=0)
end_date = today + timedelta(days=3)
sap_index = {}
data_dict = {}

## udemy_apache_spark_training_course_notes.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andrew-curthoys
                / udemy_apache_spark_training_course_notes.md
            
            
              Last active
              January 29, 2020 15:37
            
              
                Udemy Apache Spark Course Notes
              
          
    Udemy - "Taming Big Data with Apache Spark 3 and Python - Hands On!" Course Notes

Introduction to Spark


According to Apache, Spark is "a fast and general engine for large-scale
data processing"
Since it runs on a cluster, it is very scalable
It has a built in cluster manager, but it can also run on top of a Hadoop
cluster, which would then use YARN
According to Apache, Spark can "run programs up to 100x faster than Hadoop


## udemy_hadoop_training_course_notes.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                andrew-curthoys
                / udemy_hadoop_training_course_notes.md
            
            
              Last active
              January 23, 2023 01:57
            
              
                Udemy Hadoop Training Course Notes
              
          
    Udemy - "The Ultimate Hands-On Hadoop - Tame Your Big Data!" Course Notes

What is Hadoop? "Hadoop is an open source software platform for
distributed storage and distributed processing of
very large data sets on computer clusters built from commodity hardware"
- Hortonworks
Features


Distributed storage: stores data across many hard drives & has backup copies
	from pathlib import Path
	from sxl import Workbook
	from datetime import datetime
	from datetime import timedelta

	# Get today's date, end date, and initilize dictionaries
	today = datetime.today().replace(hour=0, minute=0, second=0, microsecond=0)
	end_date = today + timedelta(days=3)
	sap_index = {}
	data_dict = {}