Data Engineer Take Home Test
Every month New York City releases data about trips taken in NYC taxis. A new ride sharing company is your client. They are looking to start up a rival service to Uber or Lyft in NYC and are interested in using the Taxi Data to get insight into the transportation sector in New York. They have tasked you to prepare the datasets and perform an initial exploratory analysis to get to know the data at hand.The client is launching a new ride sharing program in New York similar to Uber or Lyft.
You have three tasks in this project:
Write an ETL process able to do the following:
a. Download Taxi data from our own Google Cloud Storage bucket.
b. Transform the data (if applicable) in a way that Google BigQuery is able to ingest it.
c. Upload data to Google BigQuery.
d. Split the resulting table into data and geometries (data and geometries should be joinable by a common key)
a. Assess the quality of the data. Identify and document any issues with the data.
Write the SQL queries that help you answer the following questions:
a. What is the average fare per mile?
b. Which are the 10 pickup taxi zones with the highest average tip?
Both the taxi data and data dictionaries can be found in this Google Cloud bucket ( gs://data_eng_test/). You can find the following files:
- NYC taxi data: gs://data_eng_test/
- Data dictionary of the NYC taxi data: gs://data_eng_test/data_dictionary_trip_records_yellow.pdf
Feel free to use any libraries of your preference, but the programming language employed in the exercise should be Python. You can use any library, but an ETL system suited for automation and scheduling such as Apache Airflow is preferred. Results should be delivered either as a Github repo with the code and results (preferred option) or a zipped directory of code and results.
You can check here how to create and start using a Google BigQuery with a trial account.
Try to document your thought process as much as possible.
Create a brief report that summarizes the core features of your ETL. The audience for this should be other data engineers who need to understand the process you followed to build the ETL. Please try to highlight the following points:
- Document your ETL and how can this be scaled for larger data
- What issues you found in the data and how you would fix them