Skip to content

Instantly share code, notes, and snippets.

@malgar
Last active May 17, 2023 08:55
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save malgar/c79b0a64b46c100fc34b4ae12694fa59 to your computer and use it in GitHub Desktop.
Save malgar/c79b0a64b46c100fc34b4ae12694fa59 to your computer and use it in GitHub Desktop.
CARTO Data Engineering test (Data team)

Data Engineer Take Home Test

Every month New York City releases data about trips taken in NYC taxis. A new ride sharing company is your client. They are looking to start up a rival service to Uber or Lyft in NYC and are interested in using the Taxi Data to get insight into the transportation sector in New York. They have tasked you to prepare the datasets and perform an initial exploratory analysis to get to know the data at hand.The client is launching a new ride sharing program in New York similar to Uber or Lyft.

You have three tasks in this project:

  1. Write an ETL process able to do the following:

    a. Download Taxi data from our own Google Cloud Storage bucket.

    b. Transform the data (if applicable) in a way that Google BigQuery is able to ingest it.

    c. Upload data to Google BigQuery.

    d. Split the resulting table into data and geometries (data and geometries should be joinable by a common key)

  2. Data quality:

    a. Assess the quality of the data. Identify and document any issues with the data.

  3. Write the SQL queries that help you answer the following questions:

    a. What is the average fare per mile?

    b. Which are the 10 pickup taxi zones with the highest average tip?

Data:

Both the taxi data and data dictionaries can be found in this Google Cloud bucket ( gs://data_eng_test/). You can find the following files:

  • NYC taxi data: gs://data_eng_test/
  • Data dictionary of the NYC taxi data: gs://data_eng_test/data_dictionary_trip_records_yellow.pdf

Guidelines:

Feel free to use any libraries of your preference, but the programming language employed in the exercise should be Python. You can use any library, but an ETL system suited for automation and scheduling such as Apache Airflow is preferred. Results should be delivered either as a Github repo with the code and results (preferred option) or a zipped directory of code and results.

You can check here how to create and start using a Google BigQuery with a trial account.

Try to document your thought process as much as possible.

Create a brief report that summarizes the core features of your ETL. The audience for this should be other data engineers who need to understand the process you followed to build the ETL. Please try to highlight the following points:

  • Document your ETL and how can this be scaled for larger data
  • What issues you found in the data and how you would fix them
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment