jmontess/ds_challenge_2023.md

## ds_challenge_2023.md

      
    Raw
  

              ds_challenge_2023.md
            
          
    Data Science Challenge

The purpose of this challenge is to assist us in evaluating candidates for a role in our Data Science team. We only pass this challenge to candidates that we feel have a solid background and could be a good fit for our team. We appreciate you taking this time to help ensure we are a good match for each other.
Tips

Include code, graphics and text in a combined output. Tell a story, and let us understand as clearly as possible your thoughts and analytical process.
Part 1: Experiment design

Background

As a ride-hailing company, Cabify operates as a two-sided marketplace that serves the needs of both riders (passengers) and drivers. Both parties are free to decide to use our services or other apps, and both want a smooth experience in which the least time possible goes wasted waiting before an actual trip begins. This means that we need to be efficient in quickly finding drivers close to the requested start location.
Cabify's basic approach is similar to other ride-hailing services: a rider requests a journey through the app, and then the system calculates a price and sends an offer to the closest driver. If the driver accepts the offer, they pick up the rider and take them to their destination.
But sometimes, drivers may not accept offers due to a variety of reasons. For instance, they may find the price or location of the trip unsatisfactory, or they may not be able to provide the requested service. When this occurs, we must locate another driver, which can result in a delay for the rider and a potentially negative experience. Additionally, if the platform fails to find a new driver, the rider may ultimately have to cancel the trip. Riders may also cancel their request if they think the assigned driver is too far away.
Proposed solution

The product team came up with an alternative idea to let the supply and demand meet.
Instead of finding ourselves the closest car, we would present drivers with a list of all nearby journeys in their app, with information about their price and distance, and let them select one of their likings. We call this feature the trip selector. This way, we would avoid the risk of going through several drivers before finding an interested one. Here is a sample screenshot of what this new feature would look like in the driver's app:

The hypothesis is that we should be quicker in finding a driver willing to serve the journey, even if they are not the closest to the passenger. However, there is still the possibility that longer pick-up times will trigger more cancellations.
Challenge

Design a set of experiments to validate the benefits of the trip selector. When doing so, please take these aspects into account:

We are interested in understanding the impact of this new feature on our drivers, i.e. how/if it benefits them (earnings, idle times, etc.).
We are also interested in understanding the impact of this feature on the entire marketplace (number of trips we can serve, company earnings, etc.).
The trip selector is already implemented but not yet deployed. The engineering team has the technical capacity to activate it when and where it is required.
You can assume we can collect any information technically possible, like user actions in the app, prices, service times, car locations, etc.

With all these in mind, provide details about how you will run experiments to assess the impact of the trip selector on both drivers in particular and the marketplace in general. Make sure to clearly explain your methodology, key metrics and how you will ensure the experiments are valid and unbiased.
Part 2: Result analysis

Background

A ride-hailing app currently assigns new incoming trips to the closest available vehicle. The app identifies it by computing the Haversine distance between the pickup point and the available vehicles. We refer to this distance as linear.
However, the expected time to reach A from B in a city is not 100% determined by the Haversine distance: Cities are known to be places where a large amount of transport infrastructure (roads, highways, bridges, tunnels) is deployed to increase capacity and reduce average travel time. This heavy investment in infrastructure also implies that bird distance does not work so well as a proxy for the travel duration. The isochrones for travel time from a given location drastically differ from the perfect circle defined by bird distance, as we can see in this example from CDMX where the blue area represents what is reachable with a 10 min drive.

In addition to this, travel times can be severely affected by traffic, accidents, road work, etc. So even if a driver is only 300m away, he might need to drive for 10 min because of road work on a bridge.
Proposal

In order to optimize operations, the engineering team has suggested they could query an external real-time maps API that not only considers roads but also knows real-time traffic information. We refer to this distance as road distance.
In principle, this assignment is more efficient and should outperform linear. However, the queries to the maps API have a certain cost (per query) and increase the complexity and reliability of a critical system within the company. So Data Science team has designed an experiment to help the engineering department to decide.
Experimental design

The designed experiment is very simple. For a period of 5 days, all trips in 3 cities (Bravos, Pentos and Volantis) have been randomly assigned using linear or road distance:

Trips whose trip_id starts with digits 0-8 were assigned using road distance
Trips whose trip_id starts with digits 9-f were assigned using linear distance

Data description

The collected data is available in this link. Each object represent a vehicle_interval that contains the following attributes:

type: can be going_to_pickup, waiting_for_rider or driving_to_destination
trip_id: uniquely identifies the trip
duration: how long the interval last, in seconds
distance: how far the vehicle moved in this interval, in meters
city_id: either bravos, pentos and volantis
started_at: when the interval started, UTC Time
vehicle_id: uniquely identifies the vehicle
rider_id: uniquely identifies the rider

Example

{
  "duration": 857,
  "distance": 5384,
  "started_at": 1475499600.287,
  "trip_id": "c00cee6963e0dc66e50e271239426914",
  "vehicle_id": "52d38cf1a3240d5cbdcf730f2d9a47d6",
  "city_id": "pentos",
  "type": "driving_to_destination"
}

Challenge

Try to answer the following questions:
Should the company move towards road distance? What's the max price it would make sense to pay per query? (make all the  assumptions you need, and make them explicit)
How would you improve the experimental design? Would you collect any additional data?