Skip to content

Instantly share code, notes, and snippets.

@gregakespret
Last active June 27, 2018 10:09
Show Gist options
  • Save gregakespret/e2bfd4eccaf60c1d9c3d to your computer and use it in GitHub Desktop.
Save gregakespret/e2bfd4eccaf60c1d9c3d to your computer and use it in GitHub Desktop.
Data Scientist Assignment

Celtra Data Scientist Assignment

First of all, thank you for taking the time to do this assignment.

There are many possible ways to solve this data problem. Your solution will help us gain insight into how you think, what tools and technologies you like to use and how you use them. Hopefully, we may be able to learn something from you, as well :)

As you will notice, not every detail is clearly defined. You have the freedom to make your own choices where you see fit. But you can also ask questions, of course.

Please e-mail your solution to the person that gave it to you within the agreed time.

Description

We will give you the link to a gzipped file that contains (anonymized) data set that we gathered when serving ads between 2015-04-16 10:00 and 2015-04-16 20:00 UTC. While the ads were live, several different events were tracked. For example:

  • adRequested, when the ad was requested
  • screenShown, at the time when some part of the ad was shown to user
  • interaction | firstInteraction, when user interacted with the ad
  • userError, when an error occured
  • and many others that you won't need in this assignment

The provided dataset is just a sample of the original one. We did not include the entire set of data to save you some time downloading it and computing results, but we still expect you to write a solution, which would work efficiently even if the size of the uncompressed file was as large as 50GB.

Each line in the file is a JSON, representing one event and can have the following attributes:

  • sessionId: Globally unique impression identifier. Use it to group/join different events together.
  • name: Name of event. One out of adRequested, screenShown, firstInteration, userError, ...
  • clientTimestamp: Timestamp measured on user’s device.
  • timestamp: Timestamp measured on server.
  • purpose: Include only sessions with purpose “live”.
  • sdk: SDK, where the ad was trafficked.
  • objectClazz: Provides a class of the object, to which the event is related e.g. interaction on Button, Video, ...
  • index: Sequential index of event. Note that some events are not indexed, but you can always use time if you want to order them.

Attributes name and timestamp are present for all events, sessionId is present for all events except conversions, other attributes are available only for certain events. You should discard sessions with no adRequested event.

Objective

Your tasks are:

  1. Calculate the effective Ad engagement rate, that is the percentage of sessions that contain interaction or firstInteraction event.
  2. We would like to know how much time users usually need to start interacting with the ad. This is how much time passes between the first screenShown and the first interaction in a session. Can you provide an estimation? You should use clientTimestamp for calculation, but note that we occasionally encounter very skewed values. The end result should not be skewed by a few incorrectly measured client timestamps.
  3. We suspect the ad engagement rate has changed at some point in the day.
3.1. Can you find that point and check whether the difference is statistically significant?

3.2. Furthermore, can you check whether this change occurred only on some specific combination of attributes (e.g. only for impressions with one specific object or one specific sdk or some combination of object and sdk, ...)?

3.3. Please write (a simple) algorithm, which detects changes on some specific combination of attributes (3.2.) and provides some explanation.

For each task describe the methods you chose and the reason why you chose these methods over the others. Provide a nice human readable presentation of results, just like you would in a real case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment