Skip to content

Instantly share code, notes, and snippets.

@victorddiniz
Created January 17, 2017 08:55
Show Gist options
  • Save victorddiniz/1bb90eb8d96e676b6074988b54ee25ce to your computer and use it in GitHub Desktop.
Save victorddiniz/1bb90eb8d96e676b6074988b54ee25ce to your computer and use it in GitHub Desktop.
Campaign Visits Analytics

When working with data storage and manipulation, it is necessary to answer three questions: where is the data going to be stored, how is the data stored and why those decisions are proper for the data manipulation.

To decide where the data is going to be stored, it is necessary to think what is the purpose of the database. As the main application for this database is to analyze campaign visits through its records in a fast way, it is reasonable to choose a database that can compute the queries fast. For that, a key-value storage would be very efficient as, first, the inputs are already in some sort of tuple, and second because with the right schema, all the data needed can be accessed with only one key, as it will be shown with the proposal schema. Redis is the choice in this case because it is very simple to setup and is the leading of the Key-Value storages in the Market.

The first module in the schema is the Tailer, this module will simply decide, depending on the data's type, what is going to happen with it. This module has three actions, one for each of the types on input. The first is the AdAction tuple (userId, action, adId, campaignId, timestamp). When the data is of that type, then a new value defined by (adId, campaignId, action) is added in the list of actions in key (userId, day) inside the AdActions Table, where day is calculated from timestamp value. The meaning of this is to say that at that day this user had an interaction of type action with the ad adId related to the campaign campaignId. Second type of data is the GeoLocation tuple (name, latitude, longitude, adId, campaignId). When this type is seem by the Tailer, then a new key (adId, campaignId) is created and stored with value (name, latitude, longitude) in the GeoLocation Table. The meaning of this key-value is that a certain advertisement adId for the campaign campaignId has the geolocation coordenates (latitude, longitude) named name. These two types are important to speed up the queries which will be explained shortly.

The third and last type is the LocationRecord tuple (userId, latitude, longitude, timestamp). This tuple will not be stored directly but is the trigger responsable to generate the data to answer the queries. With the userId and the timestamp, it is possible to create the tuple (userId, day), using the timestamp to generate the day, and then it is possible to retrieve all the actions a certain user had, if any, that day from the AdActions Table. Now with the list of tuples (adId, campaignId, action). For each of this tuples, it is going to be retrieved the coordinates of the tuple (adId, campaignId) from the GeoLocation Table and then a new data is going to be inserted in CampaignLog Table. This table will have as key a tuple (campaignId, action, day) and as value a set of userId meaning that for that campaign in that day that set of users had that action in the campaign. So, for each tuple (adId, campaignId, action) retrieved from AdAction Table, and acessing the campaign location through the GeoLocation Table, it is checked in the module Observer if the LocationRecord tuple from input has the coordinates inside 100m radius of the campaign's coordinates, and if that is true, a new user is add to the set in key (campaignId, action, day).

A simple draw of this schema can be seem here.

To answer the queries proposed in the test is very simple now. To retrieve the the number of visits that were generated through clicks/views per campaign per day is to simple do a query to the CampaignLog Table with tuple (campaignId, action, day) which will retrieve all unique users that have visited that campaign in that day after that action. Furthermore, to get informaiton for either clicks or views then it is just the union of both sets retrieved with two separete queries, one for views and one for clicks.

To output the application's results it is suggested to develop an HTTP API. This is a good way because the application can be used in any other software that needs this data. The development of this API can be done with a simple Django or Flask server, for example, and there is a Python interface for Redis database. The HTTP GET parameters could be start_date, end_date, action and campaign, which would define for each day between start_date and end_date, how many visitors were generated through action in that campaign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment