moreka/broadcast.md

## broadcast.md

      
    Raw
  

              broadcast.md
            
          
    Broadcast Code Readme

This project is Python2-compatible and have issues while running with python 3. So please use Python2 (We used Anaconda's CPython distribution, maybe PyPy would be faster, but I don't even know whether it's OK to move to PyPy)
1. Data Access

All data access related classes are in data package.
We suppose there are two SQLite databases, namely db.sqlite3 and links.sqlite3
which are stored on shared memory (/dev/shm) for faster access (they should be copied,
in case of vanishing). First DB, holds
tweets table with following schema:
CREATE TABLE tweets (user_id integer, tweet_time integer);
CREATE INDEX idx1 on tweets (user_id);
and the second one's schema is like:
CREATE TABLE links (ida integer, idb integer);
CREATE INDEX idx_ida on links (ida);
CREATE INDEX idx_idb on links (idb);
SQLite database wrapper is in db_connector.py.
Later through the process, it was discovered that it's a better idea to hold tweets in HDFS format.
So there is tweets_all.h5 file (instead of db.sqlite3). Each user's tweet list is stored as
a numpy array. For accessing a user's data, e.g. user #123456, one should seek the data in
/0012/3456/tweets (see hdfs.py)
A good abstraction for all kind of data access needed throughout the code,
is a UserRepository. We have implemented SQLiteUserRepository,
HDFSUserRepository, and a hybrid version (used in experiments) HDFSSQLiteUserRepository.
For other storage policies, one should just implement this interface (see user_repo.py).
2. Models

These models reside on data/models.py and data/user.py.
TweetList

First main model of our code is a TweetList, which holds all tweets in a time interval (as
a numpy array) and can perform basic operations such as calculating periodic intensity and
connection probability (also checkout Cythonized codes section).
For performance issues, we have two kinds of lists: a TweetList (which holds the complete data)
and a TweetListView which acts as a sub-list of a TweetList or another TweetListView. So
data is never copied by sub-listing (which is a pretty occurring task throughout the code).
For sake of unification, we work with ITweetList,
which is abstraction of both tweet list types.
User

User is another abstraction for our operations. A user is determined by his user_id (unique
ID in Twitter). You should just give a user its data access method (UserRepository), and
it will have the ability to lazy-load needed data (see @cache_enabled decorator in utils/decorators.py).
3. Optimization

Because this part needs more sophisticated attention, I will explain it in more
detail. You can find optimization code in opt package.
The main functionality goes into learn_and_optimize function. We will discuss this function line-by-line.
user_tl = user.tweet_list().sublist(learn_start_date, learn_end_date)
oi = user_tl.get_periodic_intensity(period_length, learn_start_date, learn_end_date)[start_hour:end_hour]
First line extracts user's tweet list and trims it to fit in learning interval. Second line finds the periodic intensity (default: weekly) and returns the first day intensity. Also the default budget is sum of intensity during this day (average tweets in learning interval which fall in the first day of week).
Note: One can use all the intensity, but it results in a 168-variable problem, which is pretty hard to optimize (in case of non-analytic gradients). Please check Gradients section for more details.
Note: It seems that the first 24 hours of the week is always a Thursday (because 0 of unix time is Thursday and all of our operations are on basis of UTC Unix timestamps).
Caveat! DON'T USE datetime.fromtimestamp! Use unix_timestamp implemented in util/cal.py. The first one is timezone dependent, and the latter gives timestamps in UTC.
Then it's time to calculate upper bounds. This is done by calculate_upper_bounds. The current version works like this: for each follower, find the maximum tolerance he has, and take the weighted average of these tolerances over all followers.
Note: This is a time consuming part, because all of your followers' tweet lists should be loaded into RAM and also their periodic intensities should get calculated. But don't forget that one should do this operation nevertheless, because in optimization part it is needed, and because of caching, there is no serious problem. I could think of a Pool that holds all this data, when experimenting over a lot of users, but this should not be a concern in the WebApp.
Next, follewer weights and connection probabilities are fetched from user and are dumped into näive arrays (see Cythonized Codes for reason).
Since we want the optimization code not knowing about any domain-specific objects (and mainly, we want it to be fast and Cython-compatible), we should build wrappers for our optimziation problem. In a mathematical view, our utility function takes a vector x, and returns the utility, without knowing anything about other domain-specific things. So we build such a function in our code:
def _util(x):
        return util(x, followers_wall_intensities, followers_conn_prob, followers_weights, *extra_opt)
Here util is a fast, preferebly C-compatible, function given in learn_and_optimize function. *extra_opt are extra arguments that the utility may take in addition to the paramters given explicitly (e.g. k for top-k visibility).
A similar function is built for gradient. And then, we invoke our main optimization task: optimize with initial point set to zero.
The optimize function firstly builds a proj function, which acts as a projector (pretty fast quadratic optimization), and also checks for degenerate cases (the case which your budget is more than the sum of upper bounds, which you should tweet as much as you can), and then invokes optimize_base, which implements Projected Gradient Descent algorithm.
Caveat! It's an unknown issue, which happens in rare cases, that the algorithm does not converge! (actually the step size is big and it should walk with a smaller step size, but this does not happen!) This issue happens when you use min-max utility. TODO Check validity of code in those cases.
Utility functions

Each utility/gradient function has the same signature: lambda1, lambda2_list, conn_probs, weights, *args. Because the most important (and of course, time consuming) part of optimization phase is the utility, we have this coded in Cython. We have an analytic and super-fast form for f (probability of being in top-k), expected_f_top_k (integral of f, but in analytic form, using self-implemented Incomplete Gamma for integers), expected_f_top_one (old, not used now) and gradient_top_one (gradient of expected_f_top_one).
Gradients

There is a numerical way to calculate the gradient for top-k problem, but it's super-slow; it's an n-fold usage of utility, so typically it's 24 times slower that top-1, and if period is set to a week, it will be 168 times slower. TODO Plug in the analytic gradient formula (which does not exist right now)
Cythonized-codes

During development, there were issues with building Cython files, that are not resolved up to now. Firstly, it's a good idea to have a setup.py file configured to build pyx files and make so files (such a file actually exist now). Another method is build on demand (using pyximport, which is used for now). The first method would not work properly, especially when using IPython Notebook's %autoreload. But for production, one should make a proper setup.py for the whole project, including other dependencies.
Intensity and Connection Probability

Since this is another time consuming process (in long run), we have implemented these two functions in Cython. You can find these functions in data/helper.pyx (maybe better to move it to another package!)
Utilities

All utilities are in Cython for obvious reasons. Maybe it is a good idea to put All of the optimization process in Cython for optimal performance.