Skip to content

Instantly share code, notes, and snippets.

@fawce
Last active April 21, 2017 23:03
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save fawce/7154053 to your computer and use it in GitHub Desktop.
Save fawce/7154053 to your computer and use it in GitHub Desktop.
Draft documentation for Quantopian fetcher 2.0

Fetcher 2.0

Fetcher - Load any CSV file

Quantopian provides a 11-year history of US equity market data in minute and daily bars. The US market data provides a backbone for financial analysis, but some of the most promising areas of research are finding signals in non-market data. Fetcher provides your algorithm with access to external time series data. Any time series that can be retrieved as a csv file via http or https can be incorporated into a Quantopian algorithm.

Fetcher lets you load csv files over http. To use it, invoke fetch_csv(url) in your initialize method. fetch_csv will retrieve the text of the csv file using the (wonderful) requests module. The csv is parsed into a pandas dataframe using pandas.io.parsers.read_csv. You may then specify your own methods to modify the entire dataframe prior to the start of the simulation. During simulation, the rows of the csv/dataframe are streamed to your algorithm's handle_data method as additional properties of the data parameter.

Best of all, your Fetcher data will play nicely with Quantopian's other data features:

  • Your data will be streamed to your algo without look-ahead bias.
  • Use record to plot a time series of your fetcher data.
  • Call history and retrieve a trailing window of your fetcher data.

Fetcher supports two kinds of time series:

  • Security Information: data that is about individual securities, such as short interest for a stock, or a signal value for each stock.
  • Indicators: data that stands alone, such as the Consumer Price Index, or the spot price of palladium

Fetching Security Information

Your csv file must adhere to a simple standard to be interpreted and ingested by fetcher as security information:

  • there must be a header row. There must be a date column, and a symbol column (both are case sensitive). See reference section for more advanced handling.
  • for each row, the symbol must be correct on the date of the row. This is so we can convert the symbol to a Quantopian security identifier (sid) and map your fetcher data onto the same identifiers used for trade data.

Here is an example csv file containing security information:

date,symbol,short_interest
2/28/2013,IBM,2.841933
2/28/2013,MSFT,3.197315
2/28/2013,S,4.465144
2/28/2013,AAPL,3.600895
2/28/2013,YHOO,5.115798
2/28/2013,ADVS,6.172108
2/28/2013,C,4.933204
3/29/2012,IBM,5.204008
3/29/2012,MSFT,3.707489
3/29/2012,S,4.323507
3/29/2012,AAPL,6.762691
3/29/2012,ADVS,3.660425
3/29/2012,C,3.245739
3/29/2012,NFLX,7.573924

As you can see, you can have many symbols in a single csv file. If a row's date/symbol do not match with our stock history, the row is ignored. Here is an example algorithm that pulls the data file above into an algorithm:

def initialize(context):
	# fetch the data
	fetch_csv('http://yourserver.com/shortinterest.csv')

def handle_data(context, data):
    # because the csv file had a column nmaed 'short_interest', each
    # stock covered by the csv file will have a 'short_interest'
    # property
    short_interest = data[sid(24)]['short_interest']

    # plot it
    record(aapl_short_interest=short_interest)

Fetcher and the tradeable universe

Often, you will be fetching data in order to choose which stocks to trade. In addition to adding new security information to your simulation, Fetcher also manages the universe of stocks. Practically speaking, being in the universe means that a particular stock will be present in the data parameter sent to handle_data.

Universe selection by fetcher is a daily process, and stocks can be both added and removed. On each trading day, fetcher will look at the unique sids referenced by your csv file's symbol column on that trading day. The first 150 sids referenced on a given day in your csv file will be present in the data parameter during simulation on that day. Any sids previously referenced, but not referenced on the current day, will be removed from the universe.

Often, a csv file will have data on a quarterly, monthly, or weekly frequency. In the case that there are zero records in your csv file for a given day, the last calculated universe will be used.

Using the same csv file above, here is an algorithm that prints the current universe:

def initialize(context):
	# fetch the data
	fetch_csv('http://yourserver.com/shortinterest.csv')

def handle_data(context, data):
	# print the stocks in the universe today
    print sorted([stk.symbol for stk in data.keys()])

and here is the logging output for that algorithm. Notice that YHOO drops out of the universe and NFLX enters.

2013-02-28 PRINT [u'AAPL', u'ADVS', u'C', u'IBM', u'MSFT', u'S', u'YHOO']
2013-03-01 PRINT [u'AAPL', u'ADVS', u'C', u'IBM', u'MSFT', u'S', u'YHOO']
...
...
2013-03-28PRINT[u'AAPL', u'ADVS', u'C', u'IBM', u'MSFT', u'NFLX', u'S']
2013-04-01PRINT[u'AAPL', u'ADVS', u'C', u'IBM', u'MSFT', u'NFLX', u'S']
2013-04-02PRINT[u'AAPL', u'ADVS', u'C', u'IBM', u'MSFT', u'NFLX', u'S']
2013-04-03PRINT[u'AAPL', u'ADVS', u'C', u'IBM', u'MSFT', u'NFLX', u'S']

If you make multiple fetch_csv calls, the daily universe will be the union of the daily universes of all fetcher calls.

If you explicitly reference a stock using the sid function in your algorithm, e.g. sid(24), the sid will be added to the fetcher universe every day.

If you make a set_universe call as well as fetch_csv call(s), the universe will be determined by set_universe and fetcher data will be applied only to stocks in the universe.

Universe membership is projected onto trailing windows for history, so you will retrieve history for fetcher data for all current universe members.

Fetching indicator information

Fetcher can also handle data that does not reference individual stocks. In that case, you use fetch_indicator, which will skip the step of assigning a sid to each row in the file. Here's a file with the Consumer Price Index (CPI) and Initial Jobless Claims (IJC) values over time:

date,cpi,ijc
2007-12-01,211.445,123456
2007-11-01,210.834,123456
2007-10-01,209.19,123456
2007-09-01,208.547,123456
2007-08-01,207.667,123456
2007-07-01,207.603,123456
2007-06-01,207.234,123456
2007-05-01,206.755,123456
2007-04-01,205.904,123456
2007-03-01,205.288,123456
2007-02-01,204.226,123456
2007-01-01,203.437,123456
2006-12-01,203.1,123456
2006-11-01,202.0,123456
2006-10-01,201.9,123456
2006-09-01,202.8,123456
2006-08-01,203.8,123456
2006-07-01,202.9,123456
2006-06-01,201.8,123456
2006-05-01,201.3,123456
2006-04-01,200.7,123456
2006-03-01,199.7,123456
2006-02-01,199.4,123456
2006-01-01,199.3,123456

Notice that there is not a symbol column - this is because the data will not map onto securities. To let fetcher know that you're not working with stock data, you call fetch_indicator. To make the data accessible to your algorithm, each column header will be added as a key in the data parameter to handle_data. Here is an algorithm that plots the cpi from the above csv:

def initialize(context):
    fetch_indicator('http://yourserver.com/cpi.csv')

def handle_data(context, data):
    # get the cpi for this date, check the value column.
    current_cpi = data['cpi']
    current_ijc = data['ijc']
    price_of_aapl = data[sid(24)].price

    # plot it
    record(cpi=current_cpi)

Data Manipulation with Fetcher

As noted above, fetcher expects data to adhere to a simple format. If you produce the csv, it is relatively easy to put the data into a good format for Fetcher. First decide if your file should be an indicator or security info source, then build your columns accordingly.

However, you may not always have control over the csv data file. It may be maintained by someone else, or you may be using a service that dynamically generates the csv. quandl.com, for example, provides a REST API to access many curated datasets as csv. While you could download the csv and modify them before using them in Fetcher, you would lose the benefit of the nightly data updates. In most cases it's better to request fresh files directly from the source.

Fetcher lets you massage the csv data programmatically before sending it to your algorithm. To understand how this works, it is helpful to explain a little about the internals of fetcher. Here is how a data file is parsed and added to the data stream for your algorithm:

1. the file is retrieved from the url you provided.
2. the file is ingested using pandas - specifically pandas.read_csv(your_csv_data_file) - and a `pandas.DataFrame` is created from your csv.
3. If provided, your `pre_func` is called on the dataframe from the prior step. The output of this step must be a DataFrame that matches the required format described for the csv file above.
4. The symbol column is converted into the sid column
5. The date column is parsed using pandas.to_datetime
6. If provided, your `post_func` is called on the dataframe from the prior step.
6. Each row is sorted into the data event stream for your algorithm so that it arrives in chronological order, and is merged by sid into the data parameter for handle_data.

Because fetcher uses pandas DataFrames to represent the csv data internally, all the pandas functionality for data massaging is available to your algorithm.

pre_func specifies the method you want to run on the pandas dataframe containing the CSV immediately after it was fetched from the remote server. Your method can rename columns, reformat dates, slice or select data, or even reshape the csv data. For example, imagine you have a csv in a "observation" style layout with a column for each security (vs the "record" layout of the required csv file format):

date,GPS,HON,NFLX
2002-02-15,128.00,128.18,127.41
2002-02-18,129.39,129.65,128.95
2002-02-19,128.73,129.37,128.52
2002-02-20,129.57,129.70,128.54
2002-02-21,128.64,129.05,127.72

While this is a perfectly accurate way to represent the data, the layout doesn't conform to the csv file format described above in the Fetching Security Information section. To use the data in an algorithm, we need to reshape the dataframe to match the expected format. Here is an algorithm that uses a pre_func to reshape the data and use it in the algorithm:

import numpy as np
from pandas import DataFrame

	
def unpivot(frame):
	"""
	Function to convert observation data into record data format.
	Copied from the unpivot function from:
	http://pandas.pydata.org/pandas-docs/dev/reshaping.html
	"""

	frame = frame.set_index('date')
    N, K = frame.shape
    data = {'value' : frame.values.ravel('F'),
            'variable' : np.asarray(frame.columns).repeat(N),
            'date' : np.tile(np.asarray(frame.index), K)}
    return DataFrame(data, columns=['date', 'symbol', 'value'])

def initialize(context):
	fetch_csv('http://yourserver.com/sample.csv', pre_func=unpivot)

def handle_data(context, data):
	print data[sid(25090)]['value']

post_func is called after Fetcher has sorted the data based on your given date column. This method is intended for time series calculations you want to do on the entire dataset, such as timeshifting, calculating rolling statistics, or adding derived columns to your dataframe. Again, your method should take a dataframe and return a dataframe.

Consider the file for short interest:

date,symbol,short_interest
2/28/2013,IBM,2.841933
2/28/2013,MSFT,3.197315
2/28/2013,S,4.465144
2/28/2013,AAPL,3.600895
2/28/2013,YHOO,5.115798
2/28/2013,ADVS,6.172108
2/28/2013,C,4.933204
3/29/2012,IBM,5.204008
3/29/2012,MSFT,3.707489
3/29/2012,S,4.323507
3/29/2012,AAPL,6.762691
3/29/2012,ADVS,3.660425
3/29/2012,C,3.245739
3/29/2012,NFLX,7.573924

The short interest for a stock is reported by the exchanges, but with an 8 business day lag. The date in the extract is usually the date the short interest was calculated, even though it is reported 8 days later. To avoid lookahead bias, we need to lag the data by 8 days. Here's an algorithm that uses a post_func to lag the data:

def time_lag(df):
    # data from the exchanges is delayed by 8 business days
    # using pandas' tshift to lag the index by 8 business days
    df = df.tshift(8, freq='b')
    return df

def initialize(context):
    # TODO: this is pulling a static scrape of the data,
    # need to figure out how to pull a large universe from
    # quandl
    fetch_csv(
        'https://s3.amazonaws.com/quantopian-production/data/dtoc_history.csv',
        post_func=time_lag
    )

def handle_data(context, data):
	pass

Fetcher is limited to daily data

In Live Trading, an algorithm's initialize method is called once before market open. Fetcher can only run in initialize, and and as a result requests for the source csv files happen once per day.

With Fetcher, the csv file's date/time information can potentially have a different frequency. In these cases, Fetcher will take the most recent historical values in the file and report those with the securities at the open. The data in the file will be cut off at 2am US/Eastern of the current trading day.

Jagged starts for fetcher data and trade data

If your backtesting starts before the first rows of data in your csv file, you will have a few invocations to handle_data before fetcher data is available.

You will need to have some guard code in your handle_data methods to avoid KeyError problems accessing data properties. In general, you need to check that a sid is in data before you access it, and that a property is available on a sid before you access it.:

def initialize(context):
    # an indicator Fetcher
    fetch_csv('http://priceoftea.com/', symbol='tea')
    # a security info Fetcher
    fetch_csv('http://insiderselling.com')

def handle_data(context, data):
    # guard against being called before the first trade of a security
    if sid(123) in data:
        # guard against trades happening before the first insider selling event
        if 'insider' in data[sid(123)]:
            if data[sid(123)]['insider'] > 10.0:
                order(sid(123), -100)

    # indicator data will pass a blank place holder if the first event has not been sent yet.
    # So, you can just guard against missing properties
    if 'price' in data['tea']:
        record(price_of_tea=data['tea']['price'])

Fetcher and Live Trading

Algorithms are launched each day at 2am US/Eastern. The initialize method is invoked each day, which in turn will trigger any fetcher calls. As a result, csv source files can be edited up until 1:59am US/Eastern. As with backtesting, the data in the file will be cut off at 2am US/Eastern of the current trading day.

Further reading

For more information about Fetcher, go to the API documentation or look at the sample algorithms.

API Reference

fetch_csv(url, pre_func=None, post_func=None, date_column='date',
          timezone='UTC', symbol=None, dayfirst=False, **kwargs)
Loads the given CSV file (specified by url) to be used in a backtest.
Parameters
    url: A well-formed http or https url pointing to a csv file that has a header, a date column, and a symbol column (unless symbol param is specified).
    pre_func: (optional) A function that takes a pandas dataframe parameter (the result of pandas.io.parsers.read_csv) and returns another pandas dataframe.
    post_func: (optional) A function that takes a pandas dataframe parameter and returns a dataframe.
    date_column: (optional) A string identifying the column in the csv file's header row that holds the parseable dates.
    timezone: (optional) Either a pytz timezone object or a string conforming to the pytz timezone database.
    **kwargs: (optional) Additional keyword arguments that are passed to the requests.get and pandas read_csv calls. Click here to see the valid arguments.

deprecated and ignored

  • date_format: (optional) A string defining the format of the date/time information held in the date_column.
    • we now use pandas to_datetime to parse the date internally.
  • symbol: (optional) If specified, the fetcher data will be treated as an indicator source, and all of the data in each row will be added to the data parameter of the handle_data method. You can access all the csv data from this source as data['symbol'].
    • use fetch_indicator instead of fetch_csv to retrieve data not tied to stocks
@pkaeding
Copy link

Is it possible to specify additional HTTP headers that will be used when retrieving the CSV file (basic auth is the use case I have in mind)?

@diego-
Copy link

diego- commented Apr 14, 2017

is this functionality live or going live soon? I'm having significant issues with fetch_csv with indicators in quantopian.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment