chriswhong/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Logging Maryland MTA Real-time bus data

Technical considerations for logging real-time bus data
Overview

The Maryland MTA has made available real-time bus location data in GTFS-RT format, available at http://mta.maryland.gov/content/developer-resources.  Various stakeholders wish to use historical real-time data to assess the timeliness of the bus system.  Successful analysis would require a consistent dataset of archived real time locations.
Database vs Static Files

A major consideration is whether to use a database to store the data, or to store it in a well-organized hierarchical tree of static files.
Storing the data in database means a slightly more complex hosting environment.  New data are committed to the database, but the data would not be easily accessible without more work to expose certain queries as bulk downloads, or allow users to build their own queries and retreive data on demand (basically, we must build a web api on top of the database in order to make it accessible).
With a static file approach, each periodic set of real-time data is stored as a single file, and the entire directory tree can be easily hosted by a traditional web server.  The path could look like realtime/2017/07/21/{timestamp}.csv.  If real-time data are pulled every minute, this would result in 1440 files in each daily directory, or ~43000 files generated per month.  A downside to this approach is that the data are only accessible as individual files, so more work is required by anyone who wants to use the data.  Consumers could script the download of all of the data they need, and would then need even more scripting to import it all into a databsae or to merge the loose flat files into one file.
The script

The basic idea is to write a script that can be called by a CRON job.  When the script runs, it downlaods the current GTFS-RT pbf from the MTA, decodes the data, and:

commits the data to a database or a flat file as described above
writes to a log file indicating the timestamp of the request, how many records were retreived, and whether the saving to database or file was successful

This can be done easily using node.js, here is a snippet:
//based on https://github.com/yuningalexliu/mta-realtime, this app exposes the MTA GTFS-realtime 'entities' as JSON

var http = require('http')
var ProtoBuf = require('protobufjs')

//options to be used in http request.  See config.sample
var options = {
  host:'gtfsrt.mta.maryland.gov',
  port: 8888,
  path:'/TMGTFSRealTimeWebService/Vehicle/VehiclePositions.pb'
}

var transit = ProtoBuf
  .loadProtoFile("gtfs-realtime.proto.txt")
  .build("transit_realtime");

var tripData;

//function to process the response
function processBuffers(response) {
	var data = [];
	response.on('data', function (chunk) {
		data.push(chunk);
	});

	response.on('end', function () {
		data = Buffer.concat(data);
		var decodedFeedMessage = transit.FeedMessage.decode(data);
		tripData = decodedFeedMessage.entity;
	});
}

// get the real-time data
http.request(options, processBuffers).end(function(){
  // write to database or file
  saveSomewhere(tripData);
});


Other Notes

There are technologies that quickly expose a web API for a postgresql database.  PostgREST and the CartoDB-SQL-API come to mind as options that would allow for exposing a data API without too much effort.