Skip to content

Instantly share code, notes, and snippets.

@manuelkiessling
Last active August 29, 2015 14:23
Show Gist options
  • Save manuelkiessling/b1dfae26d9b665321185 to your computer and use it in GitHub Desktop.
Save manuelkiessling/b1dfae26d9b665321185 to your computer and use it in GitHub Desktop.
Thoughts and questions about JourneyMonitor Analytics architecture re Cassandra/Spark

JourneyMonitor (see https://github.com/journeymonitor and http://journeymonitor.com/) allows users to upload a Selenium test script and have this script run regularly in a headless Firefox.

The service notifies the user if the script run failed, and additionally collects HAR data for each run, like this one: http://codebeautify.org/jsonviewer/c6b9d9

Next, we would like to provide the user with more info regarding the collected HAR-based performance metrics. For each testcase, we would like to present a summary like this:

  • Average overall load time of pages
  • Average load time of all CSS assets (or JS assets or image assets etc.)
  • Average number of 404s when requesting assets

This would be a summary that takes all test runs into account that were executed since the creation of the testcase.

Furthermore, we might want to provide the same information, but time-related - e.g. average load time of all CSS assets, for the past 7 days, per hour (and for the past 30 days per day etc.)

I'm currently looking for an excuse to use C*/Spark, and wonder if it makes sense here.

If I wanted to start providing this feature now, but also wanted make this info available for all the test runs that happened in the past (and for which the HAR data is available), I'm pretty sure it would make sense to use Spark - I would use it to go over all the HAR files I have collected and extract and calculate the needed info from them; doing this in a parallel fashion using Spark is certainly way more efficient than doing this in a serial fashion by writing a script that iterates over the files.

What's more interesting is the question whether it's still a good choice to use Spark on newly generated HAR data from test runs that are executed once this feature is in place. The test run is a cronjob that executes the test script and generates the raw HAR data. After finishing the test run, I could immediately parse the HAR and calculate my metrics. Or, I could store the HAR and have Spark do the parsing and calculating afterwards.

This way, my (potentially) many test run jobs would just do their job of running the test and creating the HAR, would dump these HARs somewhere (Cassandra?), and let Spark do the heavy lifting in regards to analyzing the HAR json. Sounds enticing, but I'm not sure.

The test run jobs already are distributed by nature (they are simply cronjobs that run on N number of hosts), and I don't have to combine data from multiple test run jobs - i.e., I don't need to analyze stuff like "What is the average load time of ALL css assets over ALL testcases?"

The data generated from the jobs of one testcase is only relevant in the scope of this one testcase:

                 A user
                   |
  ------------------------------------------
  Testcase   Testcase    Testcase   Testcase
     |
----------------------------------
Results  Results  Results  Results
    \       |       /        /
     \      |      /        /
        Averaged Metrics

Hypothesis: If I would want to get my averaged metrics long after collecting the results, it would make sense to set up a Spark run. But if I already know which metrics I want to present to the user, it is more efficient to update the existing metrics right in the job that creates a new result.

@pmcfadin
Copy link

I know a little about HAR files, but how big would they be on average? Not sure if it makes sense to just store a byte blog in a column for cold storage using Cassandra. Analyzing the files with Spark makes a lot of sense. These would be easily parsed and you could do a lot of interesting stat rollups.

I think your final hypothesis is correct. Long term analysis is all Spark. Use a API to parse the short term stats and push them into Cassandra.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment