Skip to content

Instantly share code, notes, and snippets.

@sulrich
Last active September 20, 2022 12:06
Show Gist options
  • Save sulrich/d83aaf9c2991ccd72a91604d2faacd4f to your computer and use it in GitHub Desktop.
Save sulrich/d83aaf9c2991ccd72a91604d2faacd4f to your computer and use it in GitHub Desktop.

NANOG data analysis fuzzy matching

notes

  • Analysis of Presentations - Google Sheets this is the original data set (aka raw speaker data or RSD).

    i did some refactoring of this to generate nanog-merge - Google Sheets renamed a number of the fields and done some light refactoring. (RAW tab)

    of note:

    • all fields are SINGLE_NO_SPACE names
    • there's a standalone NANOG DATEfield
    • there's a standalone LOCATION field
    • normalized ORIGIN ( effectively s/^found on//g)

merge methodology

  • NANOGs across the data sets are not uniform. there are effectively 3 sets of data

    • RSD elements that do not exist in the scraped space
    • scraped speaker data (aka SSD) which is the result of scraping https://archive.nanog.org
    • overlapping elements - these exist in both data sets RSD and SSD, these require merging
  • the standalone data sets were merged with all of their respective fields intact

  • the overlapping fields had the SSD data overlaid on the RSD data with an exact regex match on the NANOG, SPEAKER, and TITLE fields attempted, if the exact match failed a fuzzy match was executed on the SPEAKER and the TITLE fields. anything that didn't have a match on these fields was set aside as "unmatched".

merged data set

  • 20220821-merge - Google Sheets

    • 20220821-merged-entries tab: this is effectively the superset of content that we have across the original (aka raw) dataset as well as the scraped elements (up to NANOG 70)
    • 20220821-unmatched entries tab: this is what was scraped, but a reasonable fuzzy match was not found in the original data set. random spot checks for this seem to indicate that these items simply do no exist in the RSD dataset.

misc. TODOs

  • providing a lookup for location and date for the scraped data seems like a reasonable thing to add
  • there's more that can be filtered out in the scraped data sets
  • scraping of the data from NANOG 71-76 should be undertaken to provide additional coverage
  • we should probably come up with a consistent set of AFFILIATION values for popular companies to make the data set a little cleaner
  • for SSD it might be useful to infer the TALK_TYPE from the title in some cases.
  • for SSD it might also be useful to correlate the number of speakers in the scrape with the number of presentations and split these on a common index. this would seem to do the right thing in the majority of cases.

schema

fields in order:

  • NANOG (int) - converted in merge
  • DATE (date) - not converted in merge
  • LOCATION (string)
  • TALK_ORDER (int) - not converted in merge
  • SPEAKER (string)
  • AFFILIATION (string)
  • TITLE (string)
  • TALK_TYPE (string)
  • YOUTUBE (string)
  • PRESO_FILES (string)
  • DURATION_MIN (int) - not converted in merge
  • TAGS (string)
  • KEYWORDS (string)
  • ORIGIN (string)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment