sulrich/20220821-NANOG-agenda merge-notes.md

## 20220821-NANOG-agenda merge-notes.md

      
    Raw
  

              20220821-NANOG-agenda merge-notes.md
            
          
    NANOG data analysis fuzzy matching

notes


Analysis of Presentations - Google
Sheets
this is the original data set (aka raw speaker data or RSD).
i did some refactoring of this to generate nanog-merge - Google
Sheets
renamed a number of the fields and done some light refactoring.  (RAW tab)
of note:

all fields are SINGLE_NO_SPACE names
there's a standalone NANOG DATEfield
there's a standalone LOCATION field
normalized ORIGIN ( effectively s/^found on//g)


merge methodology


NANOGs across the data sets are not uniform.  there are effectively 3 sets of
data

RSD elements that do not exist in the scraped space
scraped speaker data (aka SSD) which is the result of scraping
https://archive.nanog.org
overlapping elements - these exist in both data sets RSD and SSD, these
require merging


the standalone data sets were merged with all of their respective fields
intact


the overlapping fields had the SSD data overlaid on the RSD data with an exact
regex match on the NANOG, SPEAKER, and TITLE fields attempted,  if the
exact match failed a fuzzy match was executed on the SPEAKER and the TITLE
fields.  anything that didn't have a match on these fields was set aside as
"unmatched".


merged data set


20220821-merge - Google
Sheets

20220821-merged-entries tab: this is effectively the superset of content
that we have across the original (aka raw) dataset as well as the scraped
elements (up to NANOG 70)
20220821-unmatched entries tab: this is what was scraped, but a reasonable
fuzzy match was not found in the original data set.  random spot checks for
this seem to indicate that these items simply do no exist in the RSD
dataset.


misc. TODOs


 providing a lookup for location and date for the scraped data seems like a
reasonable thing to add
 there's more that can be filtered out in the scraped data sets
 scraping of the data from NANOG 71-76 should be undertaken to provide
additional coverage
 we should probably come up with a consistent set of AFFILIATION values for
popular companies to make the data set a little cleaner
 for SSD it might be useful to infer the TALK_TYPE from the title in some
cases.
 for SSD it might also be useful to correlate the number of speakers in the
scrape with the number of presentations and split these on a common index.
this would seem to do the right thing in the majority of cases.

schema

fields in order:

NANOG (int) - converted in merge
DATE (date) - not converted in merge
LOCATION (string)
TALK_ORDER (int) - not converted in merge
SPEAKER (string)
AFFILIATION (string)
TITLE (string)
TALK_TYPE (string)
YOUTUBE (string)
PRESO_FILES (string)
DURATION_MIN (int) - not converted in merge
TAGS (string)
KEYWORDS (string)
ORIGIN (string)