dcwalk/A2 Debrief Notes

## A2 Debrief Notes
# A2 Debrief Notes

January 28, 2017

- Opened for general thoughts, questions from Researchers and Harvestors:

- Researchers raise the concern of overall **usefulness to** harvestors, key to have validation, and especially have helpful details cycle back
  - Mention that there is a lot of potential noise in the data
  - Need better translation of goals of what to archive, way to determine relevancy
    - e.g. is Tabular Data preferred? What is High Value data (upstream data sets)?
      - Can we make a list of formats?!? (e.g. .csv, pdfs?)

    - What level of separation of context from data is allowed?

- Frequent sense of "**are we getting everything?**" and where to stop when scoping datasets. However, once scoped sol'n was relatively approachable.
- Ambiguity in instructions:
  - What is the range of acceptable format?
  - For preserving metadata, what about HTML vs WARC
- Metadata questions:
  - How detailed should description for metadata be (in instructions be clear how to generate those files? controlled vocab? range of inputs?)
  - In json manifest "Federal agency data acquired from" <--- what is this? what about sub organizations or regional agencies?
  - Think about metadata automation (Version tracking/PM tools: JIRA, fabricator, but other options?)
- Most important to have some advance thought to only spend time getting 'right data', a de-duplicating a pain (_note: many uncrawable but scrapable interfaces had data accessible via FTP e.g. National Buoy Database_)
  - Going to be some level of redundancy, and accept that, but work to minimize it

- More **research**, and more targeted research extremely helpful:
  - What about priorities for downloading? -- key to have subject matter experts **working with** harvesters and researchers

- Working in **groups** for harvesting was helpful, with one person a little more in a leadership model

- URL by URL approach is not ideal. Some **holistic research** to assess sites and how to begin harvesting is crucial (and if it happens with experts would guide priorities)
  - even just **GROUP RELATED DOMAINS**: grouping and sorting (even within a spreadsheet) -- basically getting related data together
  - Also have a triage of level of difficulty with issues: Coder triage? (Level 1, Level 2, Level 3) **NEED THESE IN ADVANCE**
  - Think about assessing "domain topology issues" -- mapping tld, sub-domains, to get a handle on where data lives
  - Is there a better way to carve up data harvesting? And who deals with what types of data at an agency
    - Have key contacts for QAing/researching and grouping together datasets?

- Tools:
  - Using slack for small group work in a private message channel worked well

- More clear about some standards for downloading, e.g.:
  - Should urls that were downloaded from be included?
  - What about logs of downloads?

- Better task and project management
  - Spreadsheet works for quick and dirty, but too much overloading
  - Some issue tracking/pm solutions would provide a channel for research and harvestors to cross-talk (e.g. have an issue per data set and almost like a checklist for evaluation? labels? -- people good at iding javascript just do that, people with domain expertise just evaluate importance)

- Questions about what got nominated difficult to answer (inconsistency in level of importance about what gets seeded)

- Event planning thoughts:
  - Make clear whether a "Showup events" versus a more structured one (all day)
  - Think through **Onboarding**
    - Stagger it (can't just walk up, bring people in every 30mins, 1hr) Staging area before (aka seeding?)
    - Consider crash course on specific research skills and a walkthrough example (maybe a screencast)? overall workflow? **recognizing [code smells](https://en.wikipedia.org/wiki/Code_smell)** and using **Chrome Developer Tools**
  - **What about start everyone seeding first (15-45 mins)?**
      - There will be those that want to be in a section all day (e.g., harvester not interested in research) -- maybe that works for people who get there at the beginning of the day?
  - Print out support materials to minimize conflicting instructions
    - e.g. 5 pointers for crawlability that EVERYONE SEES, not just researchers
  - Consider grouping harvesting and researchers in pods (e.g. 2 harvesters / 2 researchers)
  - Instead of just Harvesting and Research have a couple stages with people working accross them?
    1. Initial evaluation
    2. Triage
    3. Implementation
  - Make sure you have pod/group leads and take advantage of delegation!!
  - Have check-ins, intro, debrief (especially for two day events :))
  - Clearer description of what will happen (aka -- we don't need your Hard Drives)
  - Provide a link to materials (especially What a web crawler is and those 5 pointers for crawlability

- **OUTSTANDING: Clear sense of what will be crawlable via IA FTP**

- Additional comments:
  - Consider having a staging area for uploads? (i.e., handing off scripts to a downloads team at a certain point?)
	# A2 Debrief Notes

	January 28, 2017

	- Opened for general thoughts, questions from Researchers and Harvestors:

	- Researchers raise the concern of overall usefulness to harvestors, key to have validation, and especially have helpful details cycle back
	- Mention that there is a lot of potential noise in the data
	- Need better translation of goals of what to archive, way to determine relevancy
	- e.g. is Tabular Data preferred? What is High Value data (upstream data sets)?
	- Can we make a list of formats?!? (e.g. .csv, pdfs?)

	- What level of separation of context from data is allowed?

	- Frequent sense of "are we getting everything?" and where to stop when scoping datasets. However, once scoped sol'n was relatively approachable.
	- Ambiguity in instructions:
	- What is the range of acceptable format?
	- For preserving metadata, what about HTML vs WARC
	- Metadata questions:
	- How detailed should description for metadata be (in instructions be clear how to generate those files? controlled vocab? range of inputs?)
	- In json manifest "Federal agency data acquired from" <--- what is this? what about sub organizations or regional agencies?
	- Think about metadata automation (Version tracking/PM tools: JIRA, fabricator, but other options?)
	- Most important to have some advance thought to only spend time getting 'right data', a de-duplicating a pain (_note: many uncrawable but scrapable interfaces had data accessible via FTP e.g. National Buoy Database_)
	- Going to be some level of redundancy, and accept that, but work to minimize it

	- More research, and more targeted research extremely helpful:
	- What about priorities for downloading? -- key to have subject matter experts working with harvesters and researchers

	- Working in groups for harvesting was helpful, with one person a little more in a leadership model

	- URL by URL approach is not ideal. Some holistic research to assess sites and how to begin harvesting is crucial (and if it happens with experts would guide priorities)
	- even just GROUP RELATED DOMAINS: grouping and sorting (even within a spreadsheet) -- basically getting related data together
	- Also have a triage of level of difficulty with issues: Coder triage? (Level 1, Level 2, Level 3) NEED THESE IN ADVANCE
	- Think about assessing "domain topology issues" -- mapping tld, sub-domains, to get a handle on where data lives
	- Is there a better way to carve up data harvesting? And who deals with what types of data at an agency
	- Have key contacts for QAing/researching and grouping together datasets?

	- Tools:
	- Using slack for small group work in a private message channel worked well

	- More clear about some standards for downloading, e.g.:
	- Should urls that were downloaded from be included?
	- What about logs of downloads?

	- Better task and project management
	- Spreadsheet works for quick and dirty, but too much overloading
	- Some issue tracking/pm solutions would provide a channel for research and harvestors to cross-talk (e.g. have an issue per data set and almost like a checklist for evaluation? labels? -- people good at iding javascript just do that, people with domain expertise just evaluate importance)

	- Questions about what got nominated difficult to answer (inconsistency in level of importance about what gets seeded)

	- Event planning thoughts:
	- Make clear whether a "Showup events" versus a more structured one (all day)
	- Think through Onboarding
	- Stagger it (can't just walk up, bring people in every 30mins, 1hr) Staging area before (aka seeding?)
	- Consider crash course on specific research skills and a walkthrough example (maybe a screencast)? overall workflow? recognizing [code smells](https://en.wikipedia.org/wiki/Code_smell) and using Chrome Developer Tools
	- What about start everyone seeding first (15-45 mins)?
	- There will be those that want to be in a section all day (e.g., harvester not interested in research) -- maybe that works for people who get there at the beginning of the day?
	- Print out support materials to minimize conflicting instructions
	- e.g. 5 pointers for crawlability that EVERYONE SEES, not just researchers
	- Consider grouping harvesting and researchers in pods (e.g. 2 harvesters / 2 researchers)
	- Instead of just Harvesting and Research have a couple stages with people working accross them?
	1. Initial evaluation
	2. Triage
	3. Implementation
	- Make sure you have pod/group leads and take advantage of delegation!!
	- Have check-ins, intro, debrief (especially for two day events :))
	- Clearer description of what will happen (aka -- we don't need your Hard Drives)
	- Provide a link to materials (especially What a web crawler is and those 5 pointers for crawlability

	- OUTSTANDING: Clear sense of what will be crawlable via IA FTP

	- Additional comments:
	- Consider having a staging area for uploads? (i.e., handing off scripts to a downloads team at a certain point?)