Skip to content

Instantly share code, notes, and snippets.

@dcwalk
Created January 29, 2017 16:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dcwalk/71791ea724395712ff4e572ed57fb0a5 to your computer and use it in GitHub Desktop.
Save dcwalk/71791ea724395712ff4e572ed57fb0a5 to your computer and use it in GitHub Desktop.
Notes from Debrief on January 28, 2017 with Researchers and Harvestors
# A2 Debrief Notes
January 28, 2017
- Opened for general thoughts, questions from Researchers and Harvestors:
- Researchers raise the concern of overall **usefulness to** harvestors, key to have validation, and especially have helpful details cycle back
- Mention that there is a lot of potential noise in the data
- Need better translation of goals of what to archive, way to determine relevancy
- e.g. is Tabular Data preferred? What is High Value data (upstream data sets)?
- Can we make a list of formats?!? (e.g. .csv, pdfs?)
- What level of separation of context from data is allowed?
- Frequent sense of "**are we getting everything?**" and where to stop when scoping datasets. However, once scoped sol'n was relatively approachable.
- Ambiguity in instructions:
- What is the range of acceptable format?
- For preserving metadata, what about HTML vs WARC
- Metadata questions:
- How detailed should description for metadata be (in instructions be clear how to generate those files? controlled vocab? range of inputs?)
- In json manifest "Federal agency data acquired from" <--- what is this? what about sub organizations or regional agencies?
- Think about metadata automation (Version tracking/PM tools: JIRA, fabricator, but other options?)
- Most important to have some advance thought to only spend time getting 'right data', a de-duplicating a pain (_note: many uncrawable but scrapable interfaces had data accessible via FTP e.g. National Buoy Database_)
- Going to be some level of redundancy, and accept that, but work to minimize it
- More **research**, and more targeted research extremely helpful:
- What about priorities for downloading? -- key to have subject matter experts **working with** harvesters and researchers
- Working in **groups** for harvesting was helpful, with one person a little more in a leadership model
- URL by URL approach is not ideal. Some **holistic research** to assess sites and how to begin harvesting is crucial (and if it happens with experts would guide priorities)
- even just **GROUP RELATED DOMAINS**: grouping and sorting (even within a spreadsheet) -- basically getting related data together
- Also have a triage of level of difficulty with issues: Coder triage? (Level 1, Level 2, Level 3) **NEED THESE IN ADVANCE**
- Think about assessing "domain topology issues" -- mapping tld, sub-domains, to get a handle on where data lives
- Is there a better way to carve up data harvesting? And who deals with what types of data at an agency
- Have key contacts for QAing/researching and grouping together datasets?
- Tools:
- Using slack for small group work in a private message channel worked well
- More clear about some standards for downloading, e.g.:
- Should urls that were downloaded from be included?
- What about logs of downloads?
- Better task and project management
- Spreadsheet works for quick and dirty, but too much overloading
- Some issue tracking/pm solutions would provide a channel for research and harvestors to cross-talk (e.g. have an issue per data set and almost like a checklist for evaluation? labels? -- people good at iding javascript just do that, people with domain expertise just evaluate importance)
- Questions about what got nominated difficult to answer (inconsistency in level of importance about what gets seeded)
- Event planning thoughts:
- Make clear whether a "Showup events" versus a more structured one (all day)
- Think through **Onboarding**
- Stagger it (can't just walk up, bring people in every 30mins, 1hr) Staging area before (aka seeding?)
- Consider crash course on specific research skills and a walkthrough example (maybe a screencast)? overall workflow? **recognizing [code smells](https://en.wikipedia.org/wiki/Code_smell)** and using **Chrome Developer Tools**
- **What about start everyone seeding first (15-45 mins)?**
- There will be those that want to be in a section all day (e.g., harvester not interested in research) -- maybe that works for people who get there at the beginning of the day?
- Print out support materials to minimize conflicting instructions
- e.g. 5 pointers for crawlability that EVERYONE SEES, not just researchers
- Consider grouping harvesting and researchers in pods (e.g. 2 harvesters / 2 researchers)
- Instead of just Harvesting and Research have a couple stages with people working accross them?
1. Initial evaluation
2. Triage
3. Implementation
- Make sure you have pod/group leads and take advantage of delegation!!
- Have check-ins, intro, debrief (especially for two day events :))
- Clearer description of what will happen (aka -- we don't need your Hard Drives)
- Provide a link to materials (especially What a web crawler is and those 5 pointers for crawlability
- **OUTSTANDING: Clear sense of what will be crawlable via IA FTP**
- Additional comments:
- Consider having a staging area for uploads? (i.e., handing off scripts to a downloads team at a certain point?)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment