Created
January 29, 2017 16:40
-
-
Save dcwalk/71791ea724395712ff4e572ed57fb0a5 to your computer and use it in GitHub Desktop.
Notes from Debrief on January 28, 2017 with Researchers and Harvestors
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# A2 Debrief Notes | |
January 28, 2017 | |
- Opened for general thoughts, questions from Researchers and Harvestors: | |
- Researchers raise the concern of overall **usefulness to** harvestors, key to have validation, and especially have helpful details cycle back | |
- Mention that there is a lot of potential noise in the data | |
- Need better translation of goals of what to archive, way to determine relevancy | |
- e.g. is Tabular Data preferred? What is High Value data (upstream data sets)? | |
- Can we make a list of formats?!? (e.g. .csv, pdfs?) | |
- What level of separation of context from data is allowed? | |
- Frequent sense of "**are we getting everything?**" and where to stop when scoping datasets. However, once scoped sol'n was relatively approachable. | |
- Ambiguity in instructions: | |
- What is the range of acceptable format? | |
- For preserving metadata, what about HTML vs WARC | |
- Metadata questions: | |
- How detailed should description for metadata be (in instructions be clear how to generate those files? controlled vocab? range of inputs?) | |
- In json manifest "Federal agency data acquired from" <--- what is this? what about sub organizations or regional agencies? | |
- Think about metadata automation (Version tracking/PM tools: JIRA, fabricator, but other options?) | |
- Most important to have some advance thought to only spend time getting 'right data', a de-duplicating a pain (_note: many uncrawable but scrapable interfaces had data accessible via FTP e.g. National Buoy Database_) | |
- Going to be some level of redundancy, and accept that, but work to minimize it | |
- More **research**, and more targeted research extremely helpful: | |
- What about priorities for downloading? -- key to have subject matter experts **working with** harvesters and researchers | |
- Working in **groups** for harvesting was helpful, with one person a little more in a leadership model | |
- URL by URL approach is not ideal. Some **holistic research** to assess sites and how to begin harvesting is crucial (and if it happens with experts would guide priorities) | |
- even just **GROUP RELATED DOMAINS**: grouping and sorting (even within a spreadsheet) -- basically getting related data together | |
- Also have a triage of level of difficulty with issues: Coder triage? (Level 1, Level 2, Level 3) **NEED THESE IN ADVANCE** | |
- Think about assessing "domain topology issues" -- mapping tld, sub-domains, to get a handle on where data lives | |
- Is there a better way to carve up data harvesting? And who deals with what types of data at an agency | |
- Have key contacts for QAing/researching and grouping together datasets? | |
- Tools: | |
- Using slack for small group work in a private message channel worked well | |
- More clear about some standards for downloading, e.g.: | |
- Should urls that were downloaded from be included? | |
- What about logs of downloads? | |
- Better task and project management | |
- Spreadsheet works for quick and dirty, but too much overloading | |
- Some issue tracking/pm solutions would provide a channel for research and harvestors to cross-talk (e.g. have an issue per data set and almost like a checklist for evaluation? labels? -- people good at iding javascript just do that, people with domain expertise just evaluate importance) | |
- Questions about what got nominated difficult to answer (inconsistency in level of importance about what gets seeded) | |
- Event planning thoughts: | |
- Make clear whether a "Showup events" versus a more structured one (all day) | |
- Think through **Onboarding** | |
- Stagger it (can't just walk up, bring people in every 30mins, 1hr) Staging area before (aka seeding?) | |
- Consider crash course on specific research skills and a walkthrough example (maybe a screencast)? overall workflow? **recognizing [code smells](https://en.wikipedia.org/wiki/Code_smell)** and using **Chrome Developer Tools** | |
- **What about start everyone seeding first (15-45 mins)?** | |
- There will be those that want to be in a section all day (e.g., harvester not interested in research) -- maybe that works for people who get there at the beginning of the day? | |
- Print out support materials to minimize conflicting instructions | |
- e.g. 5 pointers for crawlability that EVERYONE SEES, not just researchers | |
- Consider grouping harvesting and researchers in pods (e.g. 2 harvesters / 2 researchers) | |
- Instead of just Harvesting and Research have a couple stages with people working accross them? | |
1. Initial evaluation | |
2. Triage | |
3. Implementation | |
- Make sure you have pod/group leads and take advantage of delegation!! | |
- Have check-ins, intro, debrief (especially for two day events :)) | |
- Clearer description of what will happen (aka -- we don't need your Hard Drives) | |
- Provide a link to materials (especially What a web crawler is and those 5 pointers for crawlability | |
- **OUTSTANDING: Clear sense of what will be crawlable via IA FTP** | |
- Additional comments: | |
- Consider having a staging area for uploads? (i.e., handing off scripts to a downloads team at a certain point?) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment