Skip to content

Instantly share code, notes, and snippets.

@ganesh-srinivas
Last active November 24, 2022 15:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ganesh-srinivas/dedfeb7f25183eaf5067d85b6809535e to your computer and use it in GitHub Desktop.
Save ganesh-srinivas/dedfeb7f25183eaf5067d85b6809535e to your computer and use it in GitHub Desktop.
Proposal for Dark Data Extraction Research

This document will document progress, ideas and source code for dark data extraction systems. These systems use statistical inference to perform data extraction, integration and cleaning from unstructured/"dark" sources (forum posts, webpages, etc.). Data programming is the predominant paradigm for dark data extraction: noisy/conflicting user-defined functions are supplied to a generative model, which can recover the parameters of labelling process. Wherever possible, my projects are based on Snorkel/DeepDive.

Ideas (Extensions for the system):

  • There isn't any work on Domain Specific primitives (DSPs) for audio data. Pre-trained audio models (VGGish) can serve as feature extractors for high-level concepts like emotion, accent and personality for speech data(WaveNet paper mentions that these are possible), musical genre (Sander Dieleman's Spotify CNN blog post), etc.

Ideas (Applications):

  • Ecological/Environmental monitoring: use audio DSPs for building models of migration, logging/poaching, etc.
  • Digital humanities: understudied history and archaeology archives. Concrete problem: discover trading
  • Drug repurposing: build a database of serendipitous drug interactions from mentions on internet discussion forums.
  • Macro-economic indicators (like the Michigan PhD thesis on labor market flows from Twitter data).
@Ashiya96
Copy link

Hi. I am also working on dark data. Could you please help me figure out how to collect data for thesis? I thought of university domain, but again there might not be much dark data. Then I thought of business domain I still don't know how to get data from. Do you have any suggestion in which area should I do my study? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment