This document will document progress, ideas and source code for dark data extraction systems. These systems use statistical inference to perform data extraction, integration and cleaning from unstructured/"dark" sources (forum posts, webpages, etc.). Data programming is the predominant paradigm for dark data extraction: noisy/conflicting user-defined functions are supplied to a generative model, which can recover the parameters of labelling process. Wherever possible, my projects are based on Snorkel/DeepDive.
Ideas (Extensions for the system):
- There isn't any work on Domain Specific primitives (DSPs) for audio data. Pre-trained audio models (VGGish) can serve as feature extractors for high-level concepts like emotion, accent and personality for speech data(WaveNet paper mentions that these are possible), musical genre (Sander Dieleman's Spotify CNN blog post), etc.
Ideas (Applications):
- Ecological/Environmental monitoring: use audio DSPs for building models of migration, logging/poaching, etc.