Skip to content

Instantly share code, notes, and snippets.

@numberwhun
Created May 13, 2016 17:07
Show Gist options
  • Save numberwhun/c42a0d7c90e9726fc2fa56979d2fb390 to your computer and use it in GitHub Desktop.
Save numberwhun/c42a0d7c90e9726fc2fa56979d2fb390 to your computer and use it in GitHub Desktop.
Unstructured Data Mining Primer
Borrowed from: https://icrunchdatanews.com/unstructured-data-mining-primer/
--------------------------------------------------------------------------------------------------
Though it has been practiced for some years, the mining of unstructured data has recently attracted quite a bit of attention. Most stored data is unstructured and contains a great deal of relevant information. Meanwhile, the available structured data is already being exploited; hence the rising interest in unstructured data.
Most often, what is meant by “unstructured data” is natural language text, but there are other types, such as link data, digital audio recordings, images and video. Each of these represents a very diverse set of potential data sources, such as:
Text:
internal company emails
business news feeds
customer complaints
annual shareholder reports
computer logs
social media status updates
Link Data:
cellphone customers who have called each other
products which have been purchased together
legislators who have voted similarly
medical patients who have come in contact with each other
Audio Recordings:
customer service calls
military audio surveillance (for passing vehicles, etc.)
music files
digitized recordings of engine noise
Images:
agricultural inspection images
x-rays of airport luggage
satellite weather images
drone photographs of wildlife
medical CT scans
Video:
security camera footage
traffic monitoring video
Unstructured data is frequently composed of mixed types – text documents with embedded images or free-form text fields in relational databases, for instance. Separating the various components of unstructured data is sometimes a technically challenging task.
What is termed “unstructured data” actually contains a great deal of structure, but this structure does not conform to the most common data types, which are arranged in regular rows and columns (lists, tables, matrices, etc.), or the typical data manipulations (sorting, summing, indexing, etc.). Even when unstructured data is stored in regular arrays, such as pixels in the rows and columns of a digital photograph, the underlying structure rarely aligns with those dimensions.
Tools which analyze structured data, such as the predictive modeling tools used in data mining, are relatively well-developed and have been highly effective. Tools which deal directly with unstructured data are much less well developed and have more of a mixed track record. Not surprisingly then, a common approach to dealing with unstructured data is to extract structured information as familiar feature vectors, which are then fed to structured analytical tools.
Text mining in particular, very often uses this strategy, broadly following these steps:
Acquire text data from the source
Convert to a common format (HTML, Word documents, PDFs to plain ASCII, etc.)
Delete or re-direct extraneous material (embedded tables, charts, pictures, etc.)
Eliminate noise words (“of,” “a,” “the,” etc.)
Reduce words to their stems: “lending” becomes “lend,” “defaulted” becomes “default,” etc.
Consolidate synonyms
Extract features – often these are simple statistical summaries, such as counts or percentages of terms from special lists (such as “positive” or “negative” words for sentiment analysis)
Proceed as usual using structured data analysis tools
Note that most text mining solutions do not try to get the computer to “understand” the complete meaning of sentences and documents; the computer does not syntactically “read” the text. Often, comparatively simple summaries or data representations are used to good effect in this field.
Organizations already collect a substantial amount of unstructured (or semi-structured) data from customers, partners and suppliers, and yet more is available through the media and the Internet – especially social media. Any of these unstructured data sources might be analyzed for correlation to business metrics of interest. Organizations in many fields are profitably exploiting unstructured data today, often with surprisingly simple tools.
Late in 2015, Harrisburg University of Science and Technology hosted Data Analytics Summit II, an analytics conference with a theme of unstructured data. Speakers came from a mixture of backgrounds and presented information on a variety of types of unstructured data. To the best of my knowledge, neither paper nor electronic copies of the presentation materials are being distributed, but video of the presentations may be of interest and can be found at the following Web links:
Presentation by IBM Data & Analytics ( http://livestream.com/accounts/13547584/datasummit )
Presentation by QwikIntelligence, Inc. ( http://livestream.com/accounts/13547584/events/4575259 )
Presentation by WildFig Data ( http://livestream.com/accounts/13547584/events/4575262 )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment