numberwhun/Unstructured Data Mining Primer

## Unstructured Data Mining Primer
Borrowed from:  https://icrunchdatanews.com/unstructured-data-mining-primer/

--------------------------------------------------------------------------------------------------

Though it has been practiced for some years, the mining of unstructured data has recently attracted quite a bit of attention. Most stored data is unstructured and contains a great deal of relevant information. Meanwhile, the available structured data is already being exploited; hence the rising interest in unstructured data.

Most often, what is meant by “unstructured data” is natural language text, but there are other types, such as link data, digital audio recordings, images and video. Each of these represents a very diverse set of potential data sources, such as:
Text:

    internal company emails
    business news feeds
    customer complaints
    annual shareholder reports
    computer logs
    social media status updates

Link Data:

    cellphone customers who have called each other
    products which have been purchased together
    legislators who have voted similarly
    medical patients who have come in contact with each other

Audio Recordings:

    customer service calls
    military audio surveillance (for passing vehicles, etc.)
    music files
    digitized recordings of engine noise

Images:

    agricultural inspection images
    x-rays of airport luggage
    satellite weather images
    drone photographs of wildlife
    medical CT scans

Video:

    security camera footage
    traffic monitoring video

Unstructured data is frequently composed of mixed types – text documents with embedded images or free-form text fields in relational databases, for instance. Separating the various components of unstructured data is sometimes a technically challenging task.

What is termed “unstructured data” actually contains a great deal of structure, but this structure does not conform to the most common data types, which are arranged in regular rows and columns (lists, tables, matrices, etc.), or the typical data manipulations (sorting, summing, indexing, etc.). Even when unstructured data is stored in regular arrays, such as pixels in the rows and columns of a digital photograph, the underlying structure rarely aligns with those dimensions.

Tools which analyze structured data, such as the predictive modeling tools used in data mining, are relatively well-developed and have been highly effective. Tools which deal directly with unstructured data are much less well developed and have more of a mixed track record. Not surprisingly then, a common approach to dealing with unstructured data is to extract structured information as familiar feature vectors, which are then fed to structured analytical tools.
Text mining in particular, very often uses this strategy, broadly following these steps:

    Acquire text data from the source
    Convert to a common format (HTML, Word documents, PDFs to plain ASCII, etc.)
    Delete or re-direct extraneous material (embedded tables, charts, pictures, etc.)
    Eliminate noise words (“of,” “a,” “the,” etc.)
    Reduce words to their stems: “lending” becomes “lend,” “defaulted” becomes “default,” etc.
    Consolidate synonyms
    Extract features – often these are simple statistical summaries, such as counts or percentages of terms from special lists (such as “positive” or “negative” words for sentiment analysis)
    Proceed as usual using structured data analysis tools

Note that most text mining solutions do not try to get the computer to “understand” the complete meaning of sentences and documents; the computer does not syntactically “read” the text. Often, comparatively simple summaries or data representations are used to good effect in this field.

Organizations already collect a substantial amount of unstructured (or semi-structured) data from customers, partners and suppliers, and yet more is available through the media and the Internet – especially social media. Any of these unstructured data sources might be analyzed for correlation to business metrics of interest. Organizations in many fields are profitably exploiting unstructured data today, often with surprisingly simple tools.

Late in 2015, Harrisburg University of Science and Technology hosted Data Analytics Summit II, an analytics conference with a theme of unstructured data. Speakers came from a mixture of backgrounds and presented information on a variety of types of unstructured data. To the best of my knowledge, neither paper nor electronic copies of the presentation materials are being distributed, but video of the presentations may be of interest and can be found at the following Web links:


    Presentation by IBM Data & Analytics ( http://livestream.com/accounts/13547584/datasummit )
    Presentation by QwikIntelligence, Inc. ( http://livestream.com/accounts/13547584/events/4575259 )
    Presentation by WildFig Data ( http://livestream.com/accounts/13547584/events/4575262 )
	Borrowed from: https://icrunchdatanews.com/unstructured-data-mining-primer/

	--------------------------------------------------------------------------------------------------

	Though it has been practiced for some years, the mining of unstructured data has recently attracted quite a bit of attention. Most stored data is unstructured and contains a great deal of relevant information. Meanwhile, the available structured data is already being exploited; hence the rising interest in unstructured data.

	Most often, what is meant by “unstructured data” is natural language text, but there are other types, such as link data, digital audio recordings, images and video. Each of these represents a very diverse set of potential data sources, such as:
	Text:

	internal company emails
	business news feeds
	customer complaints
	annual shareholder reports
	computer logs
	social media status updates

	Link Data:

	cellphone customers who have called each other
	products which have been purchased together
	legislators who have voted similarly
	medical patients who have come in contact with each other

	Audio Recordings:

	customer service calls
	military audio surveillance (for passing vehicles, etc.)
	music files
	digitized recordings of engine noise

	Images:

	agricultural inspection images
	x-rays of airport luggage
	satellite weather images
	drone photographs of wildlife
	medical CT scans

	Video:

	security camera footage
	traffic monitoring video

	Unstructured data is frequently composed of mixed types – text documents with embedded images or free-form text fields in relational databases, for instance. Separating the various components of unstructured data is sometimes a technically challenging task.

	What is termed “unstructured data” actually contains a great deal of structure, but this structure does not conform to the most common data types, which are arranged in regular rows and columns (lists, tables, matrices, etc.), or the typical data manipulations (sorting, summing, indexing, etc.). Even when unstructured data is stored in regular arrays, such as pixels in the rows and columns of a digital photograph, the underlying structure rarely aligns with those dimensions.

	Tools which analyze structured data, such as the predictive modeling tools used in data mining, are relatively well-developed and have been highly effective. Tools which deal directly with unstructured data are much less well developed and have more of a mixed track record. Not surprisingly then, a common approach to dealing with unstructured data is to extract structured information as familiar feature vectors, which are then fed to structured analytical tools.
	Text mining in particular, very often uses this strategy, broadly following these steps:

	Acquire text data from the source
	Convert to a common format (HTML, Word documents, PDFs to plain ASCII, etc.)
	Delete or re-direct extraneous material (embedded tables, charts, pictures, etc.)
	Eliminate noise words (“of,” “a,” “the,” etc.)
	Reduce words to their stems: “lending” becomes “lend,” “defaulted” becomes “default,” etc.
	Consolidate synonyms
	Extract features – often these are simple statistical summaries, such as counts or percentages of terms from special lists (such as “positive” or “negative” words for sentiment analysis)
	Proceed as usual using structured data analysis tools

	Note that most text mining solutions do not try to get the computer to “understand” the complete meaning of sentences and documents; the computer does not syntactically “read” the text. Often, comparatively simple summaries or data representations are used to good effect in this field.

	Organizations already collect a substantial amount of unstructured (or semi-structured) data from customers, partners and suppliers, and yet more is available through the media and the Internet – especially social media. Any of these unstructured data sources might be analyzed for correlation to business metrics of interest. Organizations in many fields are profitably exploiting unstructured data today, often with surprisingly simple tools.

	Late in 2015, Harrisburg University of Science and Technology hosted Data Analytics Summit II, an analytics conference with a theme of unstructured data. Speakers came from a mixture of backgrounds and presented information on a variety of types of unstructured data. To the best of my knowledge, neither paper nor electronic copies of the presentation materials are being distributed, but video of the presentations may be of interest and can be found at the following Web links:


	Presentation by IBM Data & Analytics ( http://livestream.com/accounts/13547584/datasummit )
	Presentation by QwikIntelligence, Inc. ( http://livestream.com/accounts/13547584/events/4575259 )
	Presentation by WildFig Data ( http://livestream.com/accounts/13547584/events/4575262 )