tgalery/NMV.md

## NMV.md

      
    Raw
  

              NMV.md
            
          
    Preliminary notes on NVM transcript data

Intro:

Looking at the data from New Virgin Media, many conversions lack appropriate topics.
This is due to a number of reasons, such as :
1. Calls are not answered, so we can't extract much.

Looking at the distribution of the sample handed in:


Call Type
N. Missed Calls
N. Conversations
Total


UK Archive
25
24
49


US Archive
39
6
49


In short, 51% of the calls are missed for the UK data and 79% for the US data.
Since most of the transcripts from all the missed calls contain pre-recorded messages, they offer little insight even if concept extraction is successful.
We could filter those out by:

Filtering data out meta-data from NVM or Call center.
Filter data out by building a classifier (need to spell out details here).

2. Topics extracted from agent's speech, not the customer, so user-profile might not be accurate.

It could be the case, that extracting concepts from the agents speech might be misleading for building a semantic profile of the customer.
So we could distinguish the participants in the conversation by:

Using some metadata from NVM or call center.
Using discourse parsing, discourse plan analysis, stream analysis.
Add anaphora resolution for not throwing the baby with the bath water. (this might make the problem a little bit trickier)

3. Not extracting enough topics

Now, even for successful conversations, sometimes the analysers struggle to extract relevant concepts. This is mainly due to the fact that they are tailored to
get more specific concepts than general ones. Here are a few proposals:

Edit spotlight model / use a different version of spotlight. (Have we done this ?)
Complement concept extraction with keyword extraction. (David is working on this, write out the technical spec)
Move away from concept extraction and assign a main category to the text.

Technical specifications for possible solutions.

Complement concept extraction with keyword extraction.

We can design a custom spotter that extracts keywords, map them to either wordnet Synsets (so they can be re-mapped to dbpedia/freebase concepts), or else extract keyword and context pairs and pass that to a custom endpoint in spotlight for disambiguation.
For example, relevant keywords that serve as the input to this process that were missed by the analysers include statistics, menu, group, invoice, billing.
Here is a sample pipeline:

Extract relevant Keywords [libs: topia(python), Jate toolkit (java)]
Map keywords to Synsets or do classic NER classification
Map Synsets to dbpedia or freebase concepts [some mappings on freebase, others provided by the University of Darmstat]
Find some way to score the new entities by re-using spotlight's context store, or some other wordnet metric.

Filter missed calls by using a classifier.

To rule out bad cases of missed calls we could use some of the data as sets for a statistical model.

Determine which would be good feature for the data (filtered n-grams maybe or find a good dataset on the web)
Try different models to perform the classification (Naive Bayes, Random forests, python sci-kit learn)

Segment participants text.

This is by far the trickiest of options because even though it makes sense to distinguish conversational participants, sometimes conversations have interesting anaphoric dependencies (e.g. A: Do you like sports channels? B: Yeah, I like it .)
Options:

Stream analysis
Discourse Analysis + anaphora resolution (Libs: Dynamic Syntax Parser, Discourse Representation Theory Parser)

Improving recall by tweaking the models.


Surface forms which can't be captured because their annotation probabilty is not high enough.

[
billing,
invoice,
menu,
statistics,
headset,
debit,
credit,
bank,
cost,
job title,
buffet,
company,
table,
family residence,
food,
Parcelforce,
transaction,
fee,
parcel,
payment,
money,
account,
customer,
technology,
Direct debit,
font,
telephony,
supervisor,
profile,
report,
IT,
Help Desk,
driving school,
shim,
log out,
click,
chat,
engineer,
file,
price,
appraisal,
access code,
pound,
hash,
poud sign,
hash sign,
implementation,
integration,
the internet,
reliability,
caller,
student,
driving school,
information,
data,
development,
pricing,
cell,
text,
developer,
contact center,
manager,
meeting,
voice mail,
cloud,
telephone system,
phone system,
portal,
demo,
lead,
screen,
call,
reliability,
consultant,
multi-tenant,
multi tenant,
platform,
backup,
software,
training,
storage room,
interrogation,
investment,
DDR,
officer,
US media,
laboratory
]


Associations that need to be made:

pre sales /Presales
pre-sales /Presales
the register /The_Register
situation publishing /The_Register
Situation Publishing /The_Register
quality of the call /Digital_call_quality
Salesforce /Salesforce
sales force /Salesforce
Sales force /Salesforce
click to dial /Click-to-call
click-to-dial /Click-to-call
click to call /Click-to-call
click to call /Click-to-call
card number /Bank_card_number
dream force /Salesforce
Cisco system /Cisco_system

With these added files, we could extract extra topics. A summury using articles that are not missed calls can be found here:
https://gist.github.com/tgalery/9c4614fd9d4434acbf82 .