Skip to content

Instantly share code, notes, and snippets.

@yuriybash
Created January 16, 2019 21:05
Show Gist options
  • Save yuriybash/864760c83adae94885b5407c995e0fa1 to your computer and use it in GitHub Desktop.
Save yuriybash/864760c83adae94885b5407c995e0fa1 to your computer and use it in GitHub Desktop.
text classification question
let’s say you have two text columns: Title and URL. your data looks like this:
```Title,URL,NonEng
slack vs microsoft teams vs google hangouts,techcrunch,0
the costs of universal healthcare,nytimes,1
...```
and you are trying to create a binary classifier (the “NonEng” column)
you instantiate a CountVectorizer applied to the concatenation of both columns (i.e. a new feature Title_URL - or is this already wrong?), which comes up with a `vocabulary_` that looks something like this:
```{
'slack': 0,
'vs': 1,
'microsoft': 2,
'teams': 3,
'google': 4,
'hangouts': 5,
'techcrunch': 6,
'the': 7,
'costs': 8,
'of': 9,
'universal': 10,
'healthcare': 11,
'nytimes': 12
}```
and then apply that to the concatenated first sentence - “slack vs microsoft teams vs google hangouts techcrunch”
you then get one row vector that is a 13-dimensional (sparse) row vector that has the value:
`[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]`
(stop me if this is incorrect)
the estimator we use in sklearn ultimately needs its training input data (X) to be a matrix of size [number_rows X number_features]. given that this is the case, how is it possible to keep these two features (Title and URL) “separate”, so to speak, when training the model?
-------------------
i can think of one way to do this differently -
instead of concatenating the two columns together and _then_ converting concatenated string together to this vector form, I could apply a vectorizer on each column *separately*. so you first run it on the “Title” feature and get a vocabulary that looks like:
```{
'slack': 0,
'vs': 1,
'microsoft': 2,
'teams': 3,
'google': 4,
'hangouts': 5,
'the': 6,
'costs': 7,
'of': 8,
'universal': 9,
'healthcare': 10,
}```
and when you apply it to the “Title” value for the first row (“slack vs microsoft teams vs google hangouts”), you get:
`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0]`
i.e. an 11-dimensional sparse vector.
_then_, you separately create a vocabulary for the “URL” column, and end up with:
```{
"techcrunch": 0,
"nytimes": 1
}```
you then convert the “URL” value for the first row, which becomes a 2-dimensional sparse row vector with the value [1, 0]
_then_, you combine the two row vectors (across the Y-axis):
`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0] + [1, 0]`
and get your final input for the first row:
`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0]`
So using the first method (concatenation), your first input row is:
`[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]`
and with the second method (separate vectorization), your first input row is:
`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0]`
which of these is correct? or is there another method i’m not considering?
it’s strange, i’ve googled a ton, but all the examples with text vectorization use only *one* input field (i.e. only Title), rather than multiple ones.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment