Created
January 16, 2019 21:05
-
-
Save yuriybash/864760c83adae94885b5407c995e0fa1 to your computer and use it in GitHub Desktop.
text classification question
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
let’s say you have two text columns: Title and URL. your data looks like this: | |
```Title,URL,NonEng | |
slack vs microsoft teams vs google hangouts,techcrunch,0 | |
the costs of universal healthcare,nytimes,1 | |
...``` | |
and you are trying to create a binary classifier (the “NonEng” column) | |
you instantiate a CountVectorizer applied to the concatenation of both columns (i.e. a new feature Title_URL - or is this already wrong?), which comes up with a `vocabulary_` that looks something like this: | |
```{ | |
'slack': 0, | |
'vs': 1, | |
'microsoft': 2, | |
'teams': 3, | |
'google': 4, | |
'hangouts': 5, | |
'techcrunch': 6, | |
'the': 7, | |
'costs': 8, | |
'of': 9, | |
'universal': 10, | |
'healthcare': 11, | |
'nytimes': 12 | |
}``` | |
and then apply that to the concatenated first sentence - “slack vs microsoft teams vs google hangouts techcrunch” | |
you then get one row vector that is a 13-dimensional (sparse) row vector that has the value: | |
`[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]` | |
(stop me if this is incorrect) | |
the estimator we use in sklearn ultimately needs its training input data (X) to be a matrix of size [number_rows X number_features]. given that this is the case, how is it possible to keep these two features (Title and URL) “separate”, so to speak, when training the model? | |
------------------- | |
i can think of one way to do this differently - | |
instead of concatenating the two columns together and _then_ converting concatenated string together to this vector form, I could apply a vectorizer on each column *separately*. so you first run it on the “Title” feature and get a vocabulary that looks like: | |
```{ | |
'slack': 0, | |
'vs': 1, | |
'microsoft': 2, | |
'teams': 3, | |
'google': 4, | |
'hangouts': 5, | |
'the': 6, | |
'costs': 7, | |
'of': 8, | |
'universal': 9, | |
'healthcare': 10, | |
}``` | |
and when you apply it to the “Title” value for the first row (“slack vs microsoft teams vs google hangouts”), you get: | |
`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0]` | |
i.e. an 11-dimensional sparse vector. | |
_then_, you separately create a vocabulary for the “URL” column, and end up with: | |
```{ | |
"techcrunch": 0, | |
"nytimes": 1 | |
}``` | |
you then convert the “URL” value for the first row, which becomes a 2-dimensional sparse row vector with the value [1, 0] | |
_then_, you combine the two row vectors (across the Y-axis): | |
`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0] + [1, 0]` | |
and get your final input for the first row: | |
`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0]` | |
So using the first method (concatenation), your first input row is: | |
`[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]` | |
and with the second method (separate vectorization), your first input row is: | |
`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0]` | |
which of these is correct? or is there another method i’m not considering? | |
it’s strange, i’ve googled a ton, but all the examples with text vectorization use only *one* input field (i.e. only Title), rather than multiple ones. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment