yuriybash/gist:864760c83adae94885b5407c995e0fa1

## gistfile1.txt
let’s say you have two text columns: Title and URL. your data looks like this:

```Title,URL,NonEng
slack vs microsoft teams vs google hangouts,techcrunch,0
the costs of universal healthcare,nytimes,1
...```

and you are trying to create a binary classifier (the “NonEng” column)


you instantiate a CountVectorizer applied to the concatenation of both columns (i.e. a new feature Title_URL - or is this already wrong?), which comes up with a `vocabulary_` that looks something like this:
```{
    'slack': 0,
    'vs': 1,
    'microsoft': 2,
    'teams': 3,
    'google': 4,
    'hangouts': 5,
    'techcrunch': 6,
    'the': 7,
    'costs': 8,
    'of': 9,
    'universal': 10,
    'healthcare': 11,
    'nytimes': 12
}```
and then apply that to the concatenated first sentence - “slack vs microsoft teams vs google hangouts techcrunch”

you then get one row vector that is a 13-dimensional (sparse) row vector that has the value:

`[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]`

(stop me if this is incorrect)

the estimator we use in sklearn ultimately needs its training input data (X) to be a matrix of size [number_rows X number_features]. given that this is the case, how is it possible to keep these two features (Title and URL) “separate”, so to speak, when training the model?

-------------------

i can think of one way to do this differently -

instead of concatenating the two columns together and _then_ converting concatenated string together to this vector form, I could apply a vectorizer on each column *separately*. so you first run it on the “Title” feature and get a vocabulary that looks like:

```{
    'slack': 0,
    'vs': 1,
    'microsoft': 2,
    'teams': 3,
    'google': 4,
    'hangouts': 5,
    'the': 6,
    'costs': 7,
    'of': 8,
    'universal': 9,
    'healthcare': 10,
}```
and when you apply it to the “Title” value for the first row (“slack vs microsoft teams vs google hangouts”), you get:

`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0]`

i.e. an 11-dimensional sparse vector.

_then_, you separately create a vocabulary for the “URL” column, and end up with:
```{
    "techcrunch": 0,
    "nytimes": 1
}```

you then convert the “URL” value for the first row, which becomes a 2-dimensional sparse row vector with the value [1, 0]

_then_, you combine the two row vectors (across the Y-axis):

`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0] + [1, 0]`

and get your final input for the first row:


`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0]`


So using the first method (concatenation), your first input row is:

`[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]`

and with the second method (separate vectorization), your first input row is:

`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0]`


which of these is correct? or is there another method i’m not considering?

it’s strange, i’ve googled a ton, but all the examples with text vectorization use only *one* input field (i.e. only Title), rather than multiple ones.
	let’s say you have two text columns: Title and URL. your data looks like this:

	```Title,URL,NonEng
	slack vs microsoft teams vs google hangouts,techcrunch,0
	the costs of universal healthcare,nytimes,1
	...```

	and you are trying to create a binary classifier (the “NonEng” column)


	you instantiate a CountVectorizer applied to the concatenation of both columns (i.e. a new feature Title_URL - or is this already wrong?), which comes up with a `vocabulary_` that looks something like this:
	```{
	'slack': 0,
	'vs': 1,
	'microsoft': 2,
	'teams': 3,
	'google': 4,
	'hangouts': 5,
	'techcrunch': 6,
	'the': 7,
	'costs': 8,
	'of': 9,
	'universal': 10,
	'healthcare': 11,
	'nytimes': 12
	}```
	and then apply that to the concatenated first sentence - “slack vs microsoft teams vs google hangouts techcrunch”

	you then get one row vector that is a 13-dimensional (sparse) row vector that has the value:

	`[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]`

	(stop me if this is incorrect)

	the estimator we use in sklearn ultimately needs its training input data (X) to be a matrix of size [number_rows X number_features]. given that this is the case, how is it possible to keep these two features (Title and URL) “separate”, so to speak, when training the model?

	-------------------

	i can think of one way to do this differently -

	instead of concatenating the two columns together and _then_ converting concatenated string together to this vector form, I could apply a vectorizer on each column separately. so you first run it on the “Title” feature and get a vocabulary that looks like:

	```{
	'slack': 0,
	'vs': 1,
	'microsoft': 2,
	'teams': 3,
	'google': 4,
	'hangouts': 5,
	'the': 6,
	'costs': 7,
	'of': 8,
	'universal': 9,
	'healthcare': 10,
	}```
	and when you apply it to the “Title” value for the first row (“slack vs microsoft teams vs google hangouts”), you get:

	`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0]`

	i.e. an 11-dimensional sparse vector.

	_then_, you separately create a vocabulary for the “URL” column, and end up with:
	```{
	"techcrunch": 0,
	"nytimes": 1
	}```

	you then convert the “URL” value for the first row, which becomes a 2-dimensional sparse row vector with the value [1, 0]

	_then_, you combine the two row vectors (across the Y-axis):

	`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0] + [1, 0]`

	and get your final input for the first row:


	`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0]`


	So using the first method (concatenation), your first input row is:

	`[1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]`

	and with the second method (separate vectorization), your first input row is:

	`[1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0]`


	which of these is correct? or is there another method i’m not considering?

	it’s strange, i’ve googled a ton, but all the examples with text vectorization use only one input field (i.e. only Title), rather than multiple ones.