Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save duttashi/9ff5e599f7b4ef8f3828033dee3c3cdc to your computer and use it in GitHub Desktop.
Save duttashi/9ff5e599f7b4ef8f3828033dee3c3cdc to your computer and use it in GitHub Desktop.

There 3 options how to convert categorical features to numerical:

  • Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

  • Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

  • Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

Code:

def two_hot(x): return np.concatenate([ (x == "morning") | (x == "afternoon"), (x == "afternoon") | (x == "evening"), (x == "evening") | (x == "night"), (x == "night") | (x == "morning"), ], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T print(x) x = two_hot(x) print(x)

Output:

[['morning'] ['afternoon'] ['evening'] ['night']] [[1 0 0 1] [1 1 0 0] [0 1 1 0] [0 0 1 1]]

Then we can measure the distances:

from sklearn.metrics.pairwise import euclidean_distances euclidean_distances(x)

Output:

array([[0. , 1.41421356, 2. , 1.41421356], [1.41421356, 0. , 1.41421356, 2. ], [2. , 1.41421356, 0. , 1.41421356], [1.41421356, 2. , 1.41421356, 0. ]])

@duttashi
Copy link
Author

Code

def two_hot(x):
    return np.concatenate([
        (x == "morning") | (x == "afternoon"),
        (x == "afternoon") | (x == "evening"),
        (x == "evening") | (x == "night"),
        (x == "night") | (x == "morning"),
    ], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)

Measure distance

from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment