Fixes: scikit-learn/scikit-learn#12470
Title: Fix OneHotEncoder
to Safely Handle String Categories for ignore
Unknown Strategy
Problem:
The OneHotEncoder
from scikit-learn raises a ValueError
during the transform
method when handle_unknown='ignore'
is set and the categories are strings. This occurs if the string length of any unknown category being transformed exceeds the length of the strings encountered during fitting. The error arises because OneHotEncoder.categories_[i][0]
(the first category) is being used to replace unknown entries, and if it is a longer string than the target array's dtype allows, this string gets truncated, causing subsequent array operations to fail.
Analysis:
The root cause of the issue is the discrepancy in memory handling between strings of different lengths when dealing with NumPy arrays. Specifically, when the handle_unknown='ignore'
option is used, unknown categories are replaced by a known category from the `categories_