Skip to content

Instantly share code, notes, and snippets.

@tkazusa
Created December 2, 2018 08:53
Show Gist options
  • Save tkazusa/ee4dfdb61ac416821d2e2154869a18dd to your computer and use it in GitHub Desktop.
Save tkazusa/ee4dfdb61ac416821d2e2154869a18dd to your computer and use it in GitHub Desktop.
出現回数の少ないカテゴリに関しては、TargetEncodingの信頼性が少なくなる。ので、そうなるようにそのカテゴリの出現回数(view_cat)をlog(n), 例えばn=100000で割ることで、その信頼度を割り引いてやる。nは任意。https://www.kaggle.com/nanomathias/feature-engineering-importance-testing
for cols in ATTRIBUTION_CATEGORIES:
 # Aggregation function
def rate_calculation(x):
"""Calculate the attributed rate. Scale by confidence"""
rate = x.sum() / float(x.count())
conf = np.min([1, np.log(x.count()) / log_group])
return rate * conf
# Perform the merge
X_train = X_train.merge(
group_object['is_attributed']. \
apply(rate_calculation). \
reset_index(). \
rename(
index=str,
columns={'is_attributed': new_feature}
)[cols + [new_feature]],
on=cols, how='left'
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment