Skip to content

Instantly share code, notes, and snippets.

@michelkana
Created July 26, 2019 15:08
Show Gist options
  • Save michelkana/4f6acd0b08f859792dcefcd3092cbb62 to your computer and use it in GitHub Desktop.
Save michelkana/4f6acd0b08f859792dcefcd3092cbb62 to your computer and use it in GitHub Desktop.
max_word_len = df.yb.str.len().max()
max_word_len_utf8 = df.yb_utf8.str.len().max()
nb_labels = len(df.word_type.unique())
nb_words = df.shape[0]
print("Number of words: ", nb_words)
print("Number of labels: ", nb_labels)
print("Max word length: {} characters and {} bytes".format(max_word_len, max_word_len_utf8))
@pancodia
Copy link

pancodia commented Nov 8, 2021

I am following this article. When I execute model_lstm.fit to train the LSTM model, the following error occured

ValueError: logits and labels must have the same shape ((None, 10) vs (None, 12))

When I debug, I found

In [89]: Y_train.shape
Out[89]: (2869, 12)

In [90]: nb_labels
Out[90]: 10

In [91]: df.word_type.max()
Out[91]: 11

In [92]: df.groupby('word_type').count().iloc[:, 0]
Out[92]: 
word_type
0     1901
1     1420
2      141
3       14
4       36
5       48
6       10
7       10
8        1
11       6
Name: en, dtype: int64

The training labels are created by Y = to_categorical(Y) and it convert labels to one-hot encoding. Because in the input dataset, the max index of the label is 11, the one-hot encoding's dimension is 12. However, in the dataset, there are only 10 unique label indices, therefore nb_labels is 10.

In order to resolve the issue, should we instead calculate nb_labels as follows?

nb_labels = df.word_type.max() - df.word_type.min() + 1

@pancodia
Copy link

pancodia commented Nov 8, 2021

Further, can I ask how do we usually handle the situation where the training data misses some label categories while missing categories could happen in production?

@michelkana
Copy link
Author

@pancodia thanks for getting back to me. Sorry for the late reply. I was traveling. Did you find a fix? If yes, can you share or do you still need help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment