Skip to content

Instantly share code, notes, and snippets.

View hanneshapke's full-sized avatar

Hannes Hapke hanneshapke

View GitHub Profile
input_type_ids = tf.zeros_like(input_mask)
def preprocessing_fn(inputs):
def tokenize_text(text, sequence_length=MAX_SEQ_LEN):
...
return tf.reshape(tokens, [-1, sequence_length])
def preprocess_bert_input(text, segment_id=0):
input_word_ids = tokenize_text(text)
...
return (
@hanneshapke
hanneshapke / adding_of_CLS_and_SEP_tokens.py
Created March 9, 2020 17:47
adding_of_CLS_and_SEP_tokens
CLS_ID = tf.constant(101, dtype=tf.int64)
SEP_ID = tf.constant(102, dtype=tf.int64)
start_tokens = tf.fill([tf.shape(text)[0], 1], CLS_ID)
end_tokens = tf.fill([tf.shape(text)[0], 1], SEP_ID)
tokens = tokens[:, :sequence_length - 2]
tokens = tf.concat([start_tokens, tokens, end_tokens], axis=1)
tokens = bert_tokenizer.tokenize(text)
bert_tokenizer = text.BertTokenizer(
vocab_lookup_table=vocab_file_path,
token_out_type=tf.int64,
lower_case=do_lower_case
)
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
We can make this file beautiful and searchable if this error is corrected: No commas found in this CSV file in line 0.
‘This is the best movie I have ever seen ...’ -> 1
‘Probably the worst movie produced in 2019 ...’ -> 0
‘Tom Hank\’s performance turns this movie into ...’ -> ?
import tensorflow_text as text
vocab_file_path = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
bert_tokenizer = text.BertTokenizer(
vocab_lookup_table=vocab_file_path,
token_out_type=tf.int64,
lower_case=do_lower_case
)
[
[[b'clara'], [b'is'], [b'playing'], [b'the'], [b'piano'], [b'.']],
[[b'maria'], [b'likes'], [b'to'], [b'play'], [b'soccer'], [b'.']],
[[b'hi'], [b'tom'], [b'!']]
]
[
"Clara is playing the piano."
"Maria likes to play soccer.",
"Hi Tom!"
]