-
Import the necessary libraries and add five sentences from the data that will be used for training data. Show the following: a. Imported libraries b. Five sentences added as training data
-
Perform an initial tokenization of the five sentences and print the word index. Then turn the sentences into sequences and print the sequences. a. Show code used to tokenize the data and print the word index. c. Scroll to the end of the word index output. How many words are in the tokenized list?
d. Show code used to turn the sentences into sequences and print the sequences. e. Show a screenshot of a snippet of the output showing at least two sequences. -
Add three new sentences from the article as test data. Tokenize the test data into sequences using the tokens created with the training data. Print the sequences for the test data. a. Show code used to add the test data, tokenize the test data into sequences and printing the test data sequences. b. Show output of the test sequences. c. Indicate where you notice there is missing words in the sequences due to having out-of-vocabulary words in the test data.
-
Add an out-of-variable token and padding (either pre or post padding is acceptable) to the training data. Print the updated word index and padded sequences of the training data. a. Show code used to add the OOV token and padding, and to print the updated word index and padded sequences. b. Show output that includes the updated word index output showing the OOV token and at least three words with their tokens. c. Show output that includes one of the updated sequences with padding.
-
Load the training split from the ‘movie_rationales’ TFDS and remove a pre-defined set of stopwords. Then print the word index to demonstrate the stopwords have been removed.
For the final question, you will be using a TFDS called ‘movie_rationales’. a. Show the code used to load the data, remove the stopwords, and print the word index. Note: You can create a custom list of stopwords that should include a minimum of 20 stopwords. b. Show last five words in the word index so that you can see how many words have been tokenized.
Last active
August 5, 2023 23:55
-
-
Save thomasthaddeus/79f1241f091fd36960648bc9008c28a1 to your computer and use it in GitHub Desktop.
NLP gist for tokenizing parsing sentences
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# 1. Import the necessary libraries and add five sentences from the data that will be used for training data. | |
# Use `Tokenizer` class from `keras.preprocessing.text` for tokenizing the sentences. | |
import string | |
import spacy | |
import gensim | |
import nltk | |
from nltk import ngrams | |
from nltk.corpus import stopwords | |
from nltk.tokenize import PunktSentenceTokenizer, word_tokenize | |
import tensorflow_datasets as tfds | |
from keras.preprocessing.text import Tokenizer | |
from keras.preprocessing.sequence import pad_sequences | |
from collections import Counter | |
# Load the document | |
with open(file='../data/article.csv', mode='r', encoding='utf-8') as f: | |
document = f.read() | |
# Convert text to lowercase | |
document = document.lower() | |
# Remove punctuation | |
translator = str.maketrans('', '', string.punctuation) | |
document = document.translate(translator) | |
# Tokenize | |
words = word_tokenize(document) | |
# Remove stop words | |
stop_words = set(stopwords.words('english')) | |
words = [word for word in words if word not in stop_words] | |
# 2. Perform an initial tokenization of the five sentences and print the word index. Then turn the sentences into sequences and print the sequences. | |
# For tokenization, we will use the `Tokenizer` class and its `fit_on_texts` method. The `word_index` attribute will give us the mapping of words to their integer tokens. The `texts_to_sequences` method will convert the sentences into sequences of tokens. | |
# Initialize a tokenizer | |
tokenizer = Tokenizer() | |
# Fit the tokenizer on the sentences | |
tokenizer.fit_on_texts(words) | |
# Print the word index | |
print(f"Word index: {tokenizer.word_index}") | |
# Convert sentences to sequences | |
sequences = tokenizer.texts_to_sequences(words) | |
# Print the sequences | |
print(f"Sequences: {sequences}") | |
# 3. Add three new sentences from the article as test data. Tokenize the test data into sequences using the tokens created with the training data. | |
# We'll add three new sentences and tokenize them using the same tokenizer. Note that if there are new words in these sentences that were not in the training data, they will be ignored because our tokenizer only recognizes words it was trained on. | |
# Three new sentences for testing | |
test_sentences = [ | |
"Artificial intelligence is transforming the world.", | |
"I enjoy reading books on philosophy.", | |
"Cats and dogs are popular pets." | |
] | |
# Convert test sentences to sequences | |
test_sequences = tokenizer.texts_to_sequences(test_sentences) | |
# Print the test sequences | |
print(f"Test sequences: {test_sequences}") | |
# 4. Add an out-of-variable token and padding (either pre or post padding is acceptable) to the training data. | |
# Create a new tokenizer with an out-of-vocabulary (OOV) token. This token will be used for words in the test data that the tokenizer has not seen before. We will also pad the sequences using `pad_sequences` from `keras.preprocessing.sequence`. | |
# Initialize a new tokenizer with an OOV token | |
tokenizer = Tokenizer(oov_token="<OOV>") | |
# Fit the tokenizer on the sentences | |
tokenizer.fit_on_texts(words) | |
# Convert sentences to sequences | |
sequences = tokenizer.texts_to_sequences(words) | |
# Pad the sequences | |
padded_sequences = pad_sequences(sequences) | |
# Print the padded sequences | |
print(f"Padded sequences:\n{padded_sequences}") | |
# 5. Load the training split from the 'movie_rationales' TFDS and remove a pre-defined set of stopwords. | |
# import and load the 'movie_rationales' dataset from `tensorflow_datasets`. After loading the data, tokenize the texts, replace the stopwords with an empty string, and then print the word index. | |
# Load the 'movie_rationales' dataset | |
dataset, info = tfds.load('movie_rationales', with_info=True, split='train') | |
# Extract reviews from the dataset and remove stopwords | |
clean_reviews = [] | |
for example in dataset: | |
review = example['review'].numpy().decode('utf-8') | |
words = review.split() | |
clean_words = [word for word in words if word not in stop_words] | |
clean_reviews.append(' '.join(clean_words)) | |
# Initialize a new tokenizer | |
tokenizer = Tokenizer() | |
# Fit the tokenizer on the cleaned reviews | |
tokenizer.fit_on_texts(clean_reviews) | |
# Print the word index | |
print(f"Word index: {tokenizer.word_index}") | |
# Extract texts from the dataset | |
texts = [example['text'].numpy().decode('utf-8') for example in dataset] | |
# Assuming 'texts' is your list of movie rationales | |
clean_texts = [] | |
for text in texts: | |
words = text.split() | |
clean_words = [word for word in words if word not in stop_words] | |
clean_texts.append(' '.join(clean_words)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment