Skip to content

Instantly share code, notes, and snippets.

@thomasthaddeus
Last active August 5, 2023 23:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thomasthaddeus/79f1241f091fd36960648bc9008c28a1 to your computer and use it in GitHub Desktop.
Save thomasthaddeus/79f1241f091fd36960648bc9008c28a1 to your computer and use it in GitHub Desktop.
NLP gist for tokenizing parsing sentences
# 1. Import the necessary libraries and add five sentences from the data that will be used for training data.
# Use `Tokenizer` class from `keras.preprocessing.text` for tokenizing the sentences.
import string
import spacy
import gensim
import nltk
from nltk import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import PunktSentenceTokenizer, word_tokenize
import tensorflow_datasets as tfds
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from collections import Counter
# Load the document
with open(file='../data/article.csv', mode='r', encoding='utf-8') as f:
document = f.read()
# Convert text to lowercase
document = document.lower()
# Remove punctuation
translator = str.maketrans('', '', string.punctuation)
document = document.translate(translator)
# Tokenize
words = word_tokenize(document)
# Remove stop words
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]
# 2. Perform an initial tokenization of the five sentences and print the word index. Then turn the sentences into sequences and print the sequences.
# For tokenization, we will use the `Tokenizer` class and its `fit_on_texts` method. The `word_index` attribute will give us the mapping of words to their integer tokens. The `texts_to_sequences` method will convert the sentences into sequences of tokens.
# Initialize a tokenizer
tokenizer = Tokenizer()
# Fit the tokenizer on the sentences
tokenizer.fit_on_texts(words)
# Print the word index
print(f"Word index: {tokenizer.word_index}")
# Convert sentences to sequences
sequences = tokenizer.texts_to_sequences(words)
# Print the sequences
print(f"Sequences: {sequences}")
# 3. Add three new sentences from the article as test data. Tokenize the test data into sequences using the tokens created with the training data.
# We'll add three new sentences and tokenize them using the same tokenizer. Note that if there are new words in these sentences that were not in the training data, they will be ignored because our tokenizer only recognizes words it was trained on.
# Three new sentences for testing
test_sentences = [
"Artificial intelligence is transforming the world.",
"I enjoy reading books on philosophy.",
"Cats and dogs are popular pets."
]
# Convert test sentences to sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
# Print the test sequences
print(f"Test sequences: {test_sequences}")
# 4. Add an out-of-variable token and padding (either pre or post padding is acceptable) to the training data.
# Create a new tokenizer with an out-of-vocabulary (OOV) token. This token will be used for words in the test data that the tokenizer has not seen before. We will also pad the sequences using `pad_sequences` from `keras.preprocessing.sequence`.
# Initialize a new tokenizer with an OOV token
tokenizer = Tokenizer(oov_token="<OOV>")
# Fit the tokenizer on the sentences
tokenizer.fit_on_texts(words)
# Convert sentences to sequences
sequences = tokenizer.texts_to_sequences(words)
# Pad the sequences
padded_sequences = pad_sequences(sequences)
# Print the padded sequences
print(f"Padded sequences:\n{padded_sequences}")
# 5. Load the training split from the 'movie_rationales' TFDS and remove a pre-defined set of stopwords.
# import and load the 'movie_rationales' dataset from `tensorflow_datasets`. After loading the data, tokenize the texts, replace the stopwords with an empty string, and then print the word index.
# Load the 'movie_rationales' dataset
dataset, info = tfds.load('movie_rationales', with_info=True, split='train')
# Extract reviews from the dataset and remove stopwords
clean_reviews = []
for example in dataset:
review = example['review'].numpy().decode('utf-8')
words = review.split()
clean_words = [word for word in words if word not in stop_words]
clean_reviews.append(' '.join(clean_words))
# Initialize a new tokenizer
tokenizer = Tokenizer()
# Fit the tokenizer on the cleaned reviews
tokenizer.fit_on_texts(clean_reviews)
# Print the word index
print(f"Word index: {tokenizer.word_index}")
# Extract texts from the dataset
texts = [example['text'].numpy().decode('utf-8') for example in dataset]
# Assuming 'texts' is your list of movie rationales
clean_texts = []
for text in texts:
words = text.split()
clean_words = [word for word in words if word not in stop_words]
clean_texts.append(' '.join(clean_words))
  1. Import the necessary libraries and add five sentences from the data that will be used for training data. Show the following: a. Imported libraries b. Five sentences added as training data

  2. Perform an initial tokenization of the five sentences and print the word index. Then turn the sentences into sequences and print the sequences. a. Show code used to tokenize the data and print the word index. c. Scroll to the end of the word index output. How many words are in the tokenized list?
    d. Show code used to turn the sentences into sequences and print the sequences. e. Show a screenshot of a snippet of the output showing at least two sequences.

  3. Add three new sentences from the article as test data. Tokenize the test data into sequences using the tokens created with the training data. Print the sequences for the test data. a. Show code used to add the test data, tokenize the test data into sequences and printing the test data sequences. b. Show output of the test sequences. c. Indicate where you notice there is missing words in the sequences due to having out-of-vocabulary words in the test data.

  4. Add an out-of-variable token and padding (either pre or post padding is acceptable) to the training data. Print the updated word index and padded sequences of the training data. a. Show code used to add the OOV token and padding, and to print the updated word index and padded sequences. b. Show output that includes the updated word index output showing the OOV token and at least three words with their tokens. c. Show output that includes one of the updated sequences with padding.

  5. Load the training split from the ‘movie_rationales’ TFDS and remove a pre-defined set of stopwords. Then print the word index to demonstrate the stopwords have been removed.
    For the final question, you will be using a TFDS called ‘movie_rationales’. a. Show the code used to load the data, remove the stopwords, and print the word index. Note: You can create a custom list of stopwords that should include a minimum of 20 stopwords. b. Show last five words in the word index so that you can see how many words have been tokenized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment