Skip to content

Instantly share code, notes, and snippets.

@etherealxx
Created January 9, 2023 03:43
Show Gist options
  • Save etherealxx/75d84941910286a4b6aac3afcad47ef2 to your computer and use it in GitHub Desktop.
Save etherealxx/75d84941910286a4b6aac3afcad47ef2 to your computer and use it in GitHub Desktop.
Cleans a MS Word document containing a long ChatGPT session from excess empty paragraph

About

When you have a long session in ChatGPT, maybe you want to save it locally. I used some of the chrome extension out there, sadly it can't handle long session with hundreds of paragraph. So i usually copy all the text there into MS Word. The problem is, when you select the text to copy, your profile picture is also selected, which turns into your email address in Word. The other problem is on the copied document, distance between a single question-and-answer to another is 4 line.

This script uses python-docx to iterate through all line of a word document, and remove all email address and excess empty lines. There's also an option to remove all line starts with >, i use it personally.

Don't forget to install python-docx before using the script:

pip install python-docx

import docx
# Counter for the number of empty paragraphs removed
empty_count = 0
# Counter for the number of instances of author name removed
author_count = 0
greaterthan_count = 0
# Keep track of the previous paragraph's text
prev_text = ''
file_path = input("input the docx file path (drag and drop the file here): ")
author_name = input("input the author email: ")
remove_greaterthan = input("Remove every line that starts with '>'? [y/n] ")
name_parts = file_path.rsplit('.', 1)
path_cleaned = name_parts[0] + '-cleaned'
file_save = path_cleaned + '.' + name_parts[1]
# Open the Word document
document = docx.Document(file_path)
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
#paragraph._p = paragraph._element = None
# Iterate through the paragraphs in the document
for i, para in enumerate(document.paragraphs):
# Count the number of instances of author name in the paragraph text
count = para.text.count(author_name)
# Replace all instances of author name with an empty string
para.text = para.text.replace(author_name, '')
# Update the count of instances removed
count -= para.text.count(author_name)
author_count += count
# If the paragraph contained any instances of author name, print the line number
if count > 0:
print(f'Instance {author_count} of "{author_name}" is removed at line {i+1}')
if remove_greaterthan == 'y' or remove_greaterthan == 'Y':
if para.text.startswith('>'):
delete_paragraph(para)
greaterthan_count += 1
print(f'Removed a line starts with ">" at line {i+1}')
# If the paragraph is now empty, delete it and increment the counter
if prev_text == '' and para.text == '':
delete_paragraph(para)
empty_count += 1
print(f'Removed empty paragraph at line {i+1}')
prev_text = para.text
# Print the number of empty paragraphs removed
print(f'Removed {empty_count} empty paragraphs')
print(f'Removed {author_count} author names')
if remove_greaterthan == 'y' or remove_greaterthan == 'Y':
print(f'Removed {greaterthan_count} line that starts with ">"')
# Save the modified document
document.save(file_save)
print(f'File saved at {file_save}')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment