Skip to content

Instantly share code, notes, and snippets.

@thesanjeetc
Last active July 11, 2021 15:19
Show Gist options
  • Save thesanjeetc/e67e74520c492371e279dd273b4e98a8 to your computer and use it in GitHub Desktop.
Save thesanjeetc/e67e74520c492371e279dd273b4e98a8 to your computer and use it in GitHub Desktop.
import re
with open('transcriptsOriginal.txt', 'r', encoding="utf8") as file:
data = file.read()
pattern = '\(.*\)\n|\[.*]|\(.*\)|PART.*]|…|Jamie:.*(.|\?)|[\?]|--|\.\.\.\s|Jaime(\n|.)*?\].'
result = re.sub(pattern, ' ', data)
result = result.replace('Elon Musk: ', 'ELON MUSK:\n')
result = result.replace('Joe Rogan: ', 'JOE ROGAN:\n')
result = result.replace(' Joe Rogan', '\nJOE ROGAN:')
result = result.replace(' Elon Musk', '\nELON MUSK:')
result = result.replace('--', '-')
with open('transcriptsFormatted.txt', 'w', encoding="utf8") as file:
file.write(result)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment