Skip to content

Instantly share code, notes, and snippets.

@Kouhei-Takagi
Created February 19, 2023 12:31
Show Gist options
  • Select an option

  • Save Kouhei-Takagi/1dfc641e430e5732ed151d0327d737a7 to your computer and use it in GitHub Desktop.

Select an option

Save Kouhei-Takagi/1dfc641e430e5732ed151d0327d737a7 to your computer and use it in GitHub Desktop.
Convert xml backup file from blogger, to view in word cloud
import xml.etree.ElementTree as ET
import codecs
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import re
tree = ET.parse("./blog-01-31-2023 copy.xml")
root = tree.getroot()
notags = ET.tostring(root, method='text')
notags = notags.decode('utf8')
notags = re.sub("\<.+?\>", "", notags)
notags = re.sub("\&.+?\;", "", notags)
stop_words = ["the", "i", "com","http", "https", "www", "blogger", "blog", "https", "profile", "is", "to", "and", "for", "in", "of", "it", "this", "that", "be", "will", "are", "post", "comtag"]
wordcloud = WordCloud(font_path = '/Library/Fonts/Georgia.ttf',
background_color="black",
collocations=False,
random_state=42,
max_words=50,
stopwords=stop_words,
colormap="gnuplot2",
width=1000,height=400).generate(notags)
plt.figure(figsize=(15,12))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment