Skip to content

Instantly share code, notes, and snippets.

@cvorland
Last active July 19, 2023 16:01
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save cvorland/4362078 to your computer and use it in GitHub Desktop.
Save cvorland/4362078 to your computer and use it in GitHub Desktop.
Some ugly code to scrape blog posts and count how many, count total words, and count researchblogging.org references. Attributes will differ depending on theme used and platform (made using wordpress).
from BeautifulSoup import BeautifulSoup
import mechanize
import time
import re
pagenum = 1
url = "http://www.bloghomepage.com"
browser = mechanize.Browser()
page = browser.open(url)
postcount = 0
totalwordcount = 0
totalrefcount = 0
stop = "false"
while "false" in stop:
soup = BeautifulSoup(page)
link = soup.find("span", {"class":"next"})
# attributes probably need to be changed depending on the wordpress theme used
if "Older" in str(link):
pagenum += 1
print pagenum
stop = "false"
else:
stop = "true"
for table in soup.findAll('h1', {'class':'title'}):
links = table.findAll('a')
if "href" in str(links):
#print links
separatelinks = re.search("(?P<url>http?://[^\>\"\s]+)", str(links)).group("url")
print separatelinks
site = browser.open(str(separatelinks)).read()
soup = BeautifulSoup(site)
content = soup.find("div", {"class":"content entry-content"})
fp = open('C:/filename.html', 'w')
fp.write(str(content))
fp.close()
f = open('C:/filename.html', 'r')
readit = f.read()
f.close()
onlytext = ''.join(BeautifulSoup(readit).findAll(text=True))
#print onlytext
fp2 = open('C:/filename.html', 'w')
fp2.write(str(onlytext.encode('ascii','ignore')))
fp2.close()
words = re.findall('\w+', open('C:/filename.html').read().lower())
wordcount = len(words)
print wordcount
totalwordcount += wordcount
print totalwordcount
refcount = 0
for Z3988 in soup.findAll('span', {'class':'Z3988'}):
refcount += 1
print refcount
totalrefcount += refcount
print totalrefcount
postcount += 1
print postcount
#time.sleep(1) If you don't want to hit the server too fast
else:
print "not a link"
url = "http://www.bloghomepage.com/page/"+str(pagenum)
page = browser.open(url)
@trangnhungpham
Copy link

Hi, thank you for sharing the codes. I have tried it (with some edits) on Google Colab and it ran well, but I couldn't find the results. I guess it might have something to do with the C:/filename.html location. But I am not sure how to fix this. Can you kindly help me with this?
FYI, the link to the editted codes: https://colab.research.google.com/drive/1AZQlnWbW5INNWP6l4i_eWSp3L5pXSuh7?usp=sharing

@cvorland
Copy link
Author

cvorland commented Jul 19, 2023

@trangnhungpham are you using google colab in a hosted runtime? If so, you'll need to save it in google drive, something like this:

from google.colab import drive
drive.mount('/content/drive')
fp2 = open('/content/drive/My Drive/filename.html', 'w')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment