Skip to content

Instantly share code, notes, and snippets.

@DavidHoenisch
Created June 28, 2023 16:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save DavidHoenisch/9517333774a2393ce24a3337b5e8fa48 to your computer and use it in GitHub Desktop.
Save DavidHoenisch/9517333774a2393ce24a3337b5e8fa48 to your computer and use it in GitHub Desktop.
Scraping words from a URL.md

The following code is for scraping content from websites and extracting just the words. This is useful for being able to feed web content into other processes.

This can be accomplished in a three step process.

  1. Get the raw html content using the requests library
  2. Feed the .text results of step one into the BeautifulSoup and extract the text with .get_text(). This will strip all the html from the content and return and unstructured string.
  3. The string that is returned will need some heavy sanitization.
    1. Strip the blank lines with a python filter
    lines = filter(lambda x: x.strip(), text.splitlines())
    1. Remove all the leading and trailing white spaces
    words = map(lambda x: x.strip(), lines)
    1. Remove any special characters that could mess up future models
    words = map(lambda x: x.replace(".", "")
    	.replace(",", "")
    	.replace("'", "")
    	.replace(":", "")
    	.replace("?", "")
    	.replace('“', "")
    	.replace('”', "")
    	.replace("!", "")
    	.replace(";", "")
    	.replace("——", "")
    	.replace("_", "")
    	.replace("(", "")
    	.replace(")", "")
    	.replace("|", "")
    	.replace("<", ""), words)
    1. Split the words from sentences into tokens
    words = map(lambda x: x.split(), words)
    1. Lower case all the words to ensure that mixed cases do not mess up models down the road.
    words = map(lambda x: [y.lower() for y in x], words)
    1. The above will return a list of lists which will need to be flattened
    words = [item for sublist in words for item in sublist]
def tokenize_word_from_url(url: str) -> list:
"""
get the words from the url

:param url: url to scrape
:return: list of words
"""
try:

	content = requests.get(url).text
	soup = BeautifulSoup(content, "html.parser")
	text = soup.get_text()

	# remove all the blank lines
	lines = filter(lambda x: x.strip(), text.splitlines())

	# remove leading and trailing spaces
	words = map(lambda x: x.strip(), lines)
	
	# remove all special characters
	words = map(lambda x: x.replace(".", "")
		.replace(",", "")
		.replace("'", "")
		.replace(":", "")
		.replace("?", "")
		.replace('“', "")
		.replace('”', "")
		.replace("!", "")
		.replace(";", "")
		.replace("——", "")
		.replace("_", "")
		.replace("(", "")
		.replace(")", "")
		.replace("|", "")
		.replace("<", ""), words)

	# split the words
	words = map(lambda x: x.split(), words)

	# lower case the words
	words = map(lambda x: [y.lower() for y in x], words)

	# flatten the list
	words = [item for sublist in words for item in sublist]

	return words

except Exception as e:
	print(e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment