Skip to content

Instantly share code, notes, and snippets.

@dunnousername
Created November 20, 2019 21:38
Show Gist options
  • Save dunnousername/a7d27c73b574443a884c4504d739f023 to your computer and use it in GitHub Desktop.
Save dunnousername/a7d27c73b574443a884c4504d739f023 to your computer and use it in GitHub Desktop.
schema.org checker I made in like 5 minutes. Useful for scraping for pages that support schema.org.
#!/usr/bin/python3
# usage: schemachecker.py urls...
# requires: python 3, beautiful soup 4, requests library
from bs4 import BeautifulSoup
import requests
import sys
def checkUrl(url):
r = requests.get(url, headers={'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0'})
soup = BeautifulSoup(r.text, 'html.parser')
types = set()
for item in soup.find_all(itemtype=True):
types.add(item.get('itemtype'))
for t in types:
if t:
print('- contains a "{}" schema'.format(t))
for url in sys.argv[1:]:
print(url)
checkUrl(url)
"""
> python .\schemachecker.py https://google.com https://www.allrecipes.com/recipe/95294/chewy-caramel/
https://google.com
- contains a "http://schema.org/WebPage" schema
https://www.allrecipes.com/recipe/95294/chewy-caramel/
- contains a "http://schema.org/Recipe" schema
- contains a "http://schema.org/BreadcrumbList" schema
- contains a "http://schema.org/ListItem" schema
- contains a "http://schema.org/Rating" schema
- contains a "http://schema.org/Review" schema
- contains a "http://schema.org/AggregateRating" schema
- contains a "http://schema.org/NutritionInformation" schema
"""
@dunnousername
Copy link
Author

Tested with Anaconda on Windows 10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment