Skip to content

Instantly share code, notes, and snippets.

@petri
Created August 31, 2022 20:12
Show Gist options
  • Save petri/2be68b82ae5414add315f41a13687ff2 to your computer and use it in GitHub Desktop.
Save petri/2be68b82ae5414add315f41a13687ff2 to your computer and use it in GitHub Desktop.
get product metadata from web page in Python
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse
url = "https://www.kalevala.fi/collections/korvakorut/products/paratiisi-tappikorvakorut-hopea"
def extract_metadata(url, syntaxes=['json-ld']):
"""Extract all metadata present in the page and return a dictionary of metadata lists.
Args:
url (string): URL of page from which to extract metadata.
Returns:
metadata (dict): Dictionary of json-ld, microdata, and opengraph lists.
Each of the lists present within the dictionary contains multiple dictionaries.
"""
r = requests.get(url)
base_url = get_base_url(r.text, r.url)
metadata = extruct.extract(r.text,
base_url=base_url,
uniform=True,
syntaxes=syntaxes)
return metadata
d = extract_metadata(url)["json-ld"]
products = [p for p in d if p["@type"]=="Product"]
print(products)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment