Skip to content

Instantly share code, notes, and snippets.

@boxabirds
Created March 24, 2021 21:20
Show Gist options
  • Save boxabirds/88fbf432b756e8def8aa49a36ec4e3ba to your computer and use it in GitHub Desktop.
Save boxabirds/88fbf432b756e8def8aa49a36ec4e3ba to your computer and use it in GitHub Desktop.
Simple test of trafilatura
from trafilatura.core import bare_extraction, extract
from pathlib import Path
import json
from sentence_splitter import SentenceSplitter
html_path = Path("data") / "testfile-3.html"
html = html_path.read_text()
result_json_text = extract(
output_format="json",
filecontent=html,
include_images=True,
include_formatting=True
)
result = json.loads(result_json_text)
splitter = SentenceSplitter(language="en")
sentences = splitter.split(result["text"])
print(json.dumps(result,indent=2))
print("===")
print(json.dumps(sentences,indent=2))
@boxabirds
Copy link
Author

testfile-3.html is this file downloaded locally: https://lisn-tests.netlify.app/rich-content.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment