We’ve begun the process of integrating Browserless with the popular LangChain AI library, starting with Browserless’ REST APIs. As of today, our content API is now supported as a LangChain document loader. Using Browserless to get the contents of a webpage for ingestion into LangChain’s AI modules is as easy as calling
from langchain.document_loaders import BrowserlessLoader
The previous canonical way to get the contents of webpages in LangChain was the WebBaseLoader
module. This module uses the requests
library to make HTTP requests to the target URL. This is a perfectly valid way to get the contents of a webpage, but it has some drawbacks:
- It doesn’t execute JavaScript, so it can’t get the contents of a webpage that is dynamically generated by JavaScript
- It's prone to encoding issues if Python is expecting a different encoding than the webpage is using. LangChain users have reported seeing non-ASCII characters in their text when using the
WebBaseLoader
, which is a symptom of this issue. - It's extremely vulnerable to anti-bot measures, since the most basic anti-bot tests can determine that the request is coming from an automated script and not a real browser.
Using the new BrowserlessLoader
solves all of these problems. It executes JavaScript, so it can get the contents of a webpage that is dynamically generated. It uses the Chrome browser, so it can handle any encoding that Chrome can handle. And it uses a real browser, so it can pass anti-bot tests.
Getting the contents of a webpage using the BrowserlessLoader
can be accomplished in just a few lines of code:
from langchain.document_loaders import BrowserlessLoader
loader = BrowserlessLoader(
api_token=YOUR_BROWSERLESS_API_TOKEN,
urls=[
"https://example.com/url0",
"https://example.com/url1",
"https://example.com/url2",
]
)
documents = loader.load()
print(documents[0].page_content)
Simply sign up for a Browserless account, get your API token, and pass it to the BrowserlessLoader
constructor. Pass a list of URLs to the constructor, call the load()
method, and you’ll get back a list of Document
objects, each of which has a page_content
attribute that contains the text of the webpage.
Extracting the contents of webpages can be a useful step in many different AI workflows. For example, you could use the BrowserlessLoader
to get the contents of a webpage, and then use a long-context LLM like GPT-4 or Claude to extract particular fields from the text, even if they appear in different places across multiple webpages. You could get the contents of a blog post and then summarize it using LangChain's LLM wrappers. You could keep tabs on an online forum by getting the contents of the forum's pages and then using a classifier to identify posts that are relevant to you. LangChain has a thriving open-source community, check out the LangChain GitHub for more ideas.
In the short term: the LangChain team is currently working on modifying their RecursiveWebLoader
wrapper class to support the BrowserlessLoader
as a document loader. This will allow you to get the contents of a webpage and all of its child pages, recursively, using the BrowserlessLoader
, allowing for a higher quality guarantees on the contents of the pages and a more robust way to handle anti-bot measures.
In the long term: we're looking into more seamless integrations between Browserless and LangChain, including controlling a stateful browser session from within LangChain. This opens the possibility of using LangChain to automate web tasks that require a browser, like filling out forms or interacting with a website's UI. Stay tuned for more updates!