Skip to content

Instantly share code, notes, and snippets.

@jagilley
Created July 12, 2023 19:46
Show Gist options
  • Save jagilley/d23d1f638bab6d8111bd74b320a35b02 to your computer and use it in GitHub Desktop.
Save jagilley/d23d1f638bab6d8111bd74b320a35b02 to your computer and use it in GitHub Desktop.

Browserless <> LangChain

We’ve begun the process of integrating Browserless with the popular LangChain AI library, starting with Browserless’ REST APIs. As of today, our content API is now supported as a LangChain document loader. Using Browserless to get the contents of a webpage for ingestion into LangChain’s AI modules is as easy as calling

from langchain.document_loaders import BrowserlessLoader

What this means

The previous canonical way to get the contents of webpages in LangChain was the WebBaseLoader module. This module uses the requests library to make HTTP requests to the target URL. This is a perfectly valid way to get the contents of a webpage, but it has some drawbacks:

  • It doesn’t execute JavaScript, so it can’t get the contents of a webpage that is dynamically generated by JavaScript
  • It's prone to encoding issues if Python is expecting a different encoding than the webpage is using. LangChain users have reported seeing non-ASCII characters in their text when using the WebBaseLoader, which is a symptom of this issue.
  • It's extremely vulnerable to anti-bot measures, since the most basic anti-bot tests can determine that the request is coming from an automated script and not a real browser.

Using the new BrowserlessLoader solves all of these problems. It executes JavaScript, so it can get the contents of a webpage that is dynamically generated. It uses the Chrome browser, so it can handle any encoding that Chrome can handle. And it uses a real browser, so it can pass anti-bot tests.

How to use it

Getting the contents of a webpage using the BrowserlessLoader can be accomplished in just a few lines of code:

from langchain.document_loaders import BrowserlessLoader

loader = BrowserlessLoader(
    api_token=YOUR_BROWSERLESS_API_TOKEN,
    urls=[
        "https://example.com/url0",
        "https://example.com/url1",
        "https://example.com/url2",
    ]
)

documents = loader.load()

print(documents[0].page_content)

Simply sign up for a Browserless account, get your API token, and pass it to the BrowserlessLoader constructor. Pass a list of URLs to the constructor, call the load() method, and you’ll get back a list of Document objects, each of which has a page_content attribute that contains the text of the webpage.

Potential use cases

Extracting the contents of webpages can be a useful step in many different AI workflows. For example, you could use the BrowserlessLoader to get the contents of a webpage, and then use a long-context LLM like GPT-4 or Claude to extract particular fields from the text, even if they appear in different places across multiple webpages. You could get the contents of a blog post and then summarize it using LangChain's LLM wrappers. You could keep tabs on an online forum by getting the contents of the forum's pages and then using a classifier to identify posts that are relevant to you. LangChain has a thriving open-source community, check out the LangChain GitHub for more ideas.

What’s next

In the short term: the LangChain team is currently working on modifying their RecursiveWebLoader wrapper class to support the BrowserlessLoader as a document loader. This will allow you to get the contents of a webpage and all of its child pages, recursively, using the BrowserlessLoader, allowing for a higher quality guarantees on the contents of the pages and a more robust way to handle anti-bot measures.

In the long term: we're looking into more seamless integrations between Browserless and LangChain, including controlling a stateful browser session from within LangChain. This opens the possibility of using LangChain to automate web tasks that require a browser, like filling out forms or interacting with a website's UI. Stay tuned for more updates!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment