Skip to content

Instantly share code, notes, and snippets.

@yeus
Last active May 30, 2023 00:26
Show Gist options
  • Save yeus/af22126b268fa9b2221d1818bcf13947 to your computer and use it in GitHub Desktop.
Save yeus/af22126b268fa9b2221d1818bcf13947 to your computer and use it in GitHub Desktop.
automated_blog_writing.md

May 29, 2023

An example of using pydoxtools for LLM based article writing and file/directory-based information retrieval

When Python Libraries Talk: Pydoxtools Writes Blog Post About Itself in Fewer Than 100 Lines of Code!

We wrote the first blogpost about Pydoxtools completely automatically - using Pydoxtools! Here is the code how this was done.

:::info This was the prompt used to generate the first blogpost: Write a blog post, about 0.5-1 page long, introducing a new library to young developers that are new to AI and LLMs :::

Pydoxtools is a versatile Python library designed to streamline AI-powered document processing and information retrieval. This library is perfect for developers who want to harness the power of AI in their projects. This article showcases an efficient method to automate article writing for your project using ChatGPT and Pydoxtools in fewer than 100 lines of code. The script demonstrates the following key functionalities:

  • Indexing a directory containing files with PyDoxTools
  • Employing an agent for information retrieval within those files
  • Auto-generating a text based on a set objective

Most of the code below is available as a notebook or you can simply refer to our concise script, which executes these steps in less than 100 lines of code:

https://github.com/Xyntopia/pydoxtools/blob/main/examples/automatic_project_writing.py

or open this notebook in colab: Open In Colab

Costs and API Key

Please note that ChatGPT is a paid service, and running the script once will cost you about 2-5 cents. Pydoxtools automatically caches all calls to ChatGPT. So subsequent runs usually turn out to be a little cheaper. To use ChatGPT, you will need to generate an OpenAI API key by registering an account at https://platform.openai.com/account/api-keys. Remember to keep your API key secure and do not share it with anyone.

Pydoxtools already includes open source LLM-models which can do the same for free, locally on your computer. As of May 2023 this is being tested.

Safeguarding Your API Key in Google Colab

When working with sensitive information like API keys, it's crucial to ensure their security. In Google Colab, you can save your API key in a separate file, allowing you to share the notebook without exposing the key. To do this, follow these simple steps:

  1. Execute the cell below to create a new file in your Colab environment. This file will store your API key, and it will be deleted automatically when the Colab runtime is terminated.
!touch /tmp/openai_api_key
  1. Click on the following link to open the newly created file by clicking on the following link in colab: /tmp/openai_api_key

  2. Copy and paste your API key into the file, then save it.

By following these steps, you can ensure the security of your API key while still being able to share your notebook with others. Happy coding!

Installation

Follow these simple steps to install and configure Pydoxtools for your projects:

  1. Install the Pydoxtools library by running the following command:
%%capture
# if we want to install directly from our repository:
#!pip install -U --force-reinstall --no-deps "pydoxtools[etl,inference] @ git+https://github.com/xyntopia/pydoxtools.git"
!pip install -U pydoxtools[etl,inference]==0.6.3

After installation, restart the runtime to load the newly installed libraries into Jupyter.

  1. Now we are loading the OPENAI_API_KEY from our file.
#load the key as an environment variable:
import os
# load the key
with open('/tmp/openai_api_key') as f:
  os.environ['OPENAI_API_KEY']=f.read()
  1. now we can initialize pydoxtools which will automatically make use of the OPENAI_API_KEY
import logging

import dask
from chromadb.config import Settings

import pydoxtools as pdx
from pydoxtools import agent as ag
from pydoxtools.settings import settings

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
logging.getLogger("pydoxtools.document").setLevel(logging.INFO)

configuration

Pydoxtools can be configured in various ways for this example we are using two settings:

  • Pydoxtools uses dask in the background to handle large indexes and databases. We could even index terabytes of data this way! For this example though, we are setting the dask scheduler to "synchronous" so that we can see everything thats happening locally and make it easy to debug the script.
  • Pydoxtools has a caching mechanism which caches calls to pydoxtoos.Document. This helps during development for much faster execution on subsequent runs (for example the vector index creation or extraction of other information from documents).
# pydoxtools.DocumentBag uses a dask scheduler for parallel computing
# in the background. For easier debugging, we set this to "synchronous"
dask.config.set(scheduler='synchronous')
# dask.config.set(scheduler='multiprocessing') # can als be used...

settings.PDX_ENABLE_DISK_CACHE = True  # turn on caching for pydoxtools

download our project

In order for our program to work, we need to provide the AI with information. In this case we are using files in a directory as a source of information! We are simply downloading the "Pydoxtools" project from github. Essentialy pydoxtools is writing about itself :-). You could also mount a google drive here or simply load a folder on your computer if you're running this notebook locally on your computer.'

!cd /content
!git clone https://github.com/Xyntopia/pydoxtools.git

Index initialization

In order for an LLM like ChatGPT to retrieve the information it needs to be saved in a "vectorformat". This way we can retrieve relevant information using nearest neighbour search. We are using ChromaDB here for this purpose, but there are many other choices available.

##### Use chromadb as a vectorstore #####
chroma_settings = Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=str(settings.PDX_CACHE_DIR_BASE / "chromadb"),
    anonymized_telemetry=False
)

# create our source of information. It creates a list of documents
# in pydoxtools called "pydoxtools.DocumentBag" (which itself holds a list of pydoxtools.Document) and
# here we choose to use pydoxtools itself as an information source!
root_dir = "/content/pydoxtools"
ds = pdx.DocumentBag(
    source=root_dir,
    exclude=[  # ignore some files which make the indexing rather inefficient
        '.git/', '.idea/', '/node_modules', '/dist',
        '/__pycache__/', '.pytest_cache/', '.chroma', '.svg', '.lock',
        "/site/"
    ],
    forgiving_extracts=True
)

Initialize agent, give it a writing objective and compute the index

Now that we have everything setup, we can initialize our LLM Agent with the provided information. For the pydoxtools project in this example, computing the index will take about 5-10 minutes. In total there will be about 4000 text snippets in the vector index for the project after finishing the computation.. When using the pydoxtools cache, subsequent calculations will be much faster (~1 min).

final_result = []

agent = ag.LLMAgent(
    vector_store=chroma_settings,
    objective="Write a blog post, introducing a new library (which was developed by us, "
              "the company 'Xyntopia') to "
              "visitors of our corporate webpage, which might want to use the pydoxtools library but "
              "have no idea about programming. Make sure, the text is about half a page long.",
    data_source=ds
)
agent.pre_compute_index()
>
[########################################] | 100% Completed | 60.10 s

Search for relevant Information

The agent is able to store information as question/answer pairs and makes use of that information when executing tasks. In order to get our algorithm running a bit more quickly, we answer a basic question manually, to get the algorithm started more quickly... In a real app you could ask questions like this in a user-dialog.

agent.add_question(question="Can you please provide the main topic of the project or some primary "
                            "keywords related to the project, "
                            "to help with identifying the relevant files in the directory?",
                    answer="python library, AI, pipelines")
>
WARNING:chromadb.api.models.Collection:No embedding_function provided, using default embedding function: DefaultEmbeddingFunction https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Having this information, we ask the agent to come up with a few more questions that it needs to answer before being able to write the article.

# first, gather some basic information...
questions = agent.execute_task(
  task="What additional information do you need to create a first, very short outline as a draft? " \
        "provide it as a ranked list of questions", save_task=True)
print("\n".join(questions))
>
What is the name of the new library developed by Xyntopia?
What is the purpose of the pydoxtools library?
What are some examples of how the pydoxtools library can be used in AI pipelines?
What are some benefits of using the pydoxtools library for non-programmers?
What are some potential drawbacks or limitations of the pydoxtools library?

Having created this list of questions, we can now ask the agent to research them by itself. It will automatically use the index we computed above for this task.

# we only use the first 5 provided questions to make it faster ;).
agent.research_questions(questions[:5], allowed_documents=["text/markdown"])
>
Token indices sequence length is longer than the specified maximum sequence length for this model (871 > 512). Running this sequence through the model will result in indexing errors

After retrieving this information we can tell the agent t write the text. We tell it to automatically make use of the information by setting the "context_size" parameter to a value greater than 0. This represents the pieces of stored information that it will use to fulfill the task.

txt = agent.execute_task(task="Complete the overall objective, formulate the text "
                              "based on answered questions and format it in markdown.",
                          context_size=20, max_tokens=1000, formatting="txt")
final_result.append(txt)  # add a first draft to the result

Having our first draft of the text, lets critize it to improve the quality! Then with this critique create a new list of tasks that we can give to the agent to execute one-by-one. Gradually improving our text.

critique = agent.execute_task(task="Given this text:\n\n```markdown\n{txt}\n```"
                                    "\n\nlist 5 points of critique about the text",
                              context_size=0, max_tokens=1000)

tasks = agent.execute_task(
    task="Given this text:\n\n```markdown\n{txt}\n```\n\n"
          f"and its critique: {critique}\n\n"
          "Generate instructions that would make it better. "
          "Sort them by importance and return it as a list of tasks",
    context_size=0, max_tokens=1000)

for t in tasks:
    task = "Given this text:\n\n" \
            f"```markdown\n{txt}\n```\n\n" \
            f"Make the text better by executing this task: '{t}' " \
            f"and integrate it into the given text, but keep the overall objective in mind."
    txt = agent.execute_task(task, context_size=10, max_tokens=1000, formatting="markdown")
    final_result.append([task, txt])
print("\n".join(str(t) for t in tasks))
{'Specify the text length': 'Write a blog post that is approximately half a page long.'}
{'Define the target audience': 'Introduce the pydoxtools library to visitors of our corporate webpage who are interested in using it but have no programming experience.'}
{'Mention the name of the library': 'Introduce our new library, pydoxtools, to readers.'}
{'Explain the benefits of using the library': 'Highlight the advantages of using pydoxtools, such as simplifying documentation generation and improving workflow efficiency.'}
{'Include a call-to-action or next steps for the reader': 'Encourage readers to try out pydoxtools by providing a link to download the library and offering support resources for beginners.'}
# for debugging, you can see all intermediate results, simply uncomment the variable to check:

#final_result  # for the evolution of the final text
#agent._debug_queue  # in order to check all requests made to llms and vectorstores etc...

Final text

After all the processing is finally done, here is the final text:

from IPython.display import Markdown
Markdown(txt.strip("`").replace("markdown",""))

:::info Generated Text:

Introduction to Pydoxtools

Introduce our new library, pydoxtools, to readers. Pydoxtools is a Python library developed by Xyntopia that provides a sophisticated interface for reading and writing documents, designed to work with AI models such as GPT, Alpaca, and Huggingface. The library aims to simplify the process of building custom pipelines with LLMs and other AI tools, making it easy to integrate modern AI tools and reimagine data extraction pipelines.

Purpose of Pydoxtools

The purpose of the pydoxtools library is to provide functionalities such as pipeline management, integration with AI models, low-resource (PDF) table extraction without configuration and expensive layout detection algorithms, document analysis and question-answering, support for most document formats, vector index creation, entity and address identification, list and keyword extraction, data normalization, translation, and cleaning.

Benefits for Non-Programmers

Pydoxtools is a great tool for non-programmers who want to extract data from documents. It offers low-resource (PDF) table extraction without configuration and expensive layout detection algorithms, document analysis and question-answering, support for most document formats, vector index creation, entity and address identification, list and keyword extraction, data normalization, translation, and cleaning. The library also allows for the creation of custom pipelines with LLMs and other AI tools, making it easy to integrate modern AI tools and reimagine data extraction pipelines.

Some benefits of using the pydoxtools library for non-programmers include simplifying documentation generation and improving workflow efficiency.

Usage in AI Pipelines

Pydoxtools can be used in AI pipelines for low-resource (PDF) table extraction without configuration and expensive layout detection algorithms, document analysis and question-answering, support for most document formats, vector index creation, entity and address identification, list and keyword extraction, data normalization, translation, and cleaning. The library also allows for the creation of custom pipelines with LLMs and other AI tools, making it easy to integrate modern AI tools and reimagine data extraction pipelines.

Limitations

There is no mention of any potential drawbacks or limitations of the pydoxtools library in the provided text.

Target Audience

This blog post is intended for visitors of our corporate webpage who are interested in using the pydoxtools library but have no programming experience.

Encourage readers to try out pydoxtools by providing a link to download the library and offering support resources for beginners. If you are interested in using Pydoxtools, you can find more information on our corporate webpage. :::

Conclusion

Pydoxtools is a powerful and user-friendly Python library that makes it easy to harness the power of AI for document processing and information retrieval. Whether you are new to programming or an experienced developer, Pydoxtools can help you streamline your projects and achieve your goals. Give it a try today and experience the benefits for yourself!

Get more information under the following links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment