Skip to content

Instantly share code, notes, and snippets.

@cgarbin
Last active July 19, 2023 19:49
Show Gist options
  • Save cgarbin/2c7f1d553d195b1f9a51c50800d06b48 to your computer and use it in GitHub Desktop.
Save cgarbin/2c7f1d553d195b1f9a51c50800d06b48 to your computer and use it in GitHub Desktop.
Learning how to use Haystack for retrieval-augmented generation

Learning Haystack

"Unusual" features

Haystack first comes across as a document retrieval system. However, it has other features that may not be immediately obvious.

  1. Fine-tuning on our own data (and this one).
  2. Redirect search to different nodes, based on type (tutorial) - distinguish between keyword and sentence searches.
  3. Agents (plugins).
  4. Multimodal (image) search.
  5. Evaluation.
  6. Question generator.
  7. Pipelines - once documents are ingested, we can retrieve them in different ways (plain search, generative search, etc.).

Learning Haystack

  1. Read the concepts.
  2. Read about pipeline components.
  3. Run the tutorials

TODO

Haystack ingest/retrieve deployment with a REST API

This document describes how to deploy a Haystack instance to ingest documents and answer questions using those documents through the REST API.

This is a simple document search deployment. It does not answer questions like a large language model (e.g. GPT) does. The goal of this deployment is to understand basic Haystack principles that are used in more complex deployments later:

  1. What are nodes and pipelines.
  2. How to configure them.
  3. How to use the REST API.

The instructions are based on the Using Haystack with REST API tutorial ("last updated" date on the page says April 14, 2023). Please check the tutorial before following the instructions here. If the instructions do not match the tutorial, trust the tutorial and update the instructions.

The tutorial uses ElasticSearch as the indexing engine. We will modify the deployment in another step to use a different vector database.

Create and configure a virtual machine

These instructions apply to GCP VM. Adjust to your environment.

A small VM can be used for small-scale experiments.

  • e2-standard-2 with 2 vCPU and 8 GB memory
  • 100 GB boot disk
  • Ubuntu 22.04

Create an instance schedule to shut it down when not in use automatically.

Protect the VM by allowing access to port 8000 from specific external addresses. The easiest way to do that in GCP is to modify the standard http-server firewall rule:

  • Add port 8000 to the rule.
  • Set source addresses to allow only specific addresses or network segments.
  • Add the http-server network tag to the VM.

Install the prerequisites

At this time, the only prerequisite is Docker Compose. Follow the instructions to install Docker. It will install Docker Compose as well.

Note that we do not want the Docker Desktop installation, just the Docker engine and its CLI.

Configure Docker to not require root permission.

sudo gpasswd -a $USER docker

Log out (e.g. close the SSH session) and log back in, then run the Docker "hello world" to confirm that it works.

docker run hello-world

Configure and start Haystack

The instructions here summarize the tutorial page dated 4/13/2023. If the instructions do not match the tutorial, trust the tutorial and update the instructions.

Create a directory and get the template Docker Compose file for this Haystack deployment.

# Choose an appropriate root directory: /opt if you have access,
# otehrwise your own home directory
mkdir doc-search
cd doc-search

curl --output docker-compose.yml \
   https://raw.githubusercontent.com/deepset-ai/haystack/main/docker-compose.yml

The next step is to assemble a pipeline with the nodes we need for this application. Nodes is how Haystack splits the building blocks of the solution. Pipelines is how we put the nodes together.

Create the (empty) pipeline definition file and note the directory we are in.

touch document-search.haystack-pipeline.yml
pwd # copy the output - will be used in the next step

Open docker-compose.yml and edit the volumes value to point to the directory where we have the pipeline definition file (this directory) and the PIPELINE_YAML_PATH to the full path to that file.

  haystack-api:
    ...
    volumes:
      - ./:<output from pwd (above)>
    environment:
      ...
      - PIPELINE_YAML_PATH=<output from pwd (above)>/document-search.haystack-pipeline.yml

Populate the pipeline definition file as shown below. The components of this pipeline are a store node, a retriever node, a classifier, a converter, and a preprocessor. The pipelines section combines the components to index (ingest) files and search (query) those files.

The command below creates the pipeline as defined in the tutorial.

cat <<EOT >> document-search.haystack-pipeline.yml
version: 1.12.1

components:
  - name: DocumentStore
    type: ElasticsearchDocumentStore
    params:
      embedding_dim: 384
  - name: Retriever
    type: EmbeddingRetriever
    params:
      document_store: DocumentStore
      top_k: 10 
      embedding_model: sentence-transformers/all-MiniLM-L6-v2
  - name: FileTypeClassifier
    type: FileTypeClassifier
  - name: TextFileConverter
    type: TextConverter
  - name: Preprocessor
    type: PreProcessor
    params:
      split_by: word
      split_length: 250
      split_overlap: 30 
      split_respect_sentence_boundary: True

pipelines:
  - name: query 
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1]
      - name: Preprocessor
        inputs: [TextFileConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]
EOT

Start the Haystack nodes.

docker compose up

The Haystack API should run under a web server if everything worked.

....many logs lines not shown here

doc-search-haystack-api-1   | INFO:     Started server process [1]
doc-search-haystack-api-1   | INFO:     Waiting for application startup.
doc-search-haystack-api-1   | INFO:     Application startup complete.
doc-search-haystack-api-1   | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Open another terminal to the VM and verify that the Haystack API is up and running.

curl --request GET http://127.0.0.1:8000/initialized
# Should print "true"

Test the Haystack deployment

You can run these steps on the same VM where Haystack is installed or on another machine that has network access to the Haystack VM.

Download an unzip the data from the tutorial.

wget https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip
unzip article_txt_countries_and_capitals.zip

The official tutorial uploads hundreds of Wikipedia articles. This may take a while. To test it faster, we can upload only one of them.

# Replace the VM IP address
# If running locally (on the VM), use 127.0.0.1
export VM_IP=35.237.13.158

curl --request POST \
     --url http://$VM_IP:8000/file-upload \
     --header 'accept: application/json' \
     --header 'content-type: multipart/form-data' \
     --form files=@article_txt_countries_and_capitals/0_Minsk.txt \
     --form meta=null

Now we can ask a question.

# Replace the VM IP address
# If running locally (on the VM), use 127.0.0.1
export VM_IP=35.237.13.158

curl --request POST \
     --url http://$VM_IP:8000/query \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '{
     "query": "What football teams are based in Minsk"
     }'

Note because we have only a simple retriever, we get the document chunks that contain the answer, not a nice and clean answer we would get from a large language model such as GPT. That is covered in a different tutorial.

Inspecting the API

Open a browswer window to http://<server IP>:8000/docs to see the API docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment