louis030195/assistants_mlc-llm.md

## assistants_mlc-llm.md

      
    Raw
  

              assistants_mlc-llm.md
            
          
    At the moment, you need both Docker installed to run the API.
Additionally, Assistants currently supports Anthropic and Open Source LLMs, you need some env vars that you can put in a .env file in the root of the project:
DATABASE_URL=postgres://postgres:secret@localhost:5432/mydatabase
REDIS_URL=redis://127.0.0.1/
S3_ENDPOINT=http://localhost:9000
S3_ACCESS_KEY=minioadmin
S3_SECRET_KEY=minioadmin
S3_BUCKET_NAME=mybucket
MODEL_URL="http://localhost:8000/v1/chat/completions"
We will use OpenAI's JS SDK, but feel free to use the python one, you can copy paste this doc in chatgpt to translate to python!
Function calling

Function calling allows you to describe functions to the Assistants and have it intelligently return the functions that need to be called along with their arguments. The Assistants API will pause execution during a Run when it invokes functions, and you can supply the results of the function call back to continue the Run execution.
We'll use Intel/neural-chat-7b-v3-2 which is one of the best 7b sized open source LLM to this date, thanks to Intel!
Steps to Run the API with MLC LLM


Install MLC LLM Python Package:
Ensure you have the MLC LLM package installed. If not, install it using pip:

virtualenv env
source env/bin/activate
python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly mlc-ai-nightly

Download Pre-quantized Weights and Pre-compiled Model Library:
Follow the instructions in the MLC LLM documentation to download the necessary model files and libraries.

mkdir dist/
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt_libs
git lfs install
git clone https://huggingface.co/mlc-ai/OpenHermes-2.5-Mistral-7B-q4f16_1-MLC \
                                  dist/OpenHermes-2.5-Mistral-7B-q4f16_1-MLC
git clone https://huggingface.co/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC \
                                  dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC

Llama-2-7b-chat-hf-q4f16_1

git clone https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC \
                                  dist/Llama-2-7b-chat-hf-q4f16_1-MLC


Launch the MLC LLM Server:
Start the server with the following command, replacing MODEL with your model name and LIB_PATH with the path to your model library file:

python -m mlc_chat.rest --model dist/OpenHermes-2.5-Mistral-7B-q4f16_1-MLC --lib-path dist/prebuilt_libs/Mistral-7B-Instruct-v0.2-q4f16_1-metal.so --device auto --host localhost --port 8000
python -m mlc_chat.rest --model dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC --lib-path dist/prebuilt_libs/Mistral-7B-Instruct-v0.2-q4f16_1-metal.so --device auto --host localhost --port 8000

// llama
python -m mlc_chat.rest --model dist/Llama-2-7b-chat-hf-q4f16_1-MLC --lib-path dist/prebuilt_libs/Llama-2-7b-chat-hf-q4f16_1-metal.so --device auto --host localhost --port 8000


python -m mlc_chat_cli --model dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC \
             --model-lib-path dist/prebuilt_libs/Mistral-7B-Instruct-v0.2-q4f16_1-metal.so 

Assuming you have Python 3 and virtualenv installed.
virtualenv env
source env/bin/activate
pip3 install "fschat[model_worker]"

# Terminal 1
python3 -m fastchat.serve.controller

# Terminal 2 - FYI just change "mps" to "cpu" or "cuda" depending on your hardware
python3 -m fastchat.serve.model_worker --model-path Intel/neural-chat-7b-v3-2 --device mps --load-8bit

# Terminal 3
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

# Terminal 4
# Test if it works properly:
# Install the OpenAI API JS client
npm i openai 

# Start a node repl
node
// Paste the following code in your node repl and press enter:
const OpenAI = require('openai');

const openai = new OpenAI({
    baseURL: 'http://localhost:8000/v1',
    apiKey: 'EMPTY',
});

async function main() {
  const chatCompletion = await openai.chat.completions.create({
    messages: [{ role: 'user', content: 'Hello! What is your name?' }],
    model: 'mlc-ai/OpenHermes-2.5-Mistral-7B-q4f16_1-MLC',
  });
  console.log(chatCompletion.choices[0].message.content);
}

main();
You should see something like this:

Promise {
,
[Symbol(async_id_symbol)]: 270,
[Symbol(trigger_async_id_symbol)]: 5
}
My name is Pepper, a helpful artificial intelligence assistant.

Function calling with Intel's LLM in JS


Start the server

docker-compose --profile api -f docker/docker-compose.yml up -d

Create an Assistant

Alright, open a new terminal and start node, then drop these snippets step by step:
const OpenAI = require('openai');

const openai = new OpenAI({
    baseURL: 'http://localhost:3000',
    apiKey: 'EMPTY',
});
async function createAssistant() {
    const assistant = await openai.beta.assistants.create({
        instructions: "You are a weather bot. Use the provided functions to answer questions.",
        model: "Intel/neural-chat-7b-v3-2",
        name: "Weather Bot",
        tools: [{
            "type": "function",
            "function": {
              "name": "getCurrentWeather",
              "description": "Get the weather in location",
              "parameters": {
                  "type": "object",
                  "properties": {
                  "location": {"type": "string", "description": "The city and state e.g. San Francisco, CA"},
                  "unit": {"type": "string"}
                  },
                  "required": ["location"]
              }
            }
        }]
    });
    console.log(JSON.stringify(assistant, null, 2));
}
createAssistant();
{
  "id": "75ce7666-7560-4bb2-8358-48107d94183a",
  "object": "",
  "created_at": 1702071264602,
  "name": "Weather Bot",
  "description": null,
  "model": "Intel/neural-chat-7b-v3-2",
  "instructions": "You are a weather bot. Use the provided functions to answer questions.",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "getCurrentWeather",
        "description": "Get the weather in location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state e.g. San Francisco, CA",
              "enum": null
            },
            "unit": {
              "type": "string",
              "description": null,
              "enum": null
            }
          },
          "required": [
            "location"
          ]
        }
      }
    }
  ],
  "file_ids": null,
  "metadata": null,
}

Create a Thread

async function createThread() {
    const thread = await openai.beta.threads.create();
    console.log(JSON.stringify(thread, null, 2));
}
createThread();
{
  "id": "f74681b8-2371-4db1-946f-3efb070f0b19",
  "object": "",
  "created_at": 1702071499412,
  "metadata": null
}

Add a Message to a Thread

Replace the UUID with the actual thread id
async function createMessage() {
    const message = await openai.beta.threads.messages.create(
        "f74681b8-2371-4db1-946f-3efb070f0b19",
        {
            role: "user",
            content: "What's the weather in San Francisco?"
        }
    );
    console.log(JSON.stringify(message, null, 2));
}
createMessage();
{
  "id": "7f109eba-796f-4961-9589-3af529f14e6e",
  "object": "",
  "created_at": 1702499213,
  "thread_id": "f74681b8-2371-4db1-946f-3efb070f0b19",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": {
        "value": "What's the weather in San Francisco?",
        "annotations": []
      }
    }
  ],
  "assistant_id": "00000000-0000-0000-0000-000000000000",
  "run_id": "00000000-0000-0000-0000-000000000000",
  "file_ids": [],
  "metadata": null
}

Run the Assistant

Replace :thread_id and :assistant_id with the actual thread id and assistant id (check 1. and 3.)
async function createRun() {
    const run = await openai.beta.threads.runs.create(
        "f74681b8-2371-4db1-946f-3efb070f0b19",
    { 
        assistant_id: "75ce7666-7560-4bb2-8358-48107d94183a",
        instructions: "You are a weather bot. Use the provided functions to answer questions."
    }
    );
    console.log(JSON.stringify(run, null, 2));
}
createRun();
{
  "id": "3933cbe4-5c26-4f93-b2d8-f03dabc055f2",
  "object": "",
  "created_at": 1702499360,
  "thread_id": "f74681b8-2371-4db1-946f-3efb070f0b19",
  "assistant_id": "75ce7666-7560-4bb2-8358-48107d94183a",
  "status": "queued",
  "required_action": null,
  "last_error": null,
  "expires_at": null,
  "started_at": null,
  "cancelled_at": null,
  "failed_at": null,
  "completed_at": null,
  "model": "",
  "instructions": "You are a weather bot. Use the provided functions to answer questions.",
  "tools": [],
  "file_ids": [],
  "metadata": {}
}

Check the Run Status

Replace :thread_id and :run_id with the actual thread id and run id
async function getRun() {
    const run = await openai.beta.threads.runs.retrieve(
        "f74681b8-2371-4db1-946f-3efb070f0b19",
        "3933cbe4-5c26-4f93-b2d8-f03dabc055f2"
    );
    console.log(JSON.stringify(run, null, 2));
}
getRun();
(feel free to run this command multiple times until the run is completed - LLM can be slow, especially if you run it on your coffee machine)
{
  "id": "3933cbe4-5c26-4f93-b2d8-f03dabc055f2",
  "object": "",
  "created_at": 1702499360,
  "thread_id": "f74681b8-2371-4db1-946f-3efb070f0b19",
  "assistant_id": "75ce7666-7560-4bb2-8358-48107d94183a",
  "status": "requires_action",
  "required_action": {
    "type": "submit_tool_outputs",
    "submit_tool_outputs": {
      "tool_calls": [
        {
          "id": "9cb246b2-061c-46a4-aa34-80e7587c81c4",
          "type": "function",
          "function": {
            "name": "getCurrentWeather",
            "arguments": "{\"location\":\"San Francisco, CA\",\"unit\":\"imperial\"}"
          }
        }
      ]
    }
  },
  "last_error": null,
  "expires_at": null,
  "started_at": null,
  "cancelled_at": null,
  "failed_at": null,
  "completed_at": null,
  "model": "",
  "instructions": "You are a weather bot. Use the provided functions to answer questions.",
  "tools": [],
  "file_ids": [],
  "metadata": {}
}
The Assistant is now waiting for the user to submit the results of the function call.

Submit the Function Call Results

In practice you would execute, say, your javascript function:
// this would do a request to a weather API
const output = getCurrentWeather({location: "San Francisco, CA", unit: "imperial"})
console.log(output)
> {"temperature": 68, "unit": "F"}
Good. So it seems the weather in San Francisco is 68F. Let's tell the LLM about it:
async function submitToolOutputs() {
    const run = await openai.beta.threads.runs.submitToolOutputs(
        "f74681b8-2371-4db1-946f-3efb070f0b19",
        "3933cbe4-5c26-4f93-b2d8-f03dabc055f2",
        {
            tool_outputs: [
                {
                    tool_call_id: "9cb246b2-061c-46a4-aa34-80e7587c81c4",
                    output: "{\"temperature\": 68, \"unit\": \"F\"}"
                }
            ]
        }
    );
    console.log(JSON.stringify(run, null, 2));
}
submitToolOutputs();
{
  "id": "3933cbe4-5c26-4f93-b2d8-f03dabc055f2",
  "object": "",
  "created_at": 1702499360,
  "thread_id": "f74681b8-2371-4db1-946f-3efb070f0b19",
  "assistant_id": "75ce7666-7560-4bb2-8358-48107d94183a",
  "status": "queued",
  "required_action": {
    "type": "submit_tool_outputs",
    "submit_tool_outputs": {
      "tool_calls": [
        {
          "id": "9cb246b2-061c-46a4-aa34-80e7587c81c4",
          "type": "function",
          "function": {
            "name": "getCurrentWeather",
            "arguments": "{\"location\":\"San Francisco, CA\",\"unit\":\"imperial\"}"
          }
        }
      ]
    }
  },
  "last_error": null,
  "expires_at": null,
  "started_at": null,
  "cancelled_at": null,
  "failed_at": null,
  "completed_at": null,
  "model": "",
  "instructions": "You are a weather bot. Use the provided functions to answer questions.",
  "tools": [],
  "file_ids": [],
  "metadata": {}
}
Now the LLM knows about the weather in San Francisco, and can answer questions about it.

Display the Assistant's Response

Replace with the actual thread id
async function getMessages() {
    const messages = await openai.beta.threads.messages.list(
        "f74681b8-2371-4db1-946f-3efb070f0b19"
    );
    console.log(JSON.stringify(messages, null, 2));
}
getMessages();
{
  "data": [
    {
      "id": "7f109eba-796f-4961-9589-3af529f14e6e",
      "object": "",
      "created_at": 1702499213,
      "thread_id": "f74681b8-2371-4db1-946f-3efb070f0b19",
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": {
            "value": "What's the weather in San Francisco?",
            "annotations": []
          }
        }
      ],
      "assistant_id": "00000000-0000-0000-0000-000000000000",
      "run_id": "00000000-0000-0000-0000-000000000000",
      "file_ids": [],
      "metadata": null
    },
    {
      "id": "5add8363-1285-463a-a647-6bceee7b84f3",
      "object": "",
      "created_at": 1702499547,
      "thread_id": "f74681b8-2371-4db1-946f-3efb070f0b19",
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": {
            "value": "I just checked the weather in San Francisco, CA with Fahrenheit as the unit. The temperature in San Francisco at the given time is 68 degrees Fahrenheit.",
            "annotations": []
          }
        }
      ],
      "assistant_id": "00000000-0000-0000-0000-000000000000",
      "run_id": "00000000-0000-0000-0000-000000000000",
      "file_ids": [],
      "metadata": null
    }
  ]
}