Skip to content

Instantly share code, notes, and snippets.

@Neel-Shah-29
Last active July 8, 2024 10:41
Show Gist options
  • Select an option

  • Save Neel-Shah-29/d41472a568f9e8430da0ca1249777bf6 to your computer and use it in GitHub Desktop.

Select an option

Save Neel-Shah-29/d41472a568f9e8430da0ca1249777bf6 to your computer and use it in GitHub Desktop.
How to make a call to llama-3 using Litellm on your own server

Making calls to Aiflows using Litellm on your server!

Setting up your server

For running flows, you need the compute capacity equivalent to or exceeding the requirements for running the models. Once you have the computer resources, follow the below steps.

Using screen to connect to your server

Installation:

Install Linux Screen on Ubuntu and Debian

sudo apt update
sudo apt install screen

Steps to connect to screen:

  1. ssh in your cluster or if you have enough storage resources on your local setup then its fine.
  2. Named sessions are useful when you run multiple screen sessions. To create a named session, run the screen command with the following arguments:
 screen -S mysession

After performing step 2, you can run the flows by following the below procedure and then check the below steps.

  1. Working with Linux Screen Windows

When you start a new screen session, it creates a single window with a shell in it.

You can have multiple windows inside a Screen session.

To create a new window with shell type Ctrl+a c, the first available number from the range 0...9 will be assigned to it.

Below are some most common commands for managing Linux Screen Windows:

Ctrl+a c Create a new window (with shell).
Ctrl+a " List all windows.
Ctrl+a 0 Switch to window 0 (by number).
Ctrl+a A Rename the current window.
Ctrl+a S Split current region horizontally into two regions.
Ctrl+a | Split current region vertically into two regions.
Ctrl+a tab Switch the input focus to the next region.
Ctrl+a Ctrl+a Toggle between the current and previous windows
Ctrl+a Q Close all regions but the current one.
Ctrl+a X Close the current region.
  1. Detach from Linux Screen Session

You can detach from the screen session at any time by typing:

Ctrl+a d

The program running in the screen session will continue to run after you detach from the session.

  1. Detach from Linux Screen Session

You can detach from the screen session at any time by typing:

Ctrl+a d

The program running in the screen session will continue to run after you detach from the session.

  1. Reattach to a Linux Screen

To resume your screen session use the following command:

screen -r

In case you have multiple screen sessions running on your machine, you will need to append the screen session ID after the r switch.

To find the session ID list the current running screen sessions with:

screen -ls

Setting up the proxy server

Refer the documentation of making a call to OpenAI using the proxy server: https://docs.litellm.ai/docs/providers/custom_openai_proxy

Note that the response format of OpenAI and llama-3 are different so we need to translate the Llama-3 response format to OpenAI format.

Open AI response format:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The 2020 World Series was played in Texas at Globe Life Field in Arlington.",
        "role": "assistant"
      },
      "logprobs": null
    }
  ],
  "created": 1677664795,
  "id": "chatcmpl-7QyqpwdfhqwajicIEznoc6Q47XAyW",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 17,
    "prompt_tokens": 57,
    "total_tokens": 74
  }
}

Llama-3 Response Format

{'generated_text': 'python\nt = int(input())\nfor _ in range(t):\n    p, a, b, c = map(int, input().split())\n    wait_time_a = a - p % a if p % a!= 0 else 0\n    wait_time_b = b - p % b if p % b!= 0 else 0\n    wait_time_c = c - p % c if p % c!= 0 else 0\n    print(min(wait_time_a, wait_time_b, wait_time_c))\n```',
'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 
'usage': 
    {'prompt_tokens': 1391, 
    'completion_tokens': 109, 
    'total_tokens': 1500}
 }
]

First, we'll create a Flask application that serves as an API endpoint for text generation using Llama-3-8B.

app.py

from flask import Flask, request, jsonify
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

app = Flask(__name__)

# Load the tokenizer and model
model_name = 'meta-llama/Meta-Llama-3-8B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

@app.route('/')
def index():
    return 'Hello from Flask!'

@app.route('/v1/completions', methods=['POST'])
def completion():
    data = request.get_json()
    print(data)
    # Check if data is valid
    if not data or 'messages' not in data:
        return jsonify({"error": "Invalid request format"}), 400

    messages = data['messages']
    print(messages)
    # Ensure the prompt is retrieved correctly
    if not isinstance(messages, list) or not messages or 'content' not in messages[0]:
        return jsonify({"error": "Invalid message format"}), 400

    prompt = messages[0]['content']
    
    if not prompt:
        return jsonify({"error": "Prompt is required"}), 400

    # Generate text based on the prompt
    generated_texts = text_generator(prompt, max_length=50, num_return_sequences=1)
    generated_text = generated_texts[0]['generated_text']

    response = {
        "model": model_name,
        "choices": [{
            "message": {"content": generated_text},
            "index": 0,
            "logprobs": None,
            "finish_reason": "length"
        }],
        "usage": {
            "prompt_tokens": len(tokenizer(prompt)['input_ids']),
            "completion_tokens": len(tokenizer(generated_text)['input_ids']),
            "total_tokens": len(tokenizer(prompt)['input_ids']) + len(tokenizer(generated_text)['input_ids'])
        }
    }
    return jsonify(response)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=True)

Explanation

  • Loading the Model and Tokenizer: We start by loading the Meta-Llama-3-8B model and its tokenizer using the transformers library.
  • Setting Up Flask Routes: We define two routes: '/': A simple route that returns a welcome message. '/v1/completions': A POST route that handles text generation requests.
  • Request Validation: We validate the incoming request to ensure it has the correct format.
  • Text Generation: Using the pipeline method, we generate text based on the prompt provided in the request.
  • Response Construction: We construct a JSON response containing the generated text and token usage statistics.

Calling the API with litellm

Next, we'll create a script to call our Flask API using the litellm library.

Make sure the Flask app is running before executing the Python script with LiteLLM. The Python script will send a request to the Flask app with the prompt, and the Flask app will respond with the generated text, which will then be printed by the script.

call_flask_app.py

import litellm
from litellm import completion
import os

# Set verbose to true if you want more detailed logs
litellm.set_verbose = True
os.environ["HUGGINGFACE_API_KEY"] = "HUGGINGFACE_API_KEY"
response = completion(
  model="huggingface/meta-llama/Meta-Llama-3-8B",
  messages=[{ "content": "You're a useful assistant", "role": "system" },{ "content": "Once upon a time", "role": "user" }],
  api_base="http://0.0.0.0:5000/v1/completions",
  temperature=0.4,
  max_tokens=50
)
print(response)

Explanation

  • Importing Libraries: We import the litellm library and the completion function.
  • Setting API Key: We set the Huggingface API key in the environment variable.
  • Making the API Call: We use the completion function to call our Flask API with the following parameters: model: The model name used. It is important to note that we provide huggingface/meta-llama/Meta-Llama-3-8B instead of just meta-llama/Meta-Llama-3-8B because it is used to identify the custom llm provider as huggingface. messages: A list of messages to simulate a conversation. The user prompt is "Once upon a time". api_base: The base URL of our Flask API.This is the server URL which is used to make a litellm call to your server. temperature: A parameter to control the randomness of the text generation. max_tokens: The maximum number of tokens to generate.
  • Printing the Response: Finally, we print the response from the API call.

Important Note:-

  • Using API_BASE url you can use your base application URL to make a litellm call to any models. It saves your compute resources and even provide modularity.

  • The parameters which litellm support for different models is listed here: https://litellm.vercel.app/docs/completion/input

  • Note that while converting the OpenAI input format to huggingface response format the common parameters should only be taken as input to get the same response format.

  • So, since we had set the presence_penalty and presence_penalty parameters for OpenAI we need to remove them if we are executing the code for Huggingface models.

  • ChatAtomicFlow can be called using Litellm since its just a wrapper over the Gpt3.5 call.

Start the Flask App:

python app.py

This will start the Flask server on http://0.0.0.0:5000.

Run the Client Script:

python call_flask_app.py

This will make a request to the Flask server and print the generated response.

Running Flows with Huggingface models

Currently the Aiflows had support for only the OPenAI and Azure backends but for enabling the huggingface backend, the support can be enabled by adding/uncommenting the below lines. Currently I am providing the example for ChatAtom

run.py

import litellm
litellm.set_verbose=True
litellm.drop_params=True

Add this lines to drop the additional parameters and make it compatible with Open AI format. Litellm automatically drops such parameters if above lines are added.

run.py

    #Huggingface backend
    api_information = [ApiInfo(backend_used="huggingface", api_key = os.getenv("HUGGINGFACE_API_KEY"), api_base="http://0.0.0.0:5000/v1/completions")]

Here set the API_BASE URL while setting the huggingface backend.

demo.yaml

huggingface: "huggingface/meta-llama/Meta-Llama-3-70B-Instruct"

Add the huggingface model you want to use here.

Now as we did in previous steps, run the flask app and the API BASE of the flask app needs to be added to the flows which you want to run in the run.py file, by this way you can run the flows on your proxy server! You can have a look at run.py of ChatFlowmodule for further details.

Additional Material

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment