For running flows, you need the compute capacity equivalent to or exceeding the requirements for running the models. Once you have the computer resources, follow the below steps.
Install Linux Screen on Ubuntu and Debian
sudo apt update
sudo apt install screen- ssh in your cluster or if you have enough storage resources on your local setup then its fine.
- Named sessions are useful when you run multiple screen sessions. To create a named session, run the screen command with the following arguments:
screen -S mysessionAfter performing step 2, you can run the flows by following the below procedure and then check the below steps.
- Working with Linux Screen Windows
When you start a new screen session, it creates a single window with a shell in it.
You can have multiple windows inside a Screen session.
To create a new window with shell type Ctrl+a c, the first available number from the range 0...9 will be assigned to it.
Below are some most common commands for managing Linux Screen Windows:
Ctrl+a c Create a new window (with shell).
Ctrl+a " List all windows.
Ctrl+a 0 Switch to window 0 (by number).
Ctrl+a A Rename the current window.
Ctrl+a S Split current region horizontally into two regions.
Ctrl+a | Split current region vertically into two regions.
Ctrl+a tab Switch the input focus to the next region.
Ctrl+a Ctrl+a Toggle between the current and previous windows
Ctrl+a Q Close all regions but the current one.
Ctrl+a X Close the current region.
- Detach from Linux Screen Session
You can detach from the screen session at any time by typing:
Ctrl+a d
The program running in the screen session will continue to run after you detach from the session.
- Detach from Linux Screen Session
You can detach from the screen session at any time by typing:
Ctrl+a d
The program running in the screen session will continue to run after you detach from the session.
- Reattach to a Linux Screen
To resume your screen session use the following command:
screen -r
In case you have multiple screen sessions running on your machine, you will need to append the screen session ID after the r switch.
To find the session ID list the current running screen sessions with:
screen -ls
Refer the documentation of making a call to OpenAI using the proxy server: https://docs.litellm.ai/docs/providers/custom_openai_proxy
Note that the response format of OpenAI and llama-3 are different so we need to translate the Llama-3 response format to OpenAI format.
Open AI response format:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "The 2020 World Series was played in Texas at Globe Life Field in Arlington.",
"role": "assistant"
},
"logprobs": null
}
],
"created": 1677664795,
"id": "chatcmpl-7QyqpwdfhqwajicIEznoc6Q47XAyW",
"model": "gpt-3.5-turbo-0613",
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": 57,
"total_tokens": 74
}
}Llama-3 Response Format
{'generated_text': 'python\nt = int(input())\nfor _ in range(t):\n p, a, b, c = map(int, input().split())\n wait_time_a = a - p % a if p % a!= 0 else 0\n wait_time_b = b - p % b if p % b!= 0 else 0\n wait_time_c = c - p % c if p % c!= 0 else 0\n print(min(wait_time_a, wait_time_b, wait_time_c))\n```',
'model': 'meta-llama/Meta-Llama-3-70B-Instruct',
'usage':
{'prompt_tokens': 1391,
'completion_tokens': 109,
'total_tokens': 1500}
}
]First, we'll create a Flask application that serves as an API endpoint for text generation using Llama-3-8B.
app.py
from flask import Flask, request, jsonify
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
app = Flask(__name__)
# Load the tokenizer and model
model_name = 'meta-llama/Meta-Llama-3-8B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
@app.route('/')
def index():
return 'Hello from Flask!'
@app.route('/v1/completions', methods=['POST'])
def completion():
data = request.get_json()
print(data)
# Check if data is valid
if not data or 'messages' not in data:
return jsonify({"error": "Invalid request format"}), 400
messages = data['messages']
print(messages)
# Ensure the prompt is retrieved correctly
if not isinstance(messages, list) or not messages or 'content' not in messages[0]:
return jsonify({"error": "Invalid message format"}), 400
prompt = messages[0]['content']
if not prompt:
return jsonify({"error": "Prompt is required"}), 400
# Generate text based on the prompt
generated_texts = text_generator(prompt, max_length=50, num_return_sequences=1)
generated_text = generated_texts[0]['generated_text']
response = {
"model": model_name,
"choices": [{
"message": {"content": generated_text},
"index": 0,
"logprobs": None,
"finish_reason": "length"
}],
"usage": {
"prompt_tokens": len(tokenizer(prompt)['input_ids']),
"completion_tokens": len(tokenizer(generated_text)['input_ids']),
"total_tokens": len(tokenizer(prompt)['input_ids']) + len(tokenizer(generated_text)['input_ids'])
}
}
return jsonify(response)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)Explanation
- Loading the Model and Tokenizer: We start by loading the Meta-Llama-3-8B model and its tokenizer using the transformers library.
- Setting Up Flask Routes: We define two routes: '/': A simple route that returns a welcome message. '/v1/completions': A POST route that handles text generation requests.
- Request Validation: We validate the incoming request to ensure it has the correct format.
- Text Generation: Using the pipeline method, we generate text based on the prompt provided in the request.
- Response Construction: We construct a JSON response containing the generated text and token usage statistics.
Next, we'll create a script to call our Flask API using the litellm library.
Make sure the Flask app is running before executing the Python script with LiteLLM. The Python script will send a request to the Flask app with the prompt, and the Flask app will respond with the generated text, which will then be printed by the script.
call_flask_app.py
import litellm
from litellm import completion
import os
# Set verbose to true if you want more detailed logs
litellm.set_verbose = True
os.environ["HUGGINGFACE_API_KEY"] = "HUGGINGFACE_API_KEY"
response = completion(
model="huggingface/meta-llama/Meta-Llama-3-8B",
messages=[{ "content": "You're a useful assistant", "role": "system" },{ "content": "Once upon a time", "role": "user" }],
api_base="http://0.0.0.0:5000/v1/completions",
temperature=0.4,
max_tokens=50
)
print(response)Explanation
- Importing Libraries: We import the litellm library and the completion function.
- Setting API Key: We set the Huggingface API key in the environment variable.
- Making the API Call: We use the completion function to call our Flask API with the following parameters:
model: The model name used. It is important to note that we provide
huggingface/meta-llama/Meta-Llama-3-8Binstead of justmeta-llama/Meta-Llama-3-8Bbecause it is used to identify the custom llm provider as huggingface. messages: A list of messages to simulate a conversation. The user prompt is "Once upon a time". api_base: The base URL of our Flask API.This is the server URL which is used to make a litellm call to your server. temperature: A parameter to control the randomness of the text generation. max_tokens: The maximum number of tokens to generate. - Printing the Response: Finally, we print the response from the API call.
Important Note:-
-
Using API_BASE url you can use your base application URL to make a litellm call to any models. It saves your compute resources and even provide modularity.
-
The parameters which litellm support for different models is listed here: https://litellm.vercel.app/docs/completion/input
-
Note that while converting the OpenAI input format to huggingface response format the common parameters should only be taken as input to get the same response format.
-
So, since we had set the
presence_penaltyandpresence_penaltyparameters for OpenAI we need to remove them if we are executing the code for Huggingface models. -
ChatAtomicFlow can be called using Litellm since its just a wrapper over the Gpt3.5 call.
Start the Flask App:
python app.pyThis will start the Flask server on http://0.0.0.0:5000.
Run the Client Script:
python call_flask_app.pyThis will make a request to the Flask server and print the generated response.
Currently the Aiflows had support for only the OPenAI and Azure backends but for enabling the huggingface backend, the support can be enabled by adding/uncommenting the below lines. Currently I am providing the example for ChatAtom
run.py
import litellm
litellm.set_verbose=True
litellm.drop_params=TrueAdd this lines to drop the additional parameters and make it compatible with Open AI format. Litellm automatically drops such parameters if above lines are added.
run.py
#Huggingface backend
api_information = [ApiInfo(backend_used="huggingface", api_key = os.getenv("HUGGINGFACE_API_KEY"), api_base="http://0.0.0.0:5000/v1/completions")]Here set the API_BASE URL while setting the huggingface backend.
demo.yaml
huggingface: "huggingface/meta-llama/Meta-Llama-3-70B-Instruct"Add the huggingface model you want to use here.
Now as we did in previous steps, run the flask app and the API BASE of the flask app needs to be added to the flows which you want to run in the run.py file, by this way you can run the flows on your proxy server! You can have a look at run.py of ChatFlowmodule for further details.
- Tutorial on deploying huggingface models on RunAI: https://www.run.ai/blog/how-to-deploy-hugging-face-models-with-run-ai
- For using runai as your compute server, have a look at detailed tutorial here: https://gist.github.com/Neel-Shah-29/57666e6c735a115d3eaf349071596bce