abasu0713/example-bentoml-local.md

## example-bentoml-local.md

      
    Raw
  

              example-bentoml-local.md
            
          
    Using BentoML as an Inference Platform for Machine Learning Models

In this gist we are going to use BentoML to locally serve two Machine Learning models as Services for developmental testing.
Prerequisites


Any Laptop/Desktop/Cloud VM that is an Ubuntu Jammy or a Debian Bookworm based system with a GPU that is atleast 6GB VRAM in size.


Nvidia drivers installed. Follow Ubuntu Installation docs or the official Nvidia CUDA installation docs for Debian or other Linux based OS.


(Optional but recommended) Conda Installed. Refer quick 1 line installation of Miniconda.


Overview

We are going to cover:

Setting up a local environment for BentoML.
Download models for use so that they can be packaged using the Bento Model Management API later into container images
PromptEnhancer service for basic Prompt Engineering for Stable Diffusion images
AvatarGenerator service for generating Stable Diffused avatar of myself


You are free to use any stable diffusion models you like. I am just using my own so as to not violate any copyrights.

So let's get started
Step 1: Setting up local environment for BentoML

# Let's create a new Conda environment so that we can manage our dependencies for our project separately
conda create --name bento-ml python=3.11
# Activate the env 
conda activate bento-ml

# Create 2 directories for our services that we can later convert to serverless functions
# Replace target-directory with $HOME or ~, or whatever you see fit
cd <target-directory> && mkdir -p avatar-generator prompt-enhancer

# Add requirements.txt for prompt-enhancer
cat << EOF > prompt-enhancer/requirements.txt
accelerate
bentoml>=1.2.2
torch
transformers
triton; sys_platform == "linux"
xformers
EOF

# Add requirements.txt for avatar-generator
cat << EOF > avatar-generator/requirements.txt
accelerate
bentoml>=1.2.2
diffusers
pillow
torch
transformers
triton; sys_platform == "linux"
xformers
EOF

Ideally you should have separate conda environments for each requirements.txt and service.
But as you can see from the list above we have the same set of dependencies with the only additon of diffusers library.
So in the interest of saving space on my system I am just going to install the requirements.txt file from the
avatar-generator folder. And it should work in both.

pip install -r avatar-generator/requirements.txt
Step 2: Download models for use by Bento Model Management API

Below is just a sample script that shows how to import models into the BentoML's model store locally. This is particularly helpful because you can manage versions control seprate tagging, and also pre-package them into your containers when you build them.
from transformers import pipeline
from diffusers import DiffusionPipeline
import bentoml


# pipeline = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct")

# with bentoml.models.create(
#     name='Meta-Llama-3-8B-Instruct', # Name of the model in the Model Store
# ) as model_ref:
#     pipeline.save_pretrained(model_ref.path)
#     print(f"Model saved: {model_ref}")


# model_id = "alphaduriendur/alphaduriendur-01"
# pipeline = DiffusionPipeline.from_pretrained(model_id)

# with bentoml.models.create(
#     name='alphaduriendur-avatar-generator',
# ) as model_ref:
#     pipeline.save_pretrained(model_ref.path)
#     print(f"Model saved: {model_ref}"
# )

# model_id = "stabilityai/sd-vae-ft-mse"
# vae = AutoencoderKL.from_pretrained(model_id, use_safetensors=True)

# with bentoml.models.create(
#     name='stabilityai-sd-vae-ft-mse',
# ) as model_ref:
#     vae.save_pretrained(model_ref.path)
#     print(f"Model saved: {model_ref}"
# )

model_id = "Gustavosta/MagicPrompt-Stable-Diffusion"
pipeline2 = pipeline("text-generation", model=model_id)
with bentoml.models.create(
    name="gustavosta-magicPrompt-stable-diffusion",
) as model_ref:
    pipeline2.save_pretrained(model_ref.path)
    print(f"Model saved: {model_ref}")

Kindly run them separately or update the variable names if you wish to download multiple models in 1 go.

As you can see from the above script - BentoML allows you to import variety of models into your local model store. This is particularly helpful since you can then manage all your ML models using BentoML's model management API if you are deploying on Cloud or you can use locally to create pre-packaged containers that have version-controlled models ready within your application code. There are some advanced use cases where you can use another BentoML product called yatai to build a model store in cloud and orchestrate GitOPs like CI/CD pipelines as well.
Once downloaded check your local models:
bentoml models list

Step 3: Build PromptEnhancer Service

We are going to write a simple PromptEnhancer Service. I am not going to go into much details about the code implementation. Please refer to HuggingFace docs on Diffusers Prompt Techniques for that. The only thing I have done is lifted that code and converted it into a BentoML service. The code is exactly the same with some minor tweaks on the words and their pairs, and adding some other prompts for my own use case.
touch prompt-enhancer/service.py
import bentoml
import typing as t
from transformers import LogitsProcessor, LogitsProcessorList
import gc

@bentoml.service(
    resources={
        "gpu": 1
    },
    traffic={
        "timeout": 30,
        "max_concurrency": 100,
    },
    workers=1,
)
class PromptEnhancer():
    model_prompt_enhancer = bentoml.models.get("gustavosta-magicprompt-stable-diffusion:tbsmfosgy2xo2zdo")
    styles = {
        "cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
        "anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed, unreal engine, glorious, hyperrealistic, inspired by {artist}",
        "photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed, Leica",
        "comic": "comic panel of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, hyperrealistic, ureal, UHD, inspired by {artist}",
        "lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
        "pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
    }

    words = [
        "aesthetic", "astonishing", "beautiful", "breathtaking", "composition", "contrasted", "epic", "moody", "enhanced",
        "exceptional", "fascinating", "flawless", "glamorous", "glorious", "illumination", "impressive", "improved",
        "inspirational", "magnificent", "majestic", "hyperrealistic", "smooth", "sharp", "focus", "stunning", "detailed",
        "intricate", "dramatic", "high", "quality", "perfect", "light", "ultra", "highly", "radiant", "satisfying",
        "soothing", "sophisticated", "stylish", "sublime", "terrific", "touching", "timeless", "wonderful", "unbelievable",
        "elegant", "awesome", "amazing", "dynamic", "trendy", "unreal", "engine", "UHD", "HD"
    ]

    word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light", "unreal engine", "UHD unreal", "hyperrealistic UDH", "majestic glorious"]

    def __init__(self) -> None:
        import torch
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        

    def _find_and_order_pairs(self, s, pairs):
        words = s.split()
        found_pairs = []
        for pair in pairs:
            pair_words = pair.split()
            if pair_words[0] in words and pair_words[1] in words:
                found_pairs.append(pair)
                words.remove(pair_words[0])
                words.remove(pair_words[1])

        for word in words[:]:
            for pair in pairs:
                if word in pair.split():
                    words.remove(word)
                    break
        ordered_pairs = ", ".join(found_pairs)
        remaining_s = ", ".join(words)
        return ordered_pairs, remaining_s
    
    
    @bentoml.api
    def enahace_prompt(
        self, 
        prompt: str, 
        style: t.Literal["cinematic", "anime", "photographic", "comic", "lineart", "pixelart"], 
        artist: t.Optional[str] = None
    ) -> str:
        import torch 
        from transformers import GPT2Tokenizer, GPT2LMHeadModel, GenerationConfig
        
        try:
            tokenizer = GPT2Tokenizer.from_pretrained(self.model_prompt_enhancer.path)
            word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in self.words]
            bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to(self.device)
            bias[word_ids] = 0
            processor = CustomLogitsProcessor(bias)
            processor_list = LogitsProcessorList([processor])
            prompt = self.styles[style].format(prompt=prompt, artist=artist)
            print("Prompt: ", prompt)
            model = GPT2LMHeadModel.from_pretrained(self.model_prompt_enhancer.path, torch_dtype=torch.float16).to(self.device)
            model.eval()
            inputs = tokenizer(prompt, return_tensors="pt").to(self.device)
            token_count = inputs["input_ids"].shape[1]
            max_new_tokens = 50 - token_count

            generation_config = GenerationConfig(
                penalty_alpha=0.7,
                top_k=50,
                eos_token_id=model.config.eos_token_id,
                pad_token_id=model.config.eos_token_id,
                pad_token=model.config.pad_token_id,
                do_sample=True,
            )

            with torch.no_grad():
                generated_ids = model.generate(
                    input_ids=inputs["input_ids"],
                    attention_mask=inputs["attention_mask"],
                    max_new_tokens=max_new_tokens,
                    generation_config=generation_config,
                    logits_processor=processor_list,
                )
            output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
            input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
            pairs, words = self._find_and_order_pairs(generated_part, self.word_pairs)
            formatted_generated_part = pairs + ", " + words
            enhanced_prompt = input_part + ", " + formatted_generated_part
        finally:
            gc.collect()
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
        return enhanced_prompt


class CustomLogitsProcessor(LogitsProcessor):
    def __init__(self, bias):
        super().__init__()
        self.bias = bias

    def __call__(self, input_ids, scores):
        if len(input_ids.shape) == 2:
            last_token_id = input_ids[0, -1]
            self.bias[last_token_id] = -1e10
        return scores + self.bias
Define a Bento:
cat << EOF > prompt-enhancer/bentofile.yaml
service: 'service:PromptEnhancer'
labels:
  owner: alphaduriendur
  project: portfolio
include:
  - '*.py'
python:
  requirements_txt: 'requirements.txt'
models:
  - tag: "gustavosta-magicprompt-stable-diffusion:tbsmfosgy2xo2zdo"
EOF
Now deploy and run the service:
cd prompt-enhancer
bentoml serve service:PromptEnhancer


With that you have a simple prompt engineering service that can create beautiful prompts that you can then use for your image generation pipelines. We have:

Created a simple Bento API that serves the custom prompt engineering logic that we built.
We also are using a local model store where the service uses the BentoML Model-Store for model management.

On to our next step
Step 4: Build AvatarGenerator Service

In this step I am going to use the same pattern above to write a separate endpoint using HuggingFace's Diffusers' library to run inference on a trained Stable Diffusion model. It is a custom model that was self trained using Dreambooth. I am not going to cover Stable Diffusion training or Dreambooth in this Gist. Instead I am going to focus purely on serving different Stable Diffusion models using BentoML. You can chose any model from the diffusers library as you see fit.
touch avatar-generator/service.py
import bentoml
from PIL.Image import Image

from typing_extensions import Annotated
from annotated_types import Le, Ge
import gc

DEFAULT_HEIGHT = 640
DEFAULT_WIDTH = 640
DEFAULT_PROMPT = "close-up photography of alphaduriendur standing in the rain at night, in a street lit by lamps, leica 35mm summilux"
NEGATIVE_PROMPT = "bad anatomy, deformed, ugly, disfigured, fat cheeks, low quality, bad quality, duplicates, markings on forehead, bad posture, bad lighting, poorly drawn hands, poorly drawn legs, poorly drawn eyes, bad composition, bad lighting, bad shading, bad perspective, bad proportions, duplicate subjects, multiple faces, distorted faces"

@bentoml.service(
    resources={
        "gpu": 1
    },
    traffic={
        "timeout": 30,
        "max_concurrency": 3,
    },
    workers=1,
)
class AvatarGenerator():
    # Define the model as a class variable
    model_ref = bentoml.models.get("alphaduriendur-avatar-generator:eeljtosf66gbkgga")
    model_encode_ref = bentoml.models.get("stabilityai-sd-vae-ft-mse:gn4hntsgcsva2gga")
    

    def __init__(self) -> None:
        import torch
        import diffusers

        # Load model into pipeline
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.txt2img_pipe = diffusers.AutoPipelineForText2Image.from_pretrained(
            self.model_ref.path,
            torch_dtype=torch.float16,
            use_safetensors=True,
            safety_checker = None,
            requires_safety_checker = False
        )
        vae = diffusers.AutoencoderKL.from_pretrained(self.model_encode_ref.path, torch_dtype=torch.float16).to("cuda")
        self.txt2img_pipe.vae = vae
        self.txt2img_pipe.enable_attention_slicing()
        self.txt2img_pipe.enable_freeu(s1=0.9, s2=0.2, b1=1.5, b2=1.6)
        # self.txt2img_pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(self.txt2img_pipe.scheduler.config)
        self.txt2img_pipe.scheduler = diffusers.EulerAncestralDiscreteScheduler.from_config(self.txt2img_pipe.scheduler.config)
        self.txt2img_pipe.to(self.device, dtype=torch.float16)
        # self.txt2img_pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
        # self.txt2img_pipe.set_adapters("pixel")

    @bentoml.api
    def generate_avatar(
            self,
            prompt: str = DEFAULT_PROMPT,
            # negative_prompt: t.Optional[str] = NEGATIVE_PROMPT,
            height: int = DEFAULT_HEIGHT,
            width: int = DEFAULT_WIDTH,
            # seed: int = DEFAULT_SEED,
            num_inference_steps: Annotated[int, Ge(1), Le(50)] = 20,
            guidance_scale: Annotated[float, Ge(0.0), Le(20.)] = 10.0,
    ) -> Image:
        import torch

        try:
            # generator = torch.Generator("cuda").manual_seed(seed)
            res = self.txt2img_pipe(
                prompt=prompt,
                negative_prompt=NEGATIVE_PROMPT,
                height=height,
                width=width,
                num_inference_steps=num_inference_steps,
                guidance_scale=guidance_scale,
                # generator=generator
            )
            image = res[0][0]
        finally:
            gc.collect()
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
            self.txt2img_pipe.disable_freeu()
        return image
Similarly as above we are going to create a bentofile.yaml
cat << EOF > avatar-generator/bentofile.yaml
service: 'service:AvatarGenerator'
labels:
  owner: alphaduriendur
  project: portfolio
include:
  - '*.py'
python:
  requirements_txt: 'requirements.txt'
models:
  - tag: "alphaduriendur-avatar-generator:eeljtosf66gbkgga" # A dictionary
  - tag: "stabilityai-sd-vae-ft-mse:gn4hntsgcsva2gga"
EOF
Now serve the model:
cd avatar-generator
bentoml serve service:AvatarGenerator


With these above steps we have simplified deploying 2 separate models as API services.