Skip to content

Instantly share code, notes, and snippets.

@abasu0713
Last active July 22, 2024 02:38
Show Gist options
  • Save abasu0713/9c23b5ada895e9c7ac93f1769d4ccb84 to your computer and use it in GitHub Desktop.
Save abasu0713/9c23b5ada895e9c7ac93f1769d4ccb84 to your computer and use it in GitHub Desktop.
Production Mindset for Generative AI applications from the get go in your local environment

Using BentoML as an Inference Platform for Machine Learning Models

In this gist we are going to use BentoML to locally serve two Machine Learning models as Services for developmental testing.

Prerequisites

Overview

We are going to cover:

  1. Setting up a local environment for BentoML.
  2. Download models for use so that they can be packaged using the Bento Model Management API later into container images
  3. PromptEnhancer service for basic Prompt Engineering for Stable Diffusion images
  4. AvatarGenerator service for generating Stable Diffused avatar of myself

You are free to use any stable diffusion models you like. I am just using my own so as to not violate any copyrights.

So let's get started

Step 1: Setting up local environment for BentoML

# Let's create a new Conda environment so that we can manage our dependencies for our project separately
conda create --name bento-ml python=3.11
# Activate the env 
conda activate bento-ml

# Create 2 directories for our services that we can later convert to serverless functions
# Replace target-directory with $HOME or ~, or whatever you see fit
cd <target-directory> && mkdir -p avatar-generator prompt-enhancer

# Add requirements.txt for prompt-enhancer
cat << EOF > prompt-enhancer/requirements.txt
accelerate
bentoml>=1.2.2
torch
transformers
triton; sys_platform == "linux"
xformers
EOF

# Add requirements.txt for avatar-generator
cat << EOF > avatar-generator/requirements.txt
accelerate
bentoml>=1.2.2
diffusers
pillow
torch
transformers
triton; sys_platform == "linux"
xformers
EOF

Ideally you should have separate conda environments for each requirements.txt and service. But as you can see from the list above we have the same set of dependencies with the only additon of diffusers library. So in the interest of saving space on my system I am just going to install the requirements.txt file from the avatar-generator folder. And it should work in both.

pip install -r avatar-generator/requirements.txt

Step 2: Download models for use by Bento Model Management API

Below is just a sample script that shows how to import models into the BentoML's model store locally. This is particularly helpful because you can manage versions control seprate tagging, and also pre-package them into your containers when you build them.

from transformers import pipeline
from diffusers import DiffusionPipeline
import bentoml


# pipeline = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct")

# with bentoml.models.create(
#     name='Meta-Llama-3-8B-Instruct', # Name of the model in the Model Store
# ) as model_ref:
#     pipeline.save_pretrained(model_ref.path)
#     print(f"Model saved: {model_ref}")


# model_id = "alphaduriendur/alphaduriendur-01"
# pipeline = DiffusionPipeline.from_pretrained(model_id)

# with bentoml.models.create(
#     name='alphaduriendur-avatar-generator',
# ) as model_ref:
#     pipeline.save_pretrained(model_ref.path)
#     print(f"Model saved: {model_ref}"
# )

# model_id = "stabilityai/sd-vae-ft-mse"
# vae = AutoencoderKL.from_pretrained(model_id, use_safetensors=True)

# with bentoml.models.create(
#     name='stabilityai-sd-vae-ft-mse',
# ) as model_ref:
#     vae.save_pretrained(model_ref.path)
#     print(f"Model saved: {model_ref}"
# )

model_id = "Gustavosta/MagicPrompt-Stable-Diffusion"
pipeline2 = pipeline("text-generation", model=model_id)
with bentoml.models.create(
    name="gustavosta-magicPrompt-stable-diffusion",
) as model_ref:
    pipeline2.save_pretrained(model_ref.path)
    print(f"Model saved: {model_ref}")

Kindly run them separately or update the variable names if you wish to download multiple models in 1 go.

As you can see from the above script - BentoML allows you to import variety of models into your local model store. This is particularly helpful since you can then manage all your ML models using BentoML's model management API if you are deploying on Cloud or you can use locally to create pre-packaged containers that have version-controlled models ready within your application code. There are some advanced use cases where you can use another BentoML product called yatai to build a model store in cloud and orchestrate GitOPs like CI/CD pipelines as well.

Once downloaded check your local models:

bentoml models list

Local Model store

Step 3: Build PromptEnhancer Service

We are going to write a simple PromptEnhancer Service. I am not going to go into much details about the code implementation. Please refer to HuggingFace docs on Diffusers Prompt Techniques for that. The only thing I have done is lifted that code and converted it into a BentoML service. The code is exactly the same with some minor tweaks on the words and their pairs, and adding some other prompts for my own use case.

touch prompt-enhancer/service.py
import bentoml
import typing as t
from transformers import LogitsProcessor, LogitsProcessorList
import gc

@bentoml.service(
    resources={
        "gpu": 1
    },
    traffic={
        "timeout": 30,
        "max_concurrency": 100,
    },
    workers=1,
)
class PromptEnhancer():
    model_prompt_enhancer = bentoml.models.get("gustavosta-magicprompt-stable-diffusion:tbsmfosgy2xo2zdo")
    styles = {
        "cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
        "anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed, unreal engine, glorious, hyperrealistic, inspired by {artist}",
        "photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed, Leica",
        "comic": "comic panel of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, hyperrealistic, ureal, UHD, inspired by {artist}",
        "lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
        "pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
    }

    words = [
        "aesthetic", "astonishing", "beautiful", "breathtaking", "composition", "contrasted", "epic", "moody", "enhanced",
        "exceptional", "fascinating", "flawless", "glamorous", "glorious", "illumination", "impressive", "improved",
        "inspirational", "magnificent", "majestic", "hyperrealistic", "smooth", "sharp", "focus", "stunning", "detailed",
        "intricate", "dramatic", "high", "quality", "perfect", "light", "ultra", "highly", "radiant", "satisfying",
        "soothing", "sophisticated", "stylish", "sublime", "terrific", "touching", "timeless", "wonderful", "unbelievable",
        "elegant", "awesome", "amazing", "dynamic", "trendy", "unreal", "engine", "UHD", "HD"
    ]

    word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light", "unreal engine", "UHD unreal", "hyperrealistic UDH", "majestic glorious"]

    def __init__(self) -> None:
        import torch
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        

    def _find_and_order_pairs(self, s, pairs):
        words = s.split()
        found_pairs = []
        for pair in pairs:
            pair_words = pair.split()
            if pair_words[0] in words and pair_words[1] in words:
                found_pairs.append(pair)
                words.remove(pair_words[0])
                words.remove(pair_words[1])

        for word in words[:]:
            for pair in pairs:
                if word in pair.split():
                    words.remove(word)
                    break
        ordered_pairs = ", ".join(found_pairs)
        remaining_s = ", ".join(words)
        return ordered_pairs, remaining_s
    
    
    @bentoml.api
    def enahace_prompt(
        self, 
        prompt: str, 
        style: t.Literal["cinematic", "anime", "photographic", "comic", "lineart", "pixelart"], 
        artist: t.Optional[str] = None
    ) -> str:
        import torch 
        from transformers import GPT2Tokenizer, GPT2LMHeadModel, GenerationConfig
        
        try:
            tokenizer = GPT2Tokenizer.from_pretrained(self.model_prompt_enhancer.path)
            word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in self.words]
            bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to(self.device)
            bias[word_ids] = 0
            processor = CustomLogitsProcessor(bias)
            processor_list = LogitsProcessorList([processor])
            prompt = self.styles[style].format(prompt=prompt, artist=artist)
            print("Prompt: ", prompt)
            model = GPT2LMHeadModel.from_pretrained(self.model_prompt_enhancer.path, torch_dtype=torch.float16).to(self.device)
            model.eval()
            inputs = tokenizer(prompt, return_tensors="pt").to(self.device)
            token_count = inputs["input_ids"].shape[1]
            max_new_tokens = 50 - token_count

            generation_config = GenerationConfig(
                penalty_alpha=0.7,
                top_k=50,
                eos_token_id=model.config.eos_token_id,
                pad_token_id=model.config.eos_token_id,
                pad_token=model.config.pad_token_id,
                do_sample=True,
            )

            with torch.no_grad():
                generated_ids = model.generate(
                    input_ids=inputs["input_ids"],
                    attention_mask=inputs["attention_mask"],
                    max_new_tokens=max_new_tokens,
                    generation_config=generation_config,
                    logits_processor=processor_list,
                )
            output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
            input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
            pairs, words = self._find_and_order_pairs(generated_part, self.word_pairs)
            formatted_generated_part = pairs + ", " + words
            enhanced_prompt = input_part + ", " + formatted_generated_part
        finally:
            gc.collect()
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
        return enhanced_prompt


class CustomLogitsProcessor(LogitsProcessor):
    def __init__(self, bias):
        super().__init__()
        self.bias = bias

    def __call__(self, input_ids, scores):
        if len(input_ids.shape) == 2:
            last_token_id = input_ids[0, -1]
            self.bias[last_token_id] = -1e10
        return scores + self.bias

Define a Bento:

cat << EOF > prompt-enhancer/bentofile.yaml
service: 'service:PromptEnhancer'
labels:
  owner: alphaduriendur
  project: portfolio
include:
  - '*.py'
python:
  requirements_txt: 'requirements.txt'
models:
  - tag: "gustavosta-magicprompt-stable-diffusion:tbsmfosgy2xo2zdo"
EOF

Now deploy and run the service:

cd prompt-enhancer
bentoml serve service:PromptEnhancer

Serving Model Locally Screenshot from 2024-07-21 12-41-22 Screenshot from 2024-07-21 12-41-54

With that you have a simple prompt engineering service that can create beautiful prompts that you can then use for your image generation pipelines. We have:

  1. Created a simple Bento API that serves the custom prompt engineering logic that we built.
  2. We also are using a local model store where the service uses the BentoML Model-Store for model management.

On to our next step

Step 4: Build AvatarGenerator Service

In this step I am going to use the same pattern above to write a separate endpoint using HuggingFace's Diffusers' library to run inference on a trained Stable Diffusion model. It is a custom model that was self trained using Dreambooth. I am not going to cover Stable Diffusion training or Dreambooth in this Gist. Instead I am going to focus purely on serving different Stable Diffusion models using BentoML. You can chose any model from the diffusers library as you see fit.

touch avatar-generator/service.py
import bentoml
from PIL.Image import Image

from typing_extensions import Annotated
from annotated_types import Le, Ge
import gc

DEFAULT_HEIGHT = 640
DEFAULT_WIDTH = 640
DEFAULT_PROMPT = "close-up photography of alphaduriendur standing in the rain at night, in a street lit by lamps, leica 35mm summilux"
NEGATIVE_PROMPT = "bad anatomy, deformed, ugly, disfigured, fat cheeks, low quality, bad quality, duplicates, markings on forehead, bad posture, bad lighting, poorly drawn hands, poorly drawn legs, poorly drawn eyes, bad composition, bad lighting, bad shading, bad perspective, bad proportions, duplicate subjects, multiple faces, distorted faces"

@bentoml.service(
    resources={
        "gpu": 1
    },
    traffic={
        "timeout": 30,
        "max_concurrency": 3,
    },
    workers=1,
)
class AvatarGenerator():
    # Define the model as a class variable
    model_ref = bentoml.models.get("alphaduriendur-avatar-generator:eeljtosf66gbkgga")
    model_encode_ref = bentoml.models.get("stabilityai-sd-vae-ft-mse:gn4hntsgcsva2gga")
    

    def __init__(self) -> None:
        import torch
        import diffusers

        # Load model into pipeline
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.txt2img_pipe = diffusers.AutoPipelineForText2Image.from_pretrained(
            self.model_ref.path,
            torch_dtype=torch.float16,
            use_safetensors=True,
            safety_checker = None,
            requires_safety_checker = False
        )
        vae = diffusers.AutoencoderKL.from_pretrained(self.model_encode_ref.path, torch_dtype=torch.float16).to("cuda")
        self.txt2img_pipe.vae = vae
        self.txt2img_pipe.enable_attention_slicing()
        self.txt2img_pipe.enable_freeu(s1=0.9, s2=0.2, b1=1.5, b2=1.6)
        # self.txt2img_pipe.scheduler = diffusers.DPMSolverMultistepScheduler.from_config(self.txt2img_pipe.scheduler.config)
        self.txt2img_pipe.scheduler = diffusers.EulerAncestralDiscreteScheduler.from_config(self.txt2img_pipe.scheduler.config)
        self.txt2img_pipe.to(self.device, dtype=torch.float16)
        # self.txt2img_pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
        # self.txt2img_pipe.set_adapters("pixel")

    @bentoml.api
    def generate_avatar(
            self,
            prompt: str = DEFAULT_PROMPT,
            # negative_prompt: t.Optional[str] = NEGATIVE_PROMPT,
            height: int = DEFAULT_HEIGHT,
            width: int = DEFAULT_WIDTH,
            # seed: int = DEFAULT_SEED,
            num_inference_steps: Annotated[int, Ge(1), Le(50)] = 20,
            guidance_scale: Annotated[float, Ge(0.0), Le(20.)] = 10.0,
    ) -> Image:
        import torch

        try:
            # generator = torch.Generator("cuda").manual_seed(seed)
            res = self.txt2img_pipe(
                prompt=prompt,
                negative_prompt=NEGATIVE_PROMPT,
                height=height,
                width=width,
                num_inference_steps=num_inference_steps,
                guidance_scale=guidance_scale,
                # generator=generator
            )
            image = res[0][0]
        finally:
            gc.collect()
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
            self.txt2img_pipe.disable_freeu()
        return image

Similarly as above we are going to create a bentofile.yaml

cat << EOF > avatar-generator/bentofile.yaml
service: 'service:AvatarGenerator'
labels:
  owner: alphaduriendur
  project: portfolio
include:
  - '*.py'
python:
  requirements_txt: 'requirements.txt'
models:
  - tag: "alphaduriendur-avatar-generator:eeljtosf66gbkgga" # A dictionary
  - tag: "stabilityai-sd-vae-ft-mse:gn4hntsgcsva2gga"
EOF

Now serve the model:

cd avatar-generator
bentoml serve service:AvatarGenerator

Local Model Serving Endpoint Running Inference

With these above steps we have simplified deploying 2 separate models as API services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment