Informal (vibes-based) evaluation of the following vision-language-model captioners:
- Florence-2-base-ft
- CogVLM2
- BLIP-2
- MoonDream2
- Share-Captioner
- Florence-2-SD3-Captioner
#!/usr/bin/env python3 | |
import gradio as gr | |
import numpy as np | |
import random | |
import torch | |
from diffusers import ( | |
StableDiffusion3Pipeline, | |
SD3Transformer2DModel, | |
FlowMatchEulerDiscreteScheduler, | |
AutoencoderTiny, |
def add_profiling_markers(model): | |
"""Monkey-patch profiling markers into an nn.Module. | |
Args: | |
model: an nn.Module | |
Effect: | |
all model.named_module() forward calls get wrapped in their | |
own profiling scope, making traces easier to understand. | |
""" |
Stable Diffusion's VAE is a neural network that encodes images into a compressed "latent" format and decodes them back. The encoder performs 48x lossy compression, and the decoder generates new detail to fill in the gaps.
(Calling this model a "VAE" is sort of a misnomer - it's an encoder with some very slight KL regularization, and a conditional GAN decoder)
This document is a big pile of various links with more info.
Cleaned up version of https://gist.github.com/mrsteyk/74ad3ec2f6f823111ae4c90e168505ac,
which is in turn based on the public_diff_vae.ConvUNetVAE
from https://github.com/openai/consistencydecoder.
Install the consistency decoder code (for the inference logic) and download the extracted weights:
def summarize_tensor(x): | |
return f"\033[34m{str(tuple(x.shape)).ljust(24)}\033[0m (\033[31mmin {x.min().item():+.4f}\033[0m / \033[32mmean {x.mean().item():+.4f}\033[0m / \033[33mmax {x.max().item():+.4f}\033[0m)" | |
class ModelActivationPrinter: | |
def __init__(self, module, submodules_to_log): | |
self.id_to_name = { | |
id(module): str(name) for name, module in module.named_modules() | |
} | |
self.submodules = submodules_to_log |
#!/usr/bin/env python3 | |
from pathlib import Path | |
from safetensors.torch import load_file | |
def summarize_tensor(x): | |
if x is None: | |
return "None" | |
x = x.float() | |
return f"({x.min().item():.3f}, {x.mean().item():.3f}, {x.max().item():.3f})" |