If you encountered a KeyError: 'captions'
when trying to run summary(model)
, it is because GroundingDINO creates its own class GroundingDINO
with various customized layers.
First, use torchinfo
instead of torchsummary
because both torchsummary and torch-summary were deprecated long ago.
However, you will need to make a small modification to the torchinfo.py
file in your environment's site-packages directory. For example, you can find it at: /home/{user}/miniconda3/envs/GroundingDINO/lib/python3.8/site-packages/torchinfo/torchinfo.py
.
Add a few lines of code here.
with torch.no_grad():
model = model if device is None else model.to(device)
if isinstance(x, (list, tuple)):
# Use this code example
hidden= torch.zeros((1, 3, 800, 1200)).to(device)
captions_test = ['car.pedestrian']
_ = model(captions=captions_test, samples=hidden)
# comment out the original one
# _ = model(*x, **kwargs)
Add this line summary(model, input_size=(1, 3, 1280,800))
in demo/inference_on_a_image.py
and run the code.
python demo/inference_on_a_image.py -c groundingdino/config/GroundingDINO_SwinT_OGC.py -p {model_path_pth} -o {output_prediction_path} -i {image_path} -t "car.pedestrian"
===================================================================================================================
Layer (type:depth-idx) Output Shape Param #
===================================================================================================================
GroundingDINO [1, 900, 256] --
├─BertModelWarper: 1-1 [1, 768] --
│ └─BertEmbeddings: 2-1 [1, 5, 768] --
│ │ └─Embedding: 3-1 [1, 5, 768] 23,440,896
│ │ └─Embedding: 3-2 [1, 5, 768] 1,536
│ │ └─Embedding: 3-3 [1, 5, 768] 393,216
│ │ └─LayerNorm: 3-4 [1, 5, 768] 1,536
│ │ └─Dropout: 3-5 [1, 5, 768] --
│ └─BertEncoder: 2-2 [1, 5, 768] --
│ │ └─ModuleList: 3-6 -- 85,054,464
│ └─BertPooler: 2-3 [1, 768] --
│ │ └─Linear: 3-7 [1, 768] (590,592)
│ │ └─Tanh: 3-8 [1, 768] --
├─Linear: 1-2 [1, 5, 256] 196,864
├─Joiner: 1-3 [1, 192, 100, 150] --
│ └─SwinTransformer: 2-4 [1, 768, 25, 38] --
│ │ └─PatchEmbed: 3-9 [1, 96, 200, 300] 4,896
│ │ └─Dropout: 3-10 [1, 60000, 96] --
│ │ └─ModuleList: 3-15 -- (recursive)
│ │ └─LayerNorm: 3-12 [1, 15000, 192] 384
│ │ └─ModuleList: 3-15 -- (recursive)
│ │ └─LayerNorm: 3-14 [1, 3750, 384] 768
│ │ └─ModuleList: 3-15 -- (recursive)
│ │ └─LayerNorm: 3-16 [1, 950, 768] 1,536
│ └─PositionEmbeddingSineHW: 2-5 [1, 256, 100, 150] --
│ └─PositionEmbeddingSineHW: 2-6 [1, 256, 50, 75] --
│ └─PositionEmbeddingSineHW: 2-7 [1, 256, 25, 38] --
├─ModuleList: 1-4 -- --
│ └─Sequential: 2-8 [1, 256, 100, 150] --
│ │ └─Conv2d: 3-17 [1, 256, 100, 150] 49,408
│ │ └─GroupNorm: 3-18 [1, 256, 100, 150] 512
│ └─Sequential: 2-9 [1, 256, 50, 75] --
│ │ └─Conv2d: 3-19 [1, 256, 50, 75] 98,560
│ │ └─GroupNorm: 3-20 [1, 256, 50, 75] 512
│ └─Sequential: 2-10 [1, 256, 25, 38] --
│ │ └─Conv2d: 3-21 [1, 256, 25, 38] 196,864
│ │ └─GroupNorm: 3-22 [1, 256, 25, 38] 512
│ └─Sequential: 2-11 [1, 256, 13, 19] --
│ │ └─Conv2d: 3-23 [1, 256, 13, 19] 1,769,728
│ │ └─GroupNorm: 3-24 [1, 256, 13, 19] 512
├─Joiner: 1-5 -- (recursive)
│ └─PositionEmbeddingSineHW: 2-12 [1, 256, 13, 19] --
├─Transformer: 1-6 [1, 900, 256] 231,424
│ └─TransformerEncoder: 2-13 [1, 19947, 256] --
│ │ └─ModuleList: 3-40 -- (recursive)
│ │ └─ModuleList: 3-41 -- (recursive)
│ │ └─ModuleList: 3-42 -- (recursive)
│ │ └─ModuleList: 3-40 -- (recursive)
│ │ └─ModuleList: 3-41 -- (recursive)
│ │ └─ModuleList: 3-42 -- (recursive)
│ │ └─ModuleList: 3-40 -- (recursive)
│ │ └─ModuleList: 3-41 -- (recursive)
│ │ └─ModuleList: 3-42 -- (recursive)
│ │ └─ModuleList: 3-40 -- (recursive)
│ │ └─ModuleList: 3-41 -- (recursive)
│ │ └─ModuleList: 3-42 -- (recursive)
│ │ └─ModuleList: 3-40 -- (recursive)
│ │ └─ModuleList: 3-41 -- (recursive)
│ │ └─ModuleList: 3-42 -- (recursive)
│ │ └─ModuleList: 3-40 -- (recursive)
│ │ └─ModuleList: 3-41 -- (recursive)
│ │ └─ModuleList: 3-42 -- (recursive)
│ └─Linear: 2-14 [1, 19947, 256] 65,792
│ └─LayerNorm: 2-15 [1, 19947, 256] 512
│ └─ContrastiveEmbed: 2-16 [1, 19947, 256] --
│ └─MLP: 2-17 [1, 19947, 4] --
│ │ └─ModuleList: 3-43 -- 132,612
│ └─TransformerDecoder: 2-18 [1, 900, 256] 9,180,804
│ │ └─MLP: 3-44 [900, 1, 256] 197,120
│ │ └─ModuleList: 3-65 -- (recursive)
├─ModuleList: 1-19 -- (recursive)
│ └─MLP: 2-19 [900, 1, 4] --
│ │ └─ModuleList: 3-73 -- (recursive)
├─Transformer: 1-21 -- (recursive)
│ └─TransformerDecoder: 2-30 -- (recursive)
│ │ └─LayerNorm: 3-47 [900, 1, 256] 512
│ │ └─MLP: 3-48 [900, 1, 256] (recursive)
│ │ └─ModuleList: 3-65 -- (recursive)
├─ModuleList: 1-19 -- (recursive)
│ └─MLP: 2-21 [900, 1, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
├─Transformer: 1-21 -- (recursive)
│ └─TransformerDecoder: 2-30 -- (recursive)
│ │ └─LayerNorm: 3-51 [900, 1, 256] (recursive)
│ │ └─MLP: 3-52 [900, 1, 256] (recursive)
│ │ └─ModuleList: 3-65 -- (recursive)
├─ModuleList: 1-19 -- (recursive)
│ └─MLP: 2-23 [900, 1, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
├─Transformer: 1-21 -- (recursive)
│ └─TransformerDecoder: 2-30 -- (recursive)
│ │ └─LayerNorm: 3-55 [900, 1, 256] (recursive)
│ │ └─MLP: 3-56 [900, 1, 256] (recursive)
│ │ └─ModuleList: 3-65 -- (recursive)
├─ModuleList: 1-19 -- (recursive)
│ └─MLP: 2-25 [900, 1, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
├─Transformer: 1-21 -- (recursive)
│ └─TransformerDecoder: 2-30 -- (recursive)
│ │ └─LayerNorm: 3-59 [900, 1, 256] (recursive)
│ │ └─MLP: 3-60 [900, 1, 256] (recursive)
│ │ └─ModuleList: 3-65 -- (recursive)
├─ModuleList: 1-19 -- (recursive)
│ └─MLP: 2-27 [900, 1, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
├─Transformer: 1-21 -- (recursive)
│ └─TransformerDecoder: 2-30 -- (recursive)
│ │ └─LayerNorm: 3-63 [900, 1, 256] (recursive)
│ │ └─MLP: 3-64 [900, 1, 256] (recursive)
│ │ └─ModuleList: 3-65 -- (recursive)
├─ModuleList: 1-19 -- (recursive)
│ └─MLP: 2-29 [900, 1, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
├─Transformer: 1-21 -- (recursive)
│ └─TransformerDecoder: 2-30 -- (recursive)
│ │ └─LayerNorm: 3-67 [900, 1, 256] (recursive)
├─ModuleList: 1-19 -- (recursive)
│ └─MLP: 2-31 [1, 900, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
│ └─MLP: 2-32 [1, 900, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
│ └─MLP: 2-33 [1, 900, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
│ └─MLP: 2-34 [1, 900, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
│ └─MLP: 2-35 [1, 900, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
│ └─MLP: 2-36 [1, 900, 4] (recursive)
│ │ └─ModuleList: 3-73 -- (recursive)
├─ModuleList: 1-20 -- --
│ └─ContrastiveEmbed: 2-37 [1, 900, 256] --
│ └─ContrastiveEmbed: 2-38 [1, 900, 256] --
│ └─ContrastiveEmbed: 2-39 [1, 900, 256] --
│ └─ContrastiveEmbed: 2-40 [1, 900, 256] --
│ └─ContrastiveEmbed: 2-41 [1, 900, 256] --
│ └─ContrastiveEmbed: 2-42 [1, 900, 256] --
├─Transformer: 1-21 -- (recursive)
│ └─ContrastiveEmbed: 2-43 [1, 900, 256] --
================================================================================
Total params: 182,020,486
Trainable params: 181,429,894
Non-trainable params: 590,592
Total mult-adds (G): 10.27
================================================================================
Input size (MB): 12.29
Forward/backward pass size (MB): 9487.25
Params size (MB): 671.38
Estimated Total Size (MB): 10170.92
================================================================================
Although the GroundingDINO paper (p.17) states that the model GroundingDINO-T has 172M parameters, torchinfo shows 182M parameters in total.