Skip to content

Instantly share code, notes, and snippets.

@ortegatron
Last active March 14, 2024 19:10
Show Gist options
  • Star 45 You must be signed in to star a gist
  • Fork 14 You must be signed in to fork a gist
  • Save ortegatron/c0dad15e49c2b74de8bb09a5615d9f6b to your computer and use it in GitHub Desktop.
Save ortegatron/c0dad15e49c2b74de8bb09a5615d9f6b to your computer and use it in GitHub Desktop.
Trainer with Loss on Validation for Detectron2
from detectron2.engine.hooks import HookBase
from detectron2.evaluation import inference_context
from detectron2.utils.logger import log_every_n_seconds
from detectron2.data import DatasetMapper, build_detection_test_loader
import detectron2.utils.comm as comm
import torch
import time
import datetime
class LossEvalHook(HookBase):
def __init__(self, eval_period, model, data_loader):
self._model = model
self._period = eval_period
self._data_loader = data_loader
def _do_loss_eval(self):
# Copying inference_on_dataset from evaluator.py
total = len(self._data_loader)
num_warmup = min(5, total - 1)
start_time = time.perf_counter()
total_compute_time = 0
losses = []
for idx, inputs in enumerate(self._data_loader):
if idx == num_warmup:
start_time = time.perf_counter()
total_compute_time = 0
start_compute_time = time.perf_counter()
if torch.cuda.is_available():
torch.cuda.synchronize()
total_compute_time += time.perf_counter() - start_compute_time
iters_after_start = idx + 1 - num_warmup * int(idx >= num_warmup)
seconds_per_img = total_compute_time / iters_after_start
if idx >= num_warmup * 2 or seconds_per_img > 5:
total_seconds_per_img = (time.perf_counter() - start_time) / iters_after_start
eta = datetime.timedelta(seconds=int(total_seconds_per_img * (total - idx - 1)))
log_every_n_seconds(
logging.INFO,
"Loss on Validation done {}/{}. {:.4f} s / img. ETA={}".format(
idx + 1, total, seconds_per_img, str(eta)
),
n=5,
)
loss_batch = self._get_loss(inputs)
losses.append(loss_batch)
mean_loss = np.mean(losses)
self.trainer.storage.put_scalar('validation_loss', mean_loss)
comm.synchronize()
return losses
def _get_loss(self, data):
# How loss is calculated on train_loop
metrics_dict = self._model(data)
metrics_dict = {
k: v.detach().cpu().item() if isinstance(v, torch.Tensor) else float(v)
for k, v in metrics_dict.items()
}
total_losses_reduced = sum(loss for loss in metrics_dict.values())
return total_losses_reduced
def after_step(self):
next_iter = self.trainer.iter + 1
is_final = next_iter == self.trainer.max_iter
if is_final or (self._period > 0 and next_iter % self._period == 0):
self._do_loss_eval()
self.trainer.storage.put_scalars(timetest=12)
class MyTrainer(DefaultTrainer):
@classmethod
def build_evaluator(cls, cfg, dataset_name, output_folder=None):
if output_folder is None:
output_folder = os.path.join(cfg.OUTPUT_DIR, "inference")
return COCOEvaluator(dataset_name, cfg, True, output_folder)
def build_hooks(self):
hooks = super().build_hooks()
hooks.insert(-1,LossEvalHook(
cfg.TEST.EVAL_PERIOD,
self.model,
build_detection_test_loader(
self.cfg,
self.cfg.DATASETS.TEST[0],
DatasetMapper(self.cfg,True)
)
))
return hooks
import json
import matplotlib.pyplot as plt
experiment_folder = './output/model_iter4000_lr0005_wf1_date2020_03_20__05_16_45'
def load_json_arr(json_path):
lines = []
with open(json_path, 'r') as f:
for line in f:
lines.append(json.loads(line))
return lines
experiment_metrics = load_json_arr(experiment_folder + '/metrics.json')
plt.plot(
[x['iteration'] for x in experiment_metrics],
[x['total_loss'] for x in experiment_metrics])
plt.plot(
[x['iteration'] for x in experiment_metrics if 'validation_loss' in x],
[x['validation_loss'] for x in experiment_metrics if 'validation_loss' in x])
plt.legend(['total_loss', 'validation_loss'], loc='upper left')
plt.show()
@xQsM3
Copy link

xQsM3 commented Jan 6, 2021

cfg.SOLVER.IMS_PER_BATCH = 4 leads to lower val loss then train loss, while cfg.SOLVER.IMS_PER_BATCH = 2 is delivering val loss higher then train loss, as expected. I checked the code but I cannot figure out why? Anyone has an idea why is that?

@morganaribeiro
Copy link

@ortegatron and @alexriedel1 Is there an automatic way to separate the dataset into training, validation and testing before registering the dataset and starting training on Detectron2? If not, how do you separate manually before training?

Most of the time, splitting the dataset is based on your application requirements and data properties. You want to make sure that the data generation process / probability distrubution is the same over all sets. You can use sklearn's train_test_split. Here is an example of how this can be done coco_split.

@hadilou Is it possible with coco_split to add a validation set, that is, separate into training sets | test and validation (60, 30, 10)? Can you tell?

@Nobuyuki-Enzan
Copy link

How to implement Early Stopping with saving the model (eg. validation loss)?

@pnn19
Copy link

pnn19 commented Mar 17, 2021

Where should the three python files be placed? I tried putting LossEvalHook.py in the root of Detectron2 and the MyTrainer.py block in the Trainer script, but it doesn't work.

@c-axel
Copy link

c-axel commented Mar 23, 2021

this is great, saved me a lot of time. thanks!

@johnahjohn
Copy link

Can anyone give suggestions or instructions on integrating this code with detectron2 official code. I had done
from detectron2.engine import MyTrainer
from detectron2.engine import LossEvalHook
and
trainer = MyTrainer(cfg)
in my training scripts

But getting an error

TypeError Traceback (most recent call last)
in
21
22 os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
---> 23 trainer = MyTrainer(cfg)
24 trainer.resume_or_load(resume=False)
25 trainer.train()

TypeError: 'module' object is not callable

Please help me to solve this problem

@c-axel
Copy link

c-axel commented Jun 4, 2021

Can anyone give suggestions or instructions on integrating this code with detectron2 official code. I had done
from detectron2.engine import MyTrainer
from detectron2.engine import LossEvalHook
and
trainer = MyTrainer(cfg)
in my training scripts

But getting an error

TypeError Traceback (most recent call last)
in
21
22 os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
---> 23 trainer = MyTrainer(cfg)
24 trainer.resume_or_load(resume=False)
25 trainer.train()

TypeError: 'module' object is not callable

Please help me to solve this problem

If you use the customer MyTrainer.py above you can call it like this

from MyTrainer import MyTrainer

trainer = MyTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

@bsm1244
Copy link

bsm1244 commented Jul 22, 2021

where should i put the LossEvalHook.py? Do i just paste into default.py?
please help me to solve this error
캡처

@c-axel
Copy link

c-axel commented Jul 22, 2021

where should i put the LossEvalHook.py? Do i just paste into default.py?

I put it in MyTrainer.py and just imported MyTrainer

@bsm1244
Copy link

bsm1244 commented Jul 22, 2021

where should i put the LossEvalHook.py? Do i just paste into default.py?
please help me to solve this error
캡처

I resolved the problem by changing cfg.TEST.EVAL_PERIOD to self.cfg.TEST.EVAL_PERIOD.
I wish it can be helpful to others

@prathamsss
Copy link

try this:

class Mytrainer(DefaultTrainer):
@classmethod
def build_train_loader(cls, cfg):
return build_detection_train_loader(cfg, mapper=custom_mapper)

@classmethod
def build_evaluator(cls, cfg, dataset_name, output_folder=None):
if output_folder is None:
os.makedirs("coco_eval", exist_ok=True)
output_folder = "coco_eval"
return COCOEvaluator(dataset_name, cfg, False, output_folder)

def build_hooks(self):
hooks = super().build_hooks()
hooks.insert(-1,LossEvalHook(
cfg.TEST.EVAL_PERIOD,
self.model,
build_detection_test_loader(
self.cfg,
self.cfg.DATASETS.TEST[0],
DatasetMapper(self.cfg,True)
)
))
return hooks

@prathamsss
Copy link

I guess I found one example where it did not have a 'total_loss' in one of the lines. The 3rd up from the bottom. Not sure why.

there must be 2 instances of this total_loss missing, because I still can't get it to work.

See when you iterate over the created json, for few iteration you get training loss, but when it gets validation dict it does not.
so tha't bcze validation is done after certain given threshold

try this:

validation_loss=[]
train_loss=[]
for i in experiment_metrics:
try:
validation_loss.append(i['validation_loss'])
train_loss.append(i['validation_loss'])
except KeyError:
pass

@StevanCakic
Copy link

StevanCakic commented Oct 19, 2021

Hi I tried your code but after running validation it just hangs and does not run anything else. Please help me. Thank you very much. After a while, an error popped up: RuntimeError: [/opt/conda/conda-bld/pytorch_1587428207430/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete

I have the same issue. My model is running on HPC (Slurm, 8 GPUs). Code execution stucks when evaluation starts:

[10/19 15:07:35 d2.data.datasets.coco]: Loaded 73 images in COCO format from ./data/valid/_annotations.coco.json
[10/19 15:07:35 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')]
[10/19 15:07:35 d2.data.common]: Serializing 73 elements to byte tensors and concatenating them all ...
[10/19 15:07:35 d2.data.common]: Serialized dataset takes 0.05 MiB
WARNING [10/19 15:07:35 d2.evaluation.coco_evaluation]: COCO Evaluator instantiated using config, this is deprecated behavior. Please pass in explicit arguments instead.
[10/19 15:07:35 d2.evaluation.evaluator]: Start inference on 10 batches
[10/19 15:07:38 d2.evaluation.evaluator]: Total inference time: 0:00:01.518439 (0.303688 s / iter per device, on 8 devices)
[10/19 15:07:38 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:01 (0.291231 s / iter per device, on 8 devices)

After some time I get an error message: Timed out waiting 1800000ms for send operation to complete
When I start the same code on Google Colab for example, no error messages.

My code:

#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
"""
A main training script.

This scripts reads a given config file and runs the training or evaluation.
It is an entry point that is made to train standard models in detectron2.

In order to let one script support training of many models,
this script contains logic that are specific to these built-in models and therefore
may not be suitable for your own project.
For example, your research project perhaps only needs a single "evaluator".

Therefore, we recommend you to use detectron2 as an library and take
this file as an example of how to use the library.
You may want to write your own script with your datasets and other customizations.
"""

# import cv2
import glob
import logging
import numpy as np
import os
import ssl
import time
import torch
import pandas as pd
import matplotlib.pyplot as plt

from detectron2 import model_zoo
from detectron2.checkpoint import DetectionCheckpointer
from detectron2.config import get_cfg
from detectron2.data import MetadataCatalog, DatasetMapper, build_detection_test_loader
from detectron2.engine import DefaultPredictor, DefaultTrainer, default_argument_parser, default_setup, hooks, launch
from detectron2.evaluation import (
    CityscapesInstanceEvaluator,
    CityscapesSemSegEvaluator,
    COCOEvaluator,
    COCOPanopticEvaluator,
    DatasetEvaluators,
    LVISEvaluator,
    PascalVOCDetectionEvaluator,
    SemSegEvaluator,
    verify_results,
    inference_context,
    inference_on_dataset
)
from detectron2.modeling import GeneralizedRCNNWithTTA

from detectron2.data.datasets import register_coco_instances
from detectron2.utils.visualizer import Visualizer
import detectron2.utils.comm as comm
from detectron2.utils.logger import log_every_n_seconds
from detectron2.engine.hooks import HookBase
from demo_celije.LossEvalHook import LossEvalHook # 

ssl._create_default_https_context = ssl._create_unverified_context


class Trainer(DefaultTrainer):
    @classmethod
    def build_evaluator(cls, cfg, dataset_name, output_folder=None):
        if output_folder is None:
            output_folder = os.path.join(cfg.OUTPUT_DIR, "inference")
        return COCOEvaluator(dataset_name, cfg, True, output_folder)
    
    
    def build_hooks(self):
        print("68 - BUILD HOOK")
        hooks = super().build_hooks()
        
        hooks.insert(-1,LossEvalHook(
            self.cfg.TEST.EVAL_PERIOD,
            self.model,
            build_detection_test_loader(
                self.cfg,
                self.cfg.DATASETS.TEST[0],
                DatasetMapper(self.cfg,True)
            )
        ))
        return hooks
    
def setup(args):
    """
    Create configs and perform basic setups.
    """
    register_coco_instances("my_dataset_train", {}, "./data/train/_annotations.coco.json", "./data/train")
    register_coco_instances("my_dataset_val", {}, "./data/valid/_annotations.coco.json", "./data/valid")
    register_coco_instances("my_dataset_test", {}, "./data/test/_annotations.coco.json", "./data/test")
    cfg = get_cfg()
    cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x.yaml"))
    cfg.DATASETS.TRAIN = ("my_dataset_train",)
    cfg.DATASETS.TEST = ("my_dataset_val",)
    cfg.DATALOADER.NUM_WORKERS = 0 #ovo treba isprobati sta se desava ako povecam
    cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x.yaml")  # Let training initialize from model zoo
    # Za velike vrijednosti overfitting, a za male algoritam uci sporo
    cfg.SOLVER.IMS_PER_BATCH = 8
    # Vrsta optimizera
    # cfg.SOLVER.OPTIMIZER = 'SGD'
    # Pocetna vrijednost za LR
    cfg.SOLVER.MAX_ITER = 200 #adjust up if val mAP is still rising, adjust down if overfit
    cfg.SOLVER.BASE_LR = 0.001

    # Update learning rate
    # cfg.SOLVER.LR_POLICY = 'steps_with_decay'
    # cfg.SOLVER.GAMMA = 0.1
    # od 0 do 1000 lr = 0.001 * 0.1 * 0 = 0.01; 
    # od 1001 do 1750 lr = 0.001 * 0.1 ** 1 = 0.0001; 
    # i od 1751 do kraja lr = 0.001 * 0.1 ** 2 = 0.00001 
    # cfg.SOLVER.WARMUP_ITERS = 100
    # cfg.SOLVER.STEPS = [1000, 1750]  

    
    cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512 # Ovo i dalje ne razumijem bas, ima veze sa neuronskom mrezom
    cfg.MODEL.ROI_HEADS.NUM_CLASSES = 4 #your number of classes + 1 (kreira se neka default klasa, bug)
    cfg.TEST.EVAL_PERIOD = 150

    # cfg.merge_from_list(args.opts)
    # cfg.freeze()
    default_setup(cfg, args)
    return cfg


def main(args):
    cfg = setup(args)
    if args.eval_only:
        print("126 - OVO SE IPAK NEKAD IZVRSAVA !!!! EVAL ONLY")
        model = Trainer.build_model(cfg)
        DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(cfg.MODEL.WEIGHTS, resume=args.resume)
        res = Trainer.test(cfg, model)
        
        if cfg.TEST.AUG.ENABLED:
            print("132 - TEST AUG ENABLED ONLY EVAL")
            res.update(Trainer.test_with_TTA(cfg, model))
        
        if comm.is_main_process():
            verify_results(cfg, res)
        return res
    print("138 - Tacno prije trainer")
    trainer = Trainer(cfg)
    trainer.resume_or_load(resume=args.resume)
    
    if cfg.TEST.AUG.ENABLED:
        print("143 - OVO SE IPAK NEKAD IZVRSAVA !!!! OVO BI PUCALO !!! TEST AUG ENABLED")
        # trainer.register_hooks([hooks.EvalHook(0, lambda: trainer.test_with_TTA(cfg, trainer.model))])
    
    return trainer.train()


if __name__ == "__main__":
    from datetime import datetime
    print(torch.__version__, torch.cuda.is_available())
    args = default_argument_parser().parse_args()
    print("Command Line Args:", args)

    print("Is CUDA available:", torch.cuda.is_available())

    now = datetime.now()

    launch(main, num_gpus_per_machine = args.num_gpus, dist_url = 'auto', args=(args,))
    
    now = datetime.now()

    # Evaluate
    print("Zapoceta evaluacija")
    cfg = setup(args)
    cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.8
    predictor = DefaultPredictor(cfg)
    evaluator = COCOEvaluator("my_dataset_test", cfg, False, output_dir="./output/")
    val_loader = build_detection_test_loader(cfg, "my_dataset_test")

    trainer = Trainer(cfg)
    inference_on_dataset(trainer.model, val_loader, evaluator)

    # Generate loss function
    print("Prikaz loss funkcije, dijagram")
    
    metrics_df = pd.read_json("./output/metrics.json", orient="records", lines=True)
    mdf = metrics_df.sort_values("iteration")   
    fig, ax = plt.subplots()

    mdf1 = mdf[~mdf["total_loss"].isna()]
    ax.plot(mdf1["iteration"], mdf1["total_loss"], c="C0", label="train")
    if "validation_loss" in mdf.columns:
        mdf2 = mdf[~mdf["validation_loss"].isna()]
        ax.plot(mdf2["iteration"], mdf2["validation_loss"], c="C1", label="validation")

    # ax.set_ylim([0, 0.5])
    ax.legend()
    ax.set_title("Loss curve")
    print("Prikaz dijagrama !!!")
    fig.savefig("./output/test.png")

@jonnyevans3210
Copy link

jonnyevans3210 commented Oct 27, 2021

Same issue as above. Note that it runs fine when using only one GPU, but consistently fails in this way when using multiple GPUS. Also note that it runs just fine when the LossEvalHook method is commented out / not overwritted. any help appreciated!

@jonnyevans3210
Copy link

@sei-amellinger
Copy link

Thanks for posting this! This gist seems to be the go-to for getting validation loss in DT2.

Quick question: Why don't you need to bracket the use of the model ( metrics_dict = self._model(data) ) with "with torch.no_grad():"? Where are the adjustments being made so that the model isn't affected by the validation data?

Thanks!

@balajihosur
Copy link

@ortegatron Can you please how is the cfg file format. where to add this cfg.TEST.EVAL_PERIOD in cfg file. Can you provide me the format of config file.

@konrad98ft
Copy link

konrad98ft commented Jan 13, 2022

Used following MyTainer and after validation started got an error:
RuntimeError: DataLoader worker (pid(s) 13656, 10880, 3784, 12464) exited unexpectedly
How to solve it?

@MLDeep414
Copy link

@https://github.com/alexriedel1

Hi,
I have calculated validation_loss, based on this I am trying to implement earlystopping using this code

class EarlyStopping():
"""
Early stopping to stop the training when the loss does not improve after
certain epochs.
"""
def init(self, patience=5, min_delta=0):
"""
:param patience: how many epochs to wait before stopping when loss is
not improving
:param min_delta: minimum difference between new loss and old loss for
new loss to be considered as an improvement
"""
self.patience = patience
self.min_delta = min_delta
self.counter = 0
self.best_loss = None
self.early_stop = False
def call(self, val_loss):
if self.best_loss == None:
self.best_loss = val_loss
elif self.best_loss - val_loss > self.min_delta:
self.best_loss = val_loss
# reset counter if validation loss improves
self.counter = 0
elif self.best_loss - val_loss < self.min_delta:
self.counter += 1
print(f"INFO: Early stopping counter {self.counter} of {self.patience}")
if self.counter >= self.patience:
print('INFO: Early stopping')
self.early_stop = True
but I could not achieve it properly, could you please help me.

@alexriedel1
Copy link

@MLDeep414 please write your code properly formatted and show what actual problem your running into in order to get help

@alexriedel1
Copy link

alexriedel1 commented Feb 15, 2022

@MLDeep414 sorry but the formatting is even worse this time as some of the code is formatted and some not :(
do you get any error? is the hook not executed? whats the problem here?

@alexriedel1
Copy link

https://detectron2.readthedocs.io/en/latest/modules/engine.html#detectron2.engine.HookBase

The hook as a child of HookBase will be called as stated in the docs above.

You have to implement the method after_step if you want to check for early stopping after each step (which is maybe too much so check for a reasonable step inside of your method!). Inside your hook you can access the trainer object to get the necessary information about your training state.

@pieterblok
Copy link

A while ago I created a similar hook for validation: not based on loss but on the validation performance (which is in some context logic, as we also evaluate the test set on mAP). Earlystopping would have a similar functionality, although it really stops the training, while my method continues training, but saves the best model automatically.

You can find the procedure here:
https://github.com/pieterblok/maskal/blob/5e1b1e9b6c14a423b22d3218da66120cbb0b7f7c/maskAL.py#L319

It might be of interest…

@veer5551
Copy link

Hi,
Thanks for the amazing Work!
I included the above LossEval hook into the training and found out that the extra workers that are created (for data loaders?) are not exiting and are locked in. This makes code run out of resources and the machine stops after a while!

Using Multi-GPU training.
Here is the issue raised on the detectron2 repo: facebookresearch/detectron2#3953

Any thoughts on this guys, I am restricted from doing any training :(

Thanks!

@EmmaVanPuyenbr
Copy link

EmmaVanPuyenbr commented Apr 6, 2022

hi,
Thanks for the nice code.
I was wondering how you were able to store the logging under a customized name? or were did you first define the name of the model in

experiment_folder = './output/model_iter4000_lr0005_wf1_date2020_03_20__05_16_45' 

and then also the /metrics.json name? can you change the filename?

thanks!

@hikmatkhan
Copy link

Hi
What is this line doing?
self.trainer.storage.put_scalars(timetest=12)

@daeyeoplee
Copy link

I put these whole things in my .py file.

AssertionError:

It keep occurs this error with no other message.

How can i handle this problem thank you

@xxxming730
Copy link

In my metrics.json, i get this:
{"bbox/AP": 0.0031121327534856355, "bbox/AP-background": NaN, "bbox/AP-ck": 0.0275027502750275, "bbox/AP-fml": 0.0, "bbox/AP-fmm": 0.0, "bbox/AP-gh": 0.0, "bbox/AP-gpe": 0.0, "bbox/AP-gpr": 0.0005064445063432175, "bbox/AP-s": 0.0, "bbox/AP-sc": 0.0, "bbox/AP-ss": 0.0, "bbox/AP50": 0.00639308034241901, "bbox/AP75": 0.0, "bbox/APl": 0.00016924769400016924, "bbox/APm": 0.0, "bbox/APs": 0.03000300030003, "iteration": 150, "segm/AP": 0.001222344456667889, "segm/AP-background": NaN, "segm/AP-ck": 0.011001100110011, "segm/AP-fml": 0.0, "segm/AP-fmm": 0.0, "segm/AP-gh": 0.0, "segm/AP-gpe": 0.0, "segm/AP-gpr": 0.0, "segm/AP-s": 0.0, "segm/AP-sc": 0.0, "segm/AP-ss": 0.0, "segm/AP50": 0.006111722283339444, "segm/AP75": 0.0, "segm/APl": 0.0, "segm/APm": 0.0, "segm/APs": 0.0088008800880088}
{"data_time": 1.5641630000000077, "eta_seconds": 19170.847725999975, "fast_rcnn/cls_accuracy": 0.98468017578125, "fast_rcnn/false_negative": 1.0, "fast_rcnn/fg_cls_accuracy": 0.0, "iteration": 179, "loss_box_reg": 0.041649249847978354, "loss_cls": 0.10573109425604343, "loss_mask": 0.6462322175502777, "loss_rpn_cls": 0.13051373744383454, "loss_rpn_loc": 0.05502640060149133, "lr": 4.4955249999999996e-05, "mask_rcnn/accuracy": 0.7032937057778648, "mask_rcnn/false_negative": 0.050562366590145194, "mask_rcnn/false_positive": 0.6311390203732659, "roi_head/num_bg_samples": 504.15625, "roi_head/num_fg_samples": 7.84375, "rpn/num_neg_anchors": 236.09375, "rpn/num_pos_anchors": 19.90625, "time": 3.7675802499999236, "total_loss": 1.0010524874087423}
{"iteration": 178, "timetest": 12.0}
...

there is no 'validation_loss'

However, the 'validation_loss' is output in the console

and I also have this question:

Hi What is this line doing? self.trainer.storage.put_scalars(timetest=12)

If you have this problem, we can discuss it together. If anyone knows why, please let me know. Thank you very much. Hope everyone is OK

@xxxming730
Copy link

A single GPU is fine, we can output validation_loss and record it, but when training on multiple Gpus, the related meters.json output will be wrong like above, and we have no idea what is being output, anyone solved this problem, thanks a lot.

@Scarlet3101
Copy link

In my metrics.json, i get this: {"bbox/AP": 0.0031121327534856355, "bbox/AP-background": NaN, "bbox/AP-ck": 0.0275027502750275, "bbox/AP-fml": 0.0, "bbox/AP-fmm": 0.0, "bbox/AP-gh": 0.0, "bbox/AP-gpe": 0.0, "bbox/AP-gpr": 0.0005064445063432175, "bbox/AP-s": 0.0, "bbox/AP-sc": 0.0, "bbox/AP-ss": 0.0, "bbox/AP50": 0.00639308034241901, "bbox/AP75": 0.0, "bbox/APl": 0.00016924769400016924, "bbox/APm": 0.0, "bbox/APs": 0.03000300030003, "iteration": 150, "segm/AP": 0.001222344456667889, "segm/AP-background": NaN, "segm/AP-ck": 0.011001100110011, "segm/AP-fml": 0.0, "segm/AP-fmm": 0.0, "segm/AP-gh": 0.0, "segm/AP-gpe": 0.0, "segm/AP-gpr": 0.0, "segm/AP-s": 0.0, "segm/AP-sc": 0.0, "segm/AP-ss": 0.0, "segm/AP50": 0.006111722283339444, "segm/AP75": 0.0, "segm/APl": 0.0, "segm/APm": 0.0, "segm/APs": 0.0088008800880088} {"data_time": 1.5641630000000077, "eta_seconds": 19170.847725999975, "fast_rcnn/cls_accuracy": 0.98468017578125, "fast_rcnn/false_negative": 1.0, "fast_rcnn/fg_cls_accuracy": 0.0, "iteration": 179, "loss_box_reg": 0.041649249847978354, "loss_cls": 0.10573109425604343, "loss_mask": 0.6462322175502777, "loss_rpn_cls": 0.13051373744383454, "loss_rpn_loc": 0.05502640060149133, "lr": 4.4955249999999996e-05, "mask_rcnn/accuracy": 0.7032937057778648, "mask_rcnn/false_negative": 0.050562366590145194, "mask_rcnn/false_positive": 0.6311390203732659, "roi_head/num_bg_samples": 504.15625, "roi_head/num_fg_samples": 7.84375, "rpn/num_neg_anchors": 236.09375, "rpn/num_pos_anchors": 19.90625, "time": 3.7675802499999236, "total_loss": 1.0010524874087423} {"iteration": 178, "timetest": 12.0} ...

there is no 'validation_loss'

However, the 'validation_loss' is output in the console

and I also have this question:

Hi What is this line doing? self.trainer.storage.put_scalars(timetest=12)

If you have this problem, we can discuss it together. If anyone knows why, please let me know. Thank you very much. Hope everyone is OK

Hi!, I have the same problem when I set eval_period:100. but if I set it to 50 then everything works correctly. the validation_loss result is written to metrics.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment