Skip to content

Instantly share code, notes, and snippets.

@Ammar-Azman
Last active September 19, 2023 01:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Ammar-Azman/e25fec87b4586990a24b6922c59572e2 to your computer and use it in GitHub Desktop.
Save Ammar-Azman/e25fec87b4586990a24b6922c59572e2 to your computer and use it in GitHub Desktop.

Title: Unsolve training Yolov5 on Sagemaker all night issue

Datetime: 1/6/2023, 10:25 am

Observation

  • What I want is to train the model all night on AWS. But running all night, I mean that I shut off my laptop and let it train on the cloud and wake up on the next morning monitoring it if it still running, or get the .pt file as the training completed.

Issues

  1. Cannot close the broswer and let it train
  2. Instance is assumed to be idle even there is a training process is run

Actions

  • I have tried 2 ways to solve a this issues:

    1. Running directly on Notebook.
      • Status: Failed to run after I close the JupyterLab browser
    2. Runing on termnial with tmux
      • Status: Failed
      • I've created a session and start the training. Then, I detached.
      • I closed the browswer and the training is still work. This solve the first issue.
      • However, after several epochs (monitored on ClearML), the training is failed. I presumed the Sagemaker assumed that that is idle.

Solution

  • The solution is quite interesting. (19/09/2023) You don't need all night training to solve this. The reason of callback last checkpoint feature existed is to solve this problem where you can continue training from the last_checkpoint weight until required accuracy or metrics is shows up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment