Datetime: 1/6/2023, 10:25 am
- What I want is to train the model all night on AWS. But running all night, I mean that I shut off my laptop and let it train on the cloud
and wake up on the next morning monitoring it if it still running, or get the
.pt
file as the training completed.
- Cannot close the broswer and let it train
- Instance is assumed to be idle even there is a training process is run
-
I have tried 2 ways to solve a this issues:
- Running directly on Notebook.
- Status: Failed to run after I close the JupyterLab browser
- Runing on termnial with
tmux
- Status: Failed
- I've created a session and start the training. Then, I detached.
- I closed the browswer and the training is still work. This solve the first issue.
- However, after several epochs (monitored on ClearML), the training is failed. I presumed the Sagemaker assumed that that is idle.
- Running directly on Notebook.
- The solution is quite interesting. (19/09/2023) You don't need all night training to solve this. The reason of callback last checkpoint feature existed is to solve this problem where you can continue training from the last_checkpoint weight until required accuracy or metrics is shows up.