-
Notifications
You must be signed in to change notification settings - Fork 71
allow set base_dir for checkpoint and logger #633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
hey @rakkit , thanks for the PR :)! Can you help me understand a bit better whats the annoying part about it? I would like to learn more about the UX Regarding the PR, I believe that it can be greatly simplified. The extra args in wandb are automatically passed to the wandb.init. So you can just do something like this: perhaps the "path/to/my/folder" could be some exp_folder at the top of the config, shared by the ckpt and metric logging? e.g. Lets align on what a single config should look like, before we update all of them, to save you some time Regarding printing the config, thats a good callout. We added it to grpo, but missed for sft. Can you submit it as a separate PR? |
| if self.current_step != 0: | ||
| # should skip load if current_step is 0 | ||
| self.checkpointer.load(step=self.current_step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help me understand whats going on. Why would current_step be 0 if we are loading a checkpoint? does this ever happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be related to this #631
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a strange behaviour. If we take this PR (to set the dump folder and base_folder for checkpointer), thenself.checkpointer.load(step=0) will believe there is an existing checkpoint and gonna fail if it cannot load ckpt.
Here step!=0 is a temporary patch and i agree we should handle the checkpoint load properly.
|
Thanks a lot for the feedback, @felipemello1 — I’ll refactor this soon. By “annoying,” I meant that it always dumps checkpoints and W&B files into the current working directory. A more organized approach (like in TorchTitan) would be to allow an P.S. Many HPC clusters don’t have internet access on compute nodes, so users often need to manually sync W&B after training finishes. Keeping W&B and other run artifacts in a dedicated folder can really save people’s lives here. |
There is a minor problem. The checkpointer If we accept the extra configs {job} in YAML, then it is not very compatible with exp_folder: ./
metric_logging:
wandb:
dump_folder: {exp_folder}/wandbcause we need to do like exp_folder: ./
metric_logging:
wandb:
dump_folder: {job}->{dump_folder}->{name}same for checkpoint |
|
I would need to think a bit about it, but there should be some elegant way of doing it. If its not urgent, let me back to this later this week. Happy to review if you find a good solution meanwhile. |
The current checkpoint and logger always put ckpt and logger to the "current" folder, which is super annoying.
This pr can fix the wandb parts, the wandb will be put to
dump_folder/wandband another pr in torchtitan can fix the ckpt and put ckpt to
dump_folder/checkpointAlso, add a condition of checkpoint load (skip load if step is 0).