Skip to content

Conversation

@rakkit
Copy link

@rakkit rakkit commented Dec 9, 2025

The current checkpoint and logger always put ckpt and logger to the "current" folder, which is super annoying.

This pr can fix the wandb parts, the wandb will be put to dump_folder/wandb

and another pr in torchtitan can fix the ckpt and put ckpt to dump_folder/checkpoint

Also, add a condition of checkpoint load (skip load if step is 0).

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 9, 2025
@felipemello1
Copy link
Contributor

felipemello1 commented Dec 9, 2025

hey @rakkit , thanks for the PR :)!

Can you help me understand a bit better whats the annoying part about it? I would like to learn more about the UX

Regarding the PR, I believe that it can be greatly simplified. The extra args in wandb are automatically passed to the wandb.init. So you can just do something like this:

metric_logging:
  wandb:
    project: grpo-training
    group: grpo_exp_${oc.env:USER}'
    dump_folder: path/to/my/folder # ---> ADD IT HERE
    logging_mode: global_reduce # global_reduce, per_rank_reduce, per_rank_no_reduce
  console:
    logging_mode: global_reduce

perhaps the "path/to/my/folder" could be some exp_folder at the top of the config, shared by the ckpt and metric logging? e.g.

exp_folder: ./

metric_logging:
   wandb:
        dump_folder: {exp_folder}/wandb

Lets align on what a single config should look like, before we update all of them, to save you some time


Regarding printing the config, thats a good callout. We added it to grpo, but missed for sft. Can you submit it as a separate PR?

Comment on lines +150 to +152
if self.current_step != 0:
# should skip load if current_step is 0
self.checkpointer.load(step=self.current_step)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

help me understand whats going on. Why would current_step be 0 if we are loading a checkpoint? does this ever happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be related to this #631

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strange behaviour. If we take this PR (to set the dump folder and base_folder for checkpointer), thenself.checkpointer.load(step=0) will believe there is an existing checkpoint and gonna fail if it cannot load ckpt.

Here step!=0 is a temporary patch and i agree we should handle the checkpoint load properly.

@rakkit
Copy link
Author

rakkit commented Dec 9, 2025

Thanks a lot for the feedback, @felipemello1 — I’ll refactor this soon.

By “annoying,” I meant that it always dumps checkpoints and W&B files into the current working directory. A more organized approach (like in TorchTitan) would be to allow an exp_name, and then store training configs, checkpoints, W&B logs/profiling, etc. under that exp_name folder. That would make experiments way easier to keep tidy.

P.S. Many HPC clusters don’t have internet access on compute nodes, so users often need to manually sync W&B after training finishes. Keeping W&B and other run artifacts in a dedicated folder can really save people’s lives here.

@rakkit
Copy link
Author

rakkit commented Dec 9, 2025

perhaps the "path/to/my/folder" could be some exp_folder at the top of the config, shared by the ckpt and metric logging? e.g.

exp_folder: ./

metric_logging:
   wandb:
        dump_folder: {exp_folder}/wandb

Lets align on what a single config should look like, before we update all of them, to save you some time

Regarding printing the config, thats a good callout. We added it to grpo, but missed for sft. Can you submit it as a separate PR?

There is a minor problem. The checkpointer base_dir is set by torchtitan, which reads the key job_config.job.dump_folder. Having this exp_folder solely is clean in yaml, but we need to write code for different tasks to append it to torchtitan.

If we accept the extra configs {job} in YAML, then it is not very compatible with

exp_folder: ./

metric_logging:
   wandb:
        dump_folder: {exp_folder}/wandb

cause we need to do like

exp_folder: ./

metric_logging:
   wandb:
        dump_folder: {job}->{dump_folder}->{name}

same for checkpoint

@felipemello1
Copy link
Contributor

I would need to think a bit about it, but there should be some elegant way of doing it. If its not urgent, let me back to this later this week. Happy to review if you find a good solution meanwhile.

@felipemello1 felipemello1 self-assigned this Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants