allow set base_dir for checkpoint and logger #633

rakkit · 2025-12-09T16:06:25Z

The current checkpoint and logger always put ckpt and logger to the "current" folder, which is super annoying.

This pr can fix the wandb parts, the wandb will be put to dump_folder/wandb

and another pr in torchtitan can fix the ckpt and put ckpt to dump_folder/checkpoint

Also, add a condition of checkpoint load (skip load if step is 0).

felipemello1 · 2025-12-09T19:53:20Z

hey @rakkit , thanks for the PR :)!

Can you help me understand a bit better whats the annoying part about it? I would like to learn more about the UX

Regarding the PR, I believe that it can be greatly simplified. The extra args in wandb are automatically passed to the wandb.init. So you can just do something like this:

metric_logging:
  wandb:
    project: grpo-training
    group: grpo_exp_${oc.env:USER}'
    dump_folder: path/to/my/folder # ---> ADD IT HERE
    logging_mode: global_reduce # global_reduce, per_rank_reduce, per_rank_no_reduce
  console:
    logging_mode: global_reduce

perhaps the "path/to/my/folder" could be some exp_folder at the top of the config, shared by the ckpt and metric logging? e.g.

exp_folder: ./

metric_logging:
   wandb:
        dump_folder: {exp_folder}/wandb

Lets align on what a single config should look like, before we update all of them, to save you some time

Regarding printing the config, thats a good callout. We added it to grpo, but missed for sft. Can you submit it as a separate PR?

felipemello1 · 2025-12-09T20:00:55Z

apps/sft/main.py

+        if self.current_step != 0:
+            # should skip load if current_step is 0
+            self.checkpointer.load(step=self.current_step)


help me understand whats going on. Why would current_step be 0 if we are loading a checkpoint? does this ever happen?

it might be related to this #631

This is a strange behaviour. If we take this PR (to set the dump folder and base_folder for checkpointer), thenself.checkpointer.load(step=0) will believe there is an existing checkpoint and gonna fail if it cannot load ckpt.

Here step!=0 is a temporary patch and i agree we should handle the checkpoint load properly.

rakkit · 2025-12-09T20:06:21Z

Thanks a lot for the feedback, @felipemello1 — I’ll refactor this soon.

By “annoying,” I meant that it always dumps checkpoints and W&B files into the current working directory. A more organized approach (like in TorchTitan) would be to allow an exp_name, and then store training configs, checkpoints, W&B logs/profiling, etc. under that exp_name folder. That would make experiments way easier to keep tidy.

P.S. Many HPC clusters don’t have internet access on compute nodes, so users often need to manually sync W&B after training finishes. Keeping W&B and other run artifacts in a dedicated folder can really save people’s lives here.

rakkit · 2025-12-09T21:05:25Z

perhaps the "path/to/my/folder" could be some exp_folder at the top of the config, shared by the ckpt and metric logging? e.g.
exp_folder: ./

metric_logging:
   wandb:
        dump_folder: {exp_folder}/wandb
Lets align on what a single config should look like, before we update all of them, to save you some time

Regarding printing the config, thats a good callout. We added it to grpo, but missed for sft. Can you submit it as a separate PR?

There is a minor problem. The checkpointer base_dir is set by torchtitan, which reads the key job_config.job.dump_folder. Having this exp_folder solely is clean in yaml, but we need to write code for different tasks to append it to torchtitan.

If we accept the extra configs {job} in YAML, then it is not very compatible with

exp_folder: ./

metric_logging:
   wandb:
        dump_folder: {exp_folder}/wandb

cause we need to do like

exp_folder: ./

metric_logging:
   wandb:
        dump_folder: {job}->{dump_folder}->{name}

same for checkpoint

felipemello1 · 2025-12-10T14:44:41Z

I would need to think a bit about it, but there should be some elegant way of doing it. If its not urgent, let me back to this later this week. Happy to review if you find a good solution meanwhile.

allow checkpoint and logger to set base dir

7e8c8ec

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 9, 2025

rakkit added 2 commits December 9, 2025 17:07

by default dont pring config

55ad01f

print config when its possible for sft

bcd4875

felipemello1 reviewed Dec 9, 2025

View reviewed changes

felipemello1 self-assigned this Dec 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

allow set base_dir for checkpoint and logger #633

allow set base_dir for checkpoint and logger #633

Uh oh!

rakkit commented Dec 9, 2025

Uh oh!

felipemello1 commented Dec 9, 2025 •

edited

Loading

Uh oh!

felipemello1 Dec 9, 2025

Uh oh!

felipemello1 Dec 9, 2025

Uh oh!

rakkit Dec 9, 2025

Uh oh!

rakkit commented Dec 9, 2025

Uh oh!

rakkit commented Dec 9, 2025

Uh oh!

felipemello1 commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

allow set base_dir for checkpoint and logger #633

Are you sure you want to change the base?

allow set base_dir for checkpoint and logger #633

Uh oh!

Conversation

rakkit commented Dec 9, 2025

Uh oh!

felipemello1 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felipemello1 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

rakkit Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

rakkit commented Dec 9, 2025

Uh oh!

rakkit commented Dec 9, 2025

Uh oh!

felipemello1 commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

felipemello1 commented Dec 9, 2025 •

edited

Loading