Skip to content

Conversation

@JenniferWang
Copy link
Contributor

@JenniferWang JenniferWang commented Dec 9, 2025

Summary:

Initial commit for launching Qwen 30b MOE on MAST

Test Plan:

https://fburl.com/msl/hxkn8xb0
Current hitting RDMA crash (passed the checkpoint loading)

Summary:

Initial commit for launching Qwen 30b MOE  on MAST

Test Plan:

https://fburl.com/msl/hxkn8xb0

Reviewers:

Subscribers:

Tasks:

Tags:
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 9, 2025
Comment on lines +124 to +126
comm:
# Reference model also loads from checkpoint; so setting a higher timeout
init_timeout_seconds: 1800
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniellepintz the main catch is there are two places loading the checkpoint and we need to loosen the timeout in both places.

ref_model:
procs: 4
num_replicas: 1
hosts: 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a designated host for the ref model.

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving to unblock, but it seems that Danielle had some comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants