Pulse · NVIDIA/Megatron-LM · GitHub

September 12, 2024 – September 19, 2024

Overview

1 Active pull request

18 Active issues
- 0 Merged pull requests
- 1 Open pull request
- 9 Closed issues
- 9 New issues

1 Pull request opened by 1 person

Fix typo lobal_smoothing -> label_smoothing
#1137 opened Sep 13, 2024

9 Issues closed by 4 people

[BUG] offset mismatched in gpt_dataset.py _query_document_sample_shuffle_indices
#1145 closed Sep 19, 2024
[BUG] wrong loss scaling when context parallel is on
#906 closed Sep 19, 2024
[BUG] Function IndexPutBackward0 returned an invalid gradient at index.
#655 closed Sep 18, 2024
[ENHANCEMENT] Asynchronously save the checkpoint to storage with minimizing the paused time of training.
#651 closed Sep 18, 2024
[BUG] Docker Build Fails at `pip install megatron-core==0.4.0`
#650 closed Sep 18, 2024
[BUG] Invalid Link for examples script in README
#1063 closed Sep 18, 2024
[QUESTION] The Reason for calling torch.cuda.synchronize() in func recv_from_prev_pipeline_rank_/send_to_next_pipeline_rank
#1144 closed Sep 18, 2024
TikTokenizer tiktoken-pattern v1 and v2
#1147 closed Sep 18, 2024
[QUESTION] For DDP, why map parameter's main_grad to grad buffer instead of grad?
#690 closed Sep 18, 2024

9 Issues opened by 7 people

[BUG] Context parallel gives NCCL error
#1151 opened Sep 19, 2024
[QUESTION] Adding a new parameter in ColumnParallelLinear/RowParallelLinear raises Error
#1150 opened Sep 19, 2024
[QUESTION]NCCL timeout error when running the second iteration
#1142 opened Sep 13, 2024
[QUESTION]NCCL timeout error when the second iteration
#1141 opened Sep 13, 2024
[QUESTION] NCCL timeout error when the second interation
#1140 opened Sep 13, 2024
[QUESTION] Why does GPTDataset not directly cache all samples document_index and sample_index, and then construct different shuffle_index for different parameters?
#1139 opened Sep 13, 2024
[BUG] Learning rate not overrided when set `--override-opt_param-scheduler`
#1138 opened Sep 13, 2024
[ENHANCEMENT]Is Megatron planning to use flux technology？Integrating communication and gemm into one operator to improve overlap rate
#1136 opened Sep 13, 2024
[ENHANCEMENT] Preprocessing data that is already partitioned and gzipped
#1135 opened Sep 13, 2024

11 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

what's the biggest dataset you've tried?
#930 commented on Sep 13, 2024 • 0 new comments
[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format
#773 commented on Sep 13, 2024 • 0 new comments
[BUG] Resource Leak When Profile Parameter is Enabled
#932 commented on Sep 14, 2024 • 0 new comments
[BUG] Unnecessary initialization for router in megatron-core
#915 commented on Sep 14, 2024 • 0 new comments
[core dataset compilation error]
#807 commented on Sep 15, 2024 • 0 new comments
[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS
#756 commented on Sep 15, 2024 • 0 new comments
When can we have a the MOE checkpoint convert script.
#790 commented on Sep 16, 2024 • 0 new comments
[BUG] GPTDataset._build_document_sample_shuffle_indices does not build the indices on non-root nodes when not using NFS
#907 commented on Sep 17, 2024 • 0 new comments
[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint.
#1132 commented on Sep 18, 2024 • 0 new comments
[BUG] 'NoneType' object has no attribute 'shape' error raised when saving model state with the pretrain_gpt.py
#1134 commented on Sep 19, 2024 • 0 new comments
Fix shape of qk_layernorm.
#1130 commented on Sep 14, 2024 • 0 new comments