You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 16, 2023. It is now read-only.
$ docker run -ti --gpus all --rm palm torchrun --nproc_per_node 1 train.py --from_torch --config CONFIG_FILE.py
Traceback (most recent call last):
File "train.py", line 18, in
from colossalai.utils import colo_set_process_memory_fraction, colo_device_memory_capacity
ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/init.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11) of binary: /root/miniconda3/envs/pytorch/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/pytorch/bin/torchrun", line 33, in
On a Multi GPU A100 system:
$ cat CONFIG_FILE.py
from colossalai.amp import AMP_TYPE
SEQ_LENGTH = 512
BATCH_SIZE = 8
NUM_EPOCHS = 10
WARMUP_EPOCHS = 1
parallel = dict(
)
model = dict(
)
fp16 = dict(mode=AMP_TYPE.NAIVE)
clip_grad_norm = 1.0
export DATA=wiki_dataset/
export TOKENIZER=tokenizer/
$ docker run -ti --gpus all --rm palm torchrun --nproc_per_node 1 train.py --from_torch --config CONFIG_FILE.py
Traceback (most recent call last):
File "train.py", line 18, in
ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/init.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11) of binary: /root/miniconda3/envs/pytorch/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/pytorch/bin/torchrun", line 33, in
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-05-29_08:01:15
host : 1d3306a6abee
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 11)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: