接引前文《Windows 10 yolov5 GPU环境》,配置完成之后,一度因为虚拟内存没什么太大用处。原有设置的虚拟内存c盘(系统盘)为4096-8192。在我将虚拟内存改成1024-2048之后,然后tm报错了。就是上面的的这个错误:RuntimeError: Unable to find a valid cuDNN algorithm to run convolution。但是实际上,错误和cuda没有直接关系,目前我还不太清楚为什么虚拟内存直接关系到了cuda的运行环境,或者说pytorch的运行环境。网上搜了一下也没找到相关的资料,主要应该是我的理解太浅显。
详细错误信息:
(E:\anaconda_dirs\venvs\yolov5-gpu) C:\Users\obaby>cd /d F:\Pycharm_Projects\yolov5 (E:\anaconda_dirs\venvs\yolov5-gpu) F:\Pycharm_Projects\yolov5>python train_ads.py train: weights=yolov5s.pt, cfg=, data=data/ads.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=True, adam=False, sync_bn=False, workers=4, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=-1, freeze=0, patience=30 github: remote: Enumerating objects: 54, done. remote: Counting objects: 100% (42/42), done. remote: Compressing objects: 100% (24/24), done. remote: Total 54 (delta 28), reused 19 (delta 18), pack-reused 12 Unpacking objects: 100% (54/54), 54.76 KiB | 124.00 KiB/s, done. From https://github.com/ultralytics/yolov5 fcb225c..3beb871 master -> origin/master * [new branch] fix/profile -> origin/fix/profile * [new branch] fix/val_study_plot -> origin/fix/val_study_plot * [new branch] update/Focus -> origin/update/Focus YOLOv5 is out of date by 32 commits. Use `git pull` or `git clone https://github.com/ultralytics/yolov5` to update. YOLOv5 v5.0-405-gfad57c2 torch 1.9.0 CUDA:0 (NVIDIA GeForce RTX 3080, 10240.0MB) hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/ wandb: Currently logged in as: obaby (use `wandb login --relogin` to force relogin) wandb: wandb version 0.12.2 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.12.1 wandb: Syncing run peachy-microwave-5 wandb: View project at https://wandb.ai/obaby/YOLOv5 wandb: View run at https://wandb.ai/obaby/YOLOv5/runs/29uz7t6t wandb: Run data is saved locally in F:\Pycharm_Projects\yolov5\wandb\run-20210916_230418-29uz7t6t wandb: Run `wandb offline` to turn off syncing. Overriding model.yaml nc=80 with nc=1 from n params module arguments 0 -1 1 3520 models.common.Focus [3, 32, 3] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 18816 models.common.C3 [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 3 156928 models.common.C3 [128, 128, 3] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 3 625152 models.common.C3 [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]] 9 -1 1 1182720 models.common.C3 [512, 512, 1, False] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 361984 models.common.C3 [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 90880 models.common.C3 [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 296448 models.common.C3 [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1182720 models.common.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 16182 models.yolo.Detect [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model Summary: 283 layers, 7063542 parameters, 7063542 gradients, 16.4 GFLOPs Transferred 356/362 items from yolov5s.pt Scaled weight_decay = 0.0005 optimizer: SGD with parameter groups 59 weight, 62 weight (no decay), 62 bias train: Scanning 'data\train.cache' images and labels... 16 found, 0 missing, 0 empty, 0 corrupted: 100%|█| 16/16 [00:00 val: Scanning 'data\val.cache' images and labels... 2 found, 0 missing, 0 empty, 0 corrupted: 100%|█| 2/2 [00:00<?, ?it Plotting labels... autoanchor: Analyzing anchors... anchors/target = 4.44, Best Possible Recall (BPR) = 1.0000 Image sizes 640 train, 640 val Using 4 dataloader workers Logging results to runs\train\exp23 Starting training for 300 epochs... Epoch gpu_mem box obj cls labels img_size 0%| | 0/1 [00:04<?, ?it/s] Traceback (most recent call last): File "train_ads.py", line 610, in <module> main(opt) File "train_ads.py", line 508, in main train(opt.hyp, opt, device) File "train_ads.py", line 311, in train pred = model(imgs) # forward File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "F:\Pycharm_Projects\yolov5\models\yolo.py", line 123, in forward return self.forward_once(x, profile, visualize) # single-scale inference, train File "F:\Pycharm_Projects\yolov5\models\yolo.py", line 155, in forward_once x = m(x) # run File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "F:\Pycharm_Projects\yolov5\models\common.py", line 45, in forward return self.act(self.bn(self.conv(x))) File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\conv.py", line 443, in forward return self._conv_forward(input, self.weight, self.bias) File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\conv.py", line 439, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Unable to find a valid cuDNN algorithm to run convolution wandb: Waiting for W&B process to finish, PID 17976 wandb: Program failed with code 1. Press ctrl-c to abort syncing. wandb: wandb: Find user logs for this run at: F:\Pycharm_Projects\yolov5\wandb\run-20210916_230418-29uz7t6t\logs\debug.log wandb: Find internal logs for this run at: F:\Pycharm_Projects\yolov5\wandb\run-20210916_230418-29uz7t6t\logs\debug-internal.log wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: wandb: Synced peachy-microwave-5: https://wandb.ai/obaby/YOLOv5/runs/29uz7t6t (E:\anaconda_dirs\venvs\yolov5-gpu) F:\Pycharm_Projects\yolov5>
看到的不见得就是问题的关键,如果根据这个搜索可能解决不了问题。尝试将内存改回去。
1024是远远不够的,这个虚拟内存大小该根据什么来设置?如果不重启,直接再次运行然后就回报下面的错误:RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 10.00 GiB total capacity; 848.46 MiB already allocated; 7.01 GiB free; 892.00 MiB reserved in total by PyTorch)
详细错误信息:
(E:\anaconda_dirs\venvs\yolov5-gpu) F:\Pycharm_Projects\yolov5>python train_ads.py train: weights=yolov5s.pt, cfg=, data=data/ads.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=True, adam=False, sync_bn=False, workers=4, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=-1, freeze=0, patience=30 github: error: RPC failed; curl 56 OpenSSL SSL_read: Connection was reset, errno 10054 fatal: expected flush after ref listing Command 'git fetch && git config --get remote.origin.url' timed out after 5 seconds YOLOv5 v5.0-405-gfad57c2 torch 1.9.0 CUDA:0 (NVIDIA GeForce RTX 3080, 10240.0MB) hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/ wandb: Currently logged in as: obaby (use `wandb login --relogin` to force relogin) wandb: wandb version 0.12.2 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.12.1 wandb: Syncing run celestial-surf-6 wandb: View project at https://wandb.ai/obaby/YOLOv5 wandb: View run at https://wandb.ai/obaby/YOLOv5/runs/30m5e245 wandb: Run data is saved locally in F:\Pycharm_Projects\yolov5\wandb\run-20210916_230828-30m5e245 wandb: Run `wandb offline` to turn off syncing. Overriding model.yaml nc=80 with nc=1 from n params module arguments 0 -1 1 3520 models.common.Focus [3, 32, 3] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 18816 models.common.C3 [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 3 156928 models.common.C3 [128, 128, 3] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 3 625152 models.common.C3 [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]] 9 -1 1 1182720 models.common.C3 [512, 512, 1, False] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 361984 models.common.C3 [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 90880 models.common.C3 [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 296448 models.common.C3 [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1182720 models.common.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 16182 models.yolo.Detect [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] Model Summary: 283 layers, 7063542 parameters, 7063542 gradients, 16.4 GFLOPs Transferred 356/362 items from yolov5s.pt Scaled weight_decay = 0.0005 optimizer: SGD with parameter groups 59 weight, 62 weight (no decay), 62 bias train: Scanning 'data\train.cache' images and labels... 16 found, 0 missing, 0 empty, 0 corrupted: 100%|█| 16/16 [00:00<?, ?it/s] val: Scanning 'data\val.cache' images and labels... 2 found, 0 missing, 0 empty, 0 corrupted: 100%|████████| 2/2 [00:00<?, ?it/s] Plotting labels... autoanchor: Analyzing anchors... anchors/target = 4.44, Best Possible Recall (BPR) = 1.0000 Image sizes 640 train, 640 val Using 4 dataloader workers Logging results to runs\train\exp24 Starting training for 300 epochs... Epoch gpu_mem box obj cls labels img_size 0%| | 0/1 [00:03<?, ?it/s] Traceback (most recent call last): File "train_ads.py", line 610, in <module> main(opt) File "train_ads.py", line 508, in main train(opt.hyp, opt, device) File "train_ads.py", line 311, in train pred = model(imgs) # forward File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "F:\Pycharm_Projects\yolov5\models\yolo.py", line 123, in forward return self.forward_once(x, profile, visualize) # single-scale inference, train File "F:\Pycharm_Projects\yolov5\models\yolo.py", line 155, in forward_once x = m(x) # run File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "F:\Pycharm_Projects\yolov5\models\common.py", line 137, in forward return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1)) File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\container.py", line 139, in forward input = module(input) File "E:\anaconda_dirs\venvs\yolov5-gpu\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "F:\Pycharm_Projects\yolov5\models\common.py", line 103, in forward return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x)) RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 10.00 GiB total capacity; 848.46 MiB already allocated; 7.01 GiB free; 892.00 MiB reserved in total by PyTorch) wandb: Waiting for W&B process to finish, PID 20684 wandb: Program failed with code 1. Press ctrl-c to abort syncing. wandb: wandb: Find user logs for this run at: F:\Pycharm_Projects\yolov5\wandb\run-20210916_230828-30m5e245\logs\debug.log wandb: Find internal logs for this run at: F:\Pycharm_Projects\yolov5\wandb\run-20210916_230828-30m5e245\logs\debug-internal.log wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: wandb: Synced celestial-surf-6: https://wandb.ai/obaby/YOLOv5/runs/30m5e245 (E:\anaconda_dirs\venvs\yolov5-gpu) F:\Pycharm_Projects\yolov5>
重启之后,一切问题就解决了。