-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transfer skorch models from CUDA to CPU for inference #1096
Comments
If you train a model on GPU, save it, then load it on a machine without GPU, it should already work and be automatically transferred to CPU. Please give this a try and tell us if you encounter problems. The thread you cited is a bit different, as it is about changing the device within the same process. |
@BenjaminBossan Thanks for the input. For testing purposes, I am saving the model in a CUDA-enabled torch environment and attempting to load in a CPU-only torch environment. Let me know if there is any other information I can provide. Below is my code: SavingTrain model in CUDA environment
LoadingSwitch to CPU environment
Error Traceback
Edit: I also tested with standard pickle as shown in the documentation and ran into the same error. |
Hmm, I can't reproduce this. Here is the script that I used: import pickle
import sys
import numpy as np
import torch
from sklearn.datasets import make_classification
from torch import nn
from skorch import NeuralNetClassifier
path = "/tmp/model.pkl"
class MyModule(nn.Module):
def __init__(self, num_units=10, nonlin=nn.ReLU()):
super().__init__()
self.dense0 = nn.Linear(20, num_units)
self.nonlin = nonlin
self.dropout = nn.Dropout(0.5)
self.dense1 = nn.Linear(num_units, num_units)
self.output = nn.Linear(num_units, 2)
self.softmax = nn.Softmax(dim=-1)
def forward(self, X, **kwargs):
X = self.nonlin(self.dense0(X))
X = self.dropout(X)
X = self.nonlin(self.dense1(X))
X = self.softmax(self.output(X))
return X
def get_data():
X, y = make_classification(1000, 20, n_informative=10, random_state=0)
X = X.astype(np.float32)
y = y.astype(np.int64)
return X, y
def save(): # with cuda
assert torch.cuda.is_available()
X, y = get_data()
net = NeuralNetClassifier(
MyModule,
max_epochs=10,
lr=0.1,
# Shuffle training data on each epoch
iterator_train__shuffle=True,
device="cuda",
)
net.fit(X, y)
with open(path, "wb") as f:
pickle.dump(net, f)
def load(): # without cuda
assert not torch.cuda.is_available()
with open(path, "rb") as f:
net = pickle.load(f)
X, y = get_data()
y_pred = net.predict(X)
print(f"accuracy: {(y==y_pred).mean()}")
if __name__ == "__main__":
if sys.argv[1] == "save":
save()
elif sys.argv[1] == "load":
load()
else:
raise ValueError The script can be called as
Could you please try if you can reproduce this? If yes, there must be something else going on in your script. Note that you can also try the approach described here to save, for instance, only the model weights, so basically what you would do when using |
@BenjaminBossan Thanks for the test script -- that worked, and I got the warnings as expected. Below is a distilled test script that produces the error on my end. I'm guessing it has something to do with the torch modules I'm using? Other than those, I'm not sure what might be causing the issue. import pickle
from skorch import NeuralNetClassifier
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import sys
class ResNet1D(nn.Module):
"""
ResNet architecture for 1D spectral classification
"""
def __init__(self, block, layers, num_classes=2, input_channels=1, reduction=16):
super(ResNet1D, self).__init__()
self.in_channels = 64
# Initial convolution layer
self.conv1 = nn.Conv1d(
input_channels, 64, kernel_size=7, stride=2, padding=3, bias=False
)
self.bn1 = nn.BatchNorm1d(64)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)
# ResNet layers
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(
block, 128, layers[1], stride=2, reduction=reduction
)
self.layer3 = self._make_layer(
block, 256, layers[2], stride=2, reduction=reduction
)
self.layer4 = self._make_layer(
block, 512, layers[3], stride=2, reduction=reduction
)
self.avgpool = nn.AdaptiveAvgPool1d(1)
self.dropout = nn.Dropout(0.5)
self.fc = nn.Linear(512 * block.expansion, num_classes)
def _make_layer(self, block, channels, blocks, stride=1, reduction=16):
downsample = None
# Downsample if stride is not 1 or input/output channels differ
if stride != 1 or self.in_channels != channels * block.expansion:
downsample = nn.Sequential(
nn.Conv1d(
self.in_channels,
channels * block.expansion,
kernel_size=1,
stride=stride,
bias=False,
),
nn.BatchNorm1d(channels * block.expansion),
)
layers = []
layers.append(block(self.in_channels, channels, stride, downsample, reduction))
self.in_channels = channels * block.expansion
for _ in range(1, blocks):
layers.append(block(self.in_channels, channels, reduction=reduction))
return nn.Sequential(*layers)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.dropout(x)
x = self.fc(x)
return x
class SqueezeExcitationBlock(nn.Module):
"""
Squeeze-and-Excitation (SE) Block for channel-wise attention
Parameters:
-----------
channel : int
Number of input channels
reduction : int, optional (default=16)
Reduction ratio for the bottleneck
"""
def __init__(self, channel, reduction=16):
super(SqueezeExcitationBlock, self).__init__()
# Squeeze operation (Global Average Pooling)
self.avg_pool = nn.AdaptiveAvgPool1d(1)
# Excitation operation (Channel attention mechanism)
self.fc = nn.Sequential(
nn.Linear(channel, channel // reduction, bias=False),
nn.ReLU(inplace=True),
nn.Linear(channel // reduction, channel, bias=False),
nn.Sigmoid(),
)
def forward(self, x):
# Input shape: [batch_size, channels, length]
batch_size, channels, _ = x.size()
# Squeeze - Global Average Pooling
y = self.avg_pool(x).view(batch_size, channels)
# Excitation - Channel attention
y = self.fc(y).view(batch_size, channels, 1)
# Scale input features
return x * y.expand_as(x)
class BasicResidualBlock(nn.Module):
"""
Basic residual block for ResNet architecture
Supports 1D convolutions for spectral data
"""
expansion = 1
def __init__(
self, in_channels, out_channels, stride=1, downsample=None, reduction=16
):
super(BasicResidualBlock, self).__init__()
# First convolution layer
self.conv1 = nn.Conv1d(
in_channels,
out_channels,
kernel_size=3,
stride=stride,
padding=1,
bias=False,
)
self.bn1 = nn.BatchNorm1d(out_channels)
# Second convolution layer
self.conv2 = nn.Conv1d(
out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False
)
self.bn2 = nn.BatchNorm1d(out_channels)
# Squeeze-Excitation block
self.se = SqueezeExcitationBlock(out_channels, reduction)
# Downsampling layer for matching dimensions
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
# First conv path
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
# Apply Squeeze-Excitation
out = self.se(out)
# Downsampling if needed
if self.downsample is not None:
identity = self.downsample(x)
# Residual connection
out += identity
out = F.relu(out)
return out
file = "skorch_cuda_to_cpu_model.pkl"
from sklearn.datasets import make_classification
def get_data():
X, y = make_classification(1000, 534, n_informative=500, random_state=0)
X = X.astype(np.float32)
y = y.astype(np.int64)
return X, y
def save():
assert torch.cuda.is_available()
MyModule = ResNet1D(
BasicResidualBlock,
layers=[2,2,2,2],
num_classes=2,
input_channels=1,
)
net = NeuralNetClassifier(
MyModule,
max_epochs=5,
lr=0.1,
iterator_train__shuffle=True,
device="cuda",
criterion=torch.nn.CrossEntropyLoss,)
X, y = get_data()
X = X.reshape(X.shape[0], 1, X.shape[1])
net.fit(X, y)
with open(file, "wb") as f:
pickle.dump(net, f)
def load():
assert not torch.cuda.is_available()
with open(file, "rb") as f:
net = pickle.load(f)
print(net.device)
if __name__ == "__main__":
if sys.argv[1] == "save":
save()
elif sys.argv[1] == "load":
load()
else:
raise ValueError |
Thanks for the reproducer. I could verify that this fails at loading the model on a CPU machine. I tried to debug a little bit and it appears that when loading, torch already raises the error about missing CUDA support before the skorch code even runs, which means the skorch logic to load on CPU is not applied in time. I don't have an idea why your code would trigger this but my example wouldn't. However, there can still be workarounds. One way is to move the whole model to CPU before pickling it, i.e. calling |
@BenjaminBossan That worked to fix the issue for the reproducer test script. I went on to test the workaround in my development environment, and I still encountered the error. After a bit of debugging, I found that error seems to be caused by the combination of using the import pickle
from skorch import NeuralNetClassifier
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import sys
from skorch.callbacks import LRScheduler, EarlyStopping
class ResNet1D(nn.Module):
"""
ResNet architecture for 1D spectral classification
"""
def __init__(self, block, layers, num_classes=2, input_channels=1, reduction=16):
super(ResNet1D, self).__init__()
# num_classes = 1 if num_classes == "binary" else num_classes
self.in_channels = 64
# Initial convolution layer
self.conv1 = nn.Conv1d(
input_channels, 64, kernel_size=7, stride=2, padding=3, bias=False
)
self.bn1 = nn.BatchNorm1d(64)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)
# ResNet layers
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(
block, 128, layers[1], stride=2, reduction=reduction
)
self.layer3 = self._make_layer(
block, 256, layers[2], stride=2, reduction=reduction
)
self.layer4 = self._make_layer(
block, 512, layers[3], stride=2, reduction=reduction
)
# Global average pooling and fully connected layer
self.avgpool = nn.AdaptiveAvgPool1d(1)
self.dropout = nn.Dropout(0.5)
self.fc = nn.Linear(512 * block.expansion, num_classes)
def _make_layer(self, block, channels, blocks, stride=1, reduction=16):
downsample = None
# Downsample if stride is not 1 or input/output channels differ
if stride != 1 or self.in_channels != channels * block.expansion:
downsample = nn.Sequential(
nn.Conv1d(
self.in_channels,
channels * block.expansion,
kernel_size=1,
stride=stride,
bias=False,
),
nn.BatchNorm1d(channels * block.expansion),
)
layers = []
layers.append(block(self.in_channels, channels, stride, downsample, reduction))
self.in_channels = channels * block.expansion
for _ in range(1, blocks):
layers.append(block(self.in_channels, channels, reduction=reduction))
return nn.Sequential(*layers)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.dropout(x)
x = self.fc(x)
return x
class SqueezeExcitationBlock(nn.Module):
"""
Squeeze-and-Excitation (SE) Block for channel-wise attention
Parameters:
-----------
channel : int
Number of input channels
reduction : int, optional (default=16)
Reduction ratio for the bottleneck
"""
def __init__(self, channel, reduction=16):
super(SqueezeExcitationBlock, self).__init__()
# Squeeze operation (Global Average Pooling)
self.avg_pool = nn.AdaptiveAvgPool1d(1)
# Excitation operation (Channel attention mechanism)
self.fc = nn.Sequential(
nn.Linear(channel, channel // reduction, bias=False),
nn.ReLU(inplace=True),
nn.Linear(channel // reduction, channel, bias=False),
nn.Sigmoid(),
)
def forward(self, x):
# Input shape: [batch_size, channels, length]
batch_size, channels, _ = x.size()
# Squeeze - Global Average Pooling
y = self.avg_pool(x).view(batch_size, channels)
# Excitation - Channel attention
y = self.fc(y).view(batch_size, channels, 1)
# Scale input features
return x * y.expand_as(x)
class BasicResidualBlock(nn.Module):
"""
Basic residual block for ResNet architecture
Supports 1D convolutions for spectral data
"""
expansion = 1
def __init__(
self, in_channels, out_channels, stride=1, downsample=None, reduction=16
):
super(BasicResidualBlock, self).__init__()
# First convolution layer
self.conv1 = nn.Conv1d(
in_channels,
out_channels,
kernel_size=3,
stride=stride,
padding=1,
bias=False,
)
self.bn1 = nn.BatchNorm1d(out_channels)
# Second convolution layer
self.conv2 = nn.Conv1d(
out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False
)
self.bn2 = nn.BatchNorm1d(out_channels)
# Squeeze-Excitation block
self.se = SqueezeExcitationBlock(out_channels, reduction)
# Downsampling layer for matching dimensions
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
# First conv path
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
# Apply Squeeze-Excitation
out = self.se(out)
# Downsampling if needed
if self.downsample is not None:
identity = self.downsample(x)
# Residual connection
out += identity
out = F.relu(out)
return out
file = "skorch_cuda_to_cpu_model.pkl"
from sklearn.datasets import make_classification
def get_data():
X, y = make_classification(1000, 534, n_informative=500, random_state=0)
X = X.astype(np.float32)
y = y.astype(np.int64)
return X, y
def save():
assert torch.cuda.is_available()
MyModule = ResNet1D(
BasicResidualBlock,
layers=[2,2,2,2],
num_classes=2,
input_channels=1,
)
lr_scheduler = LRScheduler(
policy=torch.optim.lr_scheduler.ReduceLROnPlateau,
monitor='valid_loss',
patience=10,
factor=0.9,
min_lr=1e-8,
)
net = NeuralNetClassifier(
MyModule,
max_epochs=5,
lr=0.1,
iterator_train__shuffle=True,
device="cuda",
criterion=torch.nn.CrossEntropyLoss,
callbacks=[lr_scheduler],
optimizer=torch.optim.Adam,
)
X, y = get_data()
X = X.reshape(X.shape[0], 1, X.shape[1])
net.fit(X, y)
net.module.cpu()
with open(file, "wb") as f:
pickle.dump(net, f)
def load():
assert not torch.cuda.is_available()
with open(file, "rb") as f:
net = pickle.load(f)
print(net.device)
if __name__ == "__main__":
if sys.argv[1] == "save":
save()
elif sys.argv[1] == "load":
load()
else:
raise ValueError |
Thanks for providing further information. Without digging deeper: When pickling, skorch checks attributes with a CUDA-dependency, pops them from the pickle state, and saves them in a way that allows us to later load them without CUDA. Optimizers such as Adam store gradient states (mean, var), which are tensors that could be on CUDA. This is okay, since we treat the optimizer as a CUDA-dependent attribute. However, I suspect that the learning rate scheduler has a reference to the optimizer. Therefore, when the whole net, and thus the learning rate scheduler, is pickled, we still retain a reference to the CUDA tensors from the optimizer states. As to workarounds, as you mentioned, you could use an optimizer that has no optimizer states, like SGD. However, this could lead to lower performance. If, after training, you don't need to continue training, you actually don't need the learning rate scheduler and could thus remove it from the If, after training, you only need the model for inference, skorch also provides a method to get rid of all attributes that are not needed for inference: https://skorch.readthedocs.io/en/stable/net.html#skorch.net.NeuralNet.trim_for_prediction. |
Hi all, I am training skorch models locally in a CUDA-enabled torch environment, and, if possible, I would like to transfer the entirety of the model to CPU so that they can be registered and used for inference in a CPU-only environment. Is there a best method for accomplishing this?
I'm pretty new to skorch and deep learning so I'm not sure if this is even possible, but, if so, a skorch helper method for converting a model to CPU would be a nice-to-have feature.
Edit: Just noticed a very similar (old) issue that is still open at time of posting (#553). The conversation there didn't seem to completely resolve. Let me know if should post there or if reviving this topic here would be preferable.
The text was updated successfully, but these errors were encountered: