-
Notifications
You must be signed in to change notification settings - Fork 43
training/inference time as a function of number of scans used #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi there, so it's been a while since I looked at this, but the obvious reason that comes to mind would be the auto-regressive nature of DR-SPAAM. Line 112 of the same file actually updates a state, which can't be parallelized in a trivial fashion. The feature extraction itself should be parallelizable though, given that these ops don't have a state. I guess you could easily flatten the batch and number of scan dimensions, extract the features and then reshape it back prior to running only line 112 in a loop. I guess you can even try that without re-training given that it should in essence be exactly the same during inference. Do pay attention though that during training this would behave slightly differently (no idea if it's good or bad). Right now batchnorm is performed on batches of a single scan, whereas doing what I proposed above would run batchnorm collectively on all B*N scans. This might even be better, but I guess it's hard to tell without trying. |
Thank you @Pandoro By " guess you can even try that without re-training given that it should in essence be exactly the same during inference. " |
No, you don't need to change the weights. My pytorch syntax is a bit rusty and I can't test this right now, but I'm suggesting something along the following lines: B, CT, N, L = x.shape
# extract feature from all scan
out = x.view(B * CT * N, 1, L) # Not sure if that works, but I wouldn't see why not. You could also use Einops to be more explicit.
out = self._conv_and_pool(out, self.conv_block_1) # /2 <-- cut outs are processed as usual.
out = self._conv_and_pool(out, self.conv_block_2) # /4
features_all = out.view(B, CT, N, out.shape[-2], out.shape[-1]) # Again this might need some testing
for i in range(n_scan):
features_i = features_all[:, :, i, :, :] # (B, CT, C, L)
# combine current feature with memory
out, sim = self.gate(features_i) # (B, CT, C, L) Each cutout from each scan, from each batch entry is processed independently, the only place they interact is in the batchnorm in the _conv_and_pool blocks. That's what I mentioned before. I don't think this is a huge issue though and post training it should give you the same results, up to GPU non-determinism and the likes. Take of all this with a grain of salt though and test it for sure. I might be overlooking something stupid here and I'm only 95% sure this will work. |
Hello,
I have noticed that I get a substantial increase for both training time and inference time
when increasing the number of scans used from 1 at a time to 5 at a time.
For example, for training it takes around 10 hours for an epoch for finish when using 1 scan
vs 40 hours when using 5 scans. (i am trying on cpu for the moment, hence the absolute big training times)
Similarly, when using the test_inference script I get around 1.1 seconds versus 3.5 seconds on my mac M1 laptop.
I have looked into the code and a great chunk of the time increase comes from the forward method of the DRSPAAM object
in https://github.com/VisualComputingInstitute/2D_lidar_person_detection/blob/master/dr_spaam/dr_spaam/model/dr_spaam.py
at the for loop at line 102.
Is there a reason this is done sequentially and not paralled/vectorized using torch's capabilities in this respect?
Thank you
The text was updated successfully, but these errors were encountered: