You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The main dobin procedure actually turns out to be unusable for me because it's simply too slow. My current data aren't huge, 100,000s of rows, around 100 columns, but clearly way too big for dobin. The main "bottleneck" is the calculation of RANN::nn2, which has to be re-calculated on every iteratively reduced matrix:
The following code illustrates just one of several available alternatives that is a lot more efficient:
nrow<-10000ncol<-50x<-array (runif (nrow*ncol), dim= c (nrow, ncol))
n<- floor (10^ (8:16/4))
k<-20res<- vapply (n, function (i) {
xtest<-x [seq (i), ]
bench::mark (
nn_obj<-RANN::nn2 (xtest, xtest, k=k),
nn_obj<-dbscan::kNN (xtest, k=k),
check=FALSE,
time_unit="s")$median },
numeric (2))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.res<-data.frame (n=n,
RANN=res [1, ],
dbscan=res [2, ])
res<-tidyr::gather (res,
key="method",
value="duration",
RANN, dbscan)
library (ggplot2)
ggplot (res, aes (x=n, y=duration, colour=method)) +
geom_line () +
geom_point ()
Created on 2021-09-02 by the reprex package (v2.0.0.9000)
dbscan:kNN is at least twice as fast as RANN::nn2, and scales much better.
That is nevertheless unlikely to make dobin useable at scale. I suspect it may be necessary to reconsider the brute-force knn calls, and hand-code some sort of transformation of former neighbour relationships into your new B-basis. Updated neighbour relationships change very little, especially in the early (high-dimensional) stages, so there's a lot of unnecessary processing going on recalculating those from scratch each time. Happy to discuss approaches if and when things get that far, but at least dropping RANN will help us along the way. Thanks!
The text was updated successfully, but these errors were encountered:
Hi Mark, Thanks for all these comments. Much appreciated.
I never got notified. I guess the Uni email has been blocking github emails. I've changed it to dbscan now.
Thanks for the responses @sevvandi, and no worries about no responding earlier. I didn't end up using {dobin} at the time i would have liked because of this scaling issue, but would be very happy to see it somehow redesigned to scale better? It really is a very useful algorithm - thanks for developing and coding it here!
The main
dobin
procedure actually turns out to be unusable for me because it's simply too slow. My current data aren't huge, 100,000s of rows, around 100 columns, but clearly way too big fordobin
. The main "bottleneck" is the calculation ofRANN::nn2
, which has to be re-calculated on every iteratively reduced matrix:dobin/R/y_space.R
Line 11 in 04453c1
The following code illustrates just one of several available alternatives that is a lot more efficient:
Created on 2021-09-02 by the reprex package (v2.0.0.9000)
dbscan:kNN
is at least twice as fast asRANN::nn2
, and scales much better.That is nevertheless unlikely to make
dobin
useable at scale. I suspect it may be necessary to reconsider the brute-forceknn
calls, and hand-code some sort of transformation of former neighbour relationships into your newB
-basis. Updated neighbour relationships change very little, especially in the early (high-dimensional) stages, so there's a lot of unnecessary processing going on recalculating those from scratch each time. Happy to discuss approaches if and when things get that far, but at least droppingRANN
will help us along the way. Thanks!The text was updated successfully, but these errors were encountered: