Matrix profile with Non-normalized Euclidean Distance #207
Replies: 79 comments 1 reply
-
Thanks @codyschank . A few things:
Note that this is completely off the top of my head so please do not hold me to this later 😄 So, what I would recommend doing is:
Note, that there is already a nice optimal way to compute Does that make sense? What questions did this create? |
Beta Was this translation helpful? Give feedback.
-
That makes sense. I'll dig into it, and see if I can get it working. |
Beta Was this translation helpful? Give feedback.
-
Awesome, I'm here if you need a sounding board! Feel free to also submit a "throwaway" PR if you'd like me to take a look at some code. |
Beta Was this translation helpful? Give feedback.
-
I was able to make changes to core.py so that core.mass gives the non-normalized euclidean distance. This basically uses the equation above and is an implementation of the SIMPLE algorithm from the paper. Now it seems like I should create a new function that follows the SIMPLE-fast algorithm from the paper, to take the place of stumpy.stump in my processing. I also noticed the R package tsmp has an implementation of simple-fast. So I may test my data using that first to see if it gives the results I expect. |
Beta Was this translation helpful? Give feedback.
-
Okay, it basically looks like SiMPLE-fast is using a STOMP-like approach and so we should ignore the slower "SiMPLE" algorithm. I recommend creating a new Lastly, we'll have to work on adding proper unit tests (handling NaN/inf) but this shouldn't be too hard and I can help with that |
Beta Was this translation helpful? Give feedback.
-
Btw, now that you dug into the code base, I'm curious what you think? Did you find that things were confusing or was it fairly straightforward to follow? Your feedback and (more importantly) criticism is welcomed! |
Beta Was this translation helpful? Give feedback.
-
I haven't looked at the R implementation but, generally speaking, you should take care and, instead, test against a naive implementation instead (see |
Beta Was this translation helpful? Give feedback.
-
In the motif code I'm using, I do call core.mass at one point. So it seems like I should keep the work I've done there. Also, I've noticed you added some motif related code recently, and wanted to take a look to see if I could use it rather than the patchwork of code I threw together. |
Beta Was this translation helpful? Give feedback.
-
The code is very well documented, and makes sense. This is my first time getting this into the weeds with a code base like stumpy, so I'm learning a lot. My solution so far adds some code to I think implementing SIMPLE-fast is going to be more difficult however, because I'll be more starting from scratch, rather than just adding a few lines of code here and there. |
Beta Was this translation helpful? Give feedback.
-
I'll take a look at tests/naive.py. I think I get what you're saying that I shouldn't focus too much on how things work with my data, and my preconceived notions of what I want things to look like. But it was reassuring to pick a random sequence and find the closest match from the matrix profile based on my alterations to core.py and see this |
Beta Was this translation helpful? Give feedback.
-
I want to make sure that we aren't conflating two things. In my mental model, there's computing a non-normalized matrix profile (which is what SiMPLE-fast does) and then, after a matrix profile is computed, we identify motifs. Whatever ends up in Then, there are the motifs/discords. Currently, there is a "work-in-progress" PR for motif/discord discovery by @mexxexx but it'll likely be some time before it is ready. I recommend taking a look at that. The goal of STUMPY is to provide the foundation for computing matrix profiles and, hopefully, in a few more lines of code, the user can discover motifs and other interesting insights. |
Beta Was this translation helpful? Give feedback.
-
I think that this is the point that I was trying to get across earlier. Instead of necessarily trying to bend the existing functions to include "non-normalized" support, I'd start by making "parody" functions like |
Beta Was this translation helpful? Give feedback.
-
Essentially,
And then you compare the |
Beta Was this translation helpful? Give feedback.
-
I get what you're saying. My motivation for doing this non-normalized stuff is for the motif discovery I'm working on, to make it work better for my use case. But I understand we want to try to keep these separate issues. I'll keep that in mind now as this moves forward.
👍 |
Beta Was this translation helpful? Give feedback.
-
Thanks @codyschank! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the update. I am a little bit surprised by the slow down of |
Beta Was this translation helpful? Give feedback.
-
I thought so. But let me double check.
Yes, I'm using the aamp function from the file you shared on gist. And I do remember now that I watched the CPU ramp up to 100%, so it must be parallelizing.
…On Tue, Jul 14, 2020 at 5:08 PM Sean M. Law ***@***.***> wrote:
Thanks for the update. I am a little bit surprised by the slow down of
aamp. Are you using my Numba parallelized version?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/TDAmeritrade/stumpy/issues/207#issuecomment-658437996>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHVH72AQ23T63XQZXPTNJLR3TJORANCNFSM4N6U7QHQ>
.
|
Beta Was this translation helpful? Give feedback.
-
@codyschank Can you tell me what hardware and operating system you are using to generate the above table? |
Beta Was this translation helpful? Give feedback.
-
@codyschank I did some timing on my Macbook Pro:
As you can see, |
Beta Was this translation helpful? Give feedback.
-
It's a windows machine, with the following specs Interesting, I did a test on the exact same data the you generated, with same m, and it took only 4 seconds with aamp. I tried changing m to 120, which is what I use on my data, and it was still 4 seconds. So I tried inserting the same number of nans I have in my data (20), and that is what is slowing it down. It now takes 160 seconds. I've been inserting the nans as spacers between my concatenated data, but now that I think about it, I don't think I need to do that. Later on I use an exclusion zone around the breaks (where the spacers are), to make sure I don't get motifs across the breaks. Which had been happening without using the exclusion zones. |
Beta Was this translation helpful? Give feedback.
-
Ahhh, this is definitely the cause for the slow down. This makes a lot more sense now. Thanks for the info. |
Beta Was this translation helpful? Give feedback.
-
I just realized that there may be a way around this (NaN slow down). I'll keep thinking about it |
Beta Was this translation helpful? Give feedback.
-
What causes it? The unavailability of the fastmath option? Shouldn't that at least work if one uses inf instead of nan? |
Beta Was this translation helpful? Give feedback.
-
It's caused by these steps in
I think I've solved this with a more efficient way to handle this. Please stay tuned! |
Beta Was this translation helpful? Give feedback.
-
@codyschank I've updated the gist and it should now produce the same results (when handling NaNs) but should be nearly as performant as when the data contains no NaNs. Please let me know what you think! |
Beta Was this translation helpful? Give feedback.
-
Updated numbers, now AAMP is much faster, even faster than STUMP. Thanks for the fix! |
Beta Was this translation helpful? Give feedback.
-
Yessss! Let's go! Don't fret performance enhancement for |
Beta Was this translation helpful? Give feedback.
-
I tried the suggestion from @mexxexx , basically set the following
This did not seem to work. So I guess it will take a closer look at the algorithm to see if a trick like this is possible with stumpy.stump. I attempted this because I haven't been able to figure out a correction in the AAMP algorithm to "zero" the sub-sequences. I've started to ask some of my colleagues to join in the fun of figuring this out. But let me know if you have any ideas. |
Beta Was this translation helpful? Give feedback.
-
@codyschank If you look at the |
Beta Was this translation helpful? Give feedback.
-
Thanks for helping with this. I'm going to close the issue, since the your implementation of AAMP solved what I set out to do when I opened this task. |
Beta Was this translation helpful? Give feedback.
-
Splitting off from conversation under #149
Despite the findings of @tylerwmarrs related to the AAMP algorithm, I am still very much focused on calculating a matrix profile based on non-normalized euclidean distance. I've been looking through the guts of the stumpy code, and today took my first stab at making changes. Seems to me like I would need to make the changes to
stumpy.core._calculate_squared_distance()
. I've been looking for formulas that calculate raw euclidean distance also using the dot product, and came across this paper:https://ieeexplore.ieee.org/document/8392419
And specifically, this section:
My thinking was the I could just substitute this formula for this one
D_squared = np.abs(2 * m * (1.0 - (QT - m * μ_Q * M_T) / denom))
The problem, I guess, is that I don't have sub-sequence A and B in
stumpy.core._calculate_squared_distance()
. Before I go too deep down the rabbit hole, I wonder if you might have any opinion on if this is an approach that makes sense for what I'm trying to do. And if so, any recommendations for how to implement it.Beta Was this translation helpful? Give feedback.
All reactions