Matrix profile with Non-normalized Euclidean Distance #207

codyschank · 2020-06-15T20:37:35Z

codyschank
Jun 15, 2020

Splitting off from conversation under #149

Despite the findings of @tylerwmarrs related to the AAMP algorithm, I am still very much focused on calculating a matrix profile based on non-normalized euclidean distance. I've been looking through the guts of the stumpy code, and today took my first stab at making changes. Seems to me like I would need to make the changes to stumpy.core._calculate_squared_distance(). I've been looking for formulas that calculate raw euclidean distance also using the dot product, and came across this paper:

https://ieeexplore.ieee.org/document/8392419

And specifically, this section:

My thinking was the I could just substitute this formula for this one

D_squared = np.abs(2 * m * (1.0 - (QT - m * μ_Q * M_T) / denom))

The problem, I guess, is that I don't have sub-sequence A and B in stumpy.core._calculate_squared_distance(). Before I go too deep down the rabbit hole, I wonder if you might have any opinion on if this is an approach that makes sense for what I'm trying to do. And if so, any recommendations for how to implement it.

seanlaw · 2020-06-15T21:07:43Z

seanlaw
Jun 15, 2020
Maintainer

Thanks @codyschank . A few things:

This paper is written (mostly) by the same authors as the original matrix profile group so this may possibly be within scope for STUMPY (as we try to faithfully reproduce their work) but I'll have to think about it more
I'm not sure if you are already aware of this but QT is actually the same as the Sa*Sb dot product so all that you'd need is the ai*ai and bi*bi terms (both of which can be easily pre-computed once)
Note that the Unnormalized Euclidean distance equation may not be numerically stable (I have not confirmed it but wanted to state it out loud in case it may lead to over/underflow)

The problem, I guess, is that I don't have sub-sequence A and B in stumpy.core._calculate_squared_distance(). Before I go too deep down the rabbit hole, I wonder if you might have any opinion on if this is an approach that makes sense for what I'm trying to do. And if so, any recommendations for how to implement it.

Note that this is completely off the top of my head so please do not hold me to this later 😄

So, what I would recommend doing is:

Instead of integrating directly into stumpy.stump.py, I would start off by creating a standalone stumpy.non_normalized_stump.py
Then, I would take stumpy.core.py and add new functions like stumpy.core._calculate_squared_euclidean_distance() (and calculate_squared_euclidean_distance_profile, etc) and these would sit next to the existing functions distance functions. And instead of needing mean/stddevs, you just need to pass QT, ai*ai, and bi*bi
I would not touch any of the existing functions and I would not integrate with them until we have something that works on its own
Last but not least. We need a naive unit test(s) that demonstrates that the outputs are correct

Note, that there is already a nice optimal way to compute QT (the sliding dot product)

Does that make sense? What questions did this create?

0 replies

codyschank · 2020-06-16T16:29:13Z

codyschank
Jun 16, 2020
Author

That makes sense. I'll dig into it, and see if I can get it working.

0 replies

seanlaw · 2020-06-16T20:39:19Z

seanlaw
Jun 16, 2020
Maintainer

Awesome, I'm here if you need a sounding board! Feel free to also submit a "throwaway" PR if you'd like me to take a look at some code.

0 replies

codyschank · 2020-06-17T14:30:17Z

codyschank
Jun 17, 2020
Author

I was able to make changes to core.py so that core.mass gives the non-normalized euclidean distance. This basically uses the equation above and is an implementation of the SIMPLE algorithm from the paper.

Now it seems like I should create a new function that follows the SIMPLE-fast algorithm from the paper, to take the place of stumpy.stump in my processing.

I also noticed the R package tsmp has an implementation of simple-fast. So I may test my data using that first to see if it gives the results I expect.

0 replies

seanlaw · 2020-06-17T14:37:37Z

seanlaw
Jun 17, 2020
Maintainer

Okay, it basically looks like SiMPLE-fast is using a STOMP-like approach and so we should ignore the slower "SiMPLE" algorithm. I recommend creating a new stumpy/simple.py file that mimics stumpy.stump.py and just call it def simple() (instead of SiMPLE-fast).

Lastly, we'll have to work on adding proper unit tests (handling NaN/inf) but this shouldn't be too hard and I can help with that

0 replies

seanlaw · 2020-06-17T14:39:10Z

seanlaw
Jun 17, 2020
Maintainer

Btw, now that you dug into the code base, I'm curious what you think? Did you find that things were confusing or was it fairly straightforward to follow? Your feedback and (more importantly) criticism is welcomed!

0 replies

seanlaw · 2020-06-17T14:42:43Z

seanlaw
Jun 17, 2020
Maintainer

I also noticed the R package tsmp has an implementation of simple-fast. So I may test my data using that first to see if it gives the results I expect.

I haven't looked at the R implementation but, generally speaking, you should take care and, instead, test against a naive implementation instead (see tests/naive.py for inspiration). You'll need to do this anyways when we add our unit tests. I am here if you have any questions and I would be happy to hop on a video call if that would help.

0 replies

codyschank · 2020-06-17T15:58:43Z

codyschank
Jun 17, 2020
Author

Okay, it basically looks like SiMPLE-fast is using a STOMP-like approach and so we should ignore the slower "SiMPLE" algorithm. I recommend creating a new stumpy/simple.py file that mimics stumpy.stump.py and just call it def simple() (instead of SiMPLE-fast).

In the motif code I'm using, I do call core.mass at one point. So it seems like I should keep the work I've done there. Also, I've noticed you added some motif related code recently, and wanted to take a look to see if I could use it rather than the patchwork of code I threw together.

0 replies

codyschank · 2020-06-17T16:06:55Z

codyschank
Jun 17, 2020
Author

Btw, now that you dug into the code base, I'm curious what you think? Did you find that things were confusing or was it fairly straightforward to follow? Your feedback and (more importantly) criticism is welcomed!

The code is very well documented, and makes sense. This is my first time getting this into the weeds with a code base like stumpy, so I'm learning a lot.

My solution so far adds some code to stumpy.core.compute_mean_std to also calculate the rolling sum of the squares, copying the same steps you do for the rolling mean and rolling std. I then pass that along as needed to finally change the equation used in stumpy.core._calculate_squared_distance. This was all pretty straightforward once I stared at the code for awhile and got a feel for how everything was linked together.

I think implementing SIMPLE-fast is going to be more difficult however, because I'll be more starting from scratch, rather than just adding a few lines of code here and there.

0 replies

codyschank · 2020-06-17T16:11:51Z

codyschank
Jun 17, 2020
Author

I haven't looked at the R implementation but, generally speaking, you should take care and, instead, test against a naive implementation instead (see tests/naive.py for inspiration). You'll need to do this anyways when we add our unit tests. I am here if you have any questions and I would be happy to hop on a video call if that would help.

I'll take a look at tests/naive.py. I think I get what you're saying that I shouldn't focus too much on how things work with my data, and my preconceived notions of what I want things to look like.

But it was reassuring to pick a random sequence and find the closest match from the matrix profile based on my alterations to core.py and see this

0 replies

seanlaw · 2020-06-17T17:40:29Z

seanlaw
Jun 17, 2020
Maintainer

In the motif code I'm using, I do call core.mass at one point. So it seems like I should keep the work I've done there. Also, I've noticed you added some motif related code recently, and wanted to take a look to see if I could use it rather than the patchwork of code I threw together

I want to make sure that we aren't conflating two things. In my mental model, there's computing a non-normalized matrix profile (which is what SiMPLE-fast does) and then, after a matrix profile is computed, we identify motifs. Whatever ends up in stumpy/simple.py should exclusively contain code for computing the matrix profile (and indices) according to SiMPLE-fast and shouldn't contain any code about finding motifs. I want to make sure that this issue only deals with the nonnormalized Euclidean distance work and not the motif discovery work.

Then, there are the motifs/discords. Currently, there is a "work-in-progress" PR for motif/discord discovery by @mexxexx but it'll likely be some time before it is ready. I recommend taking a look at that. The goal of STUMPY is to provide the foundation for computing matrix profiles and, hopefully, in a few more lines of code, the user can discover motifs and other interesting insights.

0 replies

seanlaw · 2020-06-17T17:46:39Z

seanlaw
Jun 17, 2020
Maintainer

My solution so far adds some code to stumpy.core.compute_mean_std to also calculate the rolling sum of the squares, copying the same steps you do for the rolling mean and rolling std. I then pass that along as needed to finally change the equation used in stumpy.core._calculate_squared_distance. This was all pretty straightforward once I stared at the code for awhile and got a feel for how everything was linked together.
I think implementing SIMPLE-fast is going to be more difficult however, because I'll be more starting from scratch, rather than just adding a few lines of code here and there.

I think that this is the point that I was trying to get across earlier. Instead of necessarily trying to bend the existing functions to include "non-normalized" support, I'd start by making "parody" functions like core.nonnormalized_mass and core.nonnormalized_distance_profile. This way, you don't have to worry about integrating with/breaking the current code base (otherwise, you run the risk of breaking the unit tests). Or, better yet, just do EVERYTHING inside of simple.py (i.e., simple.nonnormalized_mass() and simple.nonnormalized_distance_profile) so that all of your contributions are perfectly self contained and we can worry about how to incorporate your code or refactor afterward.

0 replies

seanlaw · 2020-06-17T18:01:17Z

seanlaw
Jun 17, 2020
Maintainer

I'll take a look at tests/naive.py. I think I get what you're saying that I shouldn't focus too much on how things work with my data, and my preconceived notions of what I want things to look like.

But it was reassuring to pick a random sequence and find the closest match from the matrix profile based on my alterations to core.py and see this

Essentially, naive.py contains super naive (i.e., not optimized for loops) versions of code that, for a small time series dataset (like n = 64), you can quickly compute the matrix profile for. So, in theory, you can have a naive.simple() function that might look something like:

# naive.py - This is untested!

def simple(T_A, m, T_B=None):
    if T_B is None:
        T_B = T_A

    distance_profile = np.empty(T_B.shape[0]-m+1)
    P = np.empty(T_B.shape[0]-m+1)
    I = = np.empty(T_B.shape[0]-m+1, dtype=np.int64)

    for Q_A in core.rolling_window(T_A, m):
        for Q_B in core.rolling_window(T_B, m):
            D = np.linalg.norm(Q_A - Q_B, axis=1)
            # Do something with this non-normalized Euclidean distance

    return P, I

And then you compare the P and I from this naive (but easily interpretable) implementation with your optimized SiMPLE-fast implementation

0 replies

codyschank · 2020-06-17T18:36:59Z

codyschank
Jun 17, 2020
Author

I want to make sure that we aren't conflating two things. In my mental model, there's computing a non-normalized matrix profile (which is what SiMPLE-fast does) and then, after a matrix profile is computed, we identify motifs. Whatever ends up in stumpy/simple.py should exclusively contain code for computing the matrix profile (and indices) according to SiMPLE-fast and shouldn't contain any code about finding motifs. I want to make sure that this issue only deals with the nonnormalized Euclidean distance work and not the motif discovery work.

I get what you're saying. My motivation for doing this non-normalized stuff is for the motif discovery I'm working on, to make it work better for my use case. But I understand we want to try to keep these separate issues. I'll keep that in mind now as this moves forward.

Then, there are the motifs/discords. Currently, there is a "work-in-progress" PR for motif/discord discovery by @mexxexx but it'll likely be some time before it is ready. I recommend taking a look at that. The goal of STUMPY is to provide the foundation for computing matrix profiles and, hopefully, in a few more lines of code, the user can discover motifs and other interesting insights.

👍

0 replies

seanlaw · 2020-06-17T18:51:58Z

seanlaw
Jun 17, 2020
Maintainer

Thanks @codyschank!

0 replies

seanlaw · 2020-07-14T22:08:26Z

seanlaw
Jul 14, 2020
Maintainer

Thanks for the update. I am a little bit surprised by the slow down of aamp. Are you using my Numba parallelized version?

0 replies

codyschank · 2020-07-14T22:49:17Z

codyschank
Jul 14, 2020
Author

I thought so. But let me double check. Yes, I'm using the aamp function from the file you shared on gist. And I do remember now that I watched the CPU ramp up to 100%, so it must be parallelizing.

…

On Tue, Jul 14, 2020 at 5:08 PM Sean M. Law ***@***.***> wrote: Thanks for the update. I am a little bit surprised by the slow down of aamp. Are you using my Numba parallelized version? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/TDAmeritrade/stumpy/issues/207#issuecomment-658437996>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHVH72AQ23T63XQZXPTNJLR3TJORANCNFSM4N6U7QHQ> .

0 replies

seanlaw · 2020-07-14T23:29:44Z

seanlaw
Jul 14, 2020
Maintainer

@codyschank Can you tell me what hardware and operating system you are using to generate the above table?

0 replies

seanlaw · 2020-07-14T23:49:56Z

seanlaw
Jul 14, 2020
Maintainer

@codyschank I did some timing on my Macbook Pro:

x = np.random.rand(100000)

%timeit mp = aamp(x, 50)
# 22.1 s ± 636 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%timeit mp = stump(x, 50)
# 1min 44s ± 3.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you can see, aamp was way faster than stumpy.stump so I have a feeling that something else is going on in your case.

0 replies

codyschank · 2020-07-15T01:06:13Z

codyschank
Jul 15, 2020
Author

It's a windows machine, with the following specs

Interesting, I did a test on the exact same data the you generated, with same m, and it took only 4 seconds with aamp. I tried changing m to 120, which is what I use on my data, and it was still 4 seconds. So I tried inserting the same number of nans I have in my data (20), and that is what is slowing it down. It now takes 160 seconds.

I've been inserting the nans as spacers between my concatenated data, but now that I think about it, I don't think I need to do that. Later on I use an exclusion zone around the breaks (where the spacers are), to make sure I don't get motifs across the breaks. Which had been happening without using the exclusion zones.

0 replies

seanlaw · 2020-07-15T01:14:34Z

seanlaw
Jul 15, 2020
Maintainer

I've been inserting the nans as spacers between my concatenated data

Ahhh, this is definitely the cause for the slow down. This makes a lot more sense now. Thanks for the info.

0 replies

seanlaw · 2020-07-15T13:14:13Z

seanlaw
Jul 15, 2020
Maintainer

I just realized that there may be a way around this (NaN slow down). I'll keep thinking about it

0 replies

mihailescum · 2020-07-15T13:53:40Z

mihailescum
Jul 15, 2020

What causes it? The unavailability of the fastmath option? Shouldn't that at least work if one uses inf instead of nan?

0 replies

seanlaw · 2020-07-15T14:57:26Z

seanlaw
Jul 15, 2020
Maintainer

It's caused by these steps in _compute_diagonal. Essentially, when you have a large number of NaNs and a wide window, it isn't benefiting from the last distance computed and, instead, is recomputing the distance using np.linalg.norm many times.

            if i == 0 or i == k or (k < 0 and i == -k) or np.isnan(D_squared) or np.isinf(D_squared):
                D_squared = np.linalg.norm(T_A[i+k:i+k+m] - T_B[i : i + m]) ** 2
            else:
                D_squared = D_squared - (T_A[i+k-1] - T_B[i-1]) ** 2 + (T_A[i+k+m-1] - T_B[i+m-1]) ** 2

            if np.isnan(D_squared):
                 D_squared = np.inf

I think I've solved this with a more efficient way to handle this. Please stay tuned!

0 replies

seanlaw · 2020-07-15T15:29:06Z

seanlaw
Jul 15, 2020
Maintainer

@codyschank I've updated the gist and it should now produce the same results (when handling NaNs) but should be nearly as performant as when the data contains no NaNs. Please let me know what you think!

0 replies

codyschank · 2020-07-16T00:13:40Z

codyschank
Jul 16, 2020
Author

Updated numbers, now AAMP is much faster, even faster than STUMP. Thanks for the fix!

0 replies

seanlaw · 2020-07-16T00:23:24Z

seanlaw
Jul 16, 2020
Maintainer

Updated numbers, now AAMP is much faster, even faster than STUMP. Thanks for the fix!

Yessss! Let's go! Don't fret performance enhancement for stumpy.stump are coming soon too! :D

0 replies

codyschank · 2020-07-30T19:03:46Z

codyschank
Jul 30, 2020
Author

I tried the suggestion from @mexxexx , basically set the following

M_T = np.zeros(M_T.shape)
μ_Q = np.zeros(μ_Q.shape)
Σ_T = np.ones(Σ_T.shape)
σ_Q = np.ones(σ_Q.shape)

This did not seem to work. So I guess it will take a closer look at the algorithm to see if a trick like this is possible with stumpy.stump.

I attempted this because I haven't been able to figure out a correction in the AAMP algorithm to "zero" the sub-sequences. I've started to ask some of my colleagues to join in the fun of figuring this out. But let me know if you have any ideas.

0 replies

seanlaw · 2020-07-31T01:11:16Z

seanlaw
Jul 31, 2020
Maintainer

@codyschank If you look at the aamp algorithm, you'll notice that the current distance, D[i, j], is quickly computed by reusing D[i-1, j-1] (essentially, the previous distance along the same diagonal). This is only possible because there is a direct and clear 1-to-1 relationship between D[i, j] and D[i-1, j-1]. I took a look at the math and, in your case (where you want to zero the distance relative to the first element in the subsequence), I don't see the same clear relationship between D[i, j] and D[i-1, j-1]:

0 replies

codyschank · 2020-08-03T18:34:00Z

codyschank
Aug 3, 2020
Author

Thanks for helping with this. I'm going to close the issue, since the your implementation of AAMP solved what I set out to do when I opened this task.

1 reply

seanlaw Feb 13, 2021
Maintainer

@codyschank Just a heads up that starting in STUMPY v1.8.0, you no longer have to call aamp and instead, you can/should use the normalize=False parameter:

stumpy.stump(T, m, normalize=False)

This is equivalent to:

stumpy.aamp(T, m)

In the background, when normalize=False, stumpy.stump just calls stumpy.aamp for you and hands over the relevant data.

Matrix profile with Non-normalized Euclidean Distance #207

codyschank Jun 15, 2020

Replies: 79 comments · 1 reply

seanlaw Jun 15, 2020 Maintainer

codyschank Jun 16, 2020 Author

seanlaw Jun 16, 2020 Maintainer

codyschank Jun 17, 2020 Author

seanlaw Jun 17, 2020 Maintainer

seanlaw Jun 17, 2020 Maintainer

seanlaw Jun 17, 2020 Maintainer

codyschank Jun 17, 2020 Author

codyschank Jun 17, 2020 Author

codyschank Jun 17, 2020 Author

seanlaw Jun 17, 2020 Maintainer

seanlaw Jun 17, 2020 Maintainer

seanlaw Jun 17, 2020 Maintainer

codyschank Jun 17, 2020 Author

seanlaw Jun 17, 2020 Maintainer

seanlaw Jul 14, 2020 Maintainer

codyschank Jul 14, 2020 Author

seanlaw Jul 14, 2020 Maintainer

seanlaw Jul 14, 2020 Maintainer

codyschank Jul 15, 2020 Author

seanlaw Jul 15, 2020 Maintainer

seanlaw Jul 15, 2020 Maintainer

mihailescum Jul 15, 2020

seanlaw Jul 15, 2020 Maintainer

seanlaw Jul 15, 2020 Maintainer

codyschank Jul 16, 2020 Author

seanlaw Jul 16, 2020 Maintainer

codyschank Jul 30, 2020 Author

seanlaw Jul 31, 2020 Maintainer

codyschank Aug 3, 2020 Author

seanlaw Feb 13, 2021 Maintainer

codyschank
Jun 15, 2020

Replies: 79 comments 1 reply

seanlaw
Jun 15, 2020
Maintainer

codyschank
Jun 16, 2020
Author

seanlaw
Jun 16, 2020
Maintainer

codyschank
Jun 17, 2020
Author

seanlaw
Jun 17, 2020
Maintainer

seanlaw
Jun 17, 2020
Maintainer

seanlaw
Jun 17, 2020
Maintainer

codyschank
Jun 17, 2020
Author

codyschank
Jun 17, 2020
Author

codyschank
Jun 17, 2020
Author

seanlaw
Jun 17, 2020
Maintainer

seanlaw
Jun 17, 2020
Maintainer

seanlaw
Jun 17, 2020
Maintainer

codyschank
Jun 17, 2020
Author

seanlaw
Jun 17, 2020
Maintainer

seanlaw
Jul 14, 2020
Maintainer

codyschank
Jul 14, 2020
Author

seanlaw
Jul 14, 2020
Maintainer

seanlaw
Jul 14, 2020
Maintainer

codyschank
Jul 15, 2020
Author

seanlaw
Jul 15, 2020
Maintainer

seanlaw
Jul 15, 2020
Maintainer

mihailescum
Jul 15, 2020

seanlaw
Jul 15, 2020
Maintainer

seanlaw
Jul 15, 2020
Maintainer

codyschank
Jul 16, 2020
Author

seanlaw
Jul 16, 2020
Maintainer

codyschank
Jul 30, 2020
Author

seanlaw
Jul 31, 2020
Maintainer

codyschank
Aug 3, 2020
Author

seanlaw Feb 13, 2021
Maintainer