Find values of second and third peak from matrix profile output #555

roumail · 2022-03-02T16:40:35Z

roumail
Mar 2, 2022

Hi, new to the matrix profile so sorry if this is obvious for others.. I've plotted a matrix profile and I'm looking to capture the values of the second and third peak from my "param" plot below. To be clear, these are the timepoints b/w 3000-5000. My sequence length here is about 500, which makes sense I guess.

The tutorial for Motif and Discord discovery talks about talking the argsort[-1] but for me that would be the output I get right now which isn't interest..

I thought about combining this together with "fluss" to remove the downward sloping half of my data and well, that didn't exactly work. I also thought to just convert the output matrix profile/index output to pandas and apply heuristics to identify max value in each subsequence. However at this point I think I'm overcomplicating this

What am I missing here? Any tips?

seanlaw · 2022-03-02T17:49:17Z

seanlaw
Mar 2, 2022
Maintainer

@roumail Thank you for your question and welcome to the STUMPY community.

The tutorial for Motif and Discord discovery talks about talking the argsort[-1] but for me that would be the output I get right now which isn't interest..

Can you please elaborate on what you mean by this? argsort[-1] would give you the single best discord but you could continue down that path by looking at the second best, third best discords, and so on thereafter with something like:

from stumpy import config, core

mp = stumpy.stump(T, m)
# Find top 10 discords
P = mp[:, 0]
excl_zone = int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))
for k in range(10):
    discord_idx = np.argsort(P)[-1]
    print(discord_idx)
    core.apply_exclusion_zone(P, discord_idx, excl_zone, -1.0)

Also, in case it matters, we are currently working on adding a new feature to look for discords in your data but it won't be ready for a while. Please see PR #505

4 replies

seanlaw Mar 2, 2022
Maintainer

@NimaSarajpoor Do you have anything that you may be able to add or thoughts that you could contribute?

roumail Mar 3, 2022
Author

Hi @seanlaw ! Thank you so much for your reply and excuse my poorly phrased request. I'm surprised by how responsive this community is. Thank you very much!!

So by argsort[-1], I am referring to the part in the tutorial where after computing the matrix profile, we apply np.argsort(mp[:, 0])[-1] to get the "discord index". I understand the discord index to correspond to the index where we observe the highest value in the matrix profile.

I think the snippet you have here is exactly what I was looking for since I was trying to hack my way towards the apply_exclusion_zone functionality. Although I probably need to check my version since I see that I get an error when calling core.apply_exclusion_zone()

Error:
 stumpy.core.apply_exclusion_zone(matrix_profile, discord_idx, excl_zone, -1.0)

TypeError: too many arguments: expected 3, got 4

My stumpy version: 1.9.2

seanlaw Mar 3, 2022
Maintainer

@roumail core.apply_exclusion_zone has been around for a while but it was only recently that we added the ability to specify the value to apply when filling in the exclusion zone. To save us both from having to worry about the correct STUMPY version for now, let's simply try:

from stumpy import config, core

def apply_exclusion_zone(a, idx, excl_zone, val):
    zone_start = max(0, idx - excl_zone)
    zone_stop = min(a.shape[-1], idx + excl_zone)
    a[..., zone_start : zone_stop + 1] = val


mp = stumpy.stump(T, m)
# Find top 10 discords
P = mp[:, 0].copy()  # We make a copy to avoid overwriting our original matrix profile
excl_zone = int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))
for k in range(10):
    discord_idx = np.argsort(P)[-1]
    print(discord_idx)
    apply_exclusion_zone(P, discord_idx, excl_zone, -1.0)

Can you see if this works?

roumail Mar 3, 2022
Author

Yes it certainly does! I'm adding all the code I used below for reference.. Applying the exclusion zone, I'm getting a different output but I think I need to think deeper on the implications of the different discords indexes I find. Change the exclusion zone (currently at 4) and see how the output changes each time..

mp_base = stumpy.stump(param_values, m=SEQ_WINDOW_LENGTH)

def apply_exclusion_zone(a, idx, excl_zone, val):
    zone_start = max(0, idx - excl_zone)
    zone_stop = min(a.shape[-1], idx + excl_zone)
    a[..., zone_start : zone_stop + 1] = val

# Find top 10 discords
matrix_profile = mp_base[:, 0].copy() # We make a copy to avoid overwriting our original matrix profile
excl_zone = int(np.ceil(SEQ_WINDOW_LENGTH / stumpy.config.STUMPY_EXCL_ZONE_DENOM))
discords = []
for k in range(10):
    discord_idx = np.argsort(matrix_profile)[-1]
    discords.append(discord_idx)
    print(discord_idx)
    apply_exclusion_zone(matrix_profile, discord_idx, excl_zone, -1.0)

# plotting.. 
fig, axs = plt.subplots(2, sharex=True, gridspec_kw={"hspace": 0}, figsize=(11, 11))
plt.suptitle(f"Discord (Anomaly/Novelty) discovery", fontsize="25")
axs[0].plot(df[BASE_PARAMETER].values)
axs[0].set_ylabel("param", fontsize="20")
for x in discords:
    rect = Rectangle(
        (x, 0),
        SEQ_WINDOW_LENGTH,
        df[BASE_PARAMETER].values.max(),
        facecolor="lightgrey",
    )
    axs[0].add_patch(rect)
axs[1].set_xlabel("Time", fontsize="20")
axs[1].set_ylabel("Matrix Profile", fontsize="20")
for x in discords:
    axs[1].axvline(x=x, linestyle="dashed")
axs[1].plot(matrix_profile)  # mp_base[:, 0]
plt.show()

NimaSarajpoor · 2022-03-02T18:39:16Z

NimaSarajpoor
Mar 2, 2022
Maintainer

@roumail @seanlaw
I am trying to understand what you want to get here (and, I think you should also think what you want to get... are you interested just in peak? or an anomaly subsequence?)

If you want just to get the peak values (not index), you can simply do np.max() and (maybe) exclude its nearby neighbors and find the next one.
If you want to get the indices of the peak (only the peak, and NOT a subsequence that contains peak), again you can do np.argmax()

BUT,

If you want to find subsequences with peak: then I think your matrix profile can make sense here. Note that the peak of matrix profile does not necessarily match the peak of time series. Each index in matrix profile corresponds to the START index of subsequence with length 500. Let us take a look at the peak of time series that is close to index 4000. If you pay attention to the behavior of time series around this peak, you can see that there is a small bump just before this peak. Right? So, I think the peak of matrix profile (between 3000 and 4000) makes sense because that is just the START of subsequence (of length 500) that can capture both that small bump and the peak. Note that I do not have access to your data and the matrix profile. So, it is not easy for me to be sure about things that I said . I simply used paint 😄 see below (please ignore the black vertical lines):

I guess the matrix profile is working good here as it captures anomaly (which is subsequence not just a single time stamp). The left red line is one of the peaks of matrix profile that is between 3000 and 4000. That is just the start of subsequence. The end of this subsequence is the second red line which, I think, captures the peak as well.

(1) Btw, you may want to try both normalize=True / normalize=False here. (However, I guess the output might change a little in this case)

(2) Also, if you choose a smaller window size (e.g. 250), I feel your matrix profile peaks should get closer to peak of time series. (however, you still cannot expect that the peak of matrix profile matches the peak of time series)

6 replies

roumail Mar 3, 2022
Author

I am trying to understand what you want to get here (and, I think you should also think what you want to get... are you interested just in peak? or an anomaly subsequence?)
If you want just to get the peak values (not index), you can simply do np.max() and (maybe) exclude its nearby neighbors and find the next one.
If you want to get the indices of the peak (only the peak, and NOT a subsequence that contains peak), again you can do np.argmax()

I agree that I need to clarify what I'm looking for too :D. I'm actually very new to applying matrix profile so I'm exploring at the moment. I'm mostly interested in not just the value of the param at these peaks (see below peaks I'm talking about) but also the index where these peaks occur. I was planning use these indices for computations involving other time series.

I'm interested in both the peak and indices where those peaks happen. However, I wasn't familiar with applying this exclusion criteria methodology that @seanlaw shared in the previous snippet. I hadn't come across it in the tutorials (which are fantastic btw!) so I was trying to implement something on my own.

If you want to find subsequences with peak: then I think your matrix profile can make sense here. Note that the peak of matrix profile does not necessarily match the peak of time series. Each index in matrix profile corresponds to the START index of subsequence with length 500. Let us take a look at the peak of time series that is close to index 4000. If you pay attention to the behavior of time series around this peak, you can see that there is a small bump just before this peak. Right? So, I think the peak of matrix profile (between 3000 and 4000) makes sense because that is just the START of subsequence (of length 500) that can capture both that small bump and the peak. Note that I do not have access to your data and the matrix profile. So, it is not easy for me to be sure about things that I said . I simply used paint 😄 see below (please ignore the black vertical lines):

The output from the matrix profile does make sense though I wasn't sure how to use that information to get what I was looking for. I'm definitely interested in the subsequences with peaks from the matrix profile. For example, looking at that subsequence b/w ~ 3500-4000 I'm mainly interested in the big bump. I agree that the mini bump that happens prior is getting "confounded" with the bigger bump but that's fine. Might still be interesting to reduce my window length

I was thinking about your problem in my mind and I was wondering if you want to ignore the first part of time series. In that case, you may want to calculate the slopes as: T[1:]-T[:-1]. So, you want to capture parts where there is huge slope. You can apply matrix profile on this current series.

The discord currently identified by the matrix profile (the grey rectangle) isn't important for me - it's simply a data artifact. I could have obviously remove the first half of the data to avoid this problem but I wanted to devise a smarter way to find the peaks since I have many different time series like these! Your suggestion to work with T[1:]-T[:-1] is quite interesting! Thanks!

(1) Btw, you may want to try both normalize=True / normalize=False here. (However, I guess the output might change a little in this case)

I hadn't considered the non-normalized version.. might be interesting to look into .. Thanks for pointing out!

roumail Mar 3, 2022
Author

BUT,

If you want to find subsequences with peak: then I think your matrix profile can make sense here. Note that the peak of matrix profile does not necessarily match the peak of time series. Each index in matrix profile corresponds to the START index of subsequence with length 500. Let us take a look at the peak of time series that is close to index 4000. If you pay attention to the behavior of time series around this peak, you can see that there is a small bump just before this peak. Right? So, I think the peak of matrix profile (between 3000 and 4000) makes sense because that is just the START of subsequence (of length 500) that can capture both that small bump and the peak. Note that I do not have access to your data and the matrix profile. So, it is not easy for me to be sure about things that I said . I simply used paint 😄 see below (please ignore the black vertical lines):

The matrix profile I have so far is indeed making sense though I'm not sure how to use the matrix profile to get the information I want. For example, I'm definitely interested in the subsequences with peaks from the matrix profile. What i'm looking to get from that subsequence b/w ~ 3500-4000 is the big bump. The mini bump that happens prior to that could be interesting but like you said I'd need to reduce my window length if I want that mini bump to not get "confounded" with the bigger bump.

(1) Btw, you may want to try both normalize=True / normalize=False here. (However, I guess the output might change a little in this case)

I hadn't considered the non-normalized version.. might be interesting to look into .. Thanks for pointing out!

NimaSarajpoor Mar 3, 2022
Maintainer

Thanks for the clarification. As you see, the more you discuss things (here and in your mind), the problem becomes more clear.

I'm definitely interested in the subsequences with peaks from the matrix profile. What i'm looking to get from that subsequence b/w ~ 3500-4000 is the big bump

Can you please explain why the following answer may not work for you? (I am trying to understand the goal you have in mind)
so, let us assume our time series T is: T = [0, 0, 0, 1, 0, 100, 0, 0]

peak_value_index = np.argmax(T)
peak_value = T[peak_value_index]

And, you can get the subsequence as follows: T[peak_value_index-1 : peak_value_index+2] (it has length of three). Now, can you convince yourself why this answer is not good? (then, you can try and find other peaks and construct a subsequence with length three for each of those).

Another example:
Let us think about a case where the small bump before that peak gets bigger (i.e. larger in value). Let us assume it gets closer to 80 (which is close to the peak value of graph param between index 3000, and 4000). Now, do you still ignore this bump? what is that threshold you have in mind? are you trying to be subjective here?

NimaSarajpoor Mar 3, 2022
Maintainer

@roumail
So, what matrix profile does for you is to help you get rid of thinking subjectively. The method is precise and the result is data-driven (@seanlaw is the expert one and I just do the talking 😆) The only thing that you need to decide is m (window size). Maybe, based on your application, m=500 is a good choice, AND what you are looking for is anomaly (i.e. surprising pattern) that "may" contain peaks (which, I guess, should be true most of the time, i.e. the anomaly most likely contains peaks), then you should trust the result because that is what it is. So, if you see the bump is included in the subsequence, you may need to think and justify the result.

However, if you already clarify things in your mind and you know you just need a subsequence that has peak (and that is enough for you, and you do not care about what happens around that peak), then I recommend going through my answer above one more time.

Just for the record, I faced the same challenge myself before where I had to clear the problem in my head and, some times, it is not easy (I still face such challenge from time to time).

@seanlaw helped me a lot before regarding this matter. And, now, I am just trying to help you think more about your problem to make it more clear, and maybe you help another person later when they face such challenge😄

roumail Mar 7, 2022
Author

Hello @NimaSarajpoor, sorry for the delay in responding (work...). I think the toy example does work for me, specifically together with the use of the "exclusion" snippet that @seanlaw shared with me earlier.

# retrieve peak values 
peak_value_index = np.argmax(T)
peak_value = T[peak_value_index]

# retrieve subsequence values

Regarding the second example where you ask when does the mini bump start to be relevant, let me rephrase what I'm looking for. We have 4 known peaks in the data. at the beginning, where there's a drop then there are two spikes and then a final spike at the end of the curve. These peaks and spikes are known ahead of time based on the data generating process. I'm trying to come up with a reasonable sequence window (currently 500) that leads me to correctly find these distinct segments of the time series. I was previously trying to use fluss for this but if I can identify these peaks in an automated manner, I can create these regions myself.

Moreover, the mini-bump in our case will always remain a mini-bump.

Based on the above, i'm actually less interested in finding "anomalies", but rather identify the indices of the different regions. Again I appreciate all the advice!

roumail · 2022-03-07T10:49:39Z

roumail
Mar 7, 2022
Author

@seanlaw and @NimaSarajpoor - I have a somewhat unrelated question that I'm not sure where best to ask. At the risk of polluting this thread with this request could you please let me know what would be the best channel reach out to your team? I'm wondering if you'd have someone interested in presenting material about Stumpy/matrix profile?

I am a data scientist in a global pharmaceutical company and we often invite internal/external speakers on topics of machine learning to foster knowledge sharing about new and exciting methods. We do this via an online meeting ~ 1 hour (15 minute questions) for a group of 20-30 people.

Looking forward to hear back from you!

4 replies

seanlaw Mar 7, 2022
Maintainer

@roumail Thank you for asking. Unfortunately, I do not have the bandwidth at this time due to other commitments. I know that it is less favorable/interactive but I also recommend some of this pre-recorded STUMPY talk.

roumail Mar 7, 2022
Author

Hi @seanlaw yes I've seen the material posted here! It's really great. If there's anyone else you could refer, a researcher/student, that would be great too! If not, I won't ask much more as the community has already provided us all with superb material like the one you posted :)

seanlaw Mar 7, 2022
Maintainer

Sadly, this open source work is completely volunteer work and I am unable to endorse anybody in particular. Perhaps, you can circle back in the end of summer/fall as my calendar may be more forgiving then.

Of course, you can also reach out to the original authors in Eamonn Keogh's lab who created matrix profiles:

https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

roumail Mar 8, 2022
Author

Thanks alot for your suggestions!!

Find values of second and third peak from matrix profile output #555

roumail Mar 2, 2022

Replies: 3 comments · 14 replies

seanlaw Mar 2, 2022 Maintainer

seanlaw Mar 2, 2022 Maintainer

roumail Mar 3, 2022 Author

seanlaw Mar 3, 2022 Maintainer

roumail Mar 3, 2022 Author

NimaSarajpoor Mar 2, 2022 Maintainer

roumail Mar 3, 2022 Author

roumail Mar 3, 2022 Author

NimaSarajpoor Mar 3, 2022 Maintainer

NimaSarajpoor Mar 3, 2022 Maintainer

roumail Mar 7, 2022 Author

roumail Mar 7, 2022 Author

seanlaw Mar 7, 2022 Maintainer

roumail Mar 7, 2022 Author

seanlaw Mar 7, 2022 Maintainer

roumail Mar 8, 2022 Author

roumail
Mar 2, 2022

Replies: 3 comments 14 replies

seanlaw
Mar 2, 2022
Maintainer

seanlaw Mar 2, 2022
Maintainer

roumail Mar 3, 2022
Author

seanlaw Mar 3, 2022
Maintainer

roumail Mar 3, 2022
Author

NimaSarajpoor
Mar 2, 2022
Maintainer

roumail Mar 3, 2022
Author

roumail Mar 3, 2022
Author

NimaSarajpoor Mar 3, 2022
Maintainer

NimaSarajpoor Mar 3, 2022
Maintainer

roumail Mar 7, 2022
Author

roumail
Mar 7, 2022
Author

seanlaw Mar 7, 2022
Maintainer

roumail Mar 7, 2022
Author

seanlaw Mar 7, 2022
Maintainer

roumail Mar 8, 2022
Author