-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(vrl): add caching feature for VRL #21348
base: master
Are you sure you want to change the base?
Conversation
This adds additional VRL functions for reading and storing data into caches that can be configured in global options. Caches can store any VRL value and are meant to store data for shorter periods. All data gets TTL (time-to-live) assigned, based on cache configuration and gets removed when that TTL expires.
@jszwedko @pront This is a very rough draft of caching feature for VRL. The idea is to have separate storage for VRL values that can be used across different VRL runs. I have opted for a very simple interface, with just There are many things to consider for this:
I hope this feature does not break some of the promises VRL gives. Let me know your thought on this and if you think this feature makes sense. If you think it does, let me know if this interface fits VRL well, or if I need to figure something else out. |
(chiming in here since this is for some of the work that we need to have done, and esensar and I have talked about this concept offline) Cache deletions: would this just be writing to a key with an empty object to imply a deletion, or should there be an explicit cache_delete that can be called? Monitoring: One of the other things that is probably a requirement for this would be monitoring. Some ideas that could be discussed:
Concurrency: It seems that the interval for concurrency would be very very short if there is a lock on the object for only as long as it takes to read or write to the memory space. The user would just have to understand that cache data may change/vanish between a read and a write, or between two reads. For our purposes, the cache is to prevent multiple "heavy" events from happening. Each event predictably produces the same outcome for some reasonably long period of time, but they do change state eventually. (Our environment: DNS data.) Cache updates: having TTL be "visible" for cached objects within VRL is necessary, since I can envision a rather crude predictive cache update method that uses randomness to refresh a cache item, based on the TTL. Otherwise, if a highly-used cached object expires, then there may be many threads trying to update it at the same time which would be very wasteful and probably a heavy load that is not desirable (after all, a cached object implies that the cache is faster than some other method.) Better to apply predictive decline to the chances that an object will be refreshed. |
Regarding the cache deletions, I think explicit function is better ( When it comes to TTL updates, would it make sense for reads to also update TTL, because I guess the idea of TTL is to avoid storing data that is not needed and frequent reads would mean that the data is still used. |
This is a great point, we should create metrics for this. I would also argue that we should use quote as proof for the perf gains from this optimization and add them to the PR description. |
Sure - I have no opinion on the method for deletions, other than it needs to be possible somehow by events within VRL contexts.
I would strongly disagree here, or this would be a separate TTL. A cache entry has an expiration because it is "volatile" data that reduces in accuracy over time, and needs to be refreshed at the end of the TTL regardless of how many times it has been used. If there is a "last read" TTL and a "expiry TTL" that would be useful and I can see sometimes that both would be useful, but we cannot combine them into a single TTL that gets refreshed/reset upon read. |
Right, I haven't thought about that. It makes sense to just have a single TTL, but with additional VRL function to read it, to be able to better control it. |
Is it necessary or desirable to have an additional VRL function to read the TTL, or should it somehow happen at the exact moment of the "cache_get"? I think it could create race conditions if another function is required, since then the fetching of the cached object itself and the TTL would not have the same timestamp. "cache_get(,[,ttl_variable])" might work,perhaps? so then ttl_variable would be an object which would be set to the value of the TTL, and the user could define what the name of that object was. I'm not sure that's the best way to do it, but I think simultaneous fetching of the cache object AND setting the TTL is a good idea. |
Hey, I did a quick pass as promised. A general comment here is that the proposed change semantically is more like a "global VRL state" vs a "VRL cache". @fuchsnj also made the following point:
So back to the original problem, I would expect a caching solution to be hidden from the users i.e. no new VRL functions. For example, imagine that the following line |
I think it can be both. From my view, the "cache" part comes from (I believe) the concept that objects in the store have a timer that can be applied, and there is a clearing process that occurs separately from any event processing that will delete objects in the store whose timers have gone below zero. True, an object which has an exceptionally long TTL which is longer than the expected runtime of the system (or perhaps which have a special high value that is treated as infinite) would therefore be treated as permanent items, and so become "state" instead of "cache."
I'm not understanding quite how this would work. Somehow you'd have to indicate that the "vrl_function_foo" action would look at the cache, instead of doing whatever that function does in the first place. This would mean either some sort of tagging (?) to turn on or off a "look at the cache for every function from here on" method, or would mean universally applying caching to all functions, which would perhaps be useful but possibly exceptionally wasteful in memory space for things that were not desired to be cached. Imagine where "vrl_function(1,'a')" where "a" was a random string of 50 characters that may or may not ever appear again, and the event stream is 100k per second, and the result of the function was 900 bytes, and those 900 bytes change every 2 hours. This is essentially what we're trying to solve. The proposed method using new functions would allow very specific values to be inserted into or looked up in the cache, allowing for very focused scope of memory use, and also giving more granular control over including/not including certain things on a per-event basis. Other than using new VRL functions, the only other way I could see this working with maximum transparency would be to use specific "magic" object namespace prefixes to indicate cached data, but that is not very clean though I suppose I haven't thought about it enough.
I'm not sure how a timed cache function would ever be expected to be reliably idempotent, since at some point the TTL will expire, the cache value will be removed and/or updated to a different result, and the result may differ between iterations of examination. The user would need to understand this, and make accommodations for no data appearing in the cache, so that the heavy or slow function would be then called, and the result (hopefully) stored in the cache for subsequent references over the next time window of TTL. The original intention of this "cache" is that items which are relatively slow to access and which may have different values across some window of time that may be faster accessed in a memory store that is more lightweight than the function that generates them. There is an implicit understanding in such timed cache models that the value stored in memory is "almost as good" as the computationally costly or slow function which is used to insert the object in the cache, but that over time that value diminishes until a threshold where the item is expunged or refreshed. Our use case requires TTL, because we sometimes will see items inserted in the cache which will only be accessed for a few times over a few minutes, and then never again. If we cannot have those items removed automatically after a time period (regardless of how many times they are used or not used) then this is effectively a catastrophic memory leak. They also need to be refreshed on occasion, as the data loses accuracy over time. |
Maybe we can think of a better name for it, I agree that VRL cache might be misleading.
But in this case, we are directly accessing the storage in Rust code, which is behind In general I think these 2 are addressing different problems. I think proper VRL caching as you described it is a bigger undertaking, because we would probably need to think about when that caching makes sense, since many of the VRL functions are very fast and would probably take a hit from cache lookups instead of getting a speedup, so we would either have to selectively apply it to some functions, or provide a way to configure it (although then that wouldn't be hidden from the users and would probably be confusing to configure). Does adding some kind of global state to VRL (optional, it would have no effect unless user specifically calls these functions) make sense for you? Does it break any of the promises VRL makes? There is still a lot of work to be done for this PR, so I would just like to know in advance. If something like this is not an option for VRL, we can think about other solutions for slow functions, something a bit more hidden from the user (but I think it would always have to provide at least some configuration options, to ensure users can control the size of the cache). |
If we want to pursue this idea of global state, we might benefit from an RFC. If there are other ideas that do not require such a big change we can probably avoid the RFC. In both cases, some perf stats will make a more compelling case. |
Ideally we would build on enrichment tables which is currently used as external state for VRL. Looping in cc @lukesteensen for future discussions. |
This is a very interesting feature, but I'm wondering about the use-cases it is solving to make sure it is the best solution for those use-cases since it diverges from one of VRL's design princples that calls should be non-blocking. Could you describe the use-cases you have for this feature @esensar @johnhtodd ? I think that would help us identify if adding these functions to VRL is the correct approach or if a separate transform or enrichment table would be better suited. |
Yes, as @pront mentioned, I believe that extending enrichment tables is the better path here. At a high-level, there are some important characteristics of VRL that we want to maintain:
Introducing shared mutable state between VRL invocations would complicate these quite a bit. Instead, I think it would be better to separate writes from reads by putting them in different components. This dramatically simplifies the data flow and makes it clearer that state is being shared. One way to do this would be to introduce a new component that is basically both a sink and an enrichment table. It would look similar to the |
Alright, that makes sense to me. The separate sink into an enrichment table could be feasible for this. That would mean a new kind of enrichment table, which would be stored in memory instead of files. I will try that and see if something like that would work instead of this. Thanks for taking the time to review this. |
Thinking about this a bit more, is this solution preferred just due to simplified data flow? I think we still have the same issue about sharing mutable state, it is just no longer from different VRL invocations, but different components. If I understood that solution correctly, something like this would be implemented:
Now, when it comes to writing to that table, some kind of a lock would still have to be utilized (or maybe there would be a way to do it lock-free, but I guess that would have some other limitations). Does this sound right @lukesteensen ? |
@esensar Yes, it sounds like you have the right idea. One small point of clarification (which you may already know) is that it would only be one new component, a new enrichment table type, which would behave in some ways like a sink (i.e. it would accept input events from other components), but wouldn't actually involve creating a new type of sink. It would involve some work in our topology code to support hooking up this enrichment table to other components, likely mirroring the logic for sinks. And yes, you're right that it is still fundamentally shared state. The difference is that now we have exactly one writer (the new enrichment table component) and all VRL components are purely readers, which maintains the property that VRL scripts can be run many times, in any order, concurrently, etc, without changing the results (i.e. they behave more like pure functions without side effects). This will make it easier to avoid unexpected behavior and high levels of contention on the shared state, and implement/maintain optimizations around how we schedule VRL scripts to be run. One potentially useful library for implementing this would be evmap, but I'm sure there are others as well. Constraining ourselves to a single writer makes the design compatible with some of these data structures that have desirable properties. |
Thank you. Alright, I was initially going to add a new sink, but that approach makes sense. Thanks for the |
This implementation is based on `evmap`, for "lock-free" reading and writing. There is still a lock when data is refreshed, but that can be controlled, to have less interruptions.
@pront Maybe there is a better way to handle this, but it feels like I need to define some of the sink related stuff on |
@pront @lukesteensen There are a couple if things I don't like about this:
More work needs to be done for this too:
|
Hi @esensar, I just wanted to assure you that this is still on our radar to review. |
Alright, thanks. Sorry for all the mentions, I got stuck at one point, but now I have managed to hook everything back up, but I am not too happy about the solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again for the delayed review! It takes me an unfortunately long time to page in all of the relevant context when I try to come back to this 😅
I think what you have here looks pretty reasonable! I've left some comments about things that could potentially be simplified, or ways to adjust how we currently lay things out that may make it a bit simpler of a change, but I don't see any particular design decisions that would be a blocker.
src/config/graph.rs
Outdated
@@ -37,24 +37,28 @@ impl Graph { | |||
sources: &IndexMap<ComponentKey, SourceOuter>, | |||
transforms: &IndexMap<ComponentKey, TransformOuter<String>>, | |||
sinks: &IndexMap<ComponentKey, SinkOuter<String>>, | |||
enrichment_tables: &IndexMap<ComponentKey, EnrichmentTableOuter<String>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to simplify somewhat, I think that these tables could be added to the graph as sinks. There's not really any function difference between them in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would probably makes things much easier to handle. Similar to how it was done in topology builder. I will make that update.
} | ||
} | ||
|
||
pub fn as_sink(&self) -> Option<SinkOuter<T>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reasonable way to do it the way things are currently designed, but it definitely shows the limitations of how we do things with SinkOuter
, etc.
If I were to re-approach it, I think I would limit SinkOuter
to config deserialization (and drop the generic to just String
), and map to some more granular things to build the topology (e.g. something for inputs, something for healthchecks, etc). That way we could try to unify where we handle "things that have inputs", etc.
You certainly don't need to do any of that, but it could be something to explore if you'd like to make this feel cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not sure how often components such as this one would come up, but it definitely makes sense for components such as this one, that don't really fit just one type.
I can think about it, but I think that would be a too big undertaking for this PR.
src/topology/running.rs
Outdated
@@ -444,6 +444,7 @@ impl RunningTopology { | |||
.filter(|&(existing_sink, _)| existing_sink) | |||
.map(|(_, key)| key.clone()); | |||
|
|||
// TODO: also remove this for enrichment tables that act as sinks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would be better off disallowing buffers in front of enrichment sinks, at least to start with. They shouldn't be needed and we can avoid this particular bit of complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, that would require different branches for sinks and enrichment tables, since sinks always have buffers (if I understand the code correctly). Currently I have hardcoded it to always use the default buffer (which is a single memory buffer).
That might not really be that hard to do and maybe it would be a smarter thing to do, since some of the things that sink has just don't make sense for these sinks built out of the table, but I think that brings us back to the above point, of changing SinkOuter
usage into some smaller components.
Hi @esensar, give us a shout whenever you want a review for this. |
I think this is now ready for review. Size limit is implemented and all required topology changes (I think). The topology changes got complicated, but I think it could be improved with @lukesteensen 's suggestions. That is a bit much for this PR now I think. I think that change to make configuration a bit more modular might be interesting for further development of this component. We found use-cases for having this component also act as a source in some cases (dumping all data from cache periodically), which introduces even more complications. I have a POC of it on a different branch: https://github.com/esensar/vector/tree/feat/vrl-cache-as-source - not really important for this PR, just wanted to mention it since it is related to configuration changes. |
Sounds good, thank you @esensar. Let's give give some time to @lukesteensen for a review. I can take a look as well if there's no updates on this for a while. Feel free to iterate on this if needed. |
I have just realized that VRL Do you have a suggestion for some other way of storing it, that would not have as much overhead as VRL Another thing that I noticed while observing this component using |
This adds additional VRL functions for reading and storing data into caches that can be configured in global options. Caches can store any VRL value and are meant to store data for shorter periods. All data gets TTL (time-to-live) assigned, based on cache configuration and gets removed when that TTL expires.