-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VRT Pixel Functions: Add function to evaluate arbitrary expression #11209
base: master
Are you sure you want to change the base?
Conversation
That's super cool, and so compact as an integration! Any hints on the performance ? In particular it would be cool if exprtk could use SIMD to vectorize computations, but I'm probably asking too much. Regarding performance, I'd suspect the biggest issue to be the GDALCopyWords() done for each pixel. Creating a temporary array with all the double values and copying in one-go to the output data type would certainly be beneficial (this remark applies to existing pixel functions). Can exptrtk can be used with non-double variables ? (in case that would make things faster for integer only computations) Thinking out loud: for ultimate performance, we should probabl use some LLVM-based solution where the expression would be just-in-time compiled to the target architecture, a bit like numba does. |
frmts/vrt/pixelfunctions.cpp
Outdated
#define exprtk_disable_rtl_io_file | ||
#define exprtk_disable_rtl_vecops | ||
#define exprtk_disable_string_capabilities | ||
#include <exprtk.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the following to avoid namespace collisions in case GDAL is integrated with another software also using exprtk?
#include <exprtk.hpp> | |
#define exprtk gdal_exprtk | |
#include <exprtk.hpp> |
Haven't looked into this yet. In particular, comparing to Python would be interesting.
Maybe they could be rewritten as templates...
The docs note "Support for various numeric types (float, double, long double, MPFR/GMP)"
I also came across mathpresso that does something along these lines. |
@rouault A limitation of the pixel function integration is that we can only output a single band. exprtk would happily accept an expression like On performance, I've done basic tests like summing up a 24-band netCDF. Performance is roughly equivalent to Python with
and numpy:
|
It accepts only a single (potentially multiband) input dataset. That said the Input can be a VRTDataset, so you can assemble several single-band datasets as a virtual multiple one. Hard to tell which way is the preferred one. Would having it available in both VRTDerivedRasterBand and VRTProcessedDataset be overkill ? Having it available in VRTDerivedRasterBand makes it probably easier/intutive to write formulas that are different per output-band, even if exprtk allows to return a tuple. |
Not necessarily. I'll look into
On the other hand, this requires repeating all of the |
that's true |
e8b10a9
to
98ceafa
Compare
f001051
to
9698635
Compare
), | ||
pytest.param( | ||
""" | ||
var chunk[10]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting... from a security point of view.
- can that trigger arbitrary memory allocation ?
- does exprtk do bound checking when accessing the arrays ?
I suspect you could also cause infinite /very long looping ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rouault - Yes ExprTk has runtime checks. One of which is explicitly for vector index access.
An example can be found here: https://www.partow.net/programming/exprtk/index.html#simple_example_20
More detailed documentation regarding the checks can be found in Section 24 of the readme: https://www.partow.net/programming/exprtk/readme.html#line_4696
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can that trigger arbitrary memory allocation
Array sizes can be up to 2 billion, so you can cause a large allocation with:
var out[2000000000];
return [out];
does exprtk do bound checking when accessing the arrays ?
Not by default, but enabled in fd94c9e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array sizes can be up to 2 billion, so you can cause a large allocation with:
could we improve exprtk to offer a way to limit the allocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we improve exprtk to offer a way to limit the allocations?
@rouault: Yes the maximum vector size that can be created in the expression can be set explicitly:
https://www.partow.net/programming/exprtk/readme.html#line_1473
parser_t parser;
parser.settings().set_max_local_vector_size(12345);
), | ||
pytest.param( | ||
""" | ||
var chunk[10]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rouault - Yes ExprTk has runtime checks. One of which is explicitly for vector index access.
An example can be found here: https://www.partow.net/programming/exprtk/index.html#simple_example_20
More detailed documentation regarding the checks can be found in Section 24 of the readme: https://www.partow.net/programming/exprtk/readme.html#line_4696
bcebd99
to
4bdc19d
Compare
The ASAN failures are unrelated to this PR and are avoided in master per fbaa99c. You can rebase on top of that. Perhaps we could limit the amount of memory to some default value, let's say 10 MB, and if above that, fail with an error like "you need to set the GDAL_VRT_MAX_MEMORY_EXPR configuration option". |
4bdc19d
to
cfd9754
Compare
57f8062
to
8e06984
Compare
1e8aa46
to
36c0665
Compare
you'll need to rebase your PR on top of latest master and run |
36c0665
to
0141c04
Compare
For the msys2-mingw64 "'file too big" error, see https://github.com/qgis/QGIS/blob/ac7a7fc78eadae3682c57387d2863128f6e65901/src/app/CMakeLists.txt#L493 |
the time spent in GDALCopyWords seems quite large. I'm wondering if it is expected or if there's some inefficiency we could solve. Do you have a link to the dataset you use here so I can have a look? |
After investigation, we are probably close to optimal, or at least, I can't spot any low-hanging fruit. |
actually I now realize I had already implemented a cache-friendly array transposition function for the multidim code. I've generalized it and used it in VRTProcessedDataset and it gives a nice perf boost: #11572 |
e9ae7d7
to
0ec86a7
Compare
I think this is getting close. A few issues:
|
Also +1 not to vendor it, and rely on CMake FindXXXX() logic (cf cmake/helpers/CheckDependentLibraries.cmake) to detect an external exprtk library
Isn't it some issue related to declspec(dllexport) / declspec(dllimport) being inappropriately used in the context of a vendored library
as it is about non-public API, you can just wrap the offending files between |
+1 to requiring user-provided exprtk. It is available on conda-forge for our CI, and it looks like it is starting to be provided by some of the other packaging systems. If we vendor it, we will get complaints about it. We lose either way 😄 |
@dbaston I've pushed to your branch a commit (cherry-picked from master) that should hopefully fix the mysterious unrelated crash under the ASAN CI config (no idea why it just popped up on your pull request!) |
6e9f144
to
34f8c05
Compare
Sorry, I didn't notice this message and apparently clobbered your commit with a force-push. I'll rebase from master again to pick it up. Is the commit history in this PR important to you? I think the simplest way to remove the vendored exprtk from the history is to squash to a single commit. (Or maybe two commits, with the patch to the vendored muparser kept separate) |
Squashing in a single commit is probably going to be the simplest solution . I tend to prefer atomic history in general like "Add function foo()", "use function foo in driver A", "use function foo in driver B", but only if each individual commit builds properly and doesn't break a |
34f8c05
to
1157bc5
Compare
We'll need to document the variables related to EXPRTK and MUPARSER dependencies in https://gdal.org/en/latest/development/building_from_source.html#cmake-package-dependent-options |
1157bc5
to
18a3c5e
Compare
dbaston#2 : CI: add muparser to conda windows build |
dbaston#3: ExprPixelFunc(): fix memory leak in error code path |
a924f0d
to
07adb1c
Compare
07adb1c
to
d548b4d
Compare
The PR LGTM. Does @hobu want to give it some additional review ? |
What does this PR do?
Adds a C++ pixel function called "expression" that can evaluate an arbitrary expression (or indeed, a mini-program) using the exprtk library. This is a single MIT-licensed header that is easily integrated into GDAL. It appears to be quite widely used. However, given that a major purpose of this is to extend the utility of VRT to users that don't want to allow functions defined in Python (possibly because of security concerns), I'm a little uncomfortable with the scope of expressions allowed by this library. Some of these, such as file I/O, can be disabled (and I've done so.)
I've also looked at:
I need to do more work to investigate details such as nodata handling, but wanted to share the concept sooner rather than later.
An alternative implementation would expose
exprtk
as an additional pixel function language alongside Python, instead of hooking into the existing C++ pixel function machinery.An example VRT that is allowed is: