Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test performance of io.cube.test_query module #403

Open
fjetter opened this issue Feb 5, 2021 · 4 comments
Open

Test performance of io.cube.test_query module #403

fjetter opened this issue Feb 5, 2021 · 4 comments

Comments

@fjetter
Copy link
Collaborator

fjetter commented Feb 5, 2021

The performance of the io.cube.test_query module is concerning and takes up the majority of runtime of the entire test suite. While this is a cornerstone of the cube functionality, the test runtimes are still unreasonably large.

Due to heavy test parametrization we're generating about 1.5k tests where runtimes of individual tests range up to several seconds.

Running the tests for only the eager backend takes about 3min where there runtime is distributed very asymmetrically
test_hypotheis ~50% // ~1:30min (up to 15s each; multiplicity 12)
test_partition_by ~20% // 42s (up to 8s each)
test_condition ~10% // 23s (~0.5s each multiplicity driver x cubes x )

I haven't performed the same benchmarks for the dask backends but I would suspect them to perform worse due to more overhead.

I believe we're struggling with a few problems here

too much local parametrization

I believe we are parametrising too much. For instance the test_partition_by is parameterized using seven single columns and two multi columns. While I understand that we have a lot of different cases (indexed, not indexed, partitioned, indexed but not in seed, etc.) I believe testing all of these cases is not necessary and we can reduce this. if this thoroughness is required we probably should push it to a lower level (see next)

testing on the wrong level

These tests are effectively end-to-end tests and are testing very granular edge cases. I believe most could be expressed on a different level (e.g. asserting the query intention but not the data)

Too many cubes

We are testing over many "different" cubes in this test suite which are named e.g. fullrange, sparse or massive_partitions where the value for some of the cases is not entirely clear.
For instance there is sparse and sparse_opt where the difference are some integer types in a few fields. Is a full parameterization for this necessary or is it sufficient to read this stuff once to ensure types are cast properly? etc.

What I'd like to discuss here is what we can remove such that build times become more comfortable but we still feel confident in the test suite.

My proposal would be:

  • remove driver dask_bag_bs1
  • remove hypothesis test entirely
  • remove some parameters from test_partition_by
  • Remove some of the test cubes, in particular sparse_outer_opt, massive_partitions, other_part, updated
  • Remove some conditions from test_conditions, e.g. such that we have one per field

@crepererum I guess you have been involved in writing these tests and I would appreciate your opinion. I know we discussed the idea about rewriting tests on query intention level but this is not super easy and would require a lot of effort, therefore, I'm looking for a fix with reasonable effort

I can also prepare a PR if that helps with the discussion but the difficult bit is obviously not deleting the code but deciding on what is necessary / feasible to keep

@marco-neumann-by
Copy link
Contributor

Are the long test times really an issue or just a slight annoyance? Also I think we normally rather have the issue that kartothek is under- instead of overtested.

remove hypothesis test entirely

Actually that test found a solid number of edge cases in the past, so I'm not a fan of removing it. In general I would probably rather remove other tests instead of the property-based testing.

I cannot judge the other tests just from top of my head, see next point.

I guess you have been involved in writing these tests and I would appreciate your opinion. I know we discussed the idea about rewriting tests on query intention level but this is not super easy and would require a lot of effort, therefore, I'm looking for a fix with reasonable effort.

What effort is "reasonable" here kinda depends on the point of view:

  • removing tests: requires us to judge how much functionality coverage we loose by it. Just deleting some tests that seem to be overkill isn't (IMHO) a good option. So the time to do this judgement carefully should be incorporated into the task estimate.
  • level change: Sure it would require some substantial test refactoring and at least partly opens up the same question (e.g. the tests for sparse_opt don't do anything meaningful on the IR level).

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 17, 2021

Are the long test times really an issue or just a slight annoyance?

Well, I believe this question can only have subjective answers but I consider it an actual problem as soon as it encourages "I'll push to CI instead of running tests myself" behaviours which I believe we've reached here. Timing our tests, the entire test suite w/out the query tests take about 3:30min on my machine while the query tests alone take up another 7:30min. This is an unreasonable split imho.

Actually that test found a solid number of edge cases in the past, so I'm not a fan of removing it. In general I would probably rather remove other tests instead of the property-based testing.

I get that point but testing is always a trade off to a certain degree. I believe this hypothesis test itself is a nice thing to have but I'm not convinced about the cost/value ratio of this (13-14% of overall runtime).
We could consider one of these "slow" pytest flags to only execute this on CI and/or only on master, etc.

removing tests: requires us to judge how much functionality coverage we loose by it. Just deleting some tests that seem to be overkill isn't (IMHO) a good option.

totally agreed. my recommendation above was not random but driven by my estimation about their value. the tests are unfortunately not documented and it's somehow guessing what the overall idea behind it was. I pinged you hoping that you could provide some insights into the overall tests setup, e.g. which edge cases things like massive_partitions should cover, etc.
The points I suggested above where not entirely random, here a bit more info about my thoughts


remove driver dask_bag_bs1

I believe we should trust dask to do the right thing w.r.t blocksizes. Having one test with a non-trivial blocksize should cover us that we're not doing something with too many assumptions on this.

remove hypothesis test entirely

TBD / slow flag / see above

remove some parameters from test_partition_by

This tests against multiple indexed columns (why are there multiple ones? what's the value of having i1,... i3 and multiple payload columns (same question). My proposal would be to restrict this to a minimal subset

Remove some of the test cubes, in particular sparse_outer_opt, massive_partitions, other_part, updated

  • updated should not have any impact on the query. Wether a update does what it is supposed to do should be tested in dedicated update tests, not in fully parametrized query tests
  • other_part, massive_partitions: I don't really understand what they are supposed to be doing
  • sparse_outer_opt: seems to be a duplicate just with different type widths for payload columns. again, not something I would consider being worthy of a fully parametrized test

Remove some conditions from test_conditions, e.g. such that we have one per field

This is a surprisingly expensive test and the conditions we have in there cover all fields, often multiple times with different operators and edges. either these are some magical numbers or it is redundant. Either way we should document it or remove it


So the time to do this judgement carefully should be incorporated into the task estimate.

I'm not looking for a "task estimate". I'm looking for some judgment on your/our side on what is sensible and what is not. This is what this ticket is about. We might also want to do only some of these things or may want to simply add documentation about what's there. Bottom line is that 70% of the test runtime is large enough that all maintainers should have a reasonable understanding about what we are testing and why, otherwise this is a nightmare to maintain.

@marco-neumann-by
Copy link
Contributor

We could consider one of these "slow" pytest flags to only execute this on CI and/or only on master, etc.

I believe we should trust dask to do the right thing w.r.t blocksizes. Having one test with a non-trivial blocksize should cover us that we're not doing something with too many assumptions on this.

That sounds like a good thing to do for the property-based testing 👍


remove driver dask_bag_bs1

I think we can do that for the query tests. There is however one that tests the graph and ensures that the linear fusing works correctly. Since we had severe issues with this in the past I would like to keep it least that specific test.


This tests against multiple indexed columns (why are there multiple ones? what's the value of having i1,... i3 and multiple payload columns (same question). My proposal would be to restrict this to a minimal subset

Because the index column belong to different datasets (a seed, a fully covered additional one, a sparse additional one).


updated should not have any impact on the query. Wether a update does what it is supposed to do should be tested in dedicated update tests, not in fully parametrized query tests

Not anymore since we've remove the time-traveling, so we can remove that.


other_part, ...: I don't really understand what they are supposed to be doing

IIRC that tests non-seed datasets partitioned differently than the seed


massive_partitions: I don't really understand what they are supposed to be doing

"the performance is at least somewhat reasonable", should be an ASV benchmark.


sparse_outer_opt

Tests that dtypes are handled correctly (e.g. integer widths are not blown up if not required). Could be a simpler test.


Remove some conditions from test_conditions, e.g. such that we have one per field

This is a surprisingly expensive test and the conditions we have in there cover all fields, often multiple times with different operators and edges. either these are some magical numbers or it is redundant. Either way we should document it or remove it

Well, it's an easy test because you know that you for sure cover a large set of cases ;) Aka: CI time is cheaper then a prod crash.


... otherwise this is a nightmare to maintain.

I think that's a bit exaggerated, no?

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 18, 2021

Thanks for your input. I believe the above comments cover exactly what I was looking for. I will (eventually) open a PR with more docs and/or some simplifications. From there we can iterate.

I think that's a bit exaggerated, no?

The nightmare I am talking about is not the runtime but rather the "I don't know what this test is doing, therefore I dare not touch it". That's always a bad place to be in (yes, maybe nightmare is a slight exageration :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants