Improve readers by parallelizing I/O and compute operations #5401

ypatia · 2024-12-06T15:16:20Z

Today when a reader issues the I/O request to VFS, we block waiting for all I/O to finish before moving to unfiltering. We then block again waiting for unfiltering to be done for all tiles and then continue to processing the results.

This PR is removing the need to wait for all I/O to be done, and uses async tasks to signal when a tile is done reading so that it can proceed to unfiltering, and when a tile is done unfiltering so that it can proceed to result processing before copying to the user buffers.

TYPE: IMPROVEMENT
DESC: Improve readers by parallelizing I/O and compute operations

This removes the read from waiting on all I/O operations and instead moves the I/O task to be owned by the datablock itself. If the I/O threadpool task is valid, we block on data access. This lets I/O and compute be interleaved by only blocking on data when its ready to be processed and allows for better background data loading.

…e state

This allows for copying the task/future an enabled multiple threads to check the status of the task in a thread-safe manner.

…checking. While the ThreadPool::SharedTask is designed to be used by multiple threads, its designed for copying. The data structure itself is not thread safe. A recursive mutext is needed because some functions like load_chunk_data call back into filtered_data() and would deadlock. This could be handled by also release the locking in load_chunk_data(), but a recursive_mutex is used for better safety against deadlocks.

… forward declaration issues currently

This is needed because we need to access the data buffer from inside the unfiltering task to unfilter into. We can't block on unfiltering being done from inside the unfiltering task so we need different accessors which let us bypass the check on if the unfiltering task is completed.

This is needed because zip_coordinates is called from the unfilter task itself.

This reverts commit 9647fff.

rroelke

I've left a couple comments/questions, but overall this looks good to me!

tiledb/sm/query/readers/filtered_data.h

tiledb/sm/query/readers/reader_base.h

tiledb/sm/query/readers/result_tile.h

rroelke · 2025-01-07T03:33:04Z

test/src/unit-sparse-global-order-reader.cc

@@ -653,7 +653,7 @@ TEST_CASE_METHOD(

  // specific relationship for failure not known, but these values
  // will result in failure with data being written.
-  total_budget_ = "15000";
+  total_budget_ = "40000";


This is a pretty big bump. It's just due to the changes in the various sizeofs right? Not much that can be done about it I suppose. Can you quantify the change in memory usage?

I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(

I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.

First of all, I reverted changes in a couple of tests in this file that were marked as [.regression] so not run, as they were meant to fail even before my changes. This particular one was one of them.

In general, though, I still needed to adapt the budgets in a couple of tests. My analysis showed that the different new member variables we needed to add as part of this work into TileBase, Tile, and ResultTile classes made the ResultTile class to grow almost double in size. In particular, sizeof(GlobalOrderResultTile<BitmapType>) , a derived class of ResultTile, grew from 1576 to 3216. A big part of this increase was due to the recursive_mutex we added in Tile class (64 bytes in my Mac).. ResultTile stores vectors of TileTuples of Tiles so that easily adds up.

Another side-effect of the members we added is that they might have a different size on different architectures and this lead to an inconsistent number of internal loops in our test code that was meant to count them, so I had to disable that test. That test though showed something interesting on my Mac, which is that loop_num increased to 20 from 9 in that case. This hints that we might end up looping more internally in environments with restricted memory budget (like the TileDB Server) because of our increase in size of ResultTile.

I couldn't think of an obvious way to reduce the size of ResultTile in the current design of this PR, so we definitely need to be vigilant for the performance impact on different readers and queries of that increase.

@rroelke @Shelnutt2 for awareness

rroelke · 2025-01-07T03:42:38Z

tiledb/sm/query/readers/sparse_global_order_reader.cc

+    // taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a
+    // lock is held forever while waiting for a result to be available, while
+    // the next scheduled task is deadlocking on that lock
+    rc.tile_->wait_all_coords();


I feel a bit like this is the wrong place to call this, it will call it once per cell but as I understand the changes it should be fine to call it just once per tile.
Hence you can add this after // Find a cell in the current result tile and then also once after GlobalOrderResultCoords rc(&*(rt_it[f]), cell_idx); in the // For all fragments loop of merge_result_cell_slabs.

I don't quite follow you about the deadlock situation - this seems to me like a correctness question. We can't put result tiles into the priority queue until the data is loaded.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

I don't quite follow you about the deadlock situation

I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:

Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the result_tile that was read. Any processing that follows, such as merge_result_cell_slabs will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data.

Our current rather naive ThreadPool uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.

Now the deadlock scenario I was experiencing with this mutex is the following:

Say we have 3 threads in our pool. Thread 1 is executing the main reader code. Thread 2 is running the async I/O task for ResultTile N. Thread 3 is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantime Thread 3 execution reaches the parallel_for in merge_result_cell_slabs. There, a few async tasks are created that need to acquire the lock on tile_queue_mutex_ and then access the unfiltered data of -among others - ResultTile N.

Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on Thread 3 that is blocked. This is our deadlock. The reason is that when Thread 1 will be done, unfiltering on Thread 3 shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(

IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.

To destroy any ambiguity in my initial comment - I definitely agree that it is correct to call wait_all_coords somewhere in the vicinity of this function. My comment is really about the granularity. The task/future based approach does not give us feedback about when the tile is partially filled in, it either is not safely filled in or it is 100% filled in. Hence we only actually need to wait once per tile, not once per cell per tile, and I think the code should reflect that.

We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

What I should have written instead of "We can't we result tiles into the priority queue until the data is loaded" was "This change strikes me as one for correctness, not one to do with deadlock".

The change in call site for this should be aligned with the goal of this PR. Both forms will wait for the first result tile from each fragment to be loaded before starting anything, and then subsequent tiles in each fragment will be awaited only once we reach them in the queue.

As for the deadlock itself -

it will try to execute another one from the pool of available tasks on the same thread

Oh yeah, I did see that in a stack trace I ran into earlier today. It just calls the function pointer on the same stack, doesn't it.

The scenario you describe looks like it would be pretty commonplace.

Proper coroutines would be great, wouldn't they!

Ah yes, with coroutines we could suspend and resume tasks so problem solved 😍

ypatia

Thank you very much for the review, it was very useful. I think I addressed most of the comments but I'll have to follow up on one of them after taking a deeper look.

ypatia · 2025-01-07T11:48:57Z

test/src/unit-sparse-global-order-reader.cc

@@ -653,7 +653,7 @@ TEST_CASE_METHOD(

  // specific relationship for failure not known, but these values
  // will result in failure with data being written.
-  total_budget_ = "15000";
+  total_budget_ = "40000";


I am not expecting this to affect memory usage overall as the memory budget we put constraints in theory the peak usage to that budget at any time. However It's a 3x in the size of the result tile, so it's very much affecting the number of result tiles we can load at the same time in memory given that constrained budget, so the number of iterations we'll need to do to load everything we need from disk, so performance :(

I need to understand why this bump is so big a bit better, come back with some good explanation and see if we can reduce.

tiledb/sm/query/readers/filtered_data.h

tiledb/sm/query/readers/reader_base.h

tiledb/sm/query/readers/result_tile.h

ypatia · 2025-01-07T14:07:13Z

tiledb/sm/query/readers/sparse_global_order_reader.cc

+    // taking the lock on tile_queue_mutex_. This is to avoid a deadlock where a
+    // lock is held forever while waiting for a result to be available, while
+    // the next scheduled task is deadlocking on that lock
+    rc.tile_->wait_all_coords();


We can't put result tiles into the priority queue until the data is loaded.

Why not? I thought the whole point of this work is to wait for a piece of data to be ready (either loaded or unfiltered) exactly at the point when we need to use it so that we are not busy waiting as much as possible and we can make progress while reading/unfiltering asynchronously.

I don't quite follow you about the deadlock situation

I am sorry the comment wasn't clear enough. Let me try to explain the deadlock scenario I've hit spuriously in testing which is due to the current design of our ThreadPool:

Overall this PR introduces 2 kinds of async tasks: I/O tasks and unfiltering tasks. An unfiltering task will block waiting on the I/O task to be ready, so that it can start unfiltering the result_tile that was read. Any processing that follows, such as merge_result_cell_slabs will then wait for the unfiltering task to be ready at the point in time that it'll need to use the data.

Our current rather naive ThreadPool uses a sort of stacking of tasks algorithm, in the sense that when a task is scheduled on a thread but not yet ready, it will try to execute another one from the pool of available tasks on the same thread, so that we don't end up in a deadlock situation where all threads are blocked waiting. When that other task is done, it will check if the previous task is now ready to continue executing.

Now the deadlock scenario I was experiencing with this mutex is the following:

Say we have 3 threads in our pool. Thread 1 is executing the main reader code. Thread 2 is running the async I/O task for ResultTile N. Thread 3 is running the async unfiltering task for ResultTile N, which is blocked waiting for the corresponding I/O task to be ready so that it can get the data. In the meantime Thread 3 execution reaches the parallel_for in merge_result_cell_slabs. There, a few async tasks are created that need to acquire the lock on tile_queue_mutex_ and then access the unfiltered data of -among others - ResultTile N.

Now it's possible that the async task that acquires the lock, wants to access ResultTile N and gets scheduled on Thread 3 that is blocked. This is our deadlock. The reason is that when Thread 1 will be done, unfiltering on Thread 3 shoud be unblocked, but this will not be possible because it will be stacked under that async task that took the mutex, which will be actually blocked waiting for that unfiltering task to be done :(

IMO this is a design flow of our ThreadPool and it's not the only deadlock situation I've had to cope with in this work. I had to make compromises in terms of performance elsewhere as well, for example when unfiltering a tile we used to do that in parallel chunks, but a similar deadlock situation would arise in that case to, exclusively due to that stacking of tasks in the threadpool that has unexpected side-effects as this one.

rroelke · 2025-01-07T20:53:28Z

Thanks for the responses, I will approve once we have a little more info on the memory budget change.

ypatia and others added 30 commits November 29, 2024 15:28

Threadpool changes

93f16c0

Add threadpool helper function

277272b

Clang format changes

e000852

Cleanup not needed changes

2d0a155

Add some more documentation to new classes

77b52a2

Default initialize and check for null threadpool in task classes

bdd4b48

Address review comments

c92a83c

Address review comments round 2

1066214

Fix unit test

85dee35

Switch to SharedTask in order to allow multi-threaded access to futur…

02d85d2

…e state

Fix unit test compilation

f1eac3a

WIP: parallelize filter pipeline and interleave comparisons

deabecf

Switch to ThreadPool::SharedTask instead of shared_ptr

aaed67a

This allows for copying the task/future an enabled multiple threads to check the status of the task in a thread-safe manner.

WIP: try to store reference to FilterData on result tile, need to fix…

d81dd29

… forward declaration issues currently

Adjust lambdas and avoid task components going out of scope.

e79d531

Add stats tracking to new tasks for reading and unfiltering tiles

c3891fd

Fix unit test compilation

b30b3f9

Add new zip_coordinates_unsafe

fe65979

This is needed because zip_coordinates is called from the unfilter task itself.

Wait until tasks are done before freeing tiles

04ecccc

Remove redundant shared future get

338aa12

Fix null counts, check tasks are valid and other fixes

ceedb1f

Fix RLE and dict decompression

c8c6b17

Fix budget tests, g.o.r. result tile is now 3496 bytes in size

0bb8cdd

Merge branch 'dev' into yt/sc-59606/threadpool_with_tasks

766a8bb

Adaptations to new threadpool

f5c0003

Fix compute task outliving dense reader

15e7a4d

Remove mutex that causes problems (TBD)

0ae71fb

Base automatically changed from yt/sc-59606/threadpool_with_tasks to dev December 9, 2024 08:36

Merge branch 'dev' into yt/sc-59605/dont_block_io

3782594

ypatia marked this pull request as draft December 9, 2024 10:06

ypatia added 7 commits December 12, 2024 14:24

Fix tile missing threadpool linkage

c51ba44

Remove duplicate library in cmake

af4857e

Disable temporarily flaky test

3de9216

Attempt to fix asan error

9647fff

Fix segfault in legacy reader

888652a

Revert "Attempt to fix asan error"

81f3166

This reverts commit 9647fff.

Fix some windows tests

88c0ecb

ypatia force-pushed the yt/sc-59605/dont_block_io branch from 73a2245 to 88c0ecb Compare December 18, 2024 09:53

Fix lifetime issues and some namings

d391375

ypatia force-pushed the yt/sc-59605/dont_block_io branch from b1d8be7 to d391375 Compare December 18, 2024 15:30

ypatia added 2 commits December 19, 2024 10:28

Fix ASAN: Destructors of base classes must be virtual

0a97612

Some more PR cleanup

446700a

ypatia force-pushed the yt/sc-59605/dont_block_io branch from ed9b334 to 446700a Compare December 19, 2024 13:11

ypatia changed the title ~~WIP: Improve readers by parallelizing I/O and compute operations~~ Improve readers by parallelizing I/O and compute operations Dec 20, 2024

ypatia marked this pull request as ready for review December 20, 2024 07:59

ypatia requested review from Shelnutt2 and rroelke December 20, 2024 08:04

Merge branch 'dev' into yt/sc-59605/dont_block_io

eacdc79

rroelke reviewed Jan 7, 2025

View reviewed changes

ypatia commented Jan 7, 2025

View reviewed changes

Address review comments

260fc22

ypatia force-pushed the yt/sc-59605/dont_block_io branch from c5899b7 to 260fc22 Compare January 7, 2025 14:23

ypatia added 4 commits January 10, 2025 15:14

Potential fix for asan spurious failure

8ffa2cf

Clarifications and adaptation of result tile sizes

9e7915b

Fix ASAN failure

395f6ba

Attempt to revert review change to fix CI

b98c536

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve readers by parallelizing I/O and compute operations #5401

Improve readers by parallelizing I/O and compute operations #5401

ypatia commented Dec 6, 2024 •

edited

Loading

rroelke left a comment

rroelke Jan 7, 2025

ypatia Jan 7, 2025

ypatia Jan 14, 2025 •

edited

Loading

rroelke Jan 7, 2025

ypatia Jan 7, 2025

rroelke Jan 7, 2025

ypatia Jan 8, 2025

ypatia left a comment

ypatia Jan 7, 2025

ypatia Jan 7, 2025

rroelke commented Jan 7, 2025

Improve readers by parallelizing I/O and compute operations #5401

Are you sure you want to change the base?

Improve readers by parallelizing I/O and compute operations #5401

Conversation

ypatia commented Dec 6, 2024 • edited Loading

rroelke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypatia Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypatia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rroelke commented Jan 7, 2025

ypatia commented Dec 6, 2024 •

edited

Loading

ypatia Jan 14, 2025 •

edited

Loading