Rr/sc 60366 sparse global order reader merge #5417

rroelke · 2025-01-02T20:35:16Z

The story contains more details, but in brief this pull request adds an additional mode to the sparse global order reader in which we pre-process the minimum bounding rectangles of all tiles from all fragments to determine a single global order in which all of the tiles must be loaded.

This pre-processing step is implemented using a "parallel merge" algorithm which merges the tiles from each fragment (which are arranged in global order within the fragment).

Parallel Merge

The parallel merge code lives in tiledb/common/algorithm/parallel_merge.h. It is written generically to merge streams of a copyable type T using any type which can compare T (default is std::less<T> of course). An explanation of the algorithm is provided within the file.

The top-level function parallel_merge is asynchronous, i.e. it returns a future which can be polled to see how much of the merge has already completed. This enables callers to begin processing merged data from the head of the eventual output before the tail of the eventual output has finished.

Sparse Global Order Reader

We extend the sparse global order reader with a new configuration sm.query.sparse_global_order.preprocess_tile_merge. If nonzero, the sparse global order reader will run a parallel merge on the fragments to find the unified tile order and then use that to populate result tiles.

preprocess_compute_result_tile_order kicks off the parallel merge.
create_result_tiles_using_preprocess advances along the global tile order to create result tiles.

The fields which are used for the old "per fragment result tiles" mode have been encapsulated into their own struct to emphasize that their use does not overlap with this new mode.

create_result_tiles_using_preprocess does not need a per-fragment memory budget; instead it pulls tiles off of the globally ordered tile list until it has saturated the memory budget as much as it can.

Tiles in the unified global order are arranged on their lower bound. The upper bounds of the tiles in the list may be out of order. To prevent cells from tile A to be emitted out of order with cells from tile B, we augment add_next_cell_to_queue to check the lower bound of the tiles which have not populated result tiles yet.

The value of sm.query.sparse_global_order.preprocess_tile_merge configures the minimum amount of work that each parallel unit of the merge will do. This is so we can benchmark with different values without re-compiling; we will either want to recommend a value to customers, or choose one and flip this to a boolean.

Serialization

The unified global tile order is state which must be communicated back and forth between the client and REST server. We can either serialize this whole list (16 bytes per tile across all fragments) or we can re-compute the parallel merge each time we run a submit on the REST server side. The current implementation chooses the latter, assuming that smaller messages are preferred to the additional CPU overhead.

Testing

Testing of all changes is augmented using rapidcheck. With this library, rather than writing some test data examples, we write properties which contain generic claims about what the expected output must look like for a given input. The rapidcheck runtime generates arbitrary inputs to the property to test our claims.

The parallel merge algorithm is tested in unit_parallel_merge.cc and has rapidcheck properties implemented for each step of the algorithm.

The sparse global order reader tests are in unit-sparse-global-order-reader.cc. The gist is that we have a generic function CSparseGlobalOrderFx::run which writes a bunch of fragments, and then reads the data back in global order, comparing against an expected result. There's a fair bit of refactoring to support this. For 1D arrays we have tests Sparse global order reader: fragment skew, fragment interleave, and fragment many overlap which set up inputs which are expected to exercise some of the edge cases in the global order reader. And then we add rapidcheck 1D and rapidcheck 2D tests which generate totally arbitrary 1D and 2D inputs respectively.

Performance Results

I still have more to do here, but things are looking pretty good... will fill in more details here as I have them. Notes are here.

TYPE: FEATURE | BUG | IMPROVEMENT
DESC: sparse global order reader determine global order of result tiles

…nction

… first pass

rroelke · 2025-01-10T02:31:13Z

tiledb/sm/query/readers/sparse_global_order_reader.cc

+          stdx::reverse_comparator<stdx::or_equal<GlobalCellCmp>> cmp(
+              stdx::or_equal<GlobalCellCmp>(array_schema_.domain()));
+          if (tile_queue.empty()) {
+            length = to_process.max_slab_length(global_order_lower_bound, cmp);


This needs some work. The intent of this code is correct (i.e. we must bound the slab here to avoid out-of-order results) but the implementation of that intent may not be.

Running on a real-world 2D array, with a highly selective subarray I observed horrid performance, and I believe this code here is responsible.
In my repro, the lower end of emit_bound is (3352576, 2) but the current coordinate from rc is (3353524, 1484). The tile extents are 2048 in both dimensions, so this coordinate occurs after emit_bound in the global order.

Surprisingly this doesn't lead to an incorrect result (or at least, a result which is different than with this code OFF), but it does cause the accumulation of lots of result slabs of length 1, which is horridly slow.

The next_global_order_tile is {fragment_idx_ = 121, tile_idx_ = 643}, what are the bounds of the previous tile which should occur earlier in global order?

Tile 642's MBR has a lower bound of (3352576, 40963) and an upper bound of (3354623, 55294).
Tile 643's MBR has a lower bound of (3352576, 2) and an upper bound of (3354905, 59234).

This smells kind of funky but it does check out if the MBR is the minimum bound on each dimension rather than the value of the minimum coordinate in the tile. Tile 643 straddles two tiles in dimension 0 so going up to 2 for dimension 1 is plausible.

And if that's the case (which I hope to confirm soon), then using the MBR here is not correct, we need a different way to get the safe-to-emit bound here.

This is now mostly resolved - I say mostly because the changes I pushed in 2efe0fa are actually still not correct, but the code looks more similar to what I believe to be the correct code.

It is not only allowed, but commonplace for the MBRs of the tiles to be out of order with respect to each other.

I added a lengthy comment to preprocess_compute_result_tile_order which goes into detail about what the merge bound should be instead.

I'm not going to resolve this yet. I want to add another 2D test for the condition which may provoke this. But I think this is definitely review-able now. There's a FIXME comment which explains what the current problem might be.

ypatia

Review part 2, everything but the changes in the readers.

test/performance/tiledb_submit_a_b.cc

tiledb/type/range/range.h

tiledb/sm/serialization/tiledb-rest.capnp

tiledb/common/pmr.h

tiledb/common/algorithm/test/main.cc

tiledb/common/algorithm/test/compile_algorithm_main.cc

tiledb/common/algorithm/parallel_merge.h

ypatia · 2025-01-15T11:14:38Z

tiledb/sm/config/config.cc

@@ -118,6 +118,8 @@ const std::string Config::SM_MEMORY_BUDGET_VAR = "10737418240";  // 10GB
 const std::string Config::SM_QUERY_DENSE_QC_COORDS_MODE = "false";
 const std::string Config::SM_QUERY_DENSE_READER = "refactored";
 const std::string Config::SM_QUERY_SPARSE_GLOBAL_ORDER_READER = "refactored";
+const std::string Config::SM_QUERY_SPARSE_GLOBAL_ORDER_PREPROCESS_TILE_MERGE =
+    "128";


Let's add a TODO or a reference to a ticket to eventually

we will either want to recommend a value to customers, or choose one and flip this to a boolean.

…or 2D

tiledb/sm/query/readers/sparse_index_reader_base.h

tiledb/sm/query/readers/sparse_global_order_reader.h

tiledb/sm/query/readers/sparse_global_order_reader.cc

ypatia · 2025-01-16T14:56:04Z

tiledb/sm/query/readers/sparse_global_order_reader.cc

+    if (tile_.has_value()) {
+      return (*tile_)->coord((*tile_)->cell_num() - 1, d);
+    } else {
+      return mbr_.value().coord(d);


if (mbr_.has_value()) ?
same for line 166.

In its current state, one of the two optional fields must be valid. I would prefer to express that using the constructors.

However this is the one part of this code which I expect needs some more tweaks - this may look different when I am done with that

ypatia · 2025-01-16T14:57:14Z

tiledb/sm/query/readers/sparse_global_order_reader.cc

+      : fragment_idx_(fragment_idx) {
+  }
+
+  unsigned fragment_idx_;


some more documentation on member variables and methods of this class would help.

ypatia · 2025-01-16T15:03:27Z

tiledb/sm/query/readers/sparse_global_order_reader.cc

+  } else if (!preprocess_tile_order_.enabled_) {
+    return memory_used_for_coords_total_ != 0;
+  } else if (preprocess_tile_order_.has_more_tiles()) {
+    return false;


Suggested change

} else if (!preprocess_tile_order_.enabled_) {

return memory_used_for_coords_total_ != 0;

} else if (preprocess_tile_order_.has_more_tiles()) {

return false;

} else if (preprocess_tile_order_.enabled_ && preprocess_tile_order_.has_more_tiles()) {

return false;

ypatia · 2025-01-17T08:59:42Z

test/src/unit-sparse-global-order-reader.cc

-          ratio_coords_.c_str(),
-          &error) == TILEDB_OK);
-  REQUIRE(error == nullptr);
+  REQUIRE(memory_.apply(config) == nullptr);

  REQUIRE(tiledb_ctx_alloc(config, &ctx_) == TILEDB_OK);


Since you added a vfs_test_setup_ I'd expect not to allocate a fresh context but update the vfs_test_setup_.ctx_ instead and set ctx_ to that. Something like:

vfs_test_setup_.update_config(config.ptr().get()); ctx_ = vfs_test_setup_.ctx_c;

See for example: https://github.com/TileDB-Inc/TileDB/blob/main/test/src/test-capi-consolidation-plan.cc#L73

Thanks for the link!

9ea370e

ypatia

I won't claim I understood everything and was able to verify correctness or completeness of all the, again excellent, code of this PR. But the thorough testing gives me high confidence, so LGTM once tests pass :)

test/src/unit-sparse-global-order-reader.cc

ypatia · 2025-01-17T09:34:21Z

test/src/unit-sparse-global-order-reader.cc

+ * *there are ways to get around this but they are not implemented.
+ */
+template <InstanceType Instance>
+static bool can_complete_in_memory_budget(


I like the idea but I think that's too much logic to be the source of truth. We'd need to unit test this one too 😅 Anyway, no objection in keeping it, I was just thinking out loud.

This one does get pretty heavy coverage! Since ::run calls it on success or failure it does get some nice if-and-only-if coverage.

The problem with this one, though, is that it makes a strong assumption about what the merge bound is, and as I have since learned that assumption is only true in one dimension.

And yeah, I'm not intending to figure out how to adapt this for two dimensions, given that...

ypatia · 2025-01-17T09:43:34Z

test/src/unit-sparse-global-order-reader.cc

+ * Data tile 1 has a MBR of [(1, 1), (5, 4)].
+ * Data tile 2 has a MBR of [(5, 1), (10, 4)].
+ *
+ * The lower bound of data tile 2's MBR is less than the upper bound


Thank you for the excellent analysis and documentation of each scenario and especially the tricky points here but also across this work!

rroelke added 30 commits December 4, 2024 10:48

Add cpp parallel merge implementation

f3ea201

Move template parallel merge impl into header and template Compare fu…

ba90274

…nction

Domain::dimensions

b90353b

Remove trailing newline

54a7de2

TransientReadState::contains_tile

ef488f8

void SparseGlobalOrderReader<BitmapType>::compute_result_tile_order()…

900ebb2

… first pass

Fix uint64_t in ParallelMerge

334eb0d

Fix errors in compute_result_tile_order

399023b

Fix indexing error in GlobalOrderMBRCmp

0593e30

Change test section names so that -c option can be used

007e43d

Passing 'qc removes full tile' unit test with compute_result_tile_order

169b20e

Remove per-fragment memory fields

a8cd4b0

TEST_CASE parallel merge rapidcheck verify_split_point_stream_bounds

a9d78d0

ParallelMergeException

0299ca7

VerifyIdentifyMergeUnit

804fc6f

VerifyTournamentMerge

0285add

VerifyParallelMerge

54cbed8

Add more parallel merge test cases for uint8_t and numbers in range 0,10

0183686

Remove commented-out code

da5d191

Fix exceeding memory budget

1225c21

Add UntypedDatum methods to Range

3833eaa

Refactor GlobalCmp

5a20d6b

Add fragment skew, fragment interleave tests

3c6e4b2

Compare against next tile MBR in add_next_cell_to_queue

d281adb

Fill in memory budget exception

b43381d

Add TODO for test case

d2d06fa

tdb::pmr::emplace_unique

a55ddf3

Use pmr in parallel merge

fc8936e

Test tweaks

a2c6a8d

Relax condition for throwing 'Cannot load a single tile'

9023c1c

rroelke added 5 commits January 9, 2025 10:46

Print error message if tiledb_object_type fails

9641e3b

Add docs to tiledb_submit_a_b.cc and make results order configurable

88b11c9

Fix JSON in example

6ceb242

Log exceptions to stderr

95f455f

Return status from do_submit

1a6f45f

rroelke commented Jan 10, 2025

View reviewed changes

rroelke added 4 commits January 11, 2025 15:37

tiledb_submit_a_b assert that results are in global order

61dc130

Add 2D out-of-order MBR test

692e517

UntypedDatumView(std::string_view)

92e70d7

Bury some gory details in array_templates.h

b67a0a1

ypatia reviewed Jan 15, 2025

View reviewed changes

rroelke added 5 commits January 15, 2025 16:39

tiledb_submit_a_b asserts that produced tuples are in global order

2bed928

Preprocess merge bound is not the next MBR, fixes performance issue f…

2efe0fa

…or 2D

Acknowledge unordered in tiledb_submit_a_b

e368b9f

Review comment docstrings

b756351

auto apirc

620c941

ypatia reviewed Jan 17, 2025

View reviewed changes

rroelke added 12 commits January 17, 2025 09:36

Doc for SparseIndexReaderBase::preprocess_tile_order

91148f7

all_fragment_tile_order_ => per_fragment_memory_state_

da6a621

Doc comment for preprocess_compute_result_tile_order

0495206

Remove maybe_ from name

f2ad89b

PreprocessTileMergeFuture::await checks merge_ first

05c1e63

DeleteArrayGuard, use instead of manual delete in create_array

dc784da

Fix 2d rapidcheck test

3ad84ce

Enable col major tile/cell orders for 2d tests

75ef9df

CSparseGlobalOrderFx uses vfs_test_setup_ context

9ea370e

Format oops

6f271d7

Test fragment skew 2d merge bound

b756484

Fix invalid merge bound and duplicate coord corner case

13d6997

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rr/sc 60366 sparse global order reader merge #5417

Rr/sc 60366 sparse global order reader merge #5417

rroelke commented Jan 2, 2025 •

edited

Loading

rroelke Jan 10, 2025

rroelke Jan 15, 2025

ypatia left a comment

ypatia Jan 15, 2025

ypatia Jan 16, 2025

rroelke Jan 17, 2025

ypatia Jan 16, 2025

ypatia Jan 16, 2025

ypatia Jan 17, 2025 •

edited

Loading

rroelke Jan 17, 2025

ypatia left a comment •

edited

Loading

ypatia Jan 17, 2025

rroelke Jan 17, 2025

ypatia Jan 17, 2025

Rr/sc 60366 sparse global order reader merge #5417

Are you sure you want to change the base?

Rr/sc 60366 sparse global order reader merge #5417

Conversation

rroelke commented Jan 2, 2025 • edited Loading

Parallel Merge

Sparse Global Order Reader

Serialization

Testing

Performance Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypatia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypatia Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ypatia left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rroelke commented Jan 2, 2025 •

edited

Loading

ypatia Jan 17, 2025 •

edited

Loading

ypatia left a comment •

edited

Loading