Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search performance issue with MMapDirectory under load #1151

Open
1 task done
paulirwin opened this issue Mar 23, 2025 · 12 comments · May be fixed by #1152
Open
1 task done

Search performance issue with MMapDirectory under load #1151

paulirwin opened this issue Mar 23, 2025 · 12 comments · May be fixed by #1152

Comments

@paulirwin
Copy link
Contributor

paulirwin commented Mar 23, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

I have discovered that MMapDirectory has serious performance problems under parallel load when searching (with no background writes), across all platforms. The demo repo to reproduce this issue is at https://github.com/paulirwin/LuceneAdventureWorks/

This demo builds a simple index from the AdventureWorks2022 database data, and does a load test against it with 100k requests, using Parallel.For to search in parallel as fast as possible.

Of note: MMapDirectory is the default Directory implementation returned by FSDirectory.Open on Windows and Linux 64-bit platforms.

OS Architecture Runtime SimpleFS NIOFS MMap Compared to SimpleFS
macOS 15 arm64 .NET 9.0 4.07s 5.46s 3m25s 51x slower
Ubuntu 24.04 x64 .NET 9.0 2.21s 2.43s 35.57s 16x slower
Windows 11 x64 .NET 9.0 4.13s 11.66s 2m33s 37x slower
Windows 11 x64 .NET Framework 4.8.1 7.83s 2m21s^ 2m21s 18x slower

^ This was an outlier for NIOFS, and yes it was identical to MMap. Also note that hardware between these machines is not exactly comparable but all are 6-8 performance core machines with at either 32GB or 64GB RAM.

Given that MMapDirectory is the default Directory implementation returned by FSDirectory on Windows and Linux 64-bit (which is the majority of Windows and Linux use nowadays), we definitely need to resolve this, as MMapDirectory is supposed to be a fast implementation.

Expected Behavior

MMapDirectory performs well on all platforms.

Steps To Reproduce

See demo repo.

Exceptions (if any)

No response

Lucene.NET Version

4.8.0-beta00017

.NET Version

No response

Operating System

No response

Anything else?

No response

@paulirwin
Copy link
Contributor Author

The results posted in the original issue were with the beta 17 nuget packages. Here is latest master on macOS arm64:

  • SimpleFS: 4.85s
  • NIOFS: 7.11s
  • MMap: 2m23s

@paulirwin
Copy link
Contributor Author

I have determined that this largely is a concurrent load issue. When setting ParallelOptions.MaxDegreeOfParallelism to 1 (so it runs serially), the numbers are a lot closer (although mmap is still slower).

MacOS 15, arm64 (MaxDegreeOfParallelism = 1):

  • SimpleFS: 7.29s
  • NIOFS: 8.88s
  • MMap: 15.27s

@paulirwin
Copy link
Contributor Author

I created an equivalent Java app using Lucene 4.8.1 on JDK 21.

MacOS 15, arm64, parallel:

  • SimpleFS: 5.21s
  • NIOFS: 3.17s
  • MMap: 1.27s

This shows that we definitely have a performance issue with MMapDirectory.

(Aside: the good news is that the Java code had no issue reading this index built with Lucene.NET.)

Here's the serial results:

  • SimpleFS: 4.45s
  • NIOFS: 3.65s
  • MMap: 2.87s

@Shazwazza
Copy link
Contributor

I wonder if this is the same issue I reported internally on concurrent searching. @kjac was the one that found an issue during load testing in Umbraco. @kjac maybe run your tests using SimpleFS instead of the result from FSDirectory.Open?

@paulirwin
Copy link
Contributor Author

Another finding: this discrepancy evaporates when just using IndexSearcher.Search and not loading the documents with .Doc(int). MMapDirectory performs about equivalently to SimpleFSDirectory in that case (1.87s vs 1.90s).

paulirwin added a commit to paulirwin/lucene.net that referenced this issue Mar 25, 2025
@paulirwin paulirwin linked a pull request Mar 25, 2025 that will close this issue
4 tasks
@kjac
Copy link

kjac commented Mar 25, 2025

Hi all 👋

It's not entirely the same thing I'm experiencing... though I do see some performance improvements when using SimpleFSDirectory over MMapDirectory (via FSDirectory.Open).

What I've been discussing with @Shazwazza has to do with query execution time when ramping up the number of concurrent users.

My test bench runs IndexSearcher.Search() and only returns the TotalHits of the TopDocs result. I'm using K6 to test concurrency.

Using MMapDirectory

  • 1 concurrent users = 3.5ms per search
  • 10 concurrent users = 8.0ms per search
  • 50 concurrent users = 36.8ms per search
  • 100 concurrent users = 73.4ms per search

Using SimpleFSDirectory

  • 1 concurrent users = 3.4ms per search
  • 10 concurrent users = 7.0ms per search
  • 50 concurrent users = 29.9ms per search
  • 100 concurrent users = 60.6ms per search

It's not a terribly scientific test setup since it just runs on my laptop. That being said, the numbers do remain relatively consistent across multiple iterations, with SimpleFSDirectory being around 15-18% faster than MMapDirectory.

I'm on Win11 64 bit, and it's a .NET 9.0 runtime.

@NightOwl888
Copy link
Contributor

@paulirwin

Nice work tracking this down. I know how hard this can be!

First of all, the below is all up for debate and is just my opinion.

To ByteBuffer or Not To ByteBuffer?

I should preface this with my desire to eliminate ByteBuffer from the codebase (especially in ICU4N). Ideally, we would move to modern .NET APIs instead, but in Lucene.NET, it probably means drifting pretty far from the original design because of how they have extended and/or wrapped ByteBuffer in several components and how unlike anything in .NET it is. Fortunately, the usages are localized to the Lucene.Net.Store namespace and some subclasses.

ByteBuffer is almost the same thing as BinaryReader and BinaryWriter around a MemoryStream, except that it also natively supports big endian reads, which is missing from that set of tools. It also allows for quick access to other wrappers, such as CharBuffer, which wraps the same underlying memory to read arrays of char in addition to bytes. However, in practice, many of the buffers in ICU4N could be replaced with Span<T> or Memory<T> or the read-only equivalents depending on the operation. These components cut much closer to the bone and are easier to optimize. CharBuffer in particular implements ICharSequence, which we are working to essentially replace with ReadOnlySpan<char> or ReadOnlyMemory<char> across the API, although the only place we used CharBuffer is in tests.

The ByteBuffer API has some confusing members, such as Flip() (which doesn't flip any array members) and Clear() (which doesn't clear the buffer). These are managing both the position state and a "mark", which is presumably how it determines where to slice from. When porting, trying to grasp what is going on at a high level in addition to the low level in order to port it is a bit confusing. As such, it is hard to envision how to support moving away from this API when upgrading Lucene. It could be done, but we would need a roadmap to follow. Of course, using LLMs to get answers on this is a lot easier than analyzing source and reviewing docs to try to understand how best to port it.

In addition, J2N is a general purpose library. ICU4N currently depends on MemoryMappedViewByteBuffer and it is potentially broken for that use case, as well. Whether we use it in the future or not, this is an existing J2N public API that is currently not performing well for all consumers, and will need to be addressed.

Combined with the fact that we would all like a production release of Lucene.NET as soon as possible, fixing the MemoryMappedViewByteBuffer seems to be the most expedient option rather than fixing it in 2 or 3 different places and perhaps in 2 or 3 different ways.

One Byte at a Time

The purpose of reading one byte at a time in Java appears to be to eliminate allocating temporary buffers on the heap to convert primitive types such as int, long, and char to and from bytes. However, in .NET we have the option of putting these tiny buffers on the stack so we don't need to allocate any heap to do the conversions.

Span<byte> buffer = stackalloc byte[4];

But it does require there to be APIs to pass that span to without creating another allocation on the heap in order for it to be sensible, and ByteBuffer is currently missing them.

I haven't run any benchmarks, but in System.Buffers, there is a BinaryPrimitives class that can be used to convert Span<byte> or ReadOnlySpan<byte> to and from primitive types. Reviewing the code, it is using MemoryMarshal along with some bit twiddling. Since there are no loops, I suspect this will generally be faster than how it was done in Java. Given that members that are reading or writing bytes are very unlikely to change in Lucene (other than the vint and vlong types), we can probably utilize these rather than the direct port from Java. Maybe there is a way to borrow some of the techniques from BinaryPrimitives to optimize vint, vlong, and LZ4, as well.

https://github.com/dotnet/runtime/tree/v9.0.3/src/libraries/System.Private.CoreLib/src/System/Buffers/Binary

Note however, that only the integral number conversions are available prior to .NET Core in the BCL. So, we will need to patch the others. If we follow the convention used for System.Numeric.BitOperations, we should create a public static class named J2N.Buffers.Binary.BinaryPrimitive (singular, not plural) and put in all of the support there. We only need the primitive types that J2N/Java supports (no unsigned types), but it wouldn't hurt to include all of them and is probably easier to do because we can just copy, paste, and rename from .NET 9 for the most part. I am not sure whether it makes any sense to cascade the calls to the BCL when they are supported, since there doesn't seem to be much of a point to optimize this code further. It is unlikely to ever change.

We will need to reuse these in Lucene.NET DataInput and DataOutput, so it makes sense to make them shared and not sensitive to which target framework is being used.

System.Memory Support

J2N.IO.Buffer and J2N.IO.ByteBuffer would need to support Get(Span<byte>) and Put(ReadOnlySpan<byte>), which can be used to load the buffer on the stack for the primitive type conversions. Much like with FileStream.Read(Span<byte> destination), these should not support any offset or length overloads because spans allow you to slice the memory before passing them into the methods.

Buffering the MemoryMappedViewAccessor means we can support these methods even though MemoryMappedViewAccessor has no span support.

For the ByteBuffer class, we can create default implementations of Get() and Put() that check the HasArray property.

  1. If it supports an array, we can use the ProtectedArray proprty to get a span of the appropriate length.
  2. If it doesn't support an array, it can throw NotSupportedException.
  3. Subclasses can override these virtual members to provide more optimizations and features. In this case, we will need to track whether reading/writing goes off the end of the buffer and load more data to fill in or read the remainder of the input.

MemoryMappedViewByteBuffer Buffering Design

  1. We should not use the ProtectedArray property to expose the buffer externally because it assumes the entire buffer is available in the Array/ProtectedArray properties.
  2. ProtectedHasBuffer should return false.
  3. We should review and override members that can be optimized or should be done to account for buffering. Some missing overloads that could be optimized better include:
    a. Equals(ByteBuffer)
    b. CompareTo(ByteBuffer)
  4. The buffering can be modeled after FileStream, since it also supports both reading and writing to an underlying file while supporting a buffer.
  5. Much like FileStream, we can overload Flush() with a boolean parameter for flushToDisk, the default being false. Disposing the MemoryMappedViewByteBuffer should call Flush(true).
  6. Ideally, the size of the buffer should be configurable in the constructor overloads, much like FileStream.
  7. There weren't a lot of tests for these classes in Apache Harmony or the JDK. I ported what I could, but there isn't a lot of coverage. If we can add more, that would be great.

There are probably more things to consider and this plan is admittedly a bit half-baked, but this is a good start.

@paulirwin
Copy link
Contributor Author

@NightOwl888 I think that approach makes sense. It was an alternative I mentioned (although thought through far less than you have here) on my prototype PR #1152. In that PR I confirmed that if the mmap is buffered, the performance problems go away, and it performs better than the other two directory implementations, like Java. Performance is actually slightly better than Java (as we would hope, but still great to see!) except for in the NIOFS case, which is slower in Lucene.NET for different reasons. My PR still has a good amount of failing unit tests that would need to be addressed if we want to take that approach though. Regardless, we've demonstrated that buffering is the solution, because reading a single byte from a mmap in .NET is painfully slow.

(Aside: my performance measuring was on macOS, which gets NIOFSDirectory from FSDirectory.Open, probably for historical reasons. I think it's clear at this point that once we fix this issue, we should return MMapDirectory on macOS as well.)

I'm good with your plan above, although I am less familiar with the J2N code than you are. If you wouldn't mind taking a first pass at that on the J2N side, we can then compare approaches and see how they fare in performance testing. IMO, if BufferedIndexInput comes out ahead, we should still consider that, although there might likely be concurrency/scale issues (including concurrent writes) with that approach I haven't considered yet that would need to be considered as well. Regardless, it sounds like for methods like ReadInt32, we should do the J2N change anyways.

@paulirwin
Copy link
Contributor Author

@kjac Thanks for confirming that there is still a performance issue with MMapDirectory in your testing, even though it sounds like that's not the main problem you're experiencing. If you wouldn't mind filing an issue about that problem separately, that would be appreciated. If you are able to come up with a minimal reproduction that would be awesome, or at least include an overview of which Lucene.NET types you're using and how you're using them, and some profiling to know what the hotspots are, would be great. Thanks!

@NightOwl888
Copy link
Contributor

@paulirwin

Yeah, the NIOFSDirectory probably shouldn't exist in .NET because we don't have an NIO namespace in .NET. It exists for compatibility with ByteBuffer in Java, but since that is not something people in .NET (who aren't porting) are going to want to extend, it doesn't seem like keeping compatibility here is required. IMHO, we should (eventually) rename the directory classes to match .NET names (MemoryMappedFileDirectory and FileStreamDirectory) instead of using the names that only make sense to people familiar with Java. And yes, we need to experiment to figure out how to setup the defaults in .NET for various operating systems.

Side note: there are also native directory classes that we haven't ported that are tucked away in https://github.com/apache/lucene/tree/releases/lucene-solr/4.8.1/lucene/misc/src/java/org/apache/lucene/store. I am not sure whether they are sensible to port. See #276.

There is a pretty decent overview of how ByteBuffer works here: https://www.baeldung.com/java-bytebuffer. Admittedly, I haven't drilled very far into the details of how the different indices work, but they are mostly handled by the base classes. The classes in J2N are for the most part a line-by-line port from Apache Harmony, with the MemoryMappedViewByteBuffer and its 2 subclasses and extension methods being the oddball extension. The only changes were the guard clauses, although we are missing several members as well because either they didn't exist in Harmony or they didn't make sense in .NET.

If you wouldn't mind taking a first pass at that on the J2N side, we can then compare approaches and see how they fare in performance testing.

I will add the span overloads to the base classes and backport the BinaryPrimitives from .NET. I will do 2 different PRs, since I haven't benchmarked any of the BinaryPrimitives code, although judging from what it uses and the simple fact that we will no longer be reading bytes one at a time, I can't imagine how it could be slower.

I think for the buffered implementation of MemoryMappedViewByteBuffer, a lot of the heavy lifting can be done by swapping the base class from ByteBuffer to HeapByteBuffer, which enables the backing array and much of the plumbing for it. Then it is a matter of working out which members to override to layer the view accessor in so it reads the next chunk when it reaches the end and if there is enough Remaining bytes to simply cascade the call to the base class without any special handling.

Note: There are also 2 subclasses of HeapByteBuffer named ReadOnlyHeapByteBuffer and ReadWriteHeapByteBuffer that can be used for reference.

@paulirwin
Copy link
Contributor Author

@NightOwl888 HeapByteBuffer looks like it could work, but unfortunately a lot of those methods are sealed, so I'm not sure on first glance how you'd override them to trigger refilling the buffer...

Another thing to explore: I found this in the dotNext repo, about how to use unsafe code to get Memory/ReadOnlySequence from MemoryMappedFile: https://dotnet.github.io/dotNext/features/io/mmfile.html - that code is MIT-licensed so we could lift it (with license attribution) without having to add a dependency if we wanted.

@NightOwl888
Copy link
Contributor

NightOwl888 commented Mar 25, 2025

Well, the good news is nearly all of the classes that are part of the buffers API are internal, including HeapByteBuffer. So, if something is sealed it can be unsealed without breaking the public API.

The DotNext code looks interesting. Looks like they are handling Span<byte> using this around the public API of MemoryMappedViewAccessor.

Image

Just a thought - it would be a lot simpler if the ByteBuffer size were the entire segment since by default ByteBuffer is fixed length. Then it could read and/or write locally and Flush() would just be to write the entire buffer back to the underlying view/segment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants