-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate usage of ByteBuffer in SmartCn #1153
Comments
|
Thank you for volunteering to take on this project. I have assigned it to you.
No. These are test files, they should be added to the
We already have the build setup to auto-generate <ItemGroup>
<InternalsVisibleTo Include="Lucene.Net.Analysis.SmartCn" />
</ItemGroup> That being said, I am not sure we actually need to access any internal APIs and I asked ChatGPT how to do this task (it is recommended to use LLMs for research because documentation for Lucene is scarce) and here is an approach it came up with for creating files with a small amount of test data. https://chatgpt.com/share/67e93675-c90c-8005-98e7-2f62d1491f6b
The test would be slightly different than what is shown.
Let me know if you need any additional assistance. |
Is there an existing issue for this?
Task description
In
BigramDictionary
andWordDictionary
ofLucene.Net.Analysis.SmartCn
, the code that loads from files usesByteBuffer
to readint
values as little endian. It was done this way in Java because the C-derived code uses little endian byte order but the default in Java is to use big endian byte order. However, in .NET, the default byte order is little endian so we can eliminate these extra allocations by wrapping theFileStream
in aBinaryReader
and callingBinaryReader.ReadInt32()
. TheintBuffer
allocation can be eliminated in addition to the allocations caused byByteBuffer.Wrap()
in each loop. But the main benefit will be to decouple us fromByteBuffer
.In addition, in
BigramDictionary
, we can eliminate the calls totmpword.ToCharArray()
by usingReadOnlySpan<char>
.can be changed to:
And we should change the
caarray
parameter toReadOnlySpan<char>
in theHash1()
,Hash2()
,GetItemIndex()
,GetBigramItemIndex()
andGetFrequency()
methods. No other changes to those methods should be required.Tests
Note that this code loads from disk and in the past has only been tested manually. There is some info here:
lucenenet/src/Lucene.Net.Analysis.SmartCn/Hhmm/BigramDictionary.cs
Lines 106 to 163 in a0578d6
that contains info about how to test it.
It would be great if we could automate the tests, but it will require getting enough info about how to create a small dictionary file to load so we don't add multiple MB of data to the test project. A few KB is all that is required to ensure it loads from disk correctly. There would also need to be a way devised to determine that loading was successful, which will require some analysis and exploration.
The text was updated successfully, but these errors were encountered: