Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate usage of ByteBuffer in SmartCn #1153

Open
1 task done
NightOwl888 opened this issue Mar 28, 2025 · 2 comments · May be fixed by #1154
Open
1 task done

Eliminate usage of ByteBuffer in SmartCn #1153

NightOwl888 opened this issue Mar 28, 2025 · 2 comments · May be fixed by #1154
Assignees
Labels
help-wanted Extra attention is needed is:enhancement New feature or request is:task A chore to be done notes:performance-improvement Contains a performance improvement pri:normal up-for-grabs This issue is open to be worked on by anyone

Comments

@NightOwl888
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Task description

In BigramDictionary and WordDictionary of Lucene.Net.Analysis.SmartCn, the code that loads from files uses ByteBuffer to read int values as little endian. It was done this way in Java because the C-derived code uses little endian byte order but the default in Java is to use big endian byte order. However, in .NET, the default byte order is little endian so we can eliminate these extra allocations by wrapping the FileStream in a BinaryReader and calling BinaryReader.ReadInt32(). The intBuffer allocation can be eliminated in addition to the allocations caused by ByteBuffer.Wrap() in each loop. But the main benefit will be to decouple us from ByteBuffer.

In addition, in BigramDictionary, we can eliminate the calls to tmpword.ToCharArray() by using ReadOnlySpan<char>.

char[] carray = tmpword.ToCharArray();

can be changed to:

ReadOnlySpan<char> caarray = tmpword.AsSpan();

And we should change the caarray parameter to ReadOnlySpan<char> in the Hash1(), Hash2(), GetItemIndex(), GetBigramItemIndex() and GetFrequency() methods. No other changes to those methods should be required.

Note: the wordItem_charArrayTable in WordDictionary could be changed to use ReadOnlyMemory<char>, but I don't think it is worth the added complexity and will diverge significantly from upstream. The references would have to be maintained for all of the strings to keep them in scope.

Tests

Note that this code loads from disk and in the past has only been tested manually. There is some info here:

// LUCENENET conversion note:
// The data in Lucene is stored in a proprietary binary format (similar to
// .NET's BinarySerializer) that cannot be read back in .NET. Therefore, the
// data was extracted using Java's DataOutputStream using the following Java code.
// It can then be read in using the LoadFromInputStream method below
// (using a DataInputStream instead of a BinaryReader), and saved
// in the correct (BinaryWriter) format by calling the SaveToObj method.
// Alternatively, the data can be loaded from disk using the files
// here(https://issues.apache.org/jira/browse/LUCENE-1629) in the analysis.data.zip file,
// which will automatically produce the .mem files.
//public void saveToOutputStream(java.io.DataOutputStream stream) throws IOException
//{
// // save wordIndexTable
// int wiLen = wordIndexTable.length;
// stream.writeInt(wiLen);
// for (int i = 0; i<wiLen; i++)
// {
// stream.writeShort(wordIndexTable[i]);
// }
// // save charIndexTable
// int ciLen = charIndexTable.length;
// stream.writeInt(ciLen);
// for (int i = 0; i<ciLen; i++)
// {
// stream.writeChar(charIndexTable[i]);
// }
// int caDim1 = wordItem_charArrayTable == null ? -1 : wordItem_charArrayTable.length;
// stream.writeInt(caDim1);
// for (int i = 0; i<caDim1; i++)
// {
// int caDim2 = wordItem_charArrayTable[i] == null ? -1 : wordItem_charArrayTable[i].length;
// stream.writeInt(caDim2);
// for (int j = 0; j<caDim2; j++)
// {
// int caDim3 = wordItem_charArrayTable[i][j] == null ? -1 : wordItem_charArrayTable[i][j].length;
// stream.writeInt(caDim3);
// for (int k = 0; k<caDim3; k++)
// {
// stream.writeChar(wordItem_charArrayTable[i][j][k]);
// }
// }
// }
// int fDim1 = wordItem_frequencyTable == null ? -1 : wordItem_frequencyTable.length;
// stream.writeInt(fDim1);
// for (int i = 0; i<fDim1; i++)
// {
// int fDim2 = wordItem_frequencyTable[i] == null ? -1 : wordItem_frequencyTable[i].length;
// stream.writeInt(fDim2);
// for (int j = 0; j<fDim2; j++)
// {
// stream.writeInt(wordItem_frequencyTable[i][j]);
// }
// }
//}

that contains info about how to test it.

It would be great if we could automate the tests, but it will require getting enough info about how to create a small dictionary file to load so we don't add multiple MB of data to the test project. A few KB is all that is required to ensure it loads from disk correctly. There would also need to be a way devised to determine that loading was successful, which will require some analysis and exploration.

Note also that the Kuromoji project loads dictionary data in a similar manner and also doesn't have automated tests.

@NightOwl888 NightOwl888 added help-wanted Extra attention is needed is:enhancement New feature or request is:task A chore to be done notes:performance-improvement Contains a performance improvement pri:normal up-for-grabs This issue is open to be worked on by anyone labels Mar 28, 2025
@NightOwl888 NightOwl888 added this to the 4.8.0-beta00018 milestone Mar 28, 2025
@NehanPathan
Copy link
Contributor

NehanPathan commented Mar 30, 2025


Hi @NightOwl888,
I have implemented the changes suggested in this issue to optimize the dictionary loading in BigramDictionary and WordDictionary by:

✅ Replacing ByteBuffer with BinaryReader.ReadInt32() to eliminate extra allocations.
✅ Using ReadOnlySpan<char> instead of ToCharArray() in methods like Hash1(), Hash2(), GetItemIndex(), GetBigramItemIndex(), and GetFrequency() for performance improvements.
✅ Refactoring code to minimize memory usage and improve efficiency while loading dictionary data from disk.

I am now working on automating the dictionary loading tests. However, I have a couple of questions:

  1. Should the test files be added directly to the Lucene.Net.Analysis.SmartCn assembly to avoid InternalsVisibleTo complications?
  2. If added to the Lucene.Net.Tests.Analysis.SmartCn project, we have two options:
    • Make internal classes public (which is not ideal).
    • Use [InternalsVisibleTo] with a public key, but this approach requires generating a strong-name key (.snk) to sign the assembly. Since the main assembly is strongly named, adding a friend assembly also requires a public key, and generating this key is usually a maintainer’s responsibility, not a contributor's task.

I’d appreciate your guidance on the best approach to proceed with these tests. Thanks!


@NightOwl888
Copy link
Contributor Author

@NehanPathan

Thank you for volunteering to take on this project. I have assigned it to you.

Should the test files be added directly to the Lucene.Net.Analysis.SmartCn assembly to avoid InternalsVisibleTo complications?

No. These are test files, they should be added to the Lucene.Net.Tests.Analysis.SmartCn project.

If added to the Lucene.Net.Tests.Analysis.SmartCn project, we have two options:

  • Make internal classes public (which is not ideal).
  • Use [InternalsVisibleTo] with a public key, but this approach requires generating a strong-name key (.snk) to sign the assembly. Since the main assembly is strongly named, adding a friend assembly also requires a public key, and generating this key is usually a maintainer’s responsibility, not a contributor's task.

We already have the build setup to auto-generate InternalsVisibleTo attributes with the public key. To use it, just set an ItemGroup with an InternalsVisibleTo element and the project that you wish to give visibility to.

<ItemGroup>
  <InternalsVisibleTo Include="Lucene.Net.Analysis.SmartCn" />
</ItemGroup>

That being said, I am not sure we actually need to access any internal APIs and InternalsVisibleTo doesn't affect the visibility of embedded resources (they are always public).

I asked ChatGPT how to do this task (it is recommended to use LLMs for research because documentation for Lucene is scarce) and here is an approach it came up with for creating files with a small amount of test data.

https://chatgpt.com/share/67e93675-c90c-8005-98e7-2f62d1491f6b

Note that the Kuromoji module has a tool to generate the dictionary files. I am not sure why there isn't a similar tool for the SmartCn module, but ChatGPT shows how the file can be generated using a .NET console app.

Also note that the example only shows the coredict.dct file. You will need to ask ChatGPT what the format for the bigramdict.dct file is supposed to look like and ask it to generate a test file that is compatible with the first one.

The test would be slightly different than what is shown.

  1. We embed loose files into the test DLL as an embedded resource so nothing other than the DLL is required for testing.
  2. The test class should subclass LuceneTestCase.
  3. Temp directories or individual temp files can be created using the LuceneTestCase.CreateTempDir() or LuceneTestCase.CreateTempFile() method overloads, depending on whether there is one temp file or multiple to group together for the test. The test framework takes care of cleaning up the files at the end of the test or optionally leaving them on disk for debugging.
  4. Once a temp file or directory is created, the contents of the embedded file can be copied to it. The embedded stream can be retrieved relative to a class using the FindAndGetManifestResourceStream() extension method. For that to work, the embedded file should be in the same directory as the test class. Then the output temp file can be opened for writing. The copy can be done using Stream.CopyTo() from the embedded resource stream to the output stream.
  5. The same temp file or folder path plus filename can then be used for the real disk load test.

Let me know if you need any additional assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted Extra attention is needed is:enhancement New feature or request is:task A chore to be done notes:performance-improvement Contains a performance improvement pri:normal up-for-grabs This issue is open to be worked on by anyone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants