Conversation
|
WIP...fixing failures with DenseVectorField tests |
…tSize_shouldThrowException
…to changes in Lucene104ScalarQuantizedVectorField format in Lucene 10.4
|
Put Claude(Opus 4.6) and Codex (GPT 5.4) to work on compilation errors and to understand test failures. Main test failures were around ScalarQuantizedDenseVectorField and BinaryQuantizedDenseVectorField due to breaking changes in Lucene104ScalarQuantizedVectorsFormat. Major changes:
|
|
Additionally, Lucene104ScalarQuantizedVectorsFormat now supports 1,2,4,7 and 8 bits in the format as opposed to only 4 and 7 earlier. Guess that should be scoped under a separate PR with test and documentation changes instead of squashing everything together in this upgrade PR. |
|
@dsmiley @alessandrobenedetti Requesting a review please. |
|
To-Do: Lucene104HnswScalarQuantizedVectorsFormat now defaults to 8 bits (ScalarEncoding.UNSIGNED_BYTE) instead of 7 bits in Lucene99HnswScalarQuantizedVectorsFormat earlier. Make that the new default for ScalarQuantizedDenseVectorField too (ScalarQuantizedDenseVectorField.DEFAULT_BITS)? |
|
Also adding @liangkaiwen since he worked on these quantization types for 10.0. Kevin, could you please take a look at the related changes? |
dsmiley
left a comment
There was a problem hiding this comment.
Remember solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc
There was a problem hiding this comment.
Remember the discussion thread about a Lucene upgrade that includes a Codec update, and needing to broadcast this fact in the upgrade notes page? This markdown file should explicitly add this step.
There was a problem hiding this comment.
I don't think I remember a discussion thread regarding this. Quick search on the dev list wasn't helpful either. No mention of codec changes in the current solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc either. Do you have a link to the discussion ?
There was a problem hiding this comment.
Here's the message I wrote in response to Jan's thread.
There was a problem hiding this comment.
Ah that one! Added changes in major-changes-in-solr-10.adoc and lucene-upgrade.md to that effect. Thanks.
| // Retain parsing of these ("compress", "confidenceInterval") legacy Solr 10 schema params for | ||
| // compatibility even though | ||
| // Lucene 10.4's scalar-quantized vector format no longer consumes them directly. They are no-op | ||
| // going forward. |
There was a problem hiding this comment.
Then maybe lets mark the params deprecated with a version param of 10.1
There was a problem hiding this comment.
Also, use the DeprecationLog so that the user sees a warning if they provide these params
| return new Lucene102HnswBinaryQuantizedVectorsFormat(getHnswM(), getHnswEfConstruction()); | ||
| return new Lucene104HnswScalarQuantizedVectorsFormat( | ||
| ScalarEncoding.SINGLE_BIT_QUERY_NIBBLE, getHnswM(), getHnswEfConstruction()); |
| useCompression(), | ||
| getConfidenceInterval(), |
There was a problem hiding this comment.
Perhaps reduce visibility of the accessors and/or mark them deprecated.
- public boolean useCompression() {
+ @Deprecated
+ @VisibleForTesting
+ boolean useCompression() {
| + | ||
| Accepted values: `BOOLEAN` | ||
| `NOTE:` Existing Solr 10 schemas may still contain legacy scalar-quantization parameters such as | ||
| `confidenceInterval`, `dynamicConfidenceInterval`, or `compress`. They are accepted for backward |
There was a problem hiding this comment.
| `confidenceInterval`, `dynamicConfidenceInterval`, or `compress`. They are accepted for backward | |
| `confidenceInterval`, `dynamicConfidenceInterval`, or `compress`. They are deprecated (i.e. parsed and ignored but not used) for backward |
…Log to notify the user
|
Thank you for the review @dsmiley and @cpoerschke. I have incorporated all the comments except one by David (pending clarification). |
…-solr-10.adoc in case of codec change
|
|
||
| `NOTE:` Existing Solr 10 schemas may still contain legacy scalar-quantization parameters such as | ||
| `confidenceInterval`, `dynamicConfidenceInterval`, or `compress`. They are accepted for backward | ||
| `confidenceInterval`, `dynamicConfidenceInterval`, or `compress`. They are deprecated (i.e. parsed and ignored but not used) for backward |
There was a problem hiding this comment.
ignored implies not used.
I suggest slightly simpler wording:
They are now deprecated and ignored. New schemas should ...
You might even drop the "New schemas should..." sentence as it seems obvious (to me).
There was a problem hiding this comment.
Oof! I completely botched the language there during a previous refactor. Fixed.
|
|
||
| Solr 10.1 upgrades the underlying Lucene library from 10.3 to 10.4, which introduces a new index codec (`Lucene104`). | ||
| New index segments will be written using the `Lucene104` codec format. | ||
| Existing segments written with older codecs (e.g. `Lucene103`) will continue to be readable. |
There was a problem hiding this comment.
Suggest simpler: Older segments will continue to be readable.
|
|
||
| WARNING: After upgrading to Solr 10.1, downgrading to an earlier Solr 10.0.x version may fail because the older version does not include the `Lucene104` codec needed to read the newly written segments. | ||
| If you require the ability to roll back, back up your indexes before upgrading. | ||
| A full reindex would be needed to downgrade after new segments have been written in Solr 10.1. |
There was a problem hiding this comment.
This sentence reads clearly yet I think it's message is confusing given what was said in the sentences before. For example, if we do backup-restore, why would a full reindex be needed? And you just said there may be errors for Solr 10 reading the indexes so, how would re-indexing (or any indexing or anything at all) work once you've seen such errors? You could drop this sentence entirely and I think we've already communicated what's needed.
There was a problem hiding this comment.
I meant to communicate that IF there is no backup and you need to downgrade, a full reindex from source would be needed. Agree the missing "if no backup" pretext makes the sentence confusing. Will fix.
https://issues.apache.org/jira/browse/SOLR-18143
Description
Upgrade Lucene dependency to 10.4
Solution
Followed instructions in dev-docs/lucene-upgrade.md and resolved compilation/test failures. Also made changes to documentation and upgrade guide wherever applicable.