Skip to content

Instantly share code, notes, and snippets.

@vladak
Last active February 9, 2024 12:58
Show Gist options
  • Save vladak/351f135c1f9b0ddc979be70ceaa20133 to your computer and use it in GitHub Desktop.
Save vladak/351f135c1f9b0ddc979be70ceaa20133 to your computer and use it in GitHub Desktop.
opengrok issue 4317

more notes for oracle/opengrok#4317

Using the code fixed to perform correct Path comparisons in both places (on the set of files from FileCollector as well as in IndexDatabase#processFileIncremental(), the script number 3 (oracle/opengrok#4317 (comment)) this still leads to cases with duplicate documents. This does not happen every time, it might be timing related.

The script goes through solaris-userland Git repository using particular sequence of changesets that were previously known to reproduce the duplicate documents (before the Path comparator fixes): 3b039b16f 170ebb43b 80cc6ae18 52c217090 a609ca278 69a8daead 07e01a4e6 154009177 794af3182 c073248f7 823f2c28e 4a5e3cb85 341f9beb2 653378bce 4f8fe9ee8

Changes to the code:

diff --git a/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java b/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java
index aea4f3e9f..f7cbea525 100644
--- a/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java
+++ b/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java
@@ -835,19 +835,25 @@ public class IndexDatabase {
         Statistics stat = new Statistics();
         LOGGER.log(Level.FINEST, "traversing the documents in {0} to collect uids of deleted documents",
                 indexDirectory);
+        StoredFields storedFields = reader.storedFields();
         for (int i = 0; i < reader.maxDoc(); i++) {
+            Document doc = storedFields.document(i, LIVE_CHECK_FIELDS);  // use limited-field version
+            IndexableField field = doc.getField(QueryBuilder.U);
             if (!liveDocs.get(i)) {
-                StoredFields storedFields = reader.storedFields();
-                Document doc = storedFields.document(i, LIVE_CHECK_FIELDS);  // use limited-field version
-                IndexableField field = doc.getField(QueryBuilder.U);
                 if (field != null) {
                     if (LOGGER.isLoggable(Level.FINEST)) {
                         String uidString = field.stringValue();
-                        LOGGER.log(Level.FINEST, "adding ''{0}'' at {1} to deleted uid set",
-                                new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
+                        LOGGER.log(Level.FINEST, "adding ''{0}'' ({2}) at {1} to deleted uid set",
+                                new Object[]{Util.uid2url(uidString), Util.uid2date(uidString), i});
                     }
                     deletedUids.add(field.stringValue());
                 }
+            } else {
+                if (field != null) {
+                    String uidString = field.stringValue();
+                    LOGGER.log(Level.FINEST, "live doc: ''{0}'' ({2}) at {1}",
+                            new Object[]{Util.uid2url(uidString), Util.uid2date(uidString), i});
+                }
             }
         }
         stat.report(LOGGER, Level.FINEST, String.format("found %s deleted documents in %s",
@@ -1675,17 +1681,27 @@ public class IndexDatabase {
     void processFileIncremental(IndexDownArgs args, File file, String path) throws IOException {
         final boolean fileExists = file.exists();
 
+        LOGGER.log(Level.FINEST, "processing ''{0}''", path);
+
         // Traverse terms until reaching document beyond path of given file.
         while (uidIter != null && uidIter.term() != null
                 && uidIter.term().compareTo(emptyBR) != 0
                 && FILEPATH_COMPARATOR.compare(
                         Path.of(Util.uid2url(uidIter.term().utf8ToString())), Path.of(path)) <= 0) {
 
+            String uidString = uidIter.term().utf8ToString();
+            LOGGER.log(Level.FINEST, "XXX: ''{0}'' at {1}",
+                    new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
+
             if (deletedUids.contains(uidIter.term().utf8ToString())) {
                 logIgnoredUid(uidIter.term().utf8ToString());
                 BytesRef next = uidIter.next();
                 if (next == null) {
                     uidIter = null;
+                } else {
+                    uidString = uidIter.term().utf8ToString();
+                    LOGGER.log(Level.FINEST, "XXX next: ''{0}'' at {1}",
+                            new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
                 }
                 continue;
             }
@@ -1712,6 +1728,10 @@ public class IndexDatabase {
             BytesRef next = uidIter.next();
             if (next == null) {
                 uidIter = null;
+            } else {
+                uidString = uidIter.term().utf8ToString();
+                LOGGER.log(Level.FINEST, "XXX next: ''{0}'' at {1}",
+                        new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
             }
         }

Will be tracking changes for /solaris-userland/components/curl/test/results-64.master (the file) that was known to appear in the list of duplicate documents.

Going through the changesets backwards:

index at 4f8fe9ee8

  • the file was changed (the log contains writing history entries for the file).
  • there was a deleted document for the file in the index and also a live path
    • FINEST: adding '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473 to deleted uid set
    • FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (9,531) at 20240131180432473

Due to the same timestamp the uid of these 2 documents was the same and thus the traversal skipped the deleted document and merely added new document, leading to the duplicate paths found by the index check (newlines added):

index '/var/tmp/data.opengrok-issue-4317/index/solaris-userland' failed document check: 
duplicate paths = {/solaris-userland/components/curl/test/results-64.master=2, 
/solaris-userland/components/curl/test/results-32.master=2, 
/solaris-userland/components/curl/curl.p5m=2, 
/solaris-userland/components/curl/Makefile=2}

Looking at the indexer logs with added debug prints in the IndexDatabase#processFileIncremental():

19:07:13 FINEST: processing '/solaris-userland/components/curl/test/results-64.master'
19:07:13 FINEST: XXX: '/solaris-userland/components/curl/test/results-64.master' at 20240131180432473
19:07:13 FINEST: ignoring deleted document for '/solaris-userland/components/curl/test/results-64.master' at 20240131180432473
19:07:13 FINEST: XXX next: '/solaris-userland/components/cvs/Makefile' at 20240131180432473
19:07:13 FINEST: processing '/solaris-userland/components/desktop/firefox/Makefile'

Then later on in the log, in the 2nd phase of indexing the file was indeed added:

19:07:14 FINEST: '/solaris-userland/components/curl/test/results-64.master': chosen by suffix: XMLAnalyzerFactory
19:07:14 FINE: Add: '/solaris-userland/components/curl/test/results-64.master' (XMLAnalyzer)

which lead to the duplicate document.

In order to see why there was a deleted document and live document with the same uid, it is necessary to look into the logs of the previous indexer run.

index at 653378bce

Firstly, the check of the duplicate documents passed for this one.

  • the file was had a corresponding deleted document
    • FINEST: adding '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473 to deleted uid set
  • and also a live doc:
    • FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (9,531) at 20240131180432473

There was no change for the file, need to move backwards.

index at 341f9beb2

ditto

index at 4a5e3cb85

ditto

index at 823f2c28e

ditto

index at c073248f7

ditto

index at 794af3182

ditto

index at 154009177

ditto

index at 07e01a4e6

ditto

index at 69a8daead

ditto

index at a609ca278

ditto

index at 52c217090

ditto

index at 80cc6ae18

  • the file was changed
    • FINEST: writing history entries to '/var/tmp/data.opengrok-issue-4317/historycache/solaris-userland/components/curl/test/results-64.master': [ceaa5b4086fdfbfdaf05a5fdce656329d1fc127a, 1797f715e4ec60309db858a7376bd31ff4831fda]
  • there was a live doc
    • FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473

the file was updated:

19:05:29 FINEST: processing '/solaris-userland/components/curl/test/results-64.master'
19:05:29 FINEST: XXX: '/solaris-userland/components/curl/test/results-64.master' at 20240131180432473
19:05:29 FINE: Remove: '/solaris-userland/components/curl/test/results-64.master'
19:05:29 FINER: Removed: '/solaris-userland/components/curl/test/results-64.master'
19:05:29 FINEST: XXX next: '/solaris-userland/components/cvs/Makefile' at 20240131180432473
19:05:29 FINEST: processing '/solaris-userland/components/desktop/firefox/Makefile'

and this is where the deleted document came into existence. The new/added document had presumably the same time stamp because checking the indexer logs of the "previous" (next in time) indexer run this is indeed the case:

$ grep 'curl/test/results-64.master' /var/tmp/opengrok-issue-4317-52c217090.log 
19:05:36 FINEST: adding '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473 to deleted uid set
19:05:36 FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (9,722) at 20240131180432473
...

$ ls -1tr opengrok-issue-4317-*.log | while read f; do echo "########### $f"; grep 20240131180432473 $f; done | less
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment