Bruno Charest 775722f5d4 Switch inverted index from stale check to incremental updates

Register a hook with the indexer so that file add/remove events
incrementally maintain the inverted index, removing the need for
periodic staleness checks and cooldowns. Rebuild the index once on
startup via init_inverted_index().

2026-05-26 12:37:59 -04:00

14 KiB

Raw Blame History

Plan: Incremental InvertedIndex for 40k+ files

Problem Summary

Every file mutation calls _add_file_to_structures / _remove_file_from_structures in backend/indexer.py, which increments _index_generation. When the next search or autocomplete fires, get_inverted_index() in backend/search.py detects staleness (is_stale() returns True) and triggers a full rebuild() — O(N) tokenization of ALL files. With 40k+ files this takes 2-5 seconds, making search unusable.

The existing 3-second cooldown hack in is_stale() only masks the problem; it doesn't fix it.

Solution: Incremental Add/Remove on the InvertedIndex

Add two methods to InvertedIndex that update ALL internal data structures incrementally when a single file is added, modified, or removed:

add_document(vault_name, path, file_info) — called on create/modify
remove_document(vault_name, path) — called on delete/move-source

Then hook these into _add_file_to_structures and _remove_file_from_structures in backend/indexer.py so the inverted index never goes stale.

Remove the is_stale() / rebuild() / cooldown mechanism entirely. The inverted index is always current.

Dependency Architecture

Current import chain:

main.py → search.py → indexer.py  (search.py imports `from backend import indexer as _indexer`)

Problem: indexer.py currently does NOT import from search.py. If we add from backend.search import get_inverted_index to indexer.py, we create a circular import: search.py → indexer.py → search.py.

Fix — Option C (Callback/Hook pattern, simplest):

Add a module-level hook variable in backend/indexer.py:

# In backend/indexer.py
_on_index_change: callable = None  # Called as (action, vault_name, path, file_info_or_None)

def set_index_change_hook(hook):
    """Register a callback for incremental index updates.
    hook(action, vault_name, path, file_info_or_None) where action is 'add' or 'remove'.
    """
    global _on_index_change
    _on_index_change = hook

Then at the end of _add_file_to_structures:

if _on_index_change:
    _on_index_change('add', vault_name, rel_path, file_info)

At the end of _remove_file_from_structures:

if _on_index_change:
    _on_index_change('remove', vault_name, rel_path, file_info)  # file_info = removed dict or None

Then in backend/search.py, at module load time (after InvertedIndex class is defined):

def _on_index_change_hook(action, vault_name, path, file_info):
    inv = get_inverted_index_raw()  # get without rebuild check
    if action == 'add':
        inv.add_document(vault_name, path, file_info)
    elif action == 'remove':
        inv.remove_document(vault_name, path)

# Register the hook — this triggers an import of indexer, but indexer is already imported
# by the time this line runs (since search.py does `from backend import indexer as _indexer` above)
_indexer.set_index_change_hook(_on_index_change_hook)

This avoids circular imports completely because:

search.py already imports indexer.py at the top (from backend import indexer as _indexer)
indexer.py never imports search.py — it just stores a callback
search.py registers the callback AFTER the InvertedIndex class is defined

Detailed Implementation

Step 1: Add hook variable to `backend/indexer.py`

File: backend/indexer.py Changes:

After _index_generation global (line ~28), add:
```
_on_index_change: callable = None
```
Add function set_index_change_hook(hook) (bottom of file, near other public functions)
Add if _on_index_change: _on_index_change('add', vault_name, file_info['path'], file_info) at end of _add_file_to_structures (~line 665)
Add if _on_index_change: _on_index_change('remove', vault_name, rel_path, removed) at end of _remove_file_from_structures (~line 597)

Step 2: Add `add_document` and `remove_document` to InvertedIndex

File: backend/search.py

`add_document(vault_name, path, file_info)`

def add_document(self, vault_name: str, path: str, file_info: dict):
    """Add or update a single document in the inverted index."""
    
    doc_key = f"{vault_name}::{path}"
    old_file_info = self.doc_info.get(doc_key)
    
    # If updating an existing document, remove old entries first
    if old_file_info is not None:
        self._remove_doc_internals(doc_key, vault_name, old_file_info)
    else:
        self.doc_count += 1
    
    # --- Metadata ---
    self.doc_info[doc_key] = file_info
    self.doc_vault[doc_key] = vault_name
    self.vault_docs[vault_name].add(doc_key)
    
    # --- Tags ---
    tags = file_info.get("tags", [])
    for tag in tags:
        self.tag_docs[tag.lower()].add(doc_key)
    
    # --- Title tokens ---
    title = file_info.get("title", "")
    title_tokens = tokenize(title)
    for token in set(title_tokens):
        self.title_index[token].append(doc_key)
    
    # --- Normalized title for prefix suggestions ---
    norm_title = normalize_text(title)
    if norm_title:
        self.title_norm_map[norm_title].append({
            "vault": vault_name,
            "path": path,
            "title": title,
        })
    
    # --- Word index (content + title TF) ---
    content = file_info.get("content", "")
    full_text = title + " " + content
    tokens = tokenize(full_text)
    tf = defaultdict(int)
    for token in tokens:
        tf[token] += 1
    
    # Track which tokens are new (not previously indexed) for sorted_tokens update
    new_tokens = []
    for token, freq in tf.items():
        if not self.word_index.get(token):
            new_tokens.append(token)
        self.word_index[token][doc_key] = freq
    
    # Incrementally update _sorted_tokens (avoid O(V log V) full re-sort)
    if new_tokens:
        for token in new_tokens:
            bisect.insort(self._sorted_tokens, token)

`remove_document(vault_name, path)`

def remove_document(self, vault_name: str, path: str):
    """Remove a document from the inverted index."""
    
    doc_key = f"{vault_name}::{path}"
    file_info = self.doc_info.get(doc_key)
    if file_info is None:
        return
    
    self._remove_doc_internals(doc_key, vault_name, file_info)
    self.doc_count -= 1

`_remove_doc_internals(doc_key, vault_name, file_info)` (private helper)

def _remove_doc_internals(self, doc_key: str, vault_name: str, file_info: dict):
    """Internal: remove one doc_key from all indexes without adjusting doc_count."""
    
    # --- Metadata ---
    self.doc_info.pop(doc_key, None)
    self.doc_vault.pop(doc_key, None)
    if vault_name in self.vault_docs:
        self.vault_docs[vault_name].discard(doc_key)
    
    # --- Tags ---
    for tag in file_info.get("tags", []):
        td = self.tag_docs.get(tag.lower())
        if td:
            td.discard(doc_key)
            if not td:
                del self.tag_docs[tag.lower()]
    
    # --- Title tokens ---
    title = file_info.get("title", "")
    for token in set(tokenize(title)):
        ti = self.title_index.get(token)
        if ti:
            try:
                ti.remove(doc_key)
            except ValueError:
                pass
            if not ti:
                del self.title_index[token]
    
    # --- Title norm map ---
    norm_title = normalize_text(title)
    if norm_title and norm_title in self.title_norm_map:
        self.title_norm_map[norm_title] = [
            e for e in self.title_norm_map[norm_title]
            if not (e["vault"] == vault_name and e["path"] == file_info.get("path"))
        ]
        if not self.title_norm_map[norm_title]:
            del self.title_norm_map[norm_title]
    
    # --- Word index ---
    content = file_info.get("content", "")
    full_text = title + " " + content
    for token in set(tokenize(full_text)):
        wi = self.word_index.get(token)
        if wi:
            wi.pop(doc_key, None)
            if not wi:
                del self.word_index[token]
                # Remove from sorted tokens via bisect
                idx = bisect.bisect_left(self._sorted_tokens, token)
                if idx < len(self._sorted_tokens) and self._sorted_tokens[idx] == token:
                    self._sorted_tokens.pop(idx)

Key: bisect.insort for insertion and bisect.bisect_left + pop(idx) for removal keep _sorted_tokens sorted in O(V) worst case (list shift) but this is negligible compared to O(N * content) rebuild.

Step 3: Modify `get_inverted_index()` to NOT check staleness

File: backend/search.py

def get_inverted_index() -> InvertedIndex:
    """Return the singleton inverted index. Always up-to-date via hooks."""
    return _inverted_index

Remove is_stale() and the _source_generation / _last_rebuild / _rebuild_cooldown fields. Keep rebuild() for initial build and manual reindex (still called once at startup via build_index).

Step 4: Call `rebuild()` once after initial index build

In backend/search.py, register the hook AND call rebuild once:

# After InvertedIndex class and _inverted_index = InvertedIndex()

def _on_index_change_hook(action, vault_name, path, file_info):
    inv = _inverted_index
    try:
        if action == 'add':
            inv.add_document(vault_name, path, file_info)
        elif action == 'remove':
            inv.remove_document(vault_name, path)
    except Exception as e:
        logger.warning(f"Inverted index incremental update failed ({action} {vault_name}/{path}): {e}")
        # Fallback: mark for rebuild on next search
        inv._needs_rebuild = True

_indexer.set_index_change_hook(_on_index_change_hook)

# Initial build trigger — called after first index is built
def init_inverted_index():
    """Force initial inverted index build. Called after build_index completes."""
    _inverted_index.rebuild()

def get_inverted_index() -> InvertedIndex:
    """Return the singleton inverted index."""
    # Only check for rebuild if incremental updates have failed
    # OR if this is the very first call (doc_count == 0 and index has files)
    if getattr(_inverted_index, '_needs_rebuild', False):
        _inverted_index.rebuild()
        _inverted_index._needs_rebuild = False
    elif _inverted_index.doc_count == 0 and any(
        vdata.get("files") for vdata in index.values()
    ):
        _inverted_index.rebuild()
    return _inverted_index

Step 5: Call `init_inverted_index()` from `build_index` in main.py

In backend/main.py, after build_index() completes in the lifespan handler, call:

from backend.search import init_inverted_index
init_inverted_index()

This ensures the inverted index is built once on startup, then incrementally maintained thereafter.

Tag prefix index handling

The tag_norm_map and tag_prefix_index are built per-vault in rebuild(). For incremental updates, we need to handle tag changes:

In add_document, after adding doc tags:

# Check if any tags are new (not in tag_norm_map)
for tag in tags:
    norm_tag = normalize_text(tag)
    if norm_tag not in self.tag_norm_map:
        self.tag_norm_map[norm_tag] = tag
        for plen in range(MIN_PREFIX_LENGTH, len(norm_tag) + 1):
            prefix = norm_tag[:plen]
            if tag not in self.tag_prefix_index[prefix]:
                self.tag_prefix_index[prefix].append(tag)

In _remove_doc_internals, we do NOT remove tags from tag_norm_map or tag_prefix_index — these are global (per-vault tag vocabulary), not per-document. They only grow over the lifetime of the inverted index. A periodic rebuild() on manual reindex will clean them up.

Step 6: Remove cooldown hack from search.py

Remove:

_last_rebuild and _rebuild_cooldown fields from InvertedIndex.__init__
is_stale() method
_source_generation field (no longer needed for staleness, but keep for diagnostics)

Step 7: Remove coalescence hack from main.py

In _on_vault_change in backend/main.py, remove:

old_gen = idx._index_generation
...
if idx._index_generation > old_gen + 1:
    idx._index_generation = old_gen + 1

This hack was only needed to reduce the number of inverted index rebuilds. With incremental updates, it's unnecessary — each mutation is cheap.

Files Modified (Summary)

File	Changes
`backend/indexer.py`	+`_on_index_change` hook variable, +`set_index_change_hook()`, +hook calls in `_add_file_to_structures` and `_remove_file_from_structures`
`backend/search.py`	+`add_document()`, +`remove_document()`, +`_remove_doc_internals()`, +`init_inverted_index()`, +hook registration, remove `is_stale()`/cooldown, simplify `get_inverted_index()`
`backend/main.py`	+`init_inverted_index()` call after `build_index()`, remove coalescence hack in `_on_vault_change`

Risks & Edge Cases

Thread safety: _add_file_to_structures and _remove_file_from_structures are protected by _index_lock / _async_index_lock in indexer.py. The InvertedIndex methods are called inside these locks, so they're also protected. No additional locking needed.
Hook registration timing: search.py imports indexer.py at the top, then later registers the hook. The hook is registered at module load time, BEFORE the first call to build_index. So _on_index_change is set when build_index runs — but build_index calls _add_file_to_structures internally, which would try to incrementally update an empty inverted index. Fix: The hook checks if _inverted_index.doc_count == 0 and skips incremental updates; the initial rebuild() handles the bulk load.
Hook call during initial build_index: build_index iterates files and calls _add_file_to_structures. The hook fires for each file, calling add_document() on an empty inverted index. This is slower than a single rebuild(). Fix: Add a flag _inverted_index._ready = False initially, set to True after init_inverted_index(). The hook skips when _ready is False.
Sorted tokens performance: bisect.insort and list.pop(idx) are O(V) worst case for large V. For 40k files, the vocabulary size V is typically 50k-200k tokens. O(V) for a single insertion is ~0.001ms, acceptable. The rebuild() call at startup handles the initial bulk.
tag_norm_map / tag_prefix_index growth: These grow monotonically (never shrink on incremental remove). With 40k files and thousands of tags, this is a few thousand entries — negligible. A manual "Réindexer" button triggers a full rebuild() to clean up.

14 KiB Raw Blame History