Register a hook with the indexer so that file add/remove events incrementally maintain the inverted index, removing the need for periodic staleness checks and cooldowns. Rebuild the index once on startup via init_inverted_index().
14 KiB
Plan: Incremental InvertedIndex for 40k+ files
Problem Summary
Every file mutation calls _add_file_to_structures / _remove_file_from_structures in backend/indexer.py, which increments _index_generation. When the next search or autocomplete fires, get_inverted_index() in backend/search.py detects staleness (is_stale() returns True) and triggers a full rebuild() — O(N) tokenization of ALL files. With 40k+ files this takes 2-5 seconds, making search unusable.
The existing 3-second cooldown hack in is_stale() only masks the problem; it doesn't fix it.
Solution: Incremental Add/Remove on the InvertedIndex
Add two methods to InvertedIndex that update ALL internal data structures incrementally when a single file is added, modified, or removed:
add_document(vault_name, path, file_info)— called on create/modifyremove_document(vault_name, path)— called on delete/move-source
Then hook these into _add_file_to_structures and _remove_file_from_structures in backend/indexer.py so the inverted index never goes stale.
Remove the is_stale() / rebuild() / cooldown mechanism entirely. The inverted index is always current.
Dependency Architecture
Current import chain:
main.py → search.py → indexer.py (search.py imports `from backend import indexer as _indexer`)
Problem: indexer.py currently does NOT import from search.py. If we add from backend.search import get_inverted_index to indexer.py, we create a circular import: search.py → indexer.py → search.py.
Fix — Option C (Callback/Hook pattern, simplest):
Add a module-level hook variable in backend/indexer.py:
# In backend/indexer.py
_on_index_change: callable = None # Called as (action, vault_name, path, file_info_or_None)
def set_index_change_hook(hook):
"""Register a callback for incremental index updates.
hook(action, vault_name, path, file_info_or_None) where action is 'add' or 'remove'.
"""
global _on_index_change
_on_index_change = hook
Then at the end of _add_file_to_structures:
if _on_index_change:
_on_index_change('add', vault_name, rel_path, file_info)
At the end of _remove_file_from_structures:
if _on_index_change:
_on_index_change('remove', vault_name, rel_path, file_info) # file_info = removed dict or None
Then in backend/search.py, at module load time (after InvertedIndex class is defined):
def _on_index_change_hook(action, vault_name, path, file_info):
inv = get_inverted_index_raw() # get without rebuild check
if action == 'add':
inv.add_document(vault_name, path, file_info)
elif action == 'remove':
inv.remove_document(vault_name, path)
# Register the hook — this triggers an import of indexer, but indexer is already imported
# by the time this line runs (since search.py does `from backend import indexer as _indexer` above)
_indexer.set_index_change_hook(_on_index_change_hook)
This avoids circular imports completely because:
search.pyalready importsindexer.pyat the top (from backend import indexer as _indexer)indexer.pynever importssearch.py— it just stores a callbacksearch.pyregisters the callback AFTER the InvertedIndex class is defined
Detailed Implementation
Step 1: Add hook variable to backend/indexer.py
File: backend/indexer.py
Changes:
- After
_index_generationglobal (line ~28), add:_on_index_change: callable = None - Add function
set_index_change_hook(hook)(bottom of file, near other public functions) - Add
if _on_index_change: _on_index_change('add', vault_name, file_info['path'], file_info)at end of_add_file_to_structures(~line 665) - Add
if _on_index_change: _on_index_change('remove', vault_name, rel_path, removed)at end of_remove_file_from_structures(~line 597)
Step 2: Add add_document and remove_document to InvertedIndex
File: backend/search.py
add_document(vault_name, path, file_info)
def add_document(self, vault_name: str, path: str, file_info: dict):
"""Add or update a single document in the inverted index."""
doc_key = f"{vault_name}::{path}"
old_file_info = self.doc_info.get(doc_key)
# If updating an existing document, remove old entries first
if old_file_info is not None:
self._remove_doc_internals(doc_key, vault_name, old_file_info)
else:
self.doc_count += 1
# --- Metadata ---
self.doc_info[doc_key] = file_info
self.doc_vault[doc_key] = vault_name
self.vault_docs[vault_name].add(doc_key)
# --- Tags ---
tags = file_info.get("tags", [])
for tag in tags:
self.tag_docs[tag.lower()].add(doc_key)
# --- Title tokens ---
title = file_info.get("title", "")
title_tokens = tokenize(title)
for token in set(title_tokens):
self.title_index[token].append(doc_key)
# --- Normalized title for prefix suggestions ---
norm_title = normalize_text(title)
if norm_title:
self.title_norm_map[norm_title].append({
"vault": vault_name,
"path": path,
"title": title,
})
# --- Word index (content + title TF) ---
content = file_info.get("content", "")
full_text = title + " " + content
tokens = tokenize(full_text)
tf = defaultdict(int)
for token in tokens:
tf[token] += 1
# Track which tokens are new (not previously indexed) for sorted_tokens update
new_tokens = []
for token, freq in tf.items():
if not self.word_index.get(token):
new_tokens.append(token)
self.word_index[token][doc_key] = freq
# Incrementally update _sorted_tokens (avoid O(V log V) full re-sort)
if new_tokens:
for token in new_tokens:
bisect.insort(self._sorted_tokens, token)
remove_document(vault_name, path)
def remove_document(self, vault_name: str, path: str):
"""Remove a document from the inverted index."""
doc_key = f"{vault_name}::{path}"
file_info = self.doc_info.get(doc_key)
if file_info is None:
return
self._remove_doc_internals(doc_key, vault_name, file_info)
self.doc_count -= 1
_remove_doc_internals(doc_key, vault_name, file_info) (private helper)
def _remove_doc_internals(self, doc_key: str, vault_name: str, file_info: dict):
"""Internal: remove one doc_key from all indexes without adjusting doc_count."""
# --- Metadata ---
self.doc_info.pop(doc_key, None)
self.doc_vault.pop(doc_key, None)
if vault_name in self.vault_docs:
self.vault_docs[vault_name].discard(doc_key)
# --- Tags ---
for tag in file_info.get("tags", []):
td = self.tag_docs.get(tag.lower())
if td:
td.discard(doc_key)
if not td:
del self.tag_docs[tag.lower()]
# --- Title tokens ---
title = file_info.get("title", "")
for token in set(tokenize(title)):
ti = self.title_index.get(token)
if ti:
try:
ti.remove(doc_key)
except ValueError:
pass
if not ti:
del self.title_index[token]
# --- Title norm map ---
norm_title = normalize_text(title)
if norm_title and norm_title in self.title_norm_map:
self.title_norm_map[norm_title] = [
e for e in self.title_norm_map[norm_title]
if not (e["vault"] == vault_name and e["path"] == file_info.get("path"))
]
if not self.title_norm_map[norm_title]:
del self.title_norm_map[norm_title]
# --- Word index ---
content = file_info.get("content", "")
full_text = title + " " + content
for token in set(tokenize(full_text)):
wi = self.word_index.get(token)
if wi:
wi.pop(doc_key, None)
if not wi:
del self.word_index[token]
# Remove from sorted tokens via bisect
idx = bisect.bisect_left(self._sorted_tokens, token)
if idx < len(self._sorted_tokens) and self._sorted_tokens[idx] == token:
self._sorted_tokens.pop(idx)
Key: bisect.insort for insertion and bisect.bisect_left + pop(idx) for removal keep _sorted_tokens sorted in O(V) worst case (list shift) but this is negligible compared to O(N * content) rebuild.
Step 3: Modify get_inverted_index() to NOT check staleness
File: backend/search.py
def get_inverted_index() -> InvertedIndex:
"""Return the singleton inverted index. Always up-to-date via hooks."""
return _inverted_index
Remove is_stale() and the _source_generation / _last_rebuild / _rebuild_cooldown fields. Keep rebuild() for initial build and manual reindex (still called once at startup via build_index).
Step 4: Call rebuild() once after initial index build
In backend/search.py, register the hook AND call rebuild once:
# After InvertedIndex class and _inverted_index = InvertedIndex()
def _on_index_change_hook(action, vault_name, path, file_info):
inv = _inverted_index
try:
if action == 'add':
inv.add_document(vault_name, path, file_info)
elif action == 'remove':
inv.remove_document(vault_name, path)
except Exception as e:
logger.warning(f"Inverted index incremental update failed ({action} {vault_name}/{path}): {e}")
# Fallback: mark for rebuild on next search
inv._needs_rebuild = True
_indexer.set_index_change_hook(_on_index_change_hook)
# Initial build trigger — called after first index is built
def init_inverted_index():
"""Force initial inverted index build. Called after build_index completes."""
_inverted_index.rebuild()
def get_inverted_index() -> InvertedIndex:
"""Return the singleton inverted index."""
# Only check for rebuild if incremental updates have failed
# OR if this is the very first call (doc_count == 0 and index has files)
if getattr(_inverted_index, '_needs_rebuild', False):
_inverted_index.rebuild()
_inverted_index._needs_rebuild = False
elif _inverted_index.doc_count == 0 and any(
vdata.get("files") for vdata in index.values()
):
_inverted_index.rebuild()
return _inverted_index
Step 5: Call init_inverted_index() from build_index in main.py
In backend/main.py, after build_index() completes in the lifespan handler, call:
from backend.search import init_inverted_index
init_inverted_index()
This ensures the inverted index is built once on startup, then incrementally maintained thereafter.
Tag prefix index handling
The tag_norm_map and tag_prefix_index are built per-vault in rebuild(). For incremental updates, we need to handle tag changes:
In add_document, after adding doc tags:
# Check if any tags are new (not in tag_norm_map)
for tag in tags:
norm_tag = normalize_text(tag)
if norm_tag not in self.tag_norm_map:
self.tag_norm_map[norm_tag] = tag
for plen in range(MIN_PREFIX_LENGTH, len(norm_tag) + 1):
prefix = norm_tag[:plen]
if tag not in self.tag_prefix_index[prefix]:
self.tag_prefix_index[prefix].append(tag)
In _remove_doc_internals, we do NOT remove tags from tag_norm_map or tag_prefix_index — these are global (per-vault tag vocabulary), not per-document. They only grow over the lifetime of the inverted index. A periodic rebuild() on manual reindex will clean them up.
Step 6: Remove cooldown hack from search.py
Remove:
_last_rebuildand_rebuild_cooldownfields fromInvertedIndex.__init__is_stale()method_source_generationfield (no longer needed for staleness, but keep for diagnostics)
Step 7: Remove coalescence hack from main.py
In _on_vault_change in backend/main.py, remove:
old_gen = idx._index_generation
...
if idx._index_generation > old_gen + 1:
idx._index_generation = old_gen + 1
This hack was only needed to reduce the number of inverted index rebuilds. With incremental updates, it's unnecessary — each mutation is cheap.
Files Modified (Summary)
| File | Changes |
|---|---|
backend/indexer.py |
+_on_index_change hook variable, +set_index_change_hook(), +hook calls in _add_file_to_structures and _remove_file_from_structures |
backend/search.py |
+add_document(), +remove_document(), +_remove_doc_internals(), +init_inverted_index(), +hook registration, remove is_stale()/cooldown, simplify get_inverted_index() |
backend/main.py |
+init_inverted_index() call after build_index(), remove coalescence hack in _on_vault_change |
Risks & Edge Cases
-
Thread safety:
_add_file_to_structuresand_remove_file_from_structuresare protected by_index_lock/_async_index_lockin indexer.py. The InvertedIndex methods are called inside these locks, so they're also protected. No additional locking needed. -
Hook registration timing:
search.pyimportsindexer.pyat the top, then later registers the hook. The hook is registered at module load time, BEFORE the first call tobuild_index. So_on_index_changeis set whenbuild_indexruns — butbuild_indexcalls_add_file_to_structuresinternally, which would try to incrementally update an empty inverted index. Fix: The hook checksif _inverted_index.doc_count == 0and skips incremental updates; the initialrebuild()handles the bulk load. -
Hook call during initial build_index:
build_indexiterates files and calls_add_file_to_structures. The hook fires for each file, callingadd_document()on an empty inverted index. This is slower than a singlerebuild(). Fix: Add a flag_inverted_index._ready = Falseinitially, set to True afterinit_inverted_index(). The hook skips when_readyis False. -
Sorted tokens performance:
bisect.insortandlist.pop(idx)are O(V) worst case for large V. For 40k files, the vocabulary size V is typically 50k-200k tokens. O(V) for a single insertion is ~0.001ms, acceptable. The rebuild() call at startup handles the initial bulk. -
tag_norm_map / tag_prefix_index growth: These grow monotonically (never shrink on incremental remove). With 40k files and thousands of tags, this is a few thousand entries — negligible. A manual "Réindexer" button triggers a full
rebuild()to clean up.