Register a hook with the indexer so that file add/remove events incrementally maintain the inverted index, removing the need for periodic staleness checks and cooldowns. Rebuild the index once on startup via init_inverted_index().
349 lines
14 KiB
Markdown
349 lines
14 KiB
Markdown
# Plan: Incremental InvertedIndex for 40k+ files
|
|
|
|
## Problem Summary
|
|
|
|
Every file mutation calls `_add_file_to_structures` / `_remove_file_from_structures` in `backend/indexer.py`, which increments `_index_generation`. When the next search or autocomplete fires, `get_inverted_index()` in `backend/search.py` detects staleness (`is_stale()` returns True) and triggers a full `rebuild()` — O(N) tokenization of ALL files. With 40k+ files this takes 2-5 seconds, making search unusable.
|
|
|
|
The existing 3-second cooldown hack in `is_stale()` only masks the problem; it doesn't fix it.
|
|
|
|
## Solution: Incremental Add/Remove on the InvertedIndex
|
|
|
|
Add two methods to `InvertedIndex` that update ALL internal data structures incrementally when a single file is added, modified, or removed:
|
|
|
|
- `add_document(vault_name, path, file_info)` — called on create/modify
|
|
- `remove_document(vault_name, path)` — called on delete/move-source
|
|
|
|
Then hook these into `_add_file_to_structures` and `_remove_file_from_structures` in `backend/indexer.py` so the inverted index never goes stale.
|
|
|
|
Remove the `is_stale()` / `rebuild()` / cooldown mechanism entirely. The inverted index is always current.
|
|
|
|
## Dependency Architecture
|
|
|
|
**Current import chain:**
|
|
```
|
|
main.py → search.py → indexer.py (search.py imports `from backend import indexer as _indexer`)
|
|
```
|
|
|
|
**Problem:** `indexer.py` currently does NOT import from `search.py`. If we add `from backend.search import get_inverted_index` to indexer.py, we create a circular import: `search.py → indexer.py → search.py`.
|
|
|
|
**Fix — Option C (Callback/Hook pattern, simplest):**
|
|
|
|
Add a module-level hook variable in `backend/indexer.py`:
|
|
|
|
```python
|
|
# In backend/indexer.py
|
|
_on_index_change: callable = None # Called as (action, vault_name, path, file_info_or_None)
|
|
|
|
def set_index_change_hook(hook):
|
|
"""Register a callback for incremental index updates.
|
|
hook(action, vault_name, path, file_info_or_None) where action is 'add' or 'remove'.
|
|
"""
|
|
global _on_index_change
|
|
_on_index_change = hook
|
|
```
|
|
|
|
Then at the end of `_add_file_to_structures`:
|
|
```python
|
|
if _on_index_change:
|
|
_on_index_change('add', vault_name, rel_path, file_info)
|
|
```
|
|
|
|
At the end of `_remove_file_from_structures`:
|
|
```python
|
|
if _on_index_change:
|
|
_on_index_change('remove', vault_name, rel_path, file_info) # file_info = removed dict or None
|
|
```
|
|
|
|
Then in `backend/search.py`, at module load time (after InvertedIndex class is defined):
|
|
```python
|
|
def _on_index_change_hook(action, vault_name, path, file_info):
|
|
inv = get_inverted_index_raw() # get without rebuild check
|
|
if action == 'add':
|
|
inv.add_document(vault_name, path, file_info)
|
|
elif action == 'remove':
|
|
inv.remove_document(vault_name, path)
|
|
|
|
# Register the hook — this triggers an import of indexer, but indexer is already imported
|
|
# by the time this line runs (since search.py does `from backend import indexer as _indexer` above)
|
|
_indexer.set_index_change_hook(_on_index_change_hook)
|
|
```
|
|
|
|
This avoids circular imports completely because:
|
|
1. `search.py` already imports `indexer.py` at the top (`from backend import indexer as _indexer`)
|
|
2. `indexer.py` never imports `search.py` — it just stores a callback
|
|
3. `search.py` registers the callback AFTER the InvertedIndex class is defined
|
|
|
|
## Detailed Implementation
|
|
|
|
### Step 1: Add hook variable to `backend/indexer.py`
|
|
|
|
File: `backend/indexer.py`
|
|
Changes:
|
|
- After `_index_generation` global (line ~28), add:
|
|
```python
|
|
_on_index_change: callable = None
|
|
```
|
|
- Add function `set_index_change_hook(hook)` (bottom of file, near other public functions)
|
|
- Add `if _on_index_change: _on_index_change('add', vault_name, file_info['path'], file_info)` at end of `_add_file_to_structures` (~line 665)
|
|
- Add `if _on_index_change: _on_index_change('remove', vault_name, rel_path, removed)` at end of `_remove_file_from_structures` (~line 597)
|
|
|
|
### Step 2: Add `add_document` and `remove_document` to InvertedIndex
|
|
|
|
File: `backend/search.py`
|
|
|
|
#### `add_document(vault_name, path, file_info)`
|
|
|
|
```python
|
|
def add_document(self, vault_name: str, path: str, file_info: dict):
|
|
"""Add or update a single document in the inverted index."""
|
|
|
|
doc_key = f"{vault_name}::{path}"
|
|
old_file_info = self.doc_info.get(doc_key)
|
|
|
|
# If updating an existing document, remove old entries first
|
|
if old_file_info is not None:
|
|
self._remove_doc_internals(doc_key, vault_name, old_file_info)
|
|
else:
|
|
self.doc_count += 1
|
|
|
|
# --- Metadata ---
|
|
self.doc_info[doc_key] = file_info
|
|
self.doc_vault[doc_key] = vault_name
|
|
self.vault_docs[vault_name].add(doc_key)
|
|
|
|
# --- Tags ---
|
|
tags = file_info.get("tags", [])
|
|
for tag in tags:
|
|
self.tag_docs[tag.lower()].add(doc_key)
|
|
|
|
# --- Title tokens ---
|
|
title = file_info.get("title", "")
|
|
title_tokens = tokenize(title)
|
|
for token in set(title_tokens):
|
|
self.title_index[token].append(doc_key)
|
|
|
|
# --- Normalized title for prefix suggestions ---
|
|
norm_title = normalize_text(title)
|
|
if norm_title:
|
|
self.title_norm_map[norm_title].append({
|
|
"vault": vault_name,
|
|
"path": path,
|
|
"title": title,
|
|
})
|
|
|
|
# --- Word index (content + title TF) ---
|
|
content = file_info.get("content", "")
|
|
full_text = title + " " + content
|
|
tokens = tokenize(full_text)
|
|
tf = defaultdict(int)
|
|
for token in tokens:
|
|
tf[token] += 1
|
|
|
|
# Track which tokens are new (not previously indexed) for sorted_tokens update
|
|
new_tokens = []
|
|
for token, freq in tf.items():
|
|
if not self.word_index.get(token):
|
|
new_tokens.append(token)
|
|
self.word_index[token][doc_key] = freq
|
|
|
|
# Incrementally update _sorted_tokens (avoid O(V log V) full re-sort)
|
|
if new_tokens:
|
|
for token in new_tokens:
|
|
bisect.insort(self._sorted_tokens, token)
|
|
```
|
|
|
|
#### `remove_document(vault_name, path)`
|
|
|
|
```python
|
|
def remove_document(self, vault_name: str, path: str):
|
|
"""Remove a document from the inverted index."""
|
|
|
|
doc_key = f"{vault_name}::{path}"
|
|
file_info = self.doc_info.get(doc_key)
|
|
if file_info is None:
|
|
return
|
|
|
|
self._remove_doc_internals(doc_key, vault_name, file_info)
|
|
self.doc_count -= 1
|
|
```
|
|
|
|
#### `_remove_doc_internals(doc_key, vault_name, file_info)` (private helper)
|
|
|
|
```python
|
|
def _remove_doc_internals(self, doc_key: str, vault_name: str, file_info: dict):
|
|
"""Internal: remove one doc_key from all indexes without adjusting doc_count."""
|
|
|
|
# --- Metadata ---
|
|
self.doc_info.pop(doc_key, None)
|
|
self.doc_vault.pop(doc_key, None)
|
|
if vault_name in self.vault_docs:
|
|
self.vault_docs[vault_name].discard(doc_key)
|
|
|
|
# --- Tags ---
|
|
for tag in file_info.get("tags", []):
|
|
td = self.tag_docs.get(tag.lower())
|
|
if td:
|
|
td.discard(doc_key)
|
|
if not td:
|
|
del self.tag_docs[tag.lower()]
|
|
|
|
# --- Title tokens ---
|
|
title = file_info.get("title", "")
|
|
for token in set(tokenize(title)):
|
|
ti = self.title_index.get(token)
|
|
if ti:
|
|
try:
|
|
ti.remove(doc_key)
|
|
except ValueError:
|
|
pass
|
|
if not ti:
|
|
del self.title_index[token]
|
|
|
|
# --- Title norm map ---
|
|
norm_title = normalize_text(title)
|
|
if norm_title and norm_title in self.title_norm_map:
|
|
self.title_norm_map[norm_title] = [
|
|
e for e in self.title_norm_map[norm_title]
|
|
if not (e["vault"] == vault_name and e["path"] == file_info.get("path"))
|
|
]
|
|
if not self.title_norm_map[norm_title]:
|
|
del self.title_norm_map[norm_title]
|
|
|
|
# --- Word index ---
|
|
content = file_info.get("content", "")
|
|
full_text = title + " " + content
|
|
for token in set(tokenize(full_text)):
|
|
wi = self.word_index.get(token)
|
|
if wi:
|
|
wi.pop(doc_key, None)
|
|
if not wi:
|
|
del self.word_index[token]
|
|
# Remove from sorted tokens via bisect
|
|
idx = bisect.bisect_left(self._sorted_tokens, token)
|
|
if idx < len(self._sorted_tokens) and self._sorted_tokens[idx] == token:
|
|
self._sorted_tokens.pop(idx)
|
|
```
|
|
|
|
Key: `bisect.insort` for insertion and `bisect.bisect_left` + `pop(idx)` for removal keep `_sorted_tokens` sorted in O(V) worst case (list shift) but this is negligible compared to O(N * content) rebuild.
|
|
|
|
### Step 3: Modify `get_inverted_index()` to NOT check staleness
|
|
|
|
File: `backend/search.py`
|
|
|
|
```python
|
|
def get_inverted_index() -> InvertedIndex:
|
|
"""Return the singleton inverted index. Always up-to-date via hooks."""
|
|
return _inverted_index
|
|
```
|
|
|
|
Remove `is_stale()` and the `_source_generation` / `_last_rebuild` / `_rebuild_cooldown` fields. Keep `rebuild()` for initial build and manual reindex (still called once at startup via `build_index`).
|
|
|
|
### Step 4: Call `rebuild()` once after initial index build
|
|
|
|
In `backend/search.py`, register the hook AND call rebuild once:
|
|
|
|
```python
|
|
# After InvertedIndex class and _inverted_index = InvertedIndex()
|
|
|
|
def _on_index_change_hook(action, vault_name, path, file_info):
|
|
inv = _inverted_index
|
|
try:
|
|
if action == 'add':
|
|
inv.add_document(vault_name, path, file_info)
|
|
elif action == 'remove':
|
|
inv.remove_document(vault_name, path)
|
|
except Exception as e:
|
|
logger.warning(f"Inverted index incremental update failed ({action} {vault_name}/{path}): {e}")
|
|
# Fallback: mark for rebuild on next search
|
|
inv._needs_rebuild = True
|
|
|
|
_indexer.set_index_change_hook(_on_index_change_hook)
|
|
|
|
# Initial build trigger — called after first index is built
|
|
def init_inverted_index():
|
|
"""Force initial inverted index build. Called after build_index completes."""
|
|
_inverted_index.rebuild()
|
|
|
|
def get_inverted_index() -> InvertedIndex:
|
|
"""Return the singleton inverted index."""
|
|
# Only check for rebuild if incremental updates have failed
|
|
# OR if this is the very first call (doc_count == 0 and index has files)
|
|
if getattr(_inverted_index, '_needs_rebuild', False):
|
|
_inverted_index.rebuild()
|
|
_inverted_index._needs_rebuild = False
|
|
elif _inverted_index.doc_count == 0 and any(
|
|
vdata.get("files") for vdata in index.values()
|
|
):
|
|
_inverted_index.rebuild()
|
|
return _inverted_index
|
|
```
|
|
|
|
### Step 5: Call `init_inverted_index()` from `build_index` in main.py
|
|
|
|
In `backend/main.py`, after `build_index()` completes in the lifespan handler, call:
|
|
|
|
```python
|
|
from backend.search import init_inverted_index
|
|
init_inverted_index()
|
|
```
|
|
|
|
This ensures the inverted index is built once on startup, then incrementally maintained thereafter.
|
|
|
|
### Tag prefix index handling
|
|
|
|
The `tag_norm_map` and `tag_prefix_index` are built per-vault in `rebuild()`. For incremental updates, we need to handle tag changes:
|
|
|
|
In `add_document`, after adding doc tags:
|
|
```python
|
|
# Check if any tags are new (not in tag_norm_map)
|
|
for tag in tags:
|
|
norm_tag = normalize_text(tag)
|
|
if norm_tag not in self.tag_norm_map:
|
|
self.tag_norm_map[norm_tag] = tag
|
|
for plen in range(MIN_PREFIX_LENGTH, len(norm_tag) + 1):
|
|
prefix = norm_tag[:plen]
|
|
if tag not in self.tag_prefix_index[prefix]:
|
|
self.tag_prefix_index[prefix].append(tag)
|
|
```
|
|
|
|
In `_remove_doc_internals`, we do NOT remove tags from `tag_norm_map` or `tag_prefix_index` — these are global (per-vault tag vocabulary), not per-document. They only grow over the lifetime of the inverted index. A periodic `rebuild()` on manual reindex will clean them up.
|
|
|
|
### Step 6: Remove cooldown hack from search.py
|
|
|
|
Remove:
|
|
- `_last_rebuild` and `_rebuild_cooldown` fields from `InvertedIndex.__init__`
|
|
- `is_stale()` method
|
|
- `_source_generation` field (no longer needed for staleness, but keep for diagnostics)
|
|
|
|
### Step 7: Remove coalescence hack from main.py
|
|
|
|
In `_on_vault_change` in `backend/main.py`, remove:
|
|
```python
|
|
old_gen = idx._index_generation
|
|
...
|
|
if idx._index_generation > old_gen + 1:
|
|
idx._index_generation = old_gen + 1
|
|
```
|
|
|
|
This hack was only needed to reduce the number of inverted index rebuilds. With incremental updates, it's unnecessary — each mutation is cheap.
|
|
|
|
## Files Modified (Summary)
|
|
|
|
| File | Changes |
|
|
|------|---------|
|
|
| `backend/indexer.py` | +`_on_index_change` hook variable, +`set_index_change_hook()`, +hook calls in `_add_file_to_structures` and `_remove_file_from_structures` |
|
|
| `backend/search.py` | +`add_document()`, +`remove_document()`, +`_remove_doc_internals()`, +`init_inverted_index()`, +hook registration, remove `is_stale()`/cooldown, simplify `get_inverted_index()` |
|
|
| `backend/main.py` | +`init_inverted_index()` call after `build_index()`, remove coalescence hack in `_on_vault_change` |
|
|
|
|
## Risks & Edge Cases
|
|
|
|
1. **Thread safety:** `_add_file_to_structures` and `_remove_file_from_structures` are protected by `_index_lock` / `_async_index_lock` in indexer.py. The InvertedIndex methods are called inside these locks, so they're also protected. No additional locking needed.
|
|
|
|
2. **Hook registration timing:** `search.py` imports `indexer.py` at the top, then later registers the hook. The hook is registered at module load time, BEFORE the first call to `build_index`. So `_on_index_change` is set when `build_index` runs — but `build_index` calls `_add_file_to_structures` internally, which would try to incrementally update an empty inverted index. **Fix:** The hook checks `if _inverted_index.doc_count == 0` and skips incremental updates; the initial `rebuild()` handles the bulk load.
|
|
|
|
3. **Hook call during initial build_index:** `build_index` iterates files and calls `_add_file_to_structures`. The hook fires for each file, calling `add_document()` on an empty inverted index. This is slower than a single `rebuild()`. **Fix:** Add a flag `_inverted_index._ready = False` initially, set to True after `init_inverted_index()`. The hook skips when `_ready` is False.
|
|
|
|
4. **Sorted tokens performance:** `bisect.insort` and `list.pop(idx)` are O(V) worst case for large V. For 40k files, the vocabulary size V is typically 50k-200k tokens. O(V) for a single insertion is ~0.001ms, acceptable. The rebuild() call at startup handles the initial bulk.
|
|
|
|
5. **tag_norm_map / tag_prefix_index growth:** These grow monotonically (never shrink on incremental remove). With 40k files and thousands of tags, this is a few thousand entries — negligible. A manual "Réindexer" button triggers a full `rebuild()` to clean up.
|