9.0 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			9.0 KiB
		
	
	
	
	
	
	
	
Phase 3 Monitoring Guide
📊 Real-Time Performance Dashboard
Access the Dashboard
# View performance metrics in JSON format
curl http://localhost:3000/__perf | jq
# Pretty print with colors
curl -s http://localhost:3000/__perf | jq '.' --color-output
# Watch metrics update in real-time
watch -n 1 'curl -s http://localhost:3000/__perf | jq .'
Dashboard Response Structure
{
  "performance": {
    "uptime": 12345,                    // Server uptime in ms
    "requests": {
      "total": 500,                     // Total requests
      "errors": 2,                      // Failed requests
      "errorRate": "0.4%"               // Error percentage
    },
    "cache": {
      "hits": 425,                      // Cache hits
      "misses": 75,                     // Cache misses
      "hitRate": "85%"                  // Hit rate percentage
    },
    "retries": {
      "meilisearch": 3,                 // Meilisearch retries
      "filesystem": 1                   // Filesystem retries
    },
    "latency": {
      "avgMs": 42,                      // Average response time
      "p95Ms": 98,                      // 95th percentile
      "samples": 500                    // Number of samples
    }
  },
  "cache": {
    "size": 8,                          // Current cache entries
    "maxItems": 10000,                  // Max cache size
    "ttlMs": 300000,                    // Cache TTL (5 min)
    "hitRate": 85.0,                    // Hit rate %
    "hits": 425,                        // Total hits
    "misses": 75,                       // Total misses
    "evictions": 0,                     // LRU evictions
    "sets": 83                          // Cache sets
  },
  "circuitBreaker": {
    "state": "closed",                  // closed|open|half-open
    "failureCount": 0,                  // Consecutive failures
    "failureThreshold": 5               // Failure threshold
  },
  "timestamp": "2025-10-23T14:30:00.000Z"
}
🎯 Key Metrics to Monitor
1. Cache Hit Rate
# Extract cache hit rate
curl -s http://localhost:3000/__perf | jq '.cache.hitRate'
# Output: 85.0
# Target: > 80% after 5 minutes of usage
# If lower: Check TTL, cache size, or request patterns
2. Response Latency
# Check average response time
curl -s http://localhost:3000/__perf | jq '.performance.latency'
# Output:
# {
#   "avgMs": 42,
#   "p95Ms": 98,
#   "samples": 500
# }
# Target: 
# - Cached: < 20ms average
# - Uncached: < 500ms average
# - P95: < 200ms
3. Error Rate
# Check error rate
curl -s http://localhost:3000/__perf | jq '.performance.requests.errorRate'
# Output: "0.4%"
# Target: < 1% under normal conditions
# If higher: Check Meilisearch or filesystem issues
4. Circuit Breaker State
# Check circuit breaker status
curl -s http://localhost:3000/__perf | jq '.circuitBreaker'
# Output:
# {
#   "state": "closed",
#   "failureCount": 0,
#   "failureThreshold": 5
# }
# States:
# - "closed": Normal operation
# - "half-open": Testing recovery
# - "open": Failing, requests rejected
5. Retry Counts
# Check retry activity
curl -s http://localhost:3000/__perf | jq '.performance.retries'
# Output:
# {
#   "meilisearch": 3,
#   "filesystem": 1
# }
# Indicates transient failures being handled gracefully
📈 Monitoring Dashboards
Simple Shell Script
#!/bin/bash
# monitor-phase3.sh
while true; do
  clear
  echo "=== ObsiViewer Phase 3 Monitoring ==="
  echo "Time: $(date)"
  echo ""
  
  curl -s http://localhost:3000/__perf | jq '{
    uptime: .performance.uptime,
    requests: .performance.requests,
    cache: .performance.cache,
    latency: .performance.latency,
    circuitBreaker: .circuitBreaker
  }'
  
  echo ""
  echo "Refreshing in 5 seconds..."
  sleep 5
done
Using jq for Specific Metrics
# Cache hit rate only
curl -s http://localhost:3000/__perf | jq '.cache.hitRate'
# Average latency
curl -s http://localhost:3000/__perf | jq '.performance.latency.avgMs'
# Error rate
curl -s http://localhost:3000/__perf | jq '.performance.requests.errorRate'
# Uptime in seconds
curl -s http://localhost:3000/__perf | jq '.performance.uptime / 1000 | floor'
🔍 Server Logs Analysis
Log Patterns to Look For
Cache Hits/Misses
[/api/vault/metadata] CACHE HIT - 12ms
[/api/vault/metadata] CACHE MISS - 245ms
Meilisearch Indexing
[Meilisearch] Scheduling background indexing...
[Meilisearch] Background indexing completed
[Meilisearch] ✅ Background indexing completed
Retry Activity
[Meilisearch] Retry attempt 1, delay 100ms: Connection timeout
[Filesystem] Retry attempt 1, delay 150ms: ENOENT
Circuit Breaker
[Meilisearch] Circuit breaker opened after 5 failures
[Meilisearch] Circuit breaker is open (reset in 25000ms)
Log Filtering
# Show only cache operations
npm run start 2>&1 | grep -i cache
# Show only Meilisearch operations
npm run start 2>&1 | grep -i meilisearch
# Show only errors
npm run start 2>&1 | grep -i error
# Show only retries
npm run start 2>&1 | grep -i retry
📊 Performance Benchmarks
Expected Performance
Startup Time
Before Phase 3: 5-10 seconds (blocked by indexing)
After Phase 3:  < 2 seconds (indexing in background)
Improvement:    5-10x faster ✅
Metadata Endpoint Response Time
First request (cache miss):   200-500ms
Subsequent requests (hit):    5-15ms
Improvement:                  30x faster ✅
Cache Hit Rate Over Time
0-1 min:   0% (warming up)
1-5 min:   50-80% (building cache)
5+ min:    85-95% (stable)
Memory Usage
Baseline:  50-100MB
With cache: 50-100MB (controlled by LRU)
Overhead:  Minimal (< 5MB for cache)
🧪 Load Testing
Test Cache Behavior
#!/bin/bash
# test-cache.sh
echo "Testing cache behavior..."
# Warm up (first request)
echo "Request 1 (cache miss):"
time curl -s http://localhost:3000/api/vault/metadata > /dev/null
# Should be fast (cache hit)
echo "Request 2 (cache hit):"
time curl -s http://localhost:3000/api/vault/metadata > /dev/null
# Check metrics
echo ""
echo "Cache statistics:"
curl -s http://localhost:3000/__perf | jq '.cache'
Test Retry Behavior
#!/bin/bash
# test-retry.sh
# Stop Meilisearch to trigger retries
echo "Stopping Meilisearch..."
docker-compose down
# Make requests (should use filesystem fallback with retries)
echo "Making requests with Meilisearch down..."
for i in {1..5}; do
  echo "Request $i:"
  curl -s http://localhost:3000/api/vault/metadata | jq '.items | length'
  sleep 1
done
# Check retry counts
echo ""
echo "Retry statistics:"
curl -s http://localhost:3000/__perf | jq '.performance.retries'
# Restart Meilisearch
echo "Restarting Meilisearch..."
docker-compose up -d
Test Circuit Breaker
#!/bin/bash
# test-circuit-breaker.sh
# Make many requests to trigger failures
echo "Triggering circuit breaker..."
for i in {1..10}; do
  curl -s http://localhost:3000/api/vault/metadata > /dev/null 2>&1 &
done
# Check circuit breaker state
sleep 2
echo "Circuit breaker state:"
curl -s http://localhost:3000/__perf | jq '.circuitBreaker'
🚨 Alert Thresholds
Recommended Alerts
| Metric | Threshold | Action | 
|---|---|---|
| Cache Hit Rate | < 50% | Check TTL, cache size | 
| Error Rate | > 5% | Check Meilisearch, filesystem | 
| P95 Latency | > 500ms | Check server load, cache | 
| Circuit Breaker | "open" | Restart Meilisearch | 
| Memory Usage | > 200MB | Check for memory leak | 
Setting Up Alerts
#!/bin/bash
# alert-monitor.sh
while true; do
  METRICS=$(curl -s http://localhost:3000/__perf)
  
  # Check cache hit rate
  HIT_RATE=$(echo $METRICS | jq '.cache.hitRate')
  if (( $(echo "$HIT_RATE < 50" | bc -l) )); then
    echo "⚠️  ALERT: Low cache hit rate: $HIT_RATE%"
  fi
  
  # Check error rate
  ERROR_RATE=$(echo $METRICS | jq '.performance.requests.errorRate' | tr -d '%')
  if (( $(echo "$ERROR_RATE > 5" | bc -l) )); then
    echo "⚠️  ALERT: High error rate: $ERROR_RATE%"
  fi
  
  # Check circuit breaker
  CB_STATE=$(echo $METRICS | jq -r '.circuitBreaker.state')
  if [ "$CB_STATE" = "open" ]; then
    echo "⚠️  ALERT: Circuit breaker is open"
  fi
  
  sleep 10
done
📝 Monitoring Checklist
- Server starts in < 2 seconds
- /__perfendpoint responds with metrics
- Cache hit rate reaches > 80% after 5 minutes
- Average latency for cached requests < 20ms
- Error rate < 1%
- Circuit breaker state is "closed"
- No memory leaks over time
- Meilisearch indexing completes in background
- Filesystem fallback works when Meilisearch down
- Graceful shutdown on SIGINT
🎯 Success Criteria
✅ Cache hit rate > 80% after 5 minutes ✅ Response time < 20ms for cached requests ✅ Server startup < 2 seconds ✅ Error rate < 1% ✅ Memory usage stable ✅ Circuit breaker protecting against cascading failures ✅ Automatic retry handling transient failures ✅ Graceful fallback to filesystem
Last Updated: 2025-10-23 Status: Production Ready