# Phase 3 Monitoring Guide ## ๐Ÿ“Š Real-Time Performance Dashboard ### Access the Dashboard ```bash # View performance metrics in JSON format curl http://localhost:3000/__perf | jq # Pretty print with colors curl -s http://localhost:3000/__perf | jq '.' --color-output # Watch metrics update in real-time watch -n 1 'curl -s http://localhost:3000/__perf | jq .' ``` ### Dashboard Response Structure ```json { "performance": { "uptime": 12345, // Server uptime in ms "requests": { "total": 500, // Total requests "errors": 2, // Failed requests "errorRate": "0.4%" // Error percentage }, "cache": { "hits": 425, // Cache hits "misses": 75, // Cache misses "hitRate": "85%" // Hit rate percentage }, "retries": { "meilisearch": 3, // Meilisearch retries "filesystem": 1 // Filesystem retries }, "latency": { "avgMs": 42, // Average response time "p95Ms": 98, // 95th percentile "samples": 500 // Number of samples } }, "cache": { "size": 8, // Current cache entries "maxItems": 10000, // Max cache size "ttlMs": 300000, // Cache TTL (5 min) "hitRate": 85.0, // Hit rate % "hits": 425, // Total hits "misses": 75, // Total misses "evictions": 0, // LRU evictions "sets": 83 // Cache sets }, "circuitBreaker": { "state": "closed", // closed|open|half-open "failureCount": 0, // Consecutive failures "failureThreshold": 5 // Failure threshold }, "timestamp": "2025-10-23T14:30:00.000Z" } ``` ## ๐ŸŽฏ Key Metrics to Monitor ### 1. Cache Hit Rate ```bash # Extract cache hit rate curl -s http://localhost:3000/__perf | jq '.cache.hitRate' # Output: 85.0 # Target: > 80% after 5 minutes of usage # If lower: Check TTL, cache size, or request patterns ``` ### 2. Response Latency ```bash # Check average response time curl -s http://localhost:3000/__perf | jq '.performance.latency' # Output: # { # "avgMs": 42, # "p95Ms": 98, # "samples": 500 # } # Target: # - Cached: < 20ms average # - Uncached: < 500ms average # - P95: < 200ms ``` ### 3. Error Rate ```bash # Check error rate curl -s http://localhost:3000/__perf | jq '.performance.requests.errorRate' # Output: "0.4%" # Target: < 1% under normal conditions # If higher: Check Meilisearch or filesystem issues ``` ### 4. Circuit Breaker State ```bash # Check circuit breaker status curl -s http://localhost:3000/__perf | jq '.circuitBreaker' # Output: # { # "state": "closed", # "failureCount": 0, # "failureThreshold": 5 # } # States: # - "closed": Normal operation # - "half-open": Testing recovery # - "open": Failing, requests rejected ``` ### 5. Retry Counts ```bash # Check retry activity curl -s http://localhost:3000/__perf | jq '.performance.retries' # Output: # { # "meilisearch": 3, # "filesystem": 1 # } # Indicates transient failures being handled gracefully ``` ## ๐Ÿ“ˆ Monitoring Dashboards ### Simple Shell Script ```bash #!/bin/bash # monitor-phase3.sh while true; do clear echo "=== ObsiViewer Phase 3 Monitoring ===" echo "Time: $(date)" echo "" curl -s http://localhost:3000/__perf | jq '{ uptime: .performance.uptime, requests: .performance.requests, cache: .performance.cache, latency: .performance.latency, circuitBreaker: .circuitBreaker }' echo "" echo "Refreshing in 5 seconds..." sleep 5 done ``` ### Using jq for Specific Metrics ```bash # Cache hit rate only curl -s http://localhost:3000/__perf | jq '.cache.hitRate' # Average latency curl -s http://localhost:3000/__perf | jq '.performance.latency.avgMs' # Error rate curl -s http://localhost:3000/__perf | jq '.performance.requests.errorRate' # Uptime in seconds curl -s http://localhost:3000/__perf | jq '.performance.uptime / 1000 | floor' ``` ## ๐Ÿ” Server Logs Analysis ### Log Patterns to Look For #### Cache Hits/Misses ``` [/api/vault/metadata] CACHE HIT - 12ms [/api/vault/metadata] CACHE MISS - 245ms ``` #### Meilisearch Indexing ``` [Meilisearch] Scheduling background indexing... [Meilisearch] Background indexing completed [Meilisearch] โœ… Background indexing completed ``` #### Retry Activity ``` [Meilisearch] Retry attempt 1, delay 100ms: Connection timeout [Filesystem] Retry attempt 1, delay 150ms: ENOENT ``` #### Circuit Breaker ``` [Meilisearch] Circuit breaker opened after 5 failures [Meilisearch] Circuit breaker is open (reset in 25000ms) ``` ### Log Filtering ```bash # Show only cache operations npm run start 2>&1 | grep -i cache # Show only Meilisearch operations npm run start 2>&1 | grep -i meilisearch # Show only errors npm run start 2>&1 | grep -i error # Show only retries npm run start 2>&1 | grep -i retry ``` ## ๐Ÿ“Š Performance Benchmarks ### Expected Performance #### Startup Time ``` Before Phase 3: 5-10 seconds (blocked by indexing) After Phase 3: < 2 seconds (indexing in background) Improvement: 5-10x faster โœ… ``` #### Metadata Endpoint Response Time ``` First request (cache miss): 200-500ms Subsequent requests (hit): 5-15ms Improvement: 30x faster โœ… ``` #### Cache Hit Rate Over Time ``` 0-1 min: 0% (warming up) 1-5 min: 50-80% (building cache) 5+ min: 85-95% (stable) ``` #### Memory Usage ``` Baseline: 50-100MB With cache: 50-100MB (controlled by LRU) Overhead: Minimal (< 5MB for cache) ``` ## ๐Ÿงช Load Testing ### Test Cache Behavior ```bash #!/bin/bash # test-cache.sh echo "Testing cache behavior..." # Warm up (first request) echo "Request 1 (cache miss):" time curl -s http://localhost:3000/api/vault/metadata > /dev/null # Should be fast (cache hit) echo "Request 2 (cache hit):" time curl -s http://localhost:3000/api/vault/metadata > /dev/null # Check metrics echo "" echo "Cache statistics:" curl -s http://localhost:3000/__perf | jq '.cache' ``` ### Test Retry Behavior ```bash #!/bin/bash # test-retry.sh # Stop Meilisearch to trigger retries echo "Stopping Meilisearch..." docker-compose down # Make requests (should use filesystem fallback with retries) echo "Making requests with Meilisearch down..." for i in {1..5}; do echo "Request $i:" curl -s http://localhost:3000/api/vault/metadata | jq '.items | length' sleep 1 done # Check retry counts echo "" echo "Retry statistics:" curl -s http://localhost:3000/__perf | jq '.performance.retries' # Restart Meilisearch echo "Restarting Meilisearch..." docker-compose up -d ``` ### Test Circuit Breaker ```bash #!/bin/bash # test-circuit-breaker.sh # Make many requests to trigger failures echo "Triggering circuit breaker..." for i in {1..10}; do curl -s http://localhost:3000/api/vault/metadata > /dev/null 2>&1 & done # Check circuit breaker state sleep 2 echo "Circuit breaker state:" curl -s http://localhost:3000/__perf | jq '.circuitBreaker' ``` ## ๐Ÿšจ Alert Thresholds ### Recommended Alerts | Metric | Threshold | Action | |--------|-----------|--------| | Cache Hit Rate | < 50% | Check TTL, cache size | | Error Rate | > 5% | Check Meilisearch, filesystem | | P95 Latency | > 500ms | Check server load, cache | | Circuit Breaker | "open" | Restart Meilisearch | | Memory Usage | > 200MB | Check for memory leak | ### Setting Up Alerts ```bash #!/bin/bash # alert-monitor.sh while true; do METRICS=$(curl -s http://localhost:3000/__perf) # Check cache hit rate HIT_RATE=$(echo $METRICS | jq '.cache.hitRate') if (( $(echo "$HIT_RATE < 50" | bc -l) )); then echo "โš ๏ธ ALERT: Low cache hit rate: $HIT_RATE%" fi # Check error rate ERROR_RATE=$(echo $METRICS | jq '.performance.requests.errorRate' | tr -d '%') if (( $(echo "$ERROR_RATE > 5" | bc -l) )); then echo "โš ๏ธ ALERT: High error rate: $ERROR_RATE%" fi # Check circuit breaker CB_STATE=$(echo $METRICS | jq -r '.circuitBreaker.state') if [ "$CB_STATE" = "open" ]; then echo "โš ๏ธ ALERT: Circuit breaker is open" fi sleep 10 done ``` ## ๐Ÿ“ Monitoring Checklist - [ ] Server starts in < 2 seconds - [ ] `/__perf` endpoint responds with metrics - [ ] Cache hit rate reaches > 80% after 5 minutes - [ ] Average latency for cached requests < 20ms - [ ] Error rate < 1% - [ ] Circuit breaker state is "closed" - [ ] No memory leaks over time - [ ] Meilisearch indexing completes in background - [ ] Filesystem fallback works when Meilisearch down - [ ] Graceful shutdown on SIGINT ## ๐ŸŽฏ Success Criteria โœ… Cache hit rate > 80% after 5 minutes โœ… Response time < 20ms for cached requests โœ… Server startup < 2 seconds โœ… Error rate < 1% โœ… Memory usage stable โœ… Circuit breaker protecting against cascading failures โœ… Automatic retry handling transient failures โœ… Graceful fallback to filesystem --- **Last Updated**: 2025-10-23 **Status**: Production Ready