9.0 KiB
9.0 KiB
Phase 3 Monitoring Guide
📊 Real-Time Performance Dashboard
Access the Dashboard
# View performance metrics in JSON format
curl http://localhost:3000/__perf | jq
# Pretty print with colors
curl -s http://localhost:3000/__perf | jq '.' --color-output
# Watch metrics update in real-time
watch -n 1 'curl -s http://localhost:3000/__perf | jq .'
Dashboard Response Structure
{
"performance": {
"uptime": 12345, // Server uptime in ms
"requests": {
"total": 500, // Total requests
"errors": 2, // Failed requests
"errorRate": "0.4%" // Error percentage
},
"cache": {
"hits": 425, // Cache hits
"misses": 75, // Cache misses
"hitRate": "85%" // Hit rate percentage
},
"retries": {
"meilisearch": 3, // Meilisearch retries
"filesystem": 1 // Filesystem retries
},
"latency": {
"avgMs": 42, // Average response time
"p95Ms": 98, // 95th percentile
"samples": 500 // Number of samples
}
},
"cache": {
"size": 8, // Current cache entries
"maxItems": 10000, // Max cache size
"ttlMs": 300000, // Cache TTL (5 min)
"hitRate": 85.0, // Hit rate %
"hits": 425, // Total hits
"misses": 75, // Total misses
"evictions": 0, // LRU evictions
"sets": 83 // Cache sets
},
"circuitBreaker": {
"state": "closed", // closed|open|half-open
"failureCount": 0, // Consecutive failures
"failureThreshold": 5 // Failure threshold
},
"timestamp": "2025-10-23T14:30:00.000Z"
}
🎯 Key Metrics to Monitor
1. Cache Hit Rate
# Extract cache hit rate
curl -s http://localhost:3000/__perf | jq '.cache.hitRate'
# Output: 85.0
# Target: > 80% after 5 minutes of usage
# If lower: Check TTL, cache size, or request patterns
2. Response Latency
# Check average response time
curl -s http://localhost:3000/__perf | jq '.performance.latency'
# Output:
# {
# "avgMs": 42,
# "p95Ms": 98,
# "samples": 500
# }
# Target:
# - Cached: < 20ms average
# - Uncached: < 500ms average
# - P95: < 200ms
3. Error Rate
# Check error rate
curl -s http://localhost:3000/__perf | jq '.performance.requests.errorRate'
# Output: "0.4%"
# Target: < 1% under normal conditions
# If higher: Check Meilisearch or filesystem issues
4. Circuit Breaker State
# Check circuit breaker status
curl -s http://localhost:3000/__perf | jq '.circuitBreaker'
# Output:
# {
# "state": "closed",
# "failureCount": 0,
# "failureThreshold": 5
# }
# States:
# - "closed": Normal operation
# - "half-open": Testing recovery
# - "open": Failing, requests rejected
5. Retry Counts
# Check retry activity
curl -s http://localhost:3000/__perf | jq '.performance.retries'
# Output:
# {
# "meilisearch": 3,
# "filesystem": 1
# }
# Indicates transient failures being handled gracefully
📈 Monitoring Dashboards
Simple Shell Script
#!/bin/bash
# monitor-phase3.sh
while true; do
clear
echo "=== ObsiViewer Phase 3 Monitoring ==="
echo "Time: $(date)"
echo ""
curl -s http://localhost:3000/__perf | jq '{
uptime: .performance.uptime,
requests: .performance.requests,
cache: .performance.cache,
latency: .performance.latency,
circuitBreaker: .circuitBreaker
}'
echo ""
echo "Refreshing in 5 seconds..."
sleep 5
done
Using jq for Specific Metrics
# Cache hit rate only
curl -s http://localhost:3000/__perf | jq '.cache.hitRate'
# Average latency
curl -s http://localhost:3000/__perf | jq '.performance.latency.avgMs'
# Error rate
curl -s http://localhost:3000/__perf | jq '.performance.requests.errorRate'
# Uptime in seconds
curl -s http://localhost:3000/__perf | jq '.performance.uptime / 1000 | floor'
🔍 Server Logs Analysis
Log Patterns to Look For
Cache Hits/Misses
[/api/vault/metadata] CACHE HIT - 12ms
[/api/vault/metadata] CACHE MISS - 245ms
Meilisearch Indexing
[Meilisearch] Scheduling background indexing...
[Meilisearch] Background indexing completed
[Meilisearch] ✅ Background indexing completed
Retry Activity
[Meilisearch] Retry attempt 1, delay 100ms: Connection timeout
[Filesystem] Retry attempt 1, delay 150ms: ENOENT
Circuit Breaker
[Meilisearch] Circuit breaker opened after 5 failures
[Meilisearch] Circuit breaker is open (reset in 25000ms)
Log Filtering
# Show only cache operations
npm run start 2>&1 | grep -i cache
# Show only Meilisearch operations
npm run start 2>&1 | grep -i meilisearch
# Show only errors
npm run start 2>&1 | grep -i error
# Show only retries
npm run start 2>&1 | grep -i retry
📊 Performance Benchmarks
Expected Performance
Startup Time
Before Phase 3: 5-10 seconds (blocked by indexing)
After Phase 3: < 2 seconds (indexing in background)
Improvement: 5-10x faster ✅
Metadata Endpoint Response Time
First request (cache miss): 200-500ms
Subsequent requests (hit): 5-15ms
Improvement: 30x faster ✅
Cache Hit Rate Over Time
0-1 min: 0% (warming up)
1-5 min: 50-80% (building cache)
5+ min: 85-95% (stable)
Memory Usage
Baseline: 50-100MB
With cache: 50-100MB (controlled by LRU)
Overhead: Minimal (< 5MB for cache)
🧪 Load Testing
Test Cache Behavior
#!/bin/bash
# test-cache.sh
echo "Testing cache behavior..."
# Warm up (first request)
echo "Request 1 (cache miss):"
time curl -s http://localhost:3000/api/vault/metadata > /dev/null
# Should be fast (cache hit)
echo "Request 2 (cache hit):"
time curl -s http://localhost:3000/api/vault/metadata > /dev/null
# Check metrics
echo ""
echo "Cache statistics:"
curl -s http://localhost:3000/__perf | jq '.cache'
Test Retry Behavior
#!/bin/bash
# test-retry.sh
# Stop Meilisearch to trigger retries
echo "Stopping Meilisearch..."
docker-compose down
# Make requests (should use filesystem fallback with retries)
echo "Making requests with Meilisearch down..."
for i in {1..5}; do
echo "Request $i:"
curl -s http://localhost:3000/api/vault/metadata | jq '.items | length'
sleep 1
done
# Check retry counts
echo ""
echo "Retry statistics:"
curl -s http://localhost:3000/__perf | jq '.performance.retries'
# Restart Meilisearch
echo "Restarting Meilisearch..."
docker-compose up -d
Test Circuit Breaker
#!/bin/bash
# test-circuit-breaker.sh
# Make many requests to trigger failures
echo "Triggering circuit breaker..."
for i in {1..10}; do
curl -s http://localhost:3000/api/vault/metadata > /dev/null 2>&1 &
done
# Check circuit breaker state
sleep 2
echo "Circuit breaker state:"
curl -s http://localhost:3000/__perf | jq '.circuitBreaker'
🚨 Alert Thresholds
Recommended Alerts
| Metric | Threshold | Action |
|---|---|---|
| Cache Hit Rate | < 50% | Check TTL, cache size |
| Error Rate | > 5% | Check Meilisearch, filesystem |
| P95 Latency | > 500ms | Check server load, cache |
| Circuit Breaker | "open" | Restart Meilisearch |
| Memory Usage | > 200MB | Check for memory leak |
Setting Up Alerts
#!/bin/bash
# alert-monitor.sh
while true; do
METRICS=$(curl -s http://localhost:3000/__perf)
# Check cache hit rate
HIT_RATE=$(echo $METRICS | jq '.cache.hitRate')
if (( $(echo "$HIT_RATE < 50" | bc -l) )); then
echo "⚠️ ALERT: Low cache hit rate: $HIT_RATE%"
fi
# Check error rate
ERROR_RATE=$(echo $METRICS | jq '.performance.requests.errorRate' | tr -d '%')
if (( $(echo "$ERROR_RATE > 5" | bc -l) )); then
echo "⚠️ ALERT: High error rate: $ERROR_RATE%"
fi
# Check circuit breaker
CB_STATE=$(echo $METRICS | jq -r '.circuitBreaker.state')
if [ "$CB_STATE" = "open" ]; then
echo "⚠️ ALERT: Circuit breaker is open"
fi
sleep 10
done
📝 Monitoring Checklist
- Server starts in < 2 seconds
/__perfendpoint responds with metrics- Cache hit rate reaches > 80% after 5 minutes
- Average latency for cached requests < 20ms
- Error rate < 1%
- Circuit breaker state is "closed"
- No memory leaks over time
- Meilisearch indexing completes in background
- Filesystem fallback works when Meilisearch down
- Graceful shutdown on SIGINT
🎯 Success Criteria
✅ Cache hit rate > 80% after 5 minutes ✅ Response time < 20ms for cached requests ✅ Server startup < 2 seconds ✅ Error rate < 1% ✅ Memory usage stable ✅ Circuit breaker protecting against cascading failures ✅ Automatic retry handling transient failures ✅ Graceful fallback to filesystem
Last Updated: 2025-10-23 Status: Production Ready