homelab_automation/documentation/prompts/add_gestion_docker.md
Bruno Charest 27eed55c9b
Some checks failed
Tests / Backend Tests (Python) (3.10) (push) Has been cancelled
Tests / Backend Tests (Python) (3.11) (push) Has been cancelled
Tests / Backend Tests (Python) (3.12) (push) Has been cancelled
Tests / Frontend Tests (JS) (push) Has been cancelled
Tests / Integration Tests (push) Has been cancelled
Tests / All Tests Passed (push) Has been cancelled
Update test coverage timestamps and fix coroutine cleanup in task creation tests by properly closing coroutines in mocked asyncio.create_task calls
2025-12-15 08:31:12 -05:00

25 KiB

[TASK] Nouvelle section "Hôtes Docker" + monitoring + actions + notifications

🎯 Objectif principal

Ajouter une fonctionnalité complète de gestion Docker au Homelab Dashboard existant, permettant :

  • Surveillance multi-hosts Docker en temps réel
  • Actions sur containers (start/stop/restart/redeploy/logs)
  • Détection proactive et alerting sur containers down
  • Intégration harmonieuse avec l'architecture existante

📐 Contraintes d'architecture OBLIGATOIRES

Stack technique existante (à respecter strictement)

Backend:
  - FastAPI (routes/ + services/ + models/ + schemas/)
  - SQLAlchemy 2.x async (data/homelab.db)
  - Alembic pour migrations
  - APScheduler (jobs périodiques déjà configurés)
  - WebSocket temps réel (websocket_manager.py)
  - Auth JWT (app.auth_utils + OAuth2PasswordBearer)
  - Notifications ntfy (services/notifications.py)

Frontend:
  - index.html + main.js (vanilla JS)
  - Tailwind CSS
  - Anime.js pour animations
  - Pattern navigation par sections (dashboard, hosts, tasks, schedules, etc.)

Infrastructure:
  - Ansible pour automation (inventaire hosts.yml existant)
  - SSH déjà configuré (automation user + clés)
  - Bootstrap SSH existant (services/bootstrap.py)

Modèles DB existants à étendre (NE PAS recréer)

# models/host.py - TABLE EXISTANTE
class Host(Base):
    __tablename__ = "hosts"
    id: int
    name: str
    host: str  # IP/hostname
    os_type: str
    status: str  # online/offline
    bootstrap_status: dict
    last_seen_at: datetime
    # À ÉTENDRE avec : docker_enabled, docker_version, docker_status

# models/task.py - TABLE EXISTANTE  
class Task(Base):
    __tablename__ = "tasks"
    id: int
    action: str
    status: str  # pending/running/success/failed
    # Réutiliser pour actions Docker

# À CRÉER (nouvelles tables uniquement)
# - docker_containers
# - docker_images  
# - docker_volumes
# - docker_alerts

🔧 Décisions techniques IMPOSÉES (pas de choix)

1. Collecte Docker : SSH + docker CLI (réutiliser pattern Ansible)

Justification :

  • SSH déjà configuré pour tous les hosts (user automation + clés)
  • ajouter la collecte Docker au processus de collecte des métriques déjà en place.
  • Pas de config supplémentaire sur les hosts (pas de TLS Docker API)
  • Même pattern que Ansible (cohérence)
  • Parse JSON : docker ps --format json, docker inspect, etc.

Implémentation :

# services/docker_service.py
async def collect_docker_host(host_id: int):
    host = await get_host(host_id)
    ssh = await ssh_connect(host.host, user="automation")
    
    # Version Docker
    version = await ssh_exec(ssh, "docker version --format '{{json .}}'")
    
    # Containers
    containers = await ssh_exec(ssh, 
        "docker ps -a --format '{{json .}}' --no-trunc")
    
    # Images  
    images = await ssh_exec(ssh,
        "docker images --format '{{json .}}'")
    
    # Volumes
    volumes = await ssh_exec(ssh,
        "docker volume ls --format '{{json .}}'")
    
    # System df
    df = await ssh_exec(ssh, "docker system df -v --format '{{json .}}'")

2. Stockage : Étendre tables existantes + créer tables Docker

-- Migration Alembic à créer
ALTER TABLE hosts ADD COLUMN docker_enabled BOOLEAN DEFAULT FALSE;
ALTER TABLE hosts ADD COLUMN docker_version TEXT;
ALTER TABLE hosts ADD COLUMN docker_last_collect_at TIMESTAMP;

CREATE TABLE docker_containers (
    id INTEGER PRIMARY KEY,
    host_id INTEGER REFERENCES hosts(id),
    container_id TEXT NOT NULL,
    name TEXT NOT NULL,
    image TEXT,
    state TEXT,  -- running/exited/paused
    status TEXT, -- Up 2 hours, Exited (0) 5 minutes ago
    health TEXT, -- healthy/unhealthy/starting/none
    created_at TIMESTAMP,
    ports JSON,
    labels JSON,
    compose_project TEXT,  -- com.docker.compose.project
    last_update_at TIMESTAMP,
    UNIQUE(host_id, container_id)
);

CREATE TABLE docker_images (
    id INTEGER PRIMARY KEY,
    host_id INTEGER REFERENCES hosts(id),
    image_id TEXT NOT NULL,
    repo_tags JSON,  -- ["nginx:latest", "nginx:1.25"]
    size BIGINT,
    created TIMESTAMP,
    last_update_at TIMESTAMP,
    UNIQUE(host_id, image_id)
);

CREATE TABLE docker_volumes (
    id INTEGER PRIMARY KEY,
    host_id INTEGER REFERENCES hosts(id),
    name TEXT NOT NULL,
    driver TEXT,
    mountpoint TEXT,
    scope TEXT,
    last_update_at TIMESTAMP,
    UNIQUE(host_id, name)
);

CREATE TABLE docker_alerts (
    id INTEGER PRIMARY KEY,
    host_id INTEGER REFERENCES hosts(id),
    container_name TEXT NOT NULL,
    severity TEXT,  -- warning/error/critical
    state TEXT,     -- open/closed
    message TEXT,
    opened_at TIMESTAMP NOT NULL,
    closed_at TIMESTAMP,
    last_notified_at TIMESTAMP,
    INDEX idx_alerts_open (state, host_id)
);

3. Scheduler : Étendre APScheduler existant

# app_optimized.py - AJOUTER au startup
from services.docker_collector import DockerCollector

@app.on_event("startup")
async def start_docker_collector():
    collector = DockerCollector(db_session, ws_manager, ntfy_service)
    
    # Job périodique : collecter tous les hosts Docker enabled
    scheduler.add_job(
        collector.collect_all_hosts,
        trigger="interval",
        seconds=60,  # Toutes les minutes
        id="docker_collect",
        name="Docker Metrics Collection"
    )
    
    # Job périodique : vérifier alertes containers down
    scheduler.add_job(
        collector.check_alerts,
        trigger="interval", 
        seconds=30,
        id="docker_alerts",
        name="Docker Alerts Check"
    )

📊 API Routes à créer (prefix /api/docker)

# routes/docker.py
router = APIRouter(prefix="/api/docker", tags=["docker"])

@router.get("/hosts")
async def list_docker_hosts(
    current_user: User = Depends(get_current_user)
):
    """Liste tous les hosts avec Docker enabled"""
    
@router.post("/hosts/{host_id}/enable")
async def enable_docker_monitoring(
    host_id: int,
    current_user: User = Depends(require_role("admin"))
):
    """Active la surveillance Docker sur un host"""
    
@router.post("/hosts/{host_id}/collect")
async def collect_docker_now(
    host_id: int,
    current_user: User = Depends(require_role("operator"))
):
    """Force une collecte immédiate"""
    
@router.get("/hosts/{host_id}/containers")
async def get_containers(host_id: int):
    """Liste containers d'un host"""
    
@router.post("/containers/{host_id}/{container_id}/start")
async def start_container(
    host_id: int,
    container_id: str,
    current_user: User = Depends(require_role("operator"))
):
    """Démarre un container"""
    
@router.post("/containers/{host_id}/{container_id}/stop")
@router.post("/containers/{host_id}/{container_id}/restart")
@router.post("/containers/{host_id}/{container_id}/remove")
@router.post("/containers/{host_id}/{container_id}/redeploy")
    
@router.get("/containers/{host_id}/{container_id}/logs")
async def get_container_logs(
    host_id: int,
    container_id: str,
    tail: int = 200
):
    """Récupère logs d'un container"""
    
@router.get("/containers/{host_id}/{container_id}/inspect")
async def inspect_container(host_id: int, container_id: str):
    """Détails complets JSON d'un container"""
    
@router.get("/alerts")
async def list_alerts(
    host_id: Optional[int] = None,
    state: Optional[str] = "open"
):
    """Liste des alertes Docker"""
    
@router.post("/alerts/{alert_id}/ack")
async def acknowledge_alert(
    alert_id: int,
    current_user: User = Depends(require_role("operator"))
):
    """Accuser réception d'une alerte"""

🔔 Logique d'alerting (détection containers down)

Règles de détection

# services/docker_alerts.py

async def check_container_alerts(session: AsyncSession):
    """
    Vérifie tous les containers critiques et génère des alertes
    """
    
    # Récupérer containers avec label homelab.monitor=true
    critical_containers = await session.execute(
        select(DockerContainer)
        .where(DockerContainer.labels.contains({"homelab.monitor": "true"}))
    )
    
    for container in critical_containers:
        expected_state = container.labels.get("homelab.desired", "running")
        
        # Cas 1 : Container arrêté alors qu'il devrait tourner
        if expected_state == "running" and container.state != "running":
            await open_alert(
                host_id=container.host_id,
                container_name=container.name,
                severity="error",
                message=f"Container {container.name} is {container.state}, expected running"
            )
        
        # Cas 2 : Container unhealthy
        if container.health == "unhealthy":
            await open_alert(
                host_id=container.host_id,
                container_name=container.name,
                severity="warning",
                message=f"Container {container.name} health check failing"
            )
        
        # Cas 3 : Container OK -> fermer alerte si ouverte
        if container.state == "running" and container.health in ["healthy", "none"]:
            await close_alert(container.host_id, container.name)


async def open_alert(host_id: int, container_name: str, severity: str, message: str):
    """
    Ouvre une alerte et envoie notification ntfy
    """
    # Vérifier si alerte déjà ouverte
    existing = await get_open_alert(host_id, container_name)
    if existing:
        # Mettre à jour timestamp
        existing.last_notified_at = datetime.utcnow()
        return
    
    # Créer nouvelle alerte
    alert = DockerAlert(
        host_id=host_id,
        container_name=container_name,
        severity=severity,
        state="open",
        message=message,
        opened_at=datetime.utcnow()
    )
    session.add(alert)
    await session.commit()
    
    # Notification ntfy
    host = await get_host(host_id)
    await ntfy_service.send_notification(
        topic="homelab-docker",
        title=f"🚨 Docker Alert - {host.name}",
        message=f"{container_name}: {message}",
        priority=4,
        tags=["warning", "docker"]
    )
    
    # WebSocket temps réel
    await ws_manager.broadcast({
        "type": "docker_alert_opened",
        "alert": alert.to_dict()
    })

🎨 UI/UX Frontend (intégration dans index.html + main.js)

Navigation (ajouter dans index.html)

<!-- Ajouter dans le menu de navigation existant -->
<nav class="nav-tabs">
    <!-- Existant : Dashboard, Hosts, Tasks, Schedules, Logs -->
    
    <button class="nav-tab" data-section="docker">
        <i class="fas fa-docker"></i>
        Docker Hosts
        <span class="badge" id="docker-alerts-badge">0</span>
    </button>
</nav>

Section Docker (nouvelle section HTML)

<section id="docker-section" class="hidden">
    <div class="section-header">
        <h2><i class="fab fa-docker"></i> Docker Hosts</h2>
        <div class="actions">
            <button id="collect-all-docker" class="btn btn-primary">
                <i class="fas fa-sync"></i> Collect All
            </button>
            <input type="text" id="docker-search" placeholder="Search hosts...">
        </div>
    </div>
    
    <!-- Liste des hosts Docker -->
    <div id="docker-hosts-grid" class="hosts-grid">
        <!-- Généré dynamiquement par JS -->
    </div>
    
    <!-- Modal détails host Docker -->
    <div id="docker-detail-modal" class="modal hidden">
        <div class="modal-content large">
            <div class="modal-header">
                <h3 id="docker-host-name"></h3>
                <button class="close-modal">&times;</button>
            </div>
            
            <!-- Tabs : Containers / Images / Volumes / Alerts -->
            <div class="tabs">
                <button class="tab active" data-tab="containers">Containers</button>
                <button class="tab" data-tab="images">Images</button>
                <button class="tab" data-tab="volumes">Volumes</button>
                <button class="tab" data-tab="alerts">Alerts</button>
            </div>
            
            <!-- Contenu des tabs -->
            <div id="containers-tab" class="tab-content">
                <table id="containers-table">
                    <thead>
                        <tr>
                            <th>Name</th>
                            <th>Image</th>
                            <th>State</th>
                            <th>Health</th>
                            <th>Ports</th>
                            <th>Age</th>
                            <th>Actions</th>
                        </tr>
                    </thead>
                    <tbody></tbody>
                </table>
            </div>
        </div>
    </div>
</section>

Logique JavaScript (main.js)

// Gestion section Docker
const dockerSection = {
    async init() {
        await this.loadDockerHosts();
        this.setupWebSocket();
        this.setupEventListeners();
    },
    
    async loadDockerHosts() {
        const response = await fetchAPI('/api/docker/hosts');
        this.renderHostsGrid(response.hosts);
    },
    
    renderHostsGrid(hosts) {
        const grid = document.getElementById('docker-hosts-grid');
        grid.innerHTML = hosts.map(host => `
            <div class="docker-host-card" data-host-id="${host.id}">
                <div class="card-header">
                    <h3>${host.name}</h3>
                    <span class="badge ${host.docker_status}">${host.docker_status}</span>
                </div>
                <div class="card-body">
                    <div class="metric">
                        <i class="fas fa-box"></i>
                        ${host.containers_running}/${host.containers_total} containers
                    </div>
                    <div class="metric">
                        <i class="fas fa-exclamation-triangle"></i>
                        ${host.open_alerts} alerts
                    </div>
                    <div class="metric">
                        <i class="fas fa-clock"></i>
                        Last: ${formatRelativeTime(host.docker_last_collect_at)}
                    </div>
                </div>
                <div class="card-actions">
                    <button class="btn btn-sm" onclick="dockerSection.viewDetails(${host.id})">
                        <i class="fas fa-eye"></i> Details
                    </button>
                    <button class="btn btn-sm" onclick="dockerSection.collectNow(${host.id})">
                        <i class="fas fa-sync"></i> Collect
                    </button>
                </div>
            </div>
        `).join('');
    },
    
    async viewDetails(hostId) {
        const [containers, images, volumes, alerts] = await Promise.all([
            fetchAPI(`/api/docker/hosts/${hostId}/containers`),
            fetchAPI(`/api/docker/hosts/${hostId}/images`),
            fetchAPI(`/api/docker/hosts/${hostId}/volumes`),
            fetchAPI(`/api/docker/alerts?host_id=${hostId}`)
        ]);
        
        this.renderContainersTab(containers);
        showModal('docker-detail-modal');
    },
    
    renderContainersTab(containers) {
        const tbody = document.querySelector('#containers-table tbody');
        tbody.innerHTML = containers.map(c => `
            <tr class="container-row" data-state="${c.state}">
                <td>
                    <i class="fab fa-docker"></i> ${c.name}
                    ${c.compose_project ? `<span class="badge">${c.compose_project}</span>` : ''}
                </td>
                <td>${c.image}</td>
                <td><span class="badge state-${c.state}">${c.state}</span></td>
                <td><span class="badge health-${c.health}">${c.health || 'none'}</span></td>
                <td>${this.formatPorts(c.ports)}</td>
                <td>${formatRelativeTime(c.created_at)}</td>
                <td>
                    <div class="action-buttons">
                        ${c.state !== 'running' ? 
                            `<button onclick="dockerSection.startContainer(${c.host_id}, '${c.container_id}')">
                                <i class="fas fa-play"></i>
                            </button>` : ''}
                        ${c.state === 'running' ?
                            `<button onclick="dockerSection.stopContainer(${c.host_id}, '${c.container_id}')">
                                <i class="fas fa-stop"></i>
                            </button>` : ''}
                        <button onclick="dockerSection.restartContainer(${c.host_id}, '${c.container_id}')">
                            <i class="fas fa-redo"></i>
                        </button>
                        <button onclick="dockerSection.showLogs(${c.host_id}, '${c.container_id}')">
                            <i class="fas fa-file-alt"></i>
                        </button>
                        <button class="danger" onclick="dockerSection.confirmRemove(${c.host_id}, '${c.container_id}')">
                            <i class="fas fa-trash"></i>
                        </button>
                    </div>
                </td>
            </tr>
        `).join('');
    },
    
    async startContainer(hostId, containerId) {
        await fetchAPI(`/api/docker/containers/${hostId}/${containerId}/start`, {
            method: 'POST'
        });
        showToast('Container started successfully', 'success');
        await this.viewDetails(hostId); // Refresh
    },
    
    setupWebSocket() {
        ws.addEventListener('message', (event) => {
            const data = JSON.parse(event.data);
            
            if (data.type === 'docker_host_updated') {
                this.updateHostCard(data.host);
            }
            
            if (data.type === 'docker_alert_opened') {
                this.showAlertNotification(data.alert);
                this.updateAlertsBadge();
            }
        });
    }
};

🧪 Tests obligatoires (minimum 8 tests)

Backend tests (pytest + pytest-asyncio)

# tests/test_docker_service.py

@pytest.mark.asyncio
async def test_collect_docker_host(mock_ssh):
    """Test collecte Docker réussie"""
    mock_ssh.exec_command.return_value = '{"Version": "24.0.7"}'
    result = await docker_service.collect_docker_host(host_id=1)
    assert result.docker_version == "24.0.7"

@pytest.mark.asyncio
async def test_detect_container_down():
    """Test détection container arrêté"""
    container = create_test_container(
        state="exited",
        labels={"homelab.monitor": "true", "homelab.desired": "running"}
    )
    alerts = await docker_alerts.check_container_alerts([container])
    assert len(alerts) == 1
    assert alerts[0].severity == "error"

@pytest.mark.asyncio
async def test_start_container(mock_ssh):
    """Test démarrage container"""
    mock_ssh.exec_command.return_value = "container_id"
    result = await docker_actions.start_container(host_id=1, container_id="abc123")
    assert result.success is True
    mock_ssh.exec_command.assert_called_with("docker start abc123")

@pytest.mark.asyncio
async def test_alert_notification_sent(mock_ntfy):
    """Test notification ntfy envoyée lors alerte"""
    await docker_alerts.open_alert(
        host_id=1,
        container_name="nginx",
        severity="error",
        message="Container down"
    )
    assert mock_ntfy.send_notification.called
    assert "nginx" in mock_ntfy.call_args.kwargs['message']

Frontend tests (Jest ou équivalent vanilla)

// tests/docker_section.test.js

test('renderHostsGrid displays correct number of cards', () => {
    const hosts = [
        {id: 1, name: 'host1', docker_status: 'online'},
        {id: 2, name: 'host2', docker_status: 'offline'}
    ];
    dockerSection.renderHostsGrid(hosts);
    const cards = document.querySelectorAll('.docker-host-card');
    expect(cards.length).toBe(2);
});

test('container action buttons reflect state', () => {
    const runningContainer = {state: 'running'};
    const stoppedContainer = {state: 'exited'};
    
    const html1 = dockerSection.renderContainerRow(runningContainer);
    expect(html1).toContain('fa-stop');
    expect(html1).not.toContain('fa-play');
    
    const html2 = dockerSection.renderContainerRow(stoppedContainer);
    expect(html2).toContain('fa-play');
    expect(html2).not.toContain('fa-stop');
});

test('WebSocket updates host card in realtime', async () => {
    const ws = new MockWebSocket();
    dockerSection.setupWebSocket();
    
    ws.emit({
        type: 'docker_host_updated',
        host: {id: 1, containers_running: 5}
    });
    
    await nextTick();
    const card = document.querySelector('[data-host-id="1"]');
    expect(card.textContent).toContain('5/');
});

📋 Checklist "Definition of Done"

Backend

  • Migration Alembic créée et testée (docker_containers, docker_images, etc.)
  • Service docker_service.py avec collecte SSH + parsing JSON
  • Service docker_actions.py avec start/stop/restart/remove/redeploy
  • Service docker_alerts.py avec logique de détection + notifications ntfy
  • Routes /api/docker/* complètes avec auth JWT
  • Jobs APScheduler ajoutés (collect + alerts)
  • WebSocket events émis (docker_host_updated, docker_alert_opened)
  • Gestion erreurs robuste (SSH timeout, docker unreachable, parsing errors)
  • 6+ tests backend passants

Frontend

  • Section "Docker Hosts" ajoutée au menu navigation
  • Vue liste hosts Docker (cards avec métriques)
  • Modal détails host avec tabs (Containers / Images / Volumes / Alerts)
  • Actions containers fonctionnelles (start/stop/restart/logs/inspect/remove)
  • Confirmations modales sur actions destructives (remove, redeploy)
  • Logs container (drawer avec tail + auto-refresh)
  • Inspect container (modal JSON viewer)
  • WebSocket live updates (hosts + alerts)
  • Animations cohérentes avec le reste du dashboard
  • 4+ tests frontend passants

Sécurité

  • Toutes les actions Docker nécessitent auth JWT (rôle operator minimum)
  • Actions destructives (remove) nécessitent rôle admin
  • Timeouts SSH stricts (5s connect, 15s exec)
  • Validation Pydantic sur tous les inputs
  • Pas d'exécution de commandes arbitraires
  • Logs serveur structurés (pas de secrets loggés)

Documentation

  • README.md mis à jour (section Docker)
  • Exemples curl pour endpoints Docker
  • Guide configuration labels homelab.monitor et homelab.desired
  • Instructions migration Alembic

🚀 Instructions d'exécution

1. Appliquer la migration DB

cd homelab-automation-api-v2
alembic revision --autogenerate -m "Add Docker management tables"
alembic upgrade head

2. Activer Docker sur un host (via UI ou API)

# Via API
curl -X POST -H "Authorization: Bearer $TOKEN" \
  http://localhost:8000/api/docker/hosts/1/enable

# Via UI : Section "Hosts" > Clic host > Bouton "Enable Docker"

3. Labelliser containers critiques

# docker-compose.yml
services:
  nginx:
    image: nginx:latest
    labels:
      homelab.monitor: "true"
      homelab.desired: "running"

4. Tester collecte manuelle

curl -X POST -H "Authorization: Bearer $TOKEN" \
  http://localhost:8000/api/docker/hosts/1/collect

5. Vérifier alertes

curl -H "Authorization: Bearer $TOKEN" \
  http://localhost:8000/api/docker/alerts

🎁 Features bonus (si temps disponible)

Priorité 1 (haute valeur, faible coût)

  • Compose awareness : Grouper containers par com.docker.compose.project
  • Resource stats : docker stats --no-stream (CPU/mem snapshot)
  • Bulk actions : Restart tous containers d'un projet compose

Priorité 2 (bonne valeur, coût moyen)

  • Event timeline : Journal des actions Docker dans vue host
  • Auto-remediation : Flag homelab.auto_restart=true → restart auto si down
  • Networks tab : Liste networks Docker + containers attachés

Priorité 3 (nice-to-have, coût élevé)

  • Prune management : Nettoyage images/volumes (danger zone + admin uniquement)
  • Image scanning : Vulnérabilités via Trivy (si installé sur hosts)
  • Logs streaming : WebSocket real-time logs (au lieu de tail statique)

⚠️ Risques et mitigation

Risque Impact Mitigation
SSH timeout sur collecte Hosts marqués offline Retry logic + timeout adaptatif (5s → 10s → 30s)
Parsing JSON Docker échoue Collecte partielle Try/catch par entity (containers/images/volumes)
WebSocket spam si many hosts UI lag Throttle broadcasts (max 1/sec par type)
Actions Docker simultanées Race conditions Lock par container_id (asyncio.Lock)
Alerte spam si container flapping Notification fatigue Cooldown 5min entre notifications même alerte

📊 Métriques de succès

  • Collecte Docker réussie sur 3+ hosts simultanés sans timeout
  • Détection container down < 60s après arrêt réel
  • Notification ntfy reçue dans les 5s après ouverture alerte
  • Actions containers (start/stop) < 3s (hors délai Docker lui-même)