Bruno Charest 27eed55c9b
Some checks failed
Tests / Backend Tests (Python) (3.10) (push) Has been cancelled
Tests / Backend Tests (Python) (3.11) (push) Has been cancelled
Tests / Backend Tests (Python) (3.12) (push) Has been cancelled
Tests / Frontend Tests (JS) (push) Has been cancelled
Tests / Integration Tests (push) Has been cancelled
Tests / All Tests Passed (push) Has been cancelled
Update test coverage timestamps and fix coroutine cleanup in task creation tests by properly closing coroutines in mocked asyncio.create_task calls
2025-12-15 08:31:12 -05:00

776 lines
25 KiB
Markdown

# [TASK] Nouvelle section "Hôtes Docker" + monitoring + actions + notifications
## 🎯 Objectif principal
Ajouter une fonctionnalité complète de gestion Docker au Homelab Dashboard existant, permettant :
- Surveillance multi-hosts Docker en temps réel
- Actions sur containers (start/stop/restart/redeploy/logs)
- Détection proactive et alerting sur containers down
- Intégration harmonieuse avec l'architecture existante
---
## 📐 Contraintes d'architecture OBLIGATOIRES
### Stack technique existante (à respecter strictement)
```yaml
Backend:
- FastAPI (routes/ + services/ + models/ + schemas/)
- SQLAlchemy 2.x async (data/homelab.db)
- Alembic pour migrations
- APScheduler (jobs périodiques déjà configurés)
- WebSocket temps réel (websocket_manager.py)
- Auth JWT (app.auth_utils + OAuth2PasswordBearer)
- Notifications ntfy (services/notifications.py)
Frontend:
- index.html + main.js (vanilla JS)
- Tailwind CSS
- Anime.js pour animations
- Pattern navigation par sections (dashboard, hosts, tasks, schedules, etc.)
Infrastructure:
- Ansible pour automation (inventaire hosts.yml existant)
- SSH déjà configuré (automation user + clés)
- Bootstrap SSH existant (services/bootstrap.py)
```
### Modèles DB existants à étendre (NE PAS recréer)
```python
# models/host.py - TABLE EXISTANTE
class Host(Base):
__tablename__ = "hosts"
id: int
name: str
host: str # IP/hostname
os_type: str
status: str # online/offline
bootstrap_status: dict
last_seen_at: datetime
# À ÉTENDRE avec : docker_enabled, docker_version, docker_status
# models/task.py - TABLE EXISTANTE
class Task(Base):
__tablename__ = "tasks"
id: int
action: str
status: str # pending/running/success/failed
# Réutiliser pour actions Docker
# À CRÉER (nouvelles tables uniquement)
# - docker_containers
# - docker_images
# - docker_volumes
# - docker_alerts
```
---
## 🔧 Décisions techniques IMPOSÉES (pas de choix)
### 1. Collecte Docker : **SSH + docker CLI** (réutiliser pattern Ansible)
**Justification** :
- ✅ SSH déjà configuré pour tous les hosts (user automation + clés)
- ✅ ajouter la collecte Docker au processus de collecte des métriques déjà en place.
- ✅ Pas de config supplémentaire sur les hosts (pas de TLS Docker API)
- ✅ Même pattern que Ansible (cohérence)
- ✅ Parse JSON : `docker ps --format json`, `docker inspect`, etc.
**Implémentation** :
```python
# services/docker_service.py
async def collect_docker_host(host_id: int):
host = await get_host(host_id)
ssh = await ssh_connect(host.host, user="automation")
# Version Docker
version = await ssh_exec(ssh, "docker version --format '{{json .}}'")
# Containers
containers = await ssh_exec(ssh,
"docker ps -a --format '{{json .}}' --no-trunc")
# Images
images = await ssh_exec(ssh,
"docker images --format '{{json .}}'")
# Volumes
volumes = await ssh_exec(ssh,
"docker volume ls --format '{{json .}}'")
# System df
df = await ssh_exec(ssh, "docker system df -v --format '{{json .}}'")
```
### 2. Stockage : **Étendre tables existantes + créer tables Docker**
```sql
-- Migration Alembic à créer
ALTER TABLE hosts ADD COLUMN docker_enabled BOOLEAN DEFAULT FALSE;
ALTER TABLE hosts ADD COLUMN docker_version TEXT;
ALTER TABLE hosts ADD COLUMN docker_last_collect_at TIMESTAMP;
CREATE TABLE docker_containers (
id INTEGER PRIMARY KEY,
host_id INTEGER REFERENCES hosts(id),
container_id TEXT NOT NULL,
name TEXT NOT NULL,
image TEXT,
state TEXT, -- running/exited/paused
status TEXT, -- Up 2 hours, Exited (0) 5 minutes ago
health TEXT, -- healthy/unhealthy/starting/none
created_at TIMESTAMP,
ports JSON,
labels JSON,
compose_project TEXT, -- com.docker.compose.project
last_update_at TIMESTAMP,
UNIQUE(host_id, container_id)
);
CREATE TABLE docker_images (
id INTEGER PRIMARY KEY,
host_id INTEGER REFERENCES hosts(id),
image_id TEXT NOT NULL,
repo_tags JSON, -- ["nginx:latest", "nginx:1.25"]
size BIGINT,
created TIMESTAMP,
last_update_at TIMESTAMP,
UNIQUE(host_id, image_id)
);
CREATE TABLE docker_volumes (
id INTEGER PRIMARY KEY,
host_id INTEGER REFERENCES hosts(id),
name TEXT NOT NULL,
driver TEXT,
mountpoint TEXT,
scope TEXT,
last_update_at TIMESTAMP,
UNIQUE(host_id, name)
);
CREATE TABLE docker_alerts (
id INTEGER PRIMARY KEY,
host_id INTEGER REFERENCES hosts(id),
container_name TEXT NOT NULL,
severity TEXT, -- warning/error/critical
state TEXT, -- open/closed
message TEXT,
opened_at TIMESTAMP NOT NULL,
closed_at TIMESTAMP,
last_notified_at TIMESTAMP,
INDEX idx_alerts_open (state, host_id)
);
```
### 3. Scheduler : **Étendre APScheduler existant**
```python
# app_optimized.py - AJOUTER au startup
from services.docker_collector import DockerCollector
@app.on_event("startup")
async def start_docker_collector():
collector = DockerCollector(db_session, ws_manager, ntfy_service)
# Job périodique : collecter tous les hosts Docker enabled
scheduler.add_job(
collector.collect_all_hosts,
trigger="interval",
seconds=60, # Toutes les minutes
id="docker_collect",
name="Docker Metrics Collection"
)
# Job périodique : vérifier alertes containers down
scheduler.add_job(
collector.check_alerts,
trigger="interval",
seconds=30,
id="docker_alerts",
name="Docker Alerts Check"
)
```
---
## 📊 API Routes à créer (prefix /api/docker)
```python
# routes/docker.py
router = APIRouter(prefix="/api/docker", tags=["docker"])
@router.get("/hosts")
async def list_docker_hosts(
current_user: User = Depends(get_current_user)
):
"""Liste tous les hosts avec Docker enabled"""
@router.post("/hosts/{host_id}/enable")
async def enable_docker_monitoring(
host_id: int,
current_user: User = Depends(require_role("admin"))
):
"""Active la surveillance Docker sur un host"""
@router.post("/hosts/{host_id}/collect")
async def collect_docker_now(
host_id: int,
current_user: User = Depends(require_role("operator"))
):
"""Force une collecte immédiate"""
@router.get("/hosts/{host_id}/containers")
async def get_containers(host_id: int):
"""Liste containers d'un host"""
@router.post("/containers/{host_id}/{container_id}/start")
async def start_container(
host_id: int,
container_id: str,
current_user: User = Depends(require_role("operator"))
):
"""Démarre un container"""
@router.post("/containers/{host_id}/{container_id}/stop")
@router.post("/containers/{host_id}/{container_id}/restart")
@router.post("/containers/{host_id}/{container_id}/remove")
@router.post("/containers/{host_id}/{container_id}/redeploy")
@router.get("/containers/{host_id}/{container_id}/logs")
async def get_container_logs(
host_id: int,
container_id: str,
tail: int = 200
):
"""Récupère logs d'un container"""
@router.get("/containers/{host_id}/{container_id}/inspect")
async def inspect_container(host_id: int, container_id: str):
"""Détails complets JSON d'un container"""
@router.get("/alerts")
async def list_alerts(
host_id: Optional[int] = None,
state: Optional[str] = "open"
):
"""Liste des alertes Docker"""
@router.post("/alerts/{alert_id}/ack")
async def acknowledge_alert(
alert_id: int,
current_user: User = Depends(require_role("operator"))
):
"""Accuser réception d'une alerte"""
```
---
## 🔔 Logique d'alerting (détection containers down)
### Règles de détection
```python
# services/docker_alerts.py
async def check_container_alerts(session: AsyncSession):
"""
Vérifie tous les containers critiques et génère des alertes
"""
# Récupérer containers avec label homelab.monitor=true
critical_containers = await session.execute(
select(DockerContainer)
.where(DockerContainer.labels.contains({"homelab.monitor": "true"}))
)
for container in critical_containers:
expected_state = container.labels.get("homelab.desired", "running")
# Cas 1 : Container arrêté alors qu'il devrait tourner
if expected_state == "running" and container.state != "running":
await open_alert(
host_id=container.host_id,
container_name=container.name,
severity="error",
message=f"Container {container.name} is {container.state}, expected running"
)
# Cas 2 : Container unhealthy
if container.health == "unhealthy":
await open_alert(
host_id=container.host_id,
container_name=container.name,
severity="warning",
message=f"Container {container.name} health check failing"
)
# Cas 3 : Container OK -> fermer alerte si ouverte
if container.state == "running" and container.health in ["healthy", "none"]:
await close_alert(container.host_id, container.name)
async def open_alert(host_id: int, container_name: str, severity: str, message: str):
"""
Ouvre une alerte et envoie notification ntfy
"""
# Vérifier si alerte déjà ouverte
existing = await get_open_alert(host_id, container_name)
if existing:
# Mettre à jour timestamp
existing.last_notified_at = datetime.utcnow()
return
# Créer nouvelle alerte
alert = DockerAlert(
host_id=host_id,
container_name=container_name,
severity=severity,
state="open",
message=message,
opened_at=datetime.utcnow()
)
session.add(alert)
await session.commit()
# Notification ntfy
host = await get_host(host_id)
await ntfy_service.send_notification(
topic="homelab-docker",
title=f"🚨 Docker Alert - {host.name}",
message=f"{container_name}: {message}",
priority=4,
tags=["warning", "docker"]
)
# WebSocket temps réel
await ws_manager.broadcast({
"type": "docker_alert_opened",
"alert": alert.to_dict()
})
```
---
## 🎨 UI/UX Frontend (intégration dans index.html + main.js)
### Navigation (ajouter dans index.html)
```html
<!-- Ajouter dans le menu de navigation existant -->
<nav class="nav-tabs">
<!-- Existant : Dashboard, Hosts, Tasks, Schedules, Logs -->
<button class="nav-tab" data-section="docker">
<i class="fas fa-docker"></i>
Docker Hosts
<span class="badge" id="docker-alerts-badge">0</span>
</button>
</nav>
```
### Section Docker (nouvelle section HTML)
```html
<section id="docker-section" class="hidden">
<div class="section-header">
<h2><i class="fab fa-docker"></i> Docker Hosts</h2>
<div class="actions">
<button id="collect-all-docker" class="btn btn-primary">
<i class="fas fa-sync"></i> Collect All
</button>
<input type="text" id="docker-search" placeholder="Search hosts...">
</div>
</div>
<!-- Liste des hosts Docker -->
<div id="docker-hosts-grid" class="hosts-grid">
<!-- Généré dynamiquement par JS -->
</div>
<!-- Modal détails host Docker -->
<div id="docker-detail-modal" class="modal hidden">
<div class="modal-content large">
<div class="modal-header">
<h3 id="docker-host-name"></h3>
<button class="close-modal">&times;</button>
</div>
<!-- Tabs : Containers / Images / Volumes / Alerts -->
<div class="tabs">
<button class="tab active" data-tab="containers">Containers</button>
<button class="tab" data-tab="images">Images</button>
<button class="tab" data-tab="volumes">Volumes</button>
<button class="tab" data-tab="alerts">Alerts</button>
</div>
<!-- Contenu des tabs -->
<div id="containers-tab" class="tab-content">
<table id="containers-table">
<thead>
<tr>
<th>Name</th>
<th>Image</th>
<th>State</th>
<th>Health</th>
<th>Ports</th>
<th>Age</th>
<th>Actions</th>
</tr>
</thead>
<tbody></tbody>
</table>
</div>
</div>
</div>
</section>
```
### Logique JavaScript (main.js)
```javascript
// Gestion section Docker
const dockerSection = {
async init() {
await this.loadDockerHosts();
this.setupWebSocket();
this.setupEventListeners();
},
async loadDockerHosts() {
const response = await fetchAPI('/api/docker/hosts');
this.renderHostsGrid(response.hosts);
},
renderHostsGrid(hosts) {
const grid = document.getElementById('docker-hosts-grid');
grid.innerHTML = hosts.map(host => `
<div class="docker-host-card" data-host-id="${host.id}">
<div class="card-header">
<h3>${host.name}</h3>
<span class="badge ${host.docker_status}">${host.docker_status}</span>
</div>
<div class="card-body">
<div class="metric">
<i class="fas fa-box"></i>
${host.containers_running}/${host.containers_total} containers
</div>
<div class="metric">
<i class="fas fa-exclamation-triangle"></i>
${host.open_alerts} alerts
</div>
<div class="metric">
<i class="fas fa-clock"></i>
Last: ${formatRelativeTime(host.docker_last_collect_at)}
</div>
</div>
<div class="card-actions">
<button class="btn btn-sm" onclick="dockerSection.viewDetails(${host.id})">
<i class="fas fa-eye"></i> Details
</button>
<button class="btn btn-sm" onclick="dockerSection.collectNow(${host.id})">
<i class="fas fa-sync"></i> Collect
</button>
</div>
</div>
`).join('');
},
async viewDetails(hostId) {
const [containers, images, volumes, alerts] = await Promise.all([
fetchAPI(`/api/docker/hosts/${hostId}/containers`),
fetchAPI(`/api/docker/hosts/${hostId}/images`),
fetchAPI(`/api/docker/hosts/${hostId}/volumes`),
fetchAPI(`/api/docker/alerts?host_id=${hostId}`)
]);
this.renderContainersTab(containers);
showModal('docker-detail-modal');
},
renderContainersTab(containers) {
const tbody = document.querySelector('#containers-table tbody');
tbody.innerHTML = containers.map(c => `
<tr class="container-row" data-state="${c.state}">
<td>
<i class="fab fa-docker"></i> ${c.name}
${c.compose_project ? `<span class="badge">${c.compose_project}</span>` : ''}
</td>
<td>${c.image}</td>
<td><span class="badge state-${c.state}">${c.state}</span></td>
<td><span class="badge health-${c.health}">${c.health || 'none'}</span></td>
<td>${this.formatPorts(c.ports)}</td>
<td>${formatRelativeTime(c.created_at)}</td>
<td>
<div class="action-buttons">
${c.state !== 'running' ?
`<button onclick="dockerSection.startContainer(${c.host_id}, '${c.container_id}')">
<i class="fas fa-play"></i>
</button>` : ''}
${c.state === 'running' ?
`<button onclick="dockerSection.stopContainer(${c.host_id}, '${c.container_id}')">
<i class="fas fa-stop"></i>
</button>` : ''}
<button onclick="dockerSection.restartContainer(${c.host_id}, '${c.container_id}')">
<i class="fas fa-redo"></i>
</button>
<button onclick="dockerSection.showLogs(${c.host_id}, '${c.container_id}')">
<i class="fas fa-file-alt"></i>
</button>
<button class="danger" onclick="dockerSection.confirmRemove(${c.host_id}, '${c.container_id}')">
<i class="fas fa-trash"></i>
</button>
</div>
</td>
</tr>
`).join('');
},
async startContainer(hostId, containerId) {
await fetchAPI(`/api/docker/containers/${hostId}/${containerId}/start`, {
method: 'POST'
});
showToast('Container started successfully', 'success');
await this.viewDetails(hostId); // Refresh
},
setupWebSocket() {
ws.addEventListener('message', (event) => {
const data = JSON.parse(event.data);
if (data.type === 'docker_host_updated') {
this.updateHostCard(data.host);
}
if (data.type === 'docker_alert_opened') {
this.showAlertNotification(data.alert);
this.updateAlertsBadge();
}
});
}
};
```
---
## 🧪 Tests obligatoires (minimum 8 tests)
### Backend tests (pytest + pytest-asyncio)
```python
# tests/test_docker_service.py
@pytest.mark.asyncio
async def test_collect_docker_host(mock_ssh):
"""Test collecte Docker réussie"""
mock_ssh.exec_command.return_value = '{"Version": "24.0.7"}'
result = await docker_service.collect_docker_host(host_id=1)
assert result.docker_version == "24.0.7"
@pytest.mark.asyncio
async def test_detect_container_down():
"""Test détection container arrêté"""
container = create_test_container(
state="exited",
labels={"homelab.monitor": "true", "homelab.desired": "running"}
)
alerts = await docker_alerts.check_container_alerts([container])
assert len(alerts) == 1
assert alerts[0].severity == "error"
@pytest.mark.asyncio
async def test_start_container(mock_ssh):
"""Test démarrage container"""
mock_ssh.exec_command.return_value = "container_id"
result = await docker_actions.start_container(host_id=1, container_id="abc123")
assert result.success is True
mock_ssh.exec_command.assert_called_with("docker start abc123")
@pytest.mark.asyncio
async def test_alert_notification_sent(mock_ntfy):
"""Test notification ntfy envoyée lors alerte"""
await docker_alerts.open_alert(
host_id=1,
container_name="nginx",
severity="error",
message="Container down"
)
assert mock_ntfy.send_notification.called
assert "nginx" in mock_ntfy.call_args.kwargs['message']
```
### Frontend tests (Jest ou équivalent vanilla)
```javascript
// tests/docker_section.test.js
test('renderHostsGrid displays correct number of cards', () => {
const hosts = [
{id: 1, name: 'host1', docker_status: 'online'},
{id: 2, name: 'host2', docker_status: 'offline'}
];
dockerSection.renderHostsGrid(hosts);
const cards = document.querySelectorAll('.docker-host-card');
expect(cards.length).toBe(2);
});
test('container action buttons reflect state', () => {
const runningContainer = {state: 'running'};
const stoppedContainer = {state: 'exited'};
const html1 = dockerSection.renderContainerRow(runningContainer);
expect(html1).toContain('fa-stop');
expect(html1).not.toContain('fa-play');
const html2 = dockerSection.renderContainerRow(stoppedContainer);
expect(html2).toContain('fa-play');
expect(html2).not.toContain('fa-stop');
});
test('WebSocket updates host card in realtime', async () => {
const ws = new MockWebSocket();
dockerSection.setupWebSocket();
ws.emit({
type: 'docker_host_updated',
host: {id: 1, containers_running: 5}
});
await nextTick();
const card = document.querySelector('[data-host-id="1"]');
expect(card.textContent).toContain('5/');
});
```
---
## 📋 Checklist "Definition of Done"
### Backend ✅
- [ ] Migration Alembic créée et testée (docker_containers, docker_images, etc.)
- [ ] Service `docker_service.py` avec collecte SSH + parsing JSON
- [ ] Service `docker_actions.py` avec start/stop/restart/remove/redeploy
- [ ] Service `docker_alerts.py` avec logique de détection + notifications ntfy
- [ ] Routes `/api/docker/*` complètes avec auth JWT
- [ ] Jobs APScheduler ajoutés (collect + alerts)
- [ ] WebSocket events émis (docker_host_updated, docker_alert_opened)
- [ ] Gestion erreurs robuste (SSH timeout, docker unreachable, parsing errors)
- [ ] 6+ tests backend passants
### Frontend ✅
- [ ] Section "Docker Hosts" ajoutée au menu navigation
- [ ] Vue liste hosts Docker (cards avec métriques)
- [ ] Modal détails host avec tabs (Containers / Images / Volumes / Alerts)
- [ ] Actions containers fonctionnelles (start/stop/restart/logs/inspect/remove)
- [ ] Confirmations modales sur actions destructives (remove, redeploy)
- [ ] Logs container (drawer avec tail + auto-refresh)
- [ ] Inspect container (modal JSON viewer)
- [ ] WebSocket live updates (hosts + alerts)
- [ ] Animations cohérentes avec le reste du dashboard
- [ ] 4+ tests frontend passants
### Sécurité ✅
- [ ] Toutes les actions Docker nécessitent auth JWT (rôle operator minimum)
- [ ] Actions destructives (remove) nécessitent rôle admin
- [ ] Timeouts SSH stricts (5s connect, 15s exec)
- [ ] Validation Pydantic sur tous les inputs
- [ ] Pas d'exécution de commandes arbitraires
- [ ] Logs serveur structurés (pas de secrets loggés)
### Documentation ✅
- [ ] README.md mis à jour (section Docker)
- [ ] Exemples curl pour endpoints Docker
- [ ] Guide configuration labels `homelab.monitor` et `homelab.desired`
- [ ] Instructions migration Alembic
---
## 🚀 Instructions d'exécution
### 1. Appliquer la migration DB
```bash
cd homelab-automation-api-v2
alembic revision --autogenerate -m "Add Docker management tables"
alembic upgrade head
```
### 2. Activer Docker sur un host (via UI ou API)
```bash
# Via API
curl -X POST -H "Authorization: Bearer $TOKEN" \
http://localhost:8000/api/docker/hosts/1/enable
# Via UI : Section "Hosts" > Clic host > Bouton "Enable Docker"
```
### 3. Labelliser containers critiques
```yaml
# docker-compose.yml
services:
nginx:
image: nginx:latest
labels:
homelab.monitor: "true"
homelab.desired: "running"
```
### 4. Tester collecte manuelle
```bash
curl -X POST -H "Authorization: Bearer $TOKEN" \
http://localhost:8000/api/docker/hosts/1/collect
```
### 5. Vérifier alertes
```bash
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:8000/api/docker/alerts
```
---
## 🎁 Features bonus (si temps disponible)
### Priorité 1 (haute valeur, faible coût)
- **Compose awareness** : Grouper containers par `com.docker.compose.project`
- **Resource stats** : `docker stats --no-stream` (CPU/mem snapshot)
- **Bulk actions** : Restart tous containers d'un projet compose
### Priorité 2 (bonne valeur, coût moyen)
- **Event timeline** : Journal des actions Docker dans vue host
- **Auto-remediation** : Flag `homelab.auto_restart=true` → restart auto si down
- **Networks tab** : Liste networks Docker + containers attachés
### Priorité 3 (nice-to-have, coût élevé)
- **Prune management** : Nettoyage images/volumes (danger zone + admin uniquement)
- **Image scanning** : Vulnérabilités via Trivy (si installé sur hosts)
- **Logs streaming** : WebSocket real-time logs (au lieu de tail statique)
---
## ⚠️ Risques et mitigation
| Risque | Impact | Mitigation |
|--------|--------|-----------|
| SSH timeout sur collecte | Hosts marqués offline | Retry logic + timeout adaptatif (5s → 10s → 30s) |
| Parsing JSON Docker échoue | Collecte partielle | Try/catch par entity (containers/images/volumes) |
| WebSocket spam si many hosts | UI lag | Throttle broadcasts (max 1/sec par type) |
| Actions Docker simultanées | Race conditions | Lock par container_id (asyncio.Lock) |
| Alerte spam si container flapping | Notification fatigue | Cooldown 5min entre notifications même alerte |
---
## 📊 Métriques de succès
- ✅ Collecte Docker réussie sur 3+ hosts simultanés sans timeout
- ✅ Détection container down < 60s après arrêt réel
- Notification ntfy reçue dans les 5s après ouverture alerte
- Actions containers (start/stop) < 3s (hors délai Docker lui-même)
-