Mise à jour de la configuration de la base de données pour utiliser PostgreSQL et ajout de Redis dans le fichier .env. Modifications du Dockerfile pour installer les dépendances nécessaires. Amélioration des services de stockage pour supporter les opérations asynchrones avec S3 et le stockage local. Refactorisation des pipelines d'images pour une meilleure gestion des tâches asynchrones. Ajout de la gestion des clés API dans l'authentification. Mise à jour de la documentation et des exemples d'utilisation.
Some checks failed
CI / Lint & Format (push) Has been cancelled
CI / Security Scan (push) Has been cancelled
CI / Tests (push) Has been cancelled
CI / Docker Build (push) Has been cancelled

This commit is contained in:
Bruno Charest 2026-02-24 16:19:18 -05:00
parent cc99fea20a
commit d68deb9c74
20 changed files with 235 additions and 285 deletions

View File

@ -20,9 +20,13 @@ HOST=0.0.0.0
PORT=8000 PORT=8000
# Base de données # Base de données
DATABASE_URL="sqlite+aiosqlite:///./data/imago.db" DATABASE_URL="postgresql+asyncpg://imago:imago@db:5432/imago"
# Pour PostgreSQL: # Modifiez les valeurs ci-dessus si vous utilisez une instance externe ou locale.
# DATABASE_URL="postgresql+asyncpg://user:password@localhost/shaarli" # Pour SQLite (développement local sans Docker):
# DATABASE_URL="sqlite+aiosqlite:///./data/imago.db"
# Redis (ARQ Worker)
REDIS_URL="redis://redis:6379/0"
# Stockage des fichiers # Stockage des fichiers
UPLOAD_DIR="./data/uploads" UPLOAD_DIR="./data/uploads"

View File

@ -5,7 +5,7 @@ RUN apt-get update && apt-get install -y \
tesseract-ocr \ tesseract-ocr \
tesseract-ocr-fra \ tesseract-ocr-fra \
tesseract-ocr-eng \ tesseract-ocr-eng \
libgl1-mesa-glx \ libgl1 \
libglib2.0-0 \ libglib2.0-0 \
curl \ curl \
&& rm -rf /var/lib/apt/lists/* && rm -rf /var/lib/apt/lists/*

View File

@ -93,7 +93,7 @@ python worker.py # Worker ARQ (requiert Redis)
### Avec Docker ### Avec Docker
```bash ```bash
docker-compose up -d # API + Redis + Worker docker-compose up -d # API + Redis + Worker + PostgreSQL
``` ```
### Commandes utiles (Makefile) ### Commandes utiles (Makefile)
@ -201,11 +201,11 @@ Chaque étape est **indépendante** : un échec partiel n'arrête pas le pipelin
> Tous les appels (sauf `/health` et `/metrics`) nécessitent une clé API valide passée dans le header `X-API-Key`. > Tous les appels (sauf `/health` et `/metrics`) nécessitent une clé API valide passée dans le header `X-API-Key`.
### Upload d'une image ### Upload d'une image
### api_key=emH92l92LD4L7cLhl2imidMZANsIUb9x_AlGWiYpVSA client_id=925463e0-27a4-4993-aa3a-f1cb31c19d32 warning=Notez cette clé ! Elle ne sera plus affichée.
```bash ```bash
curl -X POST http://localhost:8000/images/upload \ curl -X POST http://localhost:8000/images/upload \
-H "X-API-Key: your_api_key" \ -H "X-API-Key: rEYQtw3LxJJlcmBq-cgQcdeY74JcpJ45COuFWokmxPg" \
-F "file=@photo.jpg" -F "file=@pushup.gif"
``` ```
Réponse : Réponse :
@ -222,7 +222,7 @@ Réponse :
### Polling du statut ### Polling du statut
```bash ```bash
curl http://localhost:8000/images/1/status -H "X-API-Key: your_api_key" curl http://localhost:8000/images/1/status -H "X-API-Key: rEYQtw3LxJJlcmBq-cgQcdeY74JcpJ45COuFWokmxPg"
``` ```
```json ```json
@ -237,7 +237,7 @@ curl http://localhost:8000/images/1/status -H "X-API-Key: your_api_key"
### Détail complet ### Détail complet
```bash ```bash
curl http://localhost:8000/images/1 -H "X-API-Key: your_api_key" curl http://localhost:8000/images/1 -H "X-API-Key: rEYQtw3LxJJlcmBq-cgQcdeY74JcpJ45COuFWokmxPg"
``` ```
```json ```json
@ -308,7 +308,7 @@ curl -X POST http://localhost:8000/ai/draft-task \
| `JWT_SECRET_KEY` | — | Secret pour la signature des tokens | | `JWT_SECRET_KEY` | — | Secret pour la signature des tokens |
| `AI_PROVIDER` | `gemini` | `gemini` ou `openrouter` | | `AI_PROVIDER` | `gemini` | `gemini` ou `openrouter` |
| `GEMINI_API_KEY` | — | Clé API Gemini | | `GEMINI_API_KEY` | — | Clé API Gemini |
| `DATABASE_URL` | SQLite local | URL de connexion (SQLite ou Postgres) | | `DATABASE_URL` | PostgreSQL (Docker) / SQLite (Local) | URL de connexion (Postgres recommandé) |
| `REDIS_URL` | `redis://localhost:6379/0` | URL Redis pour ARQ | | `REDIS_URL` | `redis://localhost:6379/0` | URL Redis pour ARQ |
| `STORAGE_BACKEND` | `local` | `local` ou `s3` | | `STORAGE_BACKEND` | `local` | `local` ou `s3` |
| `S3_BUCKET` | — | Bucket S3/MinIO | | `S3_BUCKET` | — | Bucket S3/MinIO |
@ -393,7 +393,7 @@ imago/
├── .github/workflows/ci.yml # CI/CD pipeline ├── .github/workflows/ci.yml # CI/CD pipeline
├── pyproject.toml # ruff, mypy, coverage config ├── pyproject.toml # ruff, mypy, coverage config
├── Makefile # Commandes utiles ├── Makefile # Commandes utiles
├── docker-compose.yml # API + Redis + Worker ├── docker-compose.yml # API + Redis + Worker + PostgreSQL
├── Dockerfile ├── Dockerfile
├── requirements.txt # Production deps ├── requirements.txt # Production deps
├── requirements-dev.txt # Dev deps (lint, test) ├── requirements-dev.txt # Dev deps (lint, test)

View File

@ -70,7 +70,8 @@ async def init_db():
session.add(bootstrap_client) session.add(bootstrap_client)
await session.commit() await session.commit()
logger.info("bootstrap.client_created", extra={ msg = f"Bootstrap client created! ID: {bootstrap_client.id} | API_KEY: {raw_key}"
logger.info(msg, extra={
"client_id": bootstrap_client.id, "client_id": bootstrap_client.id,
"api_key": raw_key, "api_key": raw_key,
"warning": "Notez cette clé ! Elle ne sera plus affichée.", "warning": "Notez cette clé ! Elle ne sera plus affichée.",

View File

@ -26,33 +26,39 @@ def hash_api_key(api_key: str) -> str:
async def verify_api_key( async def verify_api_key(
request: Request, request: Request,
authorization: str = Header( authorization: str | None = Header(
..., None,
alias="Authorization", alias="Authorization",
description="Clé API au format 'Bearer <key>'", description="Clé API au format 'Bearer <key>'",
), ),
x_api_key: str | None = Header(
None,
alias="X-API-Key",
description="Clé API alternative",
),
db: AsyncSession = Depends(get_db), db: AsyncSession = Depends(get_db),
) -> APIClient: ) -> APIClient:
""" """
Vérifie la clé API fournie dans le header Authorization. Vérifie la clé API fournie dans le header Authorization ou X-API-Key.
Injecte client_id et client_plan dans request.state pour le rate limiter. Injecte client_id et client_plan dans request.state pour le rate limiter.
Raises: Raises:
HTTPException 401: clé absente, invalide ou client inactif. HTTPException 401: clé absente, invalide ou client inactif.
""" """
# ── Extraction du token ─────────────────────────────────── raw_key = None
if not authorization.startswith("Bearer "):
raise HTTPException( # ── 1. Tentative avec Authorization: Bearer <key> ────────
status_code=status.HTTP_401_UNAUTHORIZED, if authorization and authorization.startswith("Bearer "):
detail="Authentification requise", raw_key = authorization[7:].strip()
headers={"WWW-Authenticate": "Bearer"},
) # ── 2. Tentative avec X-API-Key ──────────────────────────
if not raw_key and x_api_key:
raw_key = x_api_key.strip()
raw_key = authorization[7:] # strip "Bearer "
if not raw_key: if not raw_key:
raise HTTPException( raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED, status_code=status.HTTP_401_UNAUTHORIZED,
detail="Authentification requise", detail="Authentification requise (Header Authorization ou X-API-Key manquant)",
headers={"WWW-Authenticate": "Bearer"}, headers={"WWW-Authenticate": "Bearer"},
) )

View File

@ -16,6 +16,7 @@ def configure_logging(debug: bool = False) -> None:
structlog.contextvars.merge_contextvars, structlog.contextvars.merge_contextvars,
structlog.stdlib.add_log_level, structlog.stdlib.add_log_level,
structlog.stdlib.add_logger_name, structlog.stdlib.add_logger_name,
structlog.stdlib.ExtraAdder(),
structlog.processors.TimeStamper(fmt="iso"), structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(), structlog.processors.StackInfoRenderer(),
structlog.processors.UnicodeDecoder(), structlog.processors.UnicodeDecoder(),

View File

@ -48,9 +48,9 @@ class APIClient(Base):
quota_images = Column(Integer, default=1000, nullable=False) quota_images = Column(Integer, default=1000, nullable=False)
# ── Timestamps ──────────────────────────────────────────── # ── Timestamps ────────────────────────────────────────────
created_at = Column(DateTime, default=lambda: datetime.now(timezone.utc)) created_at = Column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc))
updated_at = Column( updated_at = Column(
DateTime, DateTime(timezone=True),
default=lambda: datetime.now(timezone.utc), default=lambda: datetime.now(timezone.utc),
onupdate=lambda: datetime.now(timezone.utc), onupdate=lambda: datetime.now(timezone.utc),
) )

View File

@ -38,7 +38,7 @@ class Image(Base):
file_size = Column(BigInteger) # bytes file_size = Column(BigInteger) # bytes
width = Column(Integer) width = Column(Integer)
height = Column(Integer) height = Column(Integer)
uploaded_at = Column(DateTime, default=lambda: datetime.now(timezone.utc)) uploaded_at = Column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc))
# ── Statut du pipeline AI ───────────────────────────────── # ── Statut du pipeline AI ─────────────────────────────────
processing_status = Column( processing_status = Column(
@ -48,15 +48,15 @@ class Image(Base):
index=True index=True
) )
processing_error = Column(Text) processing_error = Column(Text)
processing_started_at = Column(DateTime) processing_started_at = Column(DateTime(timezone=True))
processing_done_at = Column(DateTime) processing_done_at = Column(DateTime(timezone=True))
# ── Métadonnées EXIF ────────────────────────────────────── # ── Métadonnées EXIF ──────────────────────────────────────
exif_raw = Column(JSON) # dict complet brut exif_raw = Column(JSON) # dict complet brut
exif_make = Column(String(256)) # Appareil — fabricant exif_make = Column(String(256)) # Appareil — fabricant
exif_model = Column(String(256)) # Appareil — modèle exif_model = Column(String(256)) # Appareil — modèle
exif_lens = Column(String(256)) exif_lens = Column(String(256))
exif_taken_at = Column(DateTime) # DateTimeOriginal EXIF exif_taken_at = Column(DateTime(timezone=True)) # DateTimeOriginal EXIF
exif_gps_lat = Column(Float) exif_gps_lat = Column(Float)
exif_gps_lon = Column(Float) exif_gps_lon = Column(Float)
exif_altitude = Column(Float) exif_altitude = Column(Float)
@ -79,7 +79,7 @@ class Image(Base):
ai_tags = Column(JSON) # ["nature", "paysage", ...] ai_tags = Column(JSON) # ["nature", "paysage", ...]
ai_confidence = Column(Float) # score de confiance global ai_confidence = Column(Float) # score de confiance global
ai_model_used = Column(String(128)) ai_model_used = Column(String(128))
ai_processed_at = Column(DateTime) ai_processed_at = Column(DateTime(timezone=True))
ai_prompt_tokens = Column(Integer) ai_prompt_tokens = Column(Integer)
ai_output_tokens = Column(Integer) ai_output_tokens = Column(Integer)

View File

@ -24,6 +24,7 @@ from app.schemas import (
) )
from app.services import storage from app.services import storage
from app.middleware import limiter, get_upload_rate_limit from app.middleware import limiter, get_upload_rate_limit
from app.workers.image_worker import QUEUE_STANDARD, QUEUE_PREMIUM
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -103,12 +104,10 @@ async def upload_image(
# Enqueue dans ARQ (persistant, avec retry) # Enqueue dans ARQ (persistant, avec retry)
arq_pool = request.app.state.arq_pool arq_pool = request.app.state.arq_pool
queue_name = "premium" if client.plan and client.plan.value == "premium" else "standard"
await arq_pool.enqueue_job( await arq_pool.enqueue_job(
"process_image_task", "process_image_task",
image.id, image.id,
str(client.id), str(client.id)
_queue_name=queue_name,
) )
return UploadResponse( return UploadResponse(
@ -408,14 +407,11 @@ async def reprocess_image(
image.processing_done_at = None image.processing_done_at = None
await db.commit() await db.commit()
# Enqueue dans ARQ
arq_pool = request.app.state.arq_pool arq_pool = request.app.state.arq_pool
queue_name = "premium" if client.plan and client.plan.value == "premium" else "standard"
await arq_pool.enqueue_job( await arq_pool.enqueue_job(
"process_image_task", "process_image_task",
image_id, image_id,
str(client.id), str(client.id)
_queue_name=queue_name,
) )
return ReprocessResponse(id=image_id) return ReprocessResponse(id=image_id)

View File

@ -46,6 +46,7 @@ class OcrData(BaseModel):
class AiData(BaseModel): class AiData(BaseModel):
model_config = ConfigDict(protected_namespaces=())
description: Optional[str] = None description: Optional[str] = None
tags: Optional[List[str]] = None tags: Optional[List[str]] = None
confidence: Optional[float] = None confidence: Optional[float] = None

View File

@ -7,6 +7,7 @@ import logging
import re import re
import base64 import base64
import httpx import httpx
import io
from pathlib import Path from pathlib import Path
from typing import Optional, Tuple from typing import Optional, Tuple
@ -14,6 +15,7 @@ from google import genai
from google.genai import types from google.genai import types
from app.config import settings from app.config import settings
from app.services.storage_backend import get_storage_backend
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -28,8 +30,8 @@ def _get_client() -> genai.Client:
return _client return _client
def _read_image(file_path: str) -> tuple[bytes, str]: async def _read_image(file_path: str) -> tuple[bytes, str]:
"""Lit l'image en bytes et détecte le media_type.""" """Lit l'image via le StorageBackend et détecte le media_type."""
path = Path(file_path) path = Path(file_path)
suffix = path.suffix.lower() suffix = path.suffix.lower()
@ -42,9 +44,15 @@ def _read_image(file_path: str) -> tuple[bytes, str]:
} }
media_type = mime_map.get(suffix, "image/jpeg") media_type = mime_map.get(suffix, "image/jpeg")
with open(path, "rb") as f: # Utilisation du StorageBackend pour lire l'image
data = f.read() backend = get_storage_backend()
# On ruse un peu car StorageBackend n'a pas de 'read',
# mais on sait qu'en LocalStorage on peut lire en direct
# et en S3Storage on peut passer par les URLs ou aioboto3.
# Pour garder une abstraction propre, on va ajouter une méthode 'get_bytes' au backend.
data = await backend.get_bytes(file_path)
return data, media_type return data, media_type
@ -141,8 +149,6 @@ async def _generate_openrouter(
"model": settings.OPENROUTER_MODEL, "model": settings.OPENROUTER_MODEL,
"messages": messages, "messages": messages,
"max_tokens": max_tokens, "max_tokens": max_tokens,
# OpenRouter/OpenAI support response_format={"type": "json_object"} pour certains modèles
# On tente le coup si le modèle est compatible, sinon le prompt engineering fait le travail
"response_format": {"type": "json_object"} "response_format": {"type": "json_object"}
} }
@ -237,14 +243,14 @@ async def analyze_image(
} }
try: try:
image_bytes, media_type = _read_image(file_path) image_bytes, media_type = await _read_image(file_path)
prompt = _build_prompt(ocr_hint, language) prompt = _build_prompt(ocr_hint, language)
response = await _generate( response = await _generate(
prompt=prompt, prompt=prompt,
image_bytes=image_bytes, image_bytes=image_bytes,
media_type=media_type, media_type=media_type,
max_tokens=settings.GEMINI_MAX_TOKENS # Ou une config unifiée max_tokens=settings.GEMINI_MAX_TOKENS
) )
text = response.get("text") text = response.get("text")
@ -286,7 +292,7 @@ async def extract_text_with_ai(file_path: str) -> dict:
logger.info("ai.ocr.fallback_start", extra={"file": Path(file_path).name}) logger.info("ai.ocr.fallback_start", extra={"file": Path(file_path).name})
try: try:
image_bytes, media_type = _read_image(file_path) image_bytes, media_type = await _read_image(file_path)
prompt = """Agis comme un moteur OCR avancé. prompt = """Agis comme un moteur OCR avancé.
Extrais TOUT le texte visible dans cette image. Extrais TOUT le texte visible dans cette image.
Retourne UNIQUEMENT un objet JSON : Retourne UNIQUEMENT un objet JSON :
@ -399,6 +405,7 @@ Retourne UNIQUEMENT ce JSON :
}}""" }}"""
try: try:
# Pas d'image ici
response = await _generate( response = await _generate(
prompt=prompt, prompt=prompt,
max_tokens=settings.GEMINI_MAX_TOKENS max_tokens=settings.GEMINI_MAX_TOKENS
@ -414,4 +421,3 @@ Retourne UNIQUEMENT ce JSON :
logger.error("ai.draft_task.error", extra={"error": str(e)}) logger.error("ai.draft_task.error", extra={"error": str(e)})
return result return result

View File

@ -2,16 +2,19 @@
Service d'extraction EXIF — Pillow + piexif Service d'extraction EXIF — Pillow + piexif
""" """
import logging import logging
import io
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
from typing import Any from typing import Any
logger = logging.getLogger(__name__)
import piexif import piexif
from PIL import Image as PILImage from PIL import Image as PILImage
from PIL.ExifTags import TAGS, GPSTAGS from PIL.ExifTags import TAGS, GPSTAGS
from app.services.storage_backend import get_storage_backend
logger = logging.getLogger(__name__)
def _dms_to_decimal(dms: tuple, ref: str) -> float | None: def _dms_to_decimal(dms: tuple, ref: str) -> float | None:
"""Convertit les coordonnées GPS DMS (degrés/minutes/secondes) en décimal.""" """Convertit les coordonnées GPS DMS (degrés/minutes/secondes) en décimal."""
@ -49,10 +52,10 @@ def _safe_str(value: Any) -> str | None:
return str(value) return str(value)
def extract_exif(file_path: str) -> dict: async def extract_exif(file_path: str) -> dict:
""" """
Extrait toutes les métadonnées EXIF d'une image. Extrait toutes les métadonnées EXIF d'une image.
Retourne un dict structuré avec les données parsées. Supporte Local et S3 via StorageBackend.
""" """
result = { result = {
"raw": {}, "raw": {},
@ -73,15 +76,15 @@ def extract_exif(file_path: str) -> dict:
} }
try: try:
path = Path(file_path) # Lecture via le backend
if not path.exists(): backend = get_storage_backend()
return result image_bytes = await backend.get_bytes(file_path)
# ── Lecture EXIF brute via piexif ───────────────────── # ── Lecture EXIF brute via piexif ─────────────────────
try: try:
exif_data = piexif.load(str(path)) exif_data = piexif.load(image_bytes)
except Exception: except Exception:
# JPEG sans EXIF, PNG, etc. # Image sans EXIF
return result return result
raw_dict = {} raw_dict = {}
@ -155,7 +158,7 @@ def extract_exif(file_path: str) -> dict:
pass pass
# ── Dict brut lisible (TAGS humains) ────────────────── # ── Dict brut lisible (TAGS humains) ──────────────────
with PILImage.open(path) as img: with PILImage.open(io.BytesIO(image_bytes)) as img:
raw_exif = img._getexif() raw_exif = img._getexif()
if raw_exif: if raw_exif:
for tag_id, val in raw_exif.items(): for tag_id, val in raw_exif.items():

View File

@ -2,9 +2,11 @@
Service OCR extraction de texte via Tesseract Service OCR extraction de texte via Tesseract
""" """
import logging import logging
import io
from pathlib import Path from pathlib import Path
from PIL import Image as PILImage from PIL import Image as PILImage
from app.config import settings from app.config import settings
from app.services.storage_backend import get_storage_backend
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -35,10 +37,10 @@ def _detect_language(text: str) -> str:
return "fr" if fr_score >= en_score else "en" return "fr" if fr_score >= en_score else "en"
def extract_text(file_path: str) -> dict: async def extract_text(file_path: str) -> dict:
""" """
Extrait le texte d'une image via Tesseract OCR. Extrait le texte d'une image via Tesseract OCR.
Retourne un dict avec le texte, la langue et le score de confiance. Supporte Local et S3 via StorageBackend (lecture en mémoire).
""" """
result = { result = {
"text": None, "text": None,
@ -54,16 +56,16 @@ def extract_text(file_path: str) -> dict:
logger.warning("ocr.unavailable", extra={"error": str(_ocr_import_error)}) logger.warning("ocr.unavailable", extra={"error": str(_ocr_import_error)})
return result return result
path = Path(file_path)
if not path.exists():
return result
try: try:
# Lecture via le backend
backend = get_storage_backend()
image_bytes = await backend.get_bytes(file_path)
# Configuration Tesseract # Configuration Tesseract
if settings.TESSERACT_CMD: if settings.TESSERACT_CMD:
pytesseract.pytesseract.tesseract_cmd = settings.TESSERACT_CMD pytesseract.pytesseract.tesseract_cmd = settings.TESSERACT_CMD
with PILImage.open(path) as img: with PILImage.open(io.BytesIO(image_bytes)) as img:
# Convertit en RGB si nécessaire # Convertit en RGB si nécessaire
if img.mode not in ("RGB", "L"): if img.mode not in ("RGB", "L"):
img = img.convert("RGB") img = img.convert("RGB")

View File

@ -22,12 +22,6 @@ import asyncio
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
async def _run_sync_in_thread(func: Any, *args: Any) -> Any:
"""Exécute une fonction synchrone dans un thread pour ne pas bloquer l'event loop."""
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, func, *args)
async def _publish_event( async def _publish_event(
redis: Any, image_id: int, event: str, data: dict | None = None redis: Any, image_id: int, event: str, data: dict | None = None
) -> None: ) -> None:
@ -48,8 +42,8 @@ async def process_image_pipeline(
) -> None: ) -> None:
""" """
Pipeline complet de traitement d'une image : Pipeline complet de traitement d'une image :
1. Extraction EXIF (sync thread) 1. Extraction EXIF (async)
2. OCR extraction texte (sync thread) 2. OCR extraction texte (async)
3. Vision AI description + tags (async) 3. Vision AI description + tags (async)
4. Sauvegarde finale en BDD 4. Sauvegarde finale en BDD
@ -81,7 +75,9 @@ async def process_image_pipeline(
try: try:
logger.info("pipeline.step.start", extra={"image_id": image_id, "step": "exif", "step_num": "1/3"}) logger.info("pipeline.step.start", extra={"image_id": image_id, "step": "exif", "step_num": "1/3"})
t0 = time.time() t0 = time.time()
exif = await _run_sync_in_thread(extract_exif, file_path)
# Maintenant async et utilise le backend
exif = await extract_exif(file_path)
image.exif_raw = exif.get("raw") image.exif_raw = exif.get("raw")
image.exif_make = exif.get("make") image.exif_make = exif.get("make")
@ -117,7 +113,9 @@ async def process_image_pipeline(
try: try:
logger.info("pipeline.step.start", extra={"image_id": image_id, "step": "ocr", "step_num": "2/3"}) logger.info("pipeline.step.start", extra={"image_id": image_id, "step": "ocr", "step_num": "2/3"})
t0 = time.time() t0 = time.time()
ocr = await _run_sync_in_thread(extract_text, file_path)
# Maintenant async et utilise le backend
ocr = await extract_text(file_path)
# Fallback AI si OCR classique échoue ou ne trouve rien # Fallback AI si OCR classique échoue ou ne trouve rien
if not ocr.get("has_text", False): if not ocr.get("has_text", False):

View File

@ -4,12 +4,13 @@ Multi-tenant : les fichiers sont isolés par client_id.
""" """
import uuid import uuid
import logging import logging
import aiofiles import io
from pathlib import Path from pathlib import Path
from datetime import datetime, timezone from datetime import datetime, timezone
from PIL import Image as PILImage from PIL import Image as PILImage
from fastapi import UploadFile, HTTPException, status from fastapi import UploadFile, HTTPException, status
from app.config import settings from app.config import settings
from app.services.storage_backend import get_storage_backend
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -28,25 +29,10 @@ def _generate_filename(original: str) -> tuple[str, str]:
return f"{uid}{suffix}", uid return f"{uid}{suffix}", uid
def _get_client_upload_path(client_id: str) -> Path:
"""Retourne le répertoire d'upload pour un client donné."""
p = settings.upload_path / client_id
p.mkdir(parents=True, exist_ok=True)
return p
def _get_client_thumbnails_path(client_id: str) -> Path:
"""Retourne le répertoire de thumbnails pour un client donné."""
p = settings.thumbnails_path / client_id
p.mkdir(parents=True, exist_ok=True)
return p
async def save_upload(file: UploadFile, client_id: str) -> dict: async def save_upload(file: UploadFile, client_id: str) -> dict:
""" """
Valide, sauvegarde le fichier uploadé et génère un thumbnail. Valide, sauvegarde le fichier uploadé et génère un thumbnail.
Les fichiers sont stockés dans uploads/{client_id}/ pour l'isolation. Utilise le backend de stockage configuré (Local ou S3).
Retourne un dict avec toutes les métadonnées fichier.
""" """
# ── Validation MIME ─────────────────────────────────────── # ── Validation MIME ───────────────────────────────────────
if file.content_type not in ALLOWED_MIME_TYPES: if file.content_type not in ALLOWED_MIME_TYPES:
@ -65,40 +51,49 @@ async def save_upload(file: UploadFile, client_id: str) -> dict:
detail=f"Fichier trop volumineux. Max : {settings.MAX_UPLOAD_SIZE_MB} MB", detail=f"Fichier trop volumineux. Max : {settings.MAX_UPLOAD_SIZE_MB} MB",
) )
# ── Nommage et chemins ──────────────────────────────────── # ── Nommage ───────────────────────────────────────────────
filename, file_uuid = _generate_filename(file.filename or "image") filename, file_uuid = _generate_filename(file.filename or "image")
upload_dir = _get_client_upload_path(client_id)
thumb_dir = _get_client_thumbnails_path(client_id)
file_path = upload_dir / filename # Chemins relatifs par rapport au bucket/base_dir
thumb_filename = f"thumb_{filename}" rel_file_path = f"uploads/{client_id}/{filename}"
thumb_path = thumb_dir / thumb_filename rel_thumb_path = f"thumbnails/{client_id}/thumb_{filename}"
backend = get_storage_backend()
# ── Sauvegarde fichier original ─────────────────────────── # ── Sauvegarde fichier original ───────────────────────────
async with aiofiles.open(file_path, "wb") as f: await backend.save(content, rel_file_path, file.content_type)
await f.write(content)
# ── Dimensions + thumbnail ──────────────────────────────── # ── Dimensions + thumbnail ────────────────────────────────
width, height = None, None width, height = None, None
thumb_saved = False
try: try:
with PILImage.open(file_path) as img: # On utilise io.BytesIO pour ne pas avoir à écrire sur le disque local
with PILImage.open(io.BytesIO(content)) as img:
width, height = img.size width, height = img.size
img.thumbnail(THUMBNAIL_SIZE, PILImage.LANCZOS) img.thumbnail(THUMBNAIL_SIZE, PILImage.LANCZOS)
# Convertit en RGB si nécessaire (ex: PNG RGBA)
# Convertit en RGB si nécessaire
if img.mode in ("RGBA", "P"): if img.mode in ("RGBA", "P"):
img = img.convert("RGB") img = img.convert("RGB")
img.save(thumb_path, "JPEG", quality=85)
# Sauvegarde thumbnail dans un buffer
thumb_buffer = io.BytesIO()
img.save(thumb_buffer, "JPEG", quality=85)
thumb_data = thumb_buffer.getvalue()
# Sauvegarde via le backend
await backend.save(thumb_data, rel_thumb_path, "image/jpeg")
thumb_saved = True
except Exception as e: except Exception as e:
# Thumbnail non bloquant
thumb_path = None
logger.warning("Erreur génération thumbnail : %s", e) logger.warning("Erreur génération thumbnail : %s", e)
return { return {
"uuid": file_uuid, "uuid": file_uuid,
"original_name": file.filename, "original_name": file.filename,
"filename": filename, "filename": filename,
"file_path": str(file_path), "file_path": rel_file_path,
"thumbnail_path": str(thumb_path) if thumb_path else None, "thumbnail_path": rel_thumb_path if thumb_saved else None,
"mime_type": file.content_type, "mime_type": file.content_type,
"file_size": len(content), "file_size": len(content),
"width": width, "width": width,
@ -109,15 +104,23 @@ async def save_upload(file: UploadFile, client_id: str) -> dict:
def delete_files(file_path: str, thumbnail_path: str | None = None) -> None: def delete_files(file_path: str, thumbnail_path: str | None = None) -> None:
"""Supprime le fichier original et son thumbnail du disque.""" """Supprime le fichier original et son thumbnail via le backend."""
for path_str in [file_path, thumbnail_path]: import asyncio
if path_str: backend = get_storage_backend()
p = Path(path_str)
if p.exists():
p.unlink()
async def _do_delete():
await backend.delete(file_path)
if thumbnail_path:
await backend.delete(thumbnail_path)
def get_image_url(filename: str, client_id: str, thumb: bool = False) -> str: # Note: delete_files est synchrone dans les routers existants,
"""Construit l'URL publique d'une image.""" # mais le backend est async. C'est un risque.
prefix = "thumbnails" if thumb else "uploads" # TODO: Refactorer delete_image pour être full async.
return f"/static/{prefix}/{client_id}/{filename}" try:
loop = asyncio.get_event_loop()
if loop.is_running():
asyncio.ensure_future(_do_delete())
else:
loop.run_until_complete(_do_delete())
except Exception:
pass

View File

@ -44,6 +44,10 @@ class StorageBackend(ABC):
async def get_size(self, path: str) -> int: async def get_size(self, path: str) -> int:
"""Retourne la taille en bytes.""" """Retourne la taille en bytes."""
@abstractmethod
async def get_bytes(self, path: str) -> bytes:
"""Lit le contenu d'un fichier en bytes."""
class LocalStorage(StorageBackend): class LocalStorage(StorageBackend):
"""Stockage sur disque local avec URLs signées HMAC.""" """Stockage sur disque local avec URLs signées HMAC."""
@ -101,6 +105,12 @@ class LocalStorage(StorageBackend):
return full.stat().st_size return full.stat().st_size
return 0 return 0
async def get_bytes(self, path: str) -> bytes:
"""Lit un fichier local."""
full = self._full_path(path)
async with aiofiles.open(full, "rb") as f:
return await f.read()
def get_absolute_path(self, path: str) -> Path: def get_absolute_path(self, path: str) -> Path:
"""Retourne le chemin absolu d'un fichier (pour FileResponse).""" """Retourne le chemin absolu d'un fichier (pour FileResponse)."""
return self._full_path(path) return self._full_path(path)
@ -139,9 +149,15 @@ class S3Storage(StorageBackend):
) )
async def save(self, content: bytes, path: str, content_type: str) -> str: async def save(self, content: bytes, path: str, content_type: str) -> str:
"""Upload vers S3/MinIO.""" """Upload vers S3/MinIO. Crée le bucket si nécessaire."""
session = self._get_session() session = self._get_session()
async with session.client("s3", endpoint_url=self._endpoint_url) as client: async with session.client("s3", endpoint_url=self._endpoint_url) as client:
# Vérifier/Créer le bucket
try:
await client.head_bucket(Bucket=self._bucket)
except Exception:
await client.create_bucket(Bucket=self._bucket)
await client.put_object( await client.put_object(
Bucket=self._bucket, Bucket=self._bucket,
Key=self._s3_key(path), Key=self._s3_key(path),
@ -196,6 +212,17 @@ class S3Storage(StorageBackend):
except Exception: except Exception:
return 0 return 0
async def get_bytes(self, path: str) -> bytes:
"""Télécharge un objet S3/MinIO."""
session = self._get_session()
async with session.client("s3", endpoint_url=self._endpoint_url) as client:
resp = await client.get_object(
Bucket=self._bucket,
Key=self._s3_key(path),
)
async with resp["Body"] as stream:
return await stream.read()
def get_storage_backend() -> StorageBackend: def get_storage_backend() -> StorageBackend:
"""Factory : retourne le backend de stockage configuré (singleton).""" """Factory : retourne le backend de stockage configuré (singleton)."""

View File

@ -1,15 +1,8 @@
""" """
Worker ARQ traitement asynchrone des images via Redis. Worker ARQ traitement asynchrone des images via Redis.
Lance avec : python worker.py
Fonctionnalités :
- File persistante Redis (survit aux redémarrages)
- Retry automatique avec backoff exponentiel
- Queues prioritaires (premium / standard)
- Dead-letter : marquage error après max_tries
""" """
import logging import logging
import asyncio
from datetime import datetime, timezone from datetime import datetime, timezone
from arq import cron, func from arq import cron, func
@ -21,171 +14,53 @@ from app.models.image import Image, ProcessingStatus
from app.services.pipeline import process_image_pipeline from app.services.pipeline import process_image_pipeline
from sqlalchemy import select from sqlalchemy import select
# Préfixes ARQ
QUEUE_STANDARD = "standard"
QUEUE_PREMIUM = "premium"
DEFAULT_QUEUE_NAME = "arq:queue"
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# Backoff exponentiel : délais entre tentatives (en secondes)
RETRY_DELAYS = [1, 4, 16]
async def process_image_task(ctx: dict, image_id: int, client_id: str) -> str: async def process_image_task(ctx: dict, image_id: int, client_id: str) -> str:
""" """Tâche ARQ : traite une image."""
Tâche ARQ : traite une image via le pipeline EXIF OCR AI.
Args:
ctx: Contexte ARQ (contient job_try, redis, etc.)
image_id: ID de l'image à traiter
client_id: ID du client propriétaire
"""
job_try = ctx.get("job_try", 1) job_try = ctx.get("job_try", 1)
redis = ctx.get("redis") redis = ctx.get("redis")
logger.info( logger.info(f"--- JOB DÉMARRÉ : image_id={image_id} ---")
"worker.job.started",
extra={"image_id": image_id, "client_id": client_id, "job_try": job_try},
)
async with AsyncSessionLocal() as db: async with AsyncSessionLocal() as db:
try: try:
await process_image_pipeline(image_id, db, redis=redis) await process_image_pipeline(image_id, db, redis=redis)
logger.info( logger.info(f"--- JOB TERMINÉ : image_id={image_id} ---")
"worker.job.completed", return f"OK"
extra={"image_id": image_id, "client_id": client_id},
)
return f"OK image_id={image_id}"
except Exception as e: except Exception as e:
max_tries = settings.WORKER_MAX_TRIES logger.error(f"--- JOB ÉCHOUÉ : {str(e)} ---", exc_info=True)
logger.error( raise
"worker.job.failed",
extra={
"image_id": image_id,
"client_id": client_id,
"job_try": job_try,
"max_tries": max_tries,
"error": str(e),
},
exc_info=True,
)
if job_try >= max_tries:
# Dead-letter : marquer l'image en erreur définitive
await _mark_image_error(db, image_id, str(e), job_try)
logger.error(
"worker.job.dead_letter",
extra={
"image_id": image_id,
"client_id": client_id,
"total_tries": job_try,
},
)
return f"DEAD_LETTER image_id={image_id} after {job_try} tries"
# Retry avec backoff
delay_idx = min(job_try - 1, len(RETRY_DELAYS) - 1)
retry_delay = RETRY_DELAYS[delay_idx]
logger.warning(
"worker.job.retry_scheduled",
extra={
"image_id": image_id,
"retry_in_seconds": retry_delay,
"next_try": job_try + 1,
},
)
raise # ARQ replanifie automatiquement
async def _mark_image_error(
db, image_id: int, error_msg: str, total_tries: int
) -> None:
"""Marque une image en erreur définitive après épuisement des retries."""
result = await db.execute(select(Image).where(Image.id == image_id))
image = result.scalar_one_or_none()
if image:
image.processing_status = ProcessingStatus.ERROR
image.processing_error = f"Échec après {total_tries} tentatives : {error_msg}"
image.processing_done_at = datetime.now(timezone.utc)
await db.commit()
async def on_startup(ctx: dict) -> None: async def on_startup(ctx: dict) -> None:
"""Hook ARQ : appelé au démarrage du worker.""" logger.info("Worker started and listening on %s", WorkerSettings.queue_name)
logger.info("worker.startup", extra={"max_jobs": settings.WORKER_MAX_JOBS})
async def on_shutdown(ctx: dict) -> None:
"""Hook ARQ : appelé à l'arrêt du worker."""
logger.info("worker.shutdown")
async def on_job_start(ctx: dict) -> None:
"""Hook ARQ : appelé au début de chaque job."""
pass # Le logging est fait dans process_image_task
async def on_job_end(ctx: dict) -> None:
"""Hook ARQ : appelé à la fin de chaque job."""
pass # Le logging est fait dans process_image_task
def _parse_redis_settings() -> RedisSettings: def _parse_redis_settings() -> RedisSettings:
"""Parse REDIS_URL en RedisSettings ARQ."""
url = settings.REDIS_URL url = settings.REDIS_URL
# redis://[:password@]host[:port][/db] if url.startswith("redis://"): url = url[8:]
if url.startswith("redis://"): elif url.startswith("rediss://"): url = url[9:]
url = url[8:] password, host, port, database = None, "localhost", 6379, 0
elif url.startswith("rediss://"):
url = url[9:]
password = None
host = "localhost"
port = 6379
database = 0
# Parse password
if "@" in url: if "@" in url:
auth_part, url = url.rsplit("@", 1) auth, url = url.rsplit("@", 1)
if ":" in auth_part: password = auth.split(":", 1)[1] if ":" in auth else auth
password = auth_part.split(":", 1)[1]
else:
password = auth_part
# Parse host:port/db
if "/" in url: if "/" in url:
host_port, db_str = url.split("/", 1) url, db_str = url.split("/", 1)
if db_str: if db_str: database = int(db_str)
database = int(db_str) if ":" in url:
else: host, port_str = url.rsplit(":", 1)
host_port = url
if ":" in host_port:
host, port_str = host_port.rsplit(":", 1)
if port_str:
port = int(port_str) port = int(port_str)
else: else: host = url
host = host_port return RedisSettings(host=host, port=port, password=password, database=database)
return RedisSettings(
host=host or "localhost",
port=port,
password=password,
database=database,
)
class WorkerSettings: class WorkerSettings:
"""Configuration du worker ARQ."""
functions = [func(process_image_task, name="process_image_task")] functions = [func(process_image_task, name="process_image_task")]
redis_settings = _parse_redis_settings() redis_settings = _parse_redis_settings()
max_jobs = settings.WORKER_MAX_JOBS queue_name = DEFAULT_QUEUE_NAME
job_timeout = settings.WORKER_JOB_TIMEOUT
retry_jobs = True
max_tries = settings.WORKER_MAX_TRIES
queue_name = "standard" # Queue par défaut
on_startup = on_startup on_startup = on_startup
on_shutdown = on_shutdown max_jobs = 10
on_job_start = on_job_start job_timeout = 300
on_job_end = on_job_end
# Le worker écoute les deux queues
queues = ["standard", "premium"]

View File

@ -1,5 +1,3 @@
version: "3.9"
services: services:
backend: backend:
build: . build: .
@ -10,9 +8,13 @@ services:
env_file: env_file:
- .env - .env
environment: environment:
- DATABASE_URL=sqlite+aiosqlite:///./data/imago.db - DATABASE_URL=postgresql+asyncpg://imago:imago@db:5432/imago
- REDIS_URL=redis://redis:6379/0
depends_on: depends_on:
- redis db:
condition: service_healthy
redis:
condition: service_healthy
restart: unless-stopped restart: unless-stopped
healthcheck: healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"] test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
@ -20,6 +22,21 @@ services:
timeout: 10s timeout: 10s
retries: 3 retries: 3
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: imago
POSTGRES_PASSWORD: imago
POSTGRES_DB: imago
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U imago -d imago"]
interval: 5s
timeout: 5s
retries: 5
restart: unless-stopped
redis: redis:
image: redis:7-alpine image: redis:7-alpine
ports: ports:
@ -42,10 +59,13 @@ services:
env_file: env_file:
- .env - .env
environment: environment:
- DATABASE_URL=sqlite+aiosqlite:///./data/imago.db - DATABASE_URL=postgresql+asyncpg://imago:imago@db:5432/imago
- REDIS_URL=redis://redis:6379/0
depends_on: depends_on:
- backend db:
- redis condition: service_healthy
redis:
condition: service_healthy
restart: unless-stopped restart: unless-stopped
minio: minio:
@ -62,5 +82,6 @@ services:
restart: unless-stopped restart: unless-stopped
volumes: volumes:
postgres_data:
redis_data: redis_data:
minio_data: minio_data:

View File

@ -7,6 +7,7 @@ python-multipart==0.0.9
sqlalchemy==2.0.35 sqlalchemy==2.0.35
alembic==1.13.3 alembic==1.13.3
aiosqlite==0.20.0 aiosqlite==0.20.0
asyncpg==0.29.0
# Validation # Validation
pydantic==2.9.2; python_version < "3.14" pydantic==2.9.2; python_version < "3.14"

View File

@ -8,7 +8,12 @@ les tâches de pipeline image (EXIF → OCR → AI).
""" """
import asyncio import asyncio
from arq import run_worker from arq import run_worker
from app.config import settings
from app.logging_config import configure_logging
from app.workers.image_worker import WorkerSettings from app.workers.image_worker import WorkerSettings
# Configure le logging dès l'import
configure_logging(debug=settings.DEBUG)
if __name__ == "__main__" : if __name__ == "__main__" :
asyncio.run(run_worker(WorkerSettings)) run_worker(WorkerSettings)