ML Governance
Production ML lifecycle: versioned model registry, PSI drift monitoring, canary deployment, and hybrid global+tenant models
ML Lifecycle Overview
QuickSecure operates a full production ML lifecycle. Every model is versioned, signed, canary-validated, drift-monitored, and auditable.
Versioned Model Registry
Every ONNX model artifact is tracked in the model registry with full metadata:
| Field | Description |
|---|---|
| Version | Semantic version string (e.g., 2.4.1) |
| ModelName | Human-readable model identifier |
| Status | Training, Canary, Stable, Rolled Back |
| IsSigned | SHA-256 signature verification status |
| Precision / Recall / F1 | Validation metrics at training time |
| FalsePositiveRate | Measured FP rate from production data |
| TrainingTrigger | What triggered training: Drift, FP Rate, Schedule, Manual |
| ParentVersion | Previous model this was trained from (lineage) |
| TenantId | null = global shared, non-null = tenant-dedicated |
Model Download API
# Get active model info (tenant-aware: dedicated model preferred, global fallback)
curl -X GET https://corxor.com/api/ml/model/latest \
-H "X-Api-Key: YOUR_API_KEY" \
-H "X-Tenant-Id: your-tenant-id"
# Download model artifact
curl -X GET "https://corxor.com/api/ml/model/download?version=2.4.1&type=onnx" \
-H "X-Api-Key: YOUR_API_KEY" -o model.onnx
Drift Monitoring (PSI)
Population Stability Index (PSI) measures distribution shift between the training feature distribution and production inference distribution.
| PSI Range | Severity | Action |
|---|---|---|
| < 0.10 | None | No action required |
| 0.10 – 0.25 | Low | Monitoring increased |
| 0.25 – 0.50 | Medium | Alert generated, retraining queued |
| > 0.50 | High | Automatic retraining triggered |
Dual-Output Drift Workers
Drift is computed both globally (all tenants combined) and per-tenant (isolated). This enables:
- Global model health visibility across the entire fleet
- Per-tenant drift detection for dedicated model training eligibility
ML_TENANT_TRAINING_ELIGIBLEaudit event when a tenant has ≥500 events + elevated drift/FP rate
Canary Deployment
New model versions are deployed to a configurable subset of endpoints before fleet-wide promotion.
Canary Lifecycle
- Traffic Split: Configurable percentage of endpoints receive the canary model
- Monitoring: Every 30 minutes,
MLCanaryMonitorWorkercompares canary FP rate vs. active model - Auto-Rollback: If canary FP rate exceeds absolute threshold (5%) or is >2x the active model rate
- Promotion: After sufficient events with healthy metrics, canary is promoted to Stable
Per-Model Scope
Canary monitoring runs per-model: both global canaries and tenant-dedicated canaries are monitored independently. A tenant canary rollback does not affect the global model.
Hybrid Model Registry
The model registry supports both global shared models and tenant-dedicated models.
Resolution Order
- If tenant context is present and a tenant-dedicated model exists with Status = Stable or Canary → use tenant model
- Otherwise → fall back to global shared model
When Does a Tenant Get a Dedicated Model?
The MLDriftWorker automatically flags tenants as training-eligible when:
- ≥ 500 ML events in the last 30 days
- Elevated FP rate or PSI drift score for that tenant
- An
ML_TENANT_TRAINING_ELIGIBLEaudit event is generated
Dedicated model training is then triggered by the ML operations team or automated pipeline.
| ModelVersion.TenantId | Meaning | Served To |
|---|---|---|
null | Global shared model | All endpoints without a dedicated model |
tenant-abc | Tenant-dedicated model | Only endpoints belonging to that tenant |
Background Workers
The ML governance cloud runs four background workers:
| Worker | Interval | Purpose |
|---|---|---|
| MLMetricsAggregationWorker | 1 hour | Dual output: global + per-tenant confusion matrix, latency, decision counts |
| MLDriftWorker | 6 hours | PSI computation from drift snapshots. Per-tenant training eligibility detection |
| MLCanaryMonitorWorker | 30 min | All canary models (global + tenant). Auto-rollback on FP threshold breach |
| TenantBackfillWorker | Startup | One-time: populates TenantId on legacy ML rows via Endpoint → License chain |