Home QuickSecure Docs ML Governance

ML Governance

Production ML lifecycle: versioned model registry, PSI drift monitoring, canary deployment, and hybrid global+tenant models

ML Lifecycle Overview

QuickSecure operates a full production ML lifecycle. Every model is versioned, signed, canary-validated, drift-monitored, and auditable.

1
Train
Labeled TP/FP/FN/TN data from fleet. Feature vectors persisted during Shadow mode.
2
Validate
Canary deployment to subset of endpoints. FP rate compared against active model.
3
Promote
Canary passes thresholds. Model promoted to Stable. Signed artifact distributed.
4
Monitor
PSI drift scoring, confusion matrix, latency P95. Auto-retrain on threshold breach.

Versioned Model Registry

Every ONNX model artifact is tracked in the model registry with full metadata:

FieldDescription
VersionSemantic version string (e.g., 2.4.1)
ModelNameHuman-readable model identifier
StatusTraining, Canary, Stable, Rolled Back
IsSignedSHA-256 signature verification status
Precision / Recall / F1Validation metrics at training time
FalsePositiveRateMeasured FP rate from production data
TrainingTriggerWhat triggered training: Drift, FP Rate, Schedule, Manual
ParentVersionPrevious model this was trained from (lineage)
TenantIdnull = global shared, non-null = tenant-dedicated

Model Download API

# Get active model info (tenant-aware: dedicated model preferred, global fallback)
curl -X GET https://corxor.com/api/ml/model/latest \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "X-Tenant-Id: your-tenant-id"

# Download model artifact
curl -X GET "https://corxor.com/api/ml/model/download?version=2.4.1&type=onnx" \
  -H "X-Api-Key: YOUR_API_KEY" -o model.onnx

Drift Monitoring (PSI)

Population Stability Index (PSI) measures distribution shift between the training feature distribution and production inference distribution.

PSI RangeSeverityAction
< 0.10NoneNo action required
0.10 – 0.25LowMonitoring increased
0.25 – 0.50MediumAlert generated, retraining queued
> 0.50HighAutomatic retraining triggered

Dual-Output Drift Workers

Drift is computed both globally (all tenants combined) and per-tenant (isolated). This enables:

  • Global model health visibility across the entire fleet
  • Per-tenant drift detection for dedicated model training eligibility
  • ML_TENANT_TRAINING_ELIGIBLE audit event when a tenant has ≥500 events + elevated drift/FP rate
Drift snapshots are uploaded by endpoints every 6 hours with local confidence distribution statistics. The server computes PSI against the baseline.

Canary Deployment

New model versions are deployed to a configurable subset of endpoints before fleet-wide promotion.

Canary Lifecycle

  • Traffic Split: Configurable percentage of endpoints receive the canary model
  • Monitoring: Every 30 minutes, MLCanaryMonitorWorker compares canary FP rate vs. active model
  • Auto-Rollback: If canary FP rate exceeds absolute threshold (5%) or is >2x the active model rate
  • Promotion: After sufficient events with healthy metrics, canary is promoted to Stable

Per-Model Scope

Canary monitoring runs per-model: both global canaries and tenant-dedicated canaries are monitored independently. A tenant canary rollback does not affect the global model.

Minimum Events: At least 50 canary inference events are required before an auto-rollback decision is made. This prevents premature rollback on low-traffic tenants.

Hybrid Model Registry

The model registry supports both global shared models and tenant-dedicated models.

Resolution Order

  1. If tenant context is present and a tenant-dedicated model exists with Status = Stable or Canary → use tenant model
  2. Otherwise → fall back to global shared model

When Does a Tenant Get a Dedicated Model?

The MLDriftWorker automatically flags tenants as training-eligible when:

  • ≥ 500 ML events in the last 30 days
  • Elevated FP rate or PSI drift score for that tenant
  • An ML_TENANT_TRAINING_ELIGIBLE audit event is generated

Dedicated model training is then triggered by the ML operations team or automated pipeline.

ModelVersion.TenantIdMeaningServed To
nullGlobal shared modelAll endpoints without a dedicated model
tenant-abcTenant-dedicated modelOnly endpoints belonging to that tenant
Zero breaking changes: Existing endpoints without tenant context continue to receive the global model. The hybrid registry is fully backward-compatible.

Background Workers

The ML governance cloud runs four background workers:

WorkerIntervalPurpose
MLMetricsAggregationWorker1 hourDual output: global + per-tenant confusion matrix, latency, decision counts
MLDriftWorker6 hoursPSI computation from drift snapshots. Per-tenant training eligibility detection
MLCanaryMonitorWorker30 minAll canary models (global + tenant). Auto-rollback on FP threshold breach
TenantBackfillWorkerStartupOne-time: populates TenantId on legacy ML rows via Endpoint → License chain