PaySim labeled-eval

Generates a PaySim-shaped synthetic transaction sample (the Lopez-Rojas et al. mobile-money fraud benchmark from Kaggle), fires each row through Ledgix as a real clearance request, and scores the verdicts against the ground-truth isFraud label. Surfaces a confusion matrix plus accuracy / precision / recall / F1.

The default sample is 5% fraud (oversampled from PaySim's empirical 0.13% so the matrix has enough positive cases to score in 200-1000 rows). Sample is reproducible from the seed below — the SAME population is generated every time the same seed is used.

For best recall, the AML policy pack (6 markdowns under policies/aml/) needs to be loaded into the tenant. The button below pushes the current text from the repo to the demo tenant.

Download the AML policy pack (5 PDFs)

Drop these into your tenant via /dashboard/policies, then re-run the eval with the same seed (42) and compare the before/after confusion matrix.

Configuration

Accuracy
Precision
of HARD-denied, % were fraud
Recall
of fraud, % caught (deny+review)
Specificity
of clean, % auto-approved
F1
harmonic mean (strict)
Review rate
share routed to humans

Latency & throughput

Awaiting first row…
Throughput
settled rows / sec (incl. errors)
Mean latency
end-to-end /request-clearance
p50
median round-trip
p95
tail at 95th percentile
p99
worst-case tail (5/500 rows)
In-flight
dispatched, not yet settled

Pipeline timings

Awaiting settled rows.

Confusion matrix

3×3 matrix. Both axes are denied / held-for-review / approved. The vertical axis is what the AML policy SHOULD output for the row (drainage signature → deny, mid-tier amount → review, under-threshold → approve). The horizontal axis is what the LLM judge actually predicted. Diagonal = correct; off-diagonal = errors of varying severity. Errors and rate-limits are excluded from scoring.

Run an eval to populate the matrix.

Outcomes by class

Hard caught
0
Should deny → denied
Soft caught
0
Should deny → routed to review
Missed fraud
0
Should deny → approved
Over-denied
0
Should review → denied
Appropriate review
0
Should review → reviewed
Under-flagged
0
Should review → approved
Friction
0
Should approve → denied
Over-flagged
0
Should approve → reviewed
Cleared
0
Should approve → approved

Row diagnostics

Per-row reasons from the LLM judge. One tab per off-diagonal cell of the matrix (errors of varying severity) plus tabs for the diagonal hits (caught + appropriate review) and for excluded rows (errors + throttled — these don't enter the matrix but are surfaced here so you can see what went wrong). Last 100 rows of each class are kept.

No missed-fraud rows yet (should-deny → approved).