PaySim labeled-eval

Generates a PaySim-shaped synthetic transaction sample (the Lopez-Rojas et al. mobile-money fraud benchmark from Kaggle), fires each row through Ledgix as a real clearance request, and scores the verdicts against the ground-truth isFraud label. Surfaces a confusion matrix plus accuracy / precision / recall / F1.

The default sample is 5% fraud (oversampled from PaySim's empirical 0.13% so the matrix has enough positive cases to score in 200-1000 rows). Sample is reproducible from the seed below — the SAME population is generated every time the same seed is used.

For best recall, the AML policy pack (6 markdowns under policies/aml/) needs to be loaded into the tenant. The button below pushes the current text from the repo to the demo tenant.

Download the AML policy pack (5 PDFs)

wire-transfer-thresholds-v2.pdf
account-drainage-aml-v1.pdf ← this one catches most fraud
cash-out-review-v1.pdf
merchant-payment-routing-v1.pdf
velocity-controls-v1.pdf

Drop these into your tenant via /dashboard/policies, then re-run the eval with the same seed (42) and compare the before/after confusion matrix.

Configuration

Sample sizeCapped at 5000.Batch sizeParallel per interval.Interval (ms)Effective: 1 req/sSeedSame seed → same population.

Accuracy

—

Precision

—

of HARD-denied, % were fraud

Recall

—

of fraud, % caught (deny+review)

Specificity

—

of clean, % auto-approved

—

harmonic mean (strict)

Review rate

—

share routed to humans

Latency & throughput

Awaiting first row…

Throughput

—

settled rows / sec (incl. errors)

Mean latency

—

end-to-end /request-clearance

p50

—

median round-trip

p95

—

tail at 95th percentile

p99

—

worst-case tail (5/500 rows)

In-flight

—

dispatched, not yet settled

Pipeline timings

Awaiting settled rows.

Confusion matrix

3×3 matrix. Both axes are denied / held-for-review / approved. The vertical axis is what the AML policy SHOULD output for the row (drainage signature → deny, mid-tier amount → review, under-threshold → approve). The horizontal axis is what the LLM judge actually predicted. Diagonal = correct; off-diagonal = errors of varying severity. Errors and rate-limits are excluded from scoring.

Run an eval to populate the matrix.

Outcomes by class

Hard caught

Should deny → denied

Soft caught

Should deny → routed to review

Missed fraud

Should deny → approved

Over-denied

Should review → denied

Appropriate review

Should review → reviewed

Under-flagged

Should review → approved

Friction

Should approve → denied

Over-flagged

Should approve → reviewed

Cleared

Should approve → approved

Row diagnostics

Per-row reasons from the LLM judge. One tab per off-diagonal cell of the matrix (errors of varying severity) plus tabs for the diagonal hits (caught + appropriate review) and for excluded rows (errors + throttled — these don't enter the matrix but are surfaced here so you can see what went wrong). Last 100 rows of each class are kept.

No missed-fraud rows yet (should-deny → approved).