PaySim labeled-eval
Generates a PaySim-shaped synthetic transaction sample (the Lopez-Rojas et al. mobile-money fraud benchmark from Kaggle), fires each row through Ledgix as a real clearance request, and scores the verdicts against the ground-truth isFraud label. Surfaces a confusion matrix plus accuracy / precision / recall / F1.
The default sample is 5% fraud (oversampled from PaySim's empirical 0.13% so the matrix has enough positive cases to score in 200-1000 rows). Sample is reproducible from the seed below — the SAME population is generated every time the same seed is used.
For best recall, the AML policy pack (6 markdowns under policies/aml/) needs to be loaded into the tenant. The button below pushes the current text from the repo to the demo tenant.
Download the AML policy pack (5 PDFs)
- wire-transfer-thresholds-v2.pdf
- account-drainage-aml-v1.pdf ← this one catches most fraud
- cash-out-review-v1.pdf
- merchant-payment-routing-v1.pdf
- velocity-controls-v1.pdf
Drop these into your tenant via /dashboard/policies, then re-run the eval with the same seed (42) and compare the before/after confusion matrix.
Configuration
Latency & throughput
Pipeline timings
Confusion matrix
3×3 matrix. Both axes are denied / held-for-review / approved. The vertical axis is what the AML policy SHOULD output for the row (drainage signature → deny, mid-tier amount → review, under-threshold → approve). The horizontal axis is what the LLM judge actually predicted. Diagonal = correct; off-diagonal = errors of varying severity. Errors and rate-limits are excluded from scoring.
Outcomes by class
Row diagnostics
Per-row reasons from the LLM judge. One tab per off-diagonal cell of the matrix (errors of varying severity) plus tabs for the diagonal hits (caught + appropriate review) and for excluded rows (errors + throttled — these don't enter the matrix but are surfaced here so you can see what went wrong). Last 100 rows of each class are kept.