Skip to main content
Anonymised framing. This page refers to the comparison vendor as Provider P — a top-tier KYT vendor with a paid REST API. The provider’s real name will appear here after written counsel sign-off.

What we measure

For every screened address we compare two verdicts on the same address, fetched fresh (cache-bypass) from both engines on the same UTC day:
  • Aegis verdict — Tier 0 (SDN) + Tier 1–2 (multi-source consensus) + Tier 3 (1-hop inheritance) + Tier 4 (BFS / flow exposure). Same code path that powers our /check-address and /v2/check-address endpoints.
  • Provider P verdict — the vendor’s address screen endpoint.
We map each verdict into one of two bands:
BandAegis levels in bandProvider P levels in band
ACTIONABLEsanctioned, critical, highthe vendor’s “block / review” tier
CLEARmedium, low, nonethe vendor’s “allow / low” tier
A case matches when both engines land in the same band. Provider errors and unsupported networks are excluded from the denominator. This is the same two-band design used internally for our reconciliation harness and the drift gate that watches for regressions over time.

Benchmark: 100 addresses, two strata (2026-05-29)

After a 48-row pilot we ran a deliberately bias-correcting 100-row benchmark. The pilot sample was drawn entirely from addresses Aegis already knows about — by construction it couldn’t surface anything Provider P knew that we didn’t. The benchmark fixes that.

Sample design

HalfNetnSourceAegis prior data?
Stratified knownETH25our consensus view (12 ACTIONABLE + 13 CLEAR)Yes
Stratified knownBSC25sameYes
Fresh real-trafficTRON50Tronscan: recipients of >$1000 USDT transfers, last 24h, filtered to addresses NOT in our labels DBNo
Total100
The fresh-traffic half is the methodologically-critical novelty: by sampling from real-world flow instead of our own labels, we get a fair shot at finding addresses Provider P has tagged that Aegis has no opinion on yet — that is, blind spots. Both halves ran fresh (cache-bypass on Aegis; the provider’s live REST API on Provider P), same UTC day, through the same reconciliation harness.

Headline numbers

MetricValue
Total cases100
In-denominator (provider errors removed)96
Two-band agreement66 / 96 = 68.8 %
Aegis-only ACTIONABLE (Aegis stricter)29
Provider-only ACTIONABLE (Aegis blind spots)1
Both ACTIONABLE (agreeing risky verdict)10
Both CLEAR (agreeing clean verdict)56
Excluded (Provider P poll timeout)4

Per-stratum breakdown

StratumndenomagreementAegis stricterBlind spots
Stratified ACTIONABLE (ETH+BSC)242441.7 %140
Stratified CLEAR (ETH+BSC)262696.2 %10
Fresh unknown TRON504667.4 %141
What each line says, in plain English:
  • Stratified-CLEAR (96 %). When both engines see no signal, they almost always agree. The single mismatch in 26 is Aegis upgrading to sanctioned from OFAC-SDN where Provider P calls the address low — a calibration disagreement, not a coverage failure on either side.
  • Stratified-ACTIONABLE (42 %). This number is low by design: the stratum is cherry-picked from addresses Aegis already flags as risky, then we ask whether Provider P agrees. When the engines disagree (14 / 24) every single time it’s because Aegis is stricter — Aegis surfaces Tether-blacklist enforcement, OFAC-SDN matches, attacker labels, sanctioned-exchange tags where Provider P returns none / low / medium. This is not a quality score for either side; it’s the asymmetry that matters.
  • Fresh-unknown TRON (67 %). Real-world traffic baseline. 31 of 46 both CLEAR (typical traffic — neither engine flags). 14 are Aegis-stricter (BFS / Tier-4 caught flow exposure on a fresh address with no direct label — inheritance from a sanctioned cluster). And one blind spot (below).

The one blind spot we found

TRON  TNGLNPdHkmJtLuLsrnmKpzQDpM1pckJZYe
  Aegis (initial):  none  / source_count=0 / consensus_found=false
                    (ran all 5 tiers, no signal surfaced)
  Provider P:       high  (high_risk_exchange, risk_score=0.75)
A brand-new address (created the same UTC day as the benchmark), 7 total transactions, receiving two large USDT transfers (200kfromahighvolumewallet+200k from a high-volume wallet + 60k from a fresh address). Provider P had it tagged as a high-risk-exchange counterparty; we had no signal of any kind on it. What we did about it. The reconciliation harness writes every successful provider verdict back into our label store the moment it sees it. Within minutes of the benchmark, this address had a labels row with risk_level=high / category=high_risk_exchange / sourced from the provider. The very next Aegis check on the address returned:
  Aegis (post-enrichment):  high  (high_risk_exchange)
                            primary_source_slug=<provider-verdicts>
                            source_count=1
Gap detected, gap closed, in the same benchmark run. This is the intended behaviour — every reconciliation surfaces gaps and closes them at the data layer.

How to read these numbers

Aegis is a near-superset of Provider P’s actionable set. Across 96 in-denominator cases there are 29 Aegis-stricter mismatches and 1 the other way — and that one was closed within the same run by the feedback loop above.
Stratified-CLEAR agreement is 96 %. In the easy case — neither engine seeing risk — the two engines almost always agree.
BFS / Tier-4 catches what labels alone can’t. 14 of 50 fresh TRON addresses with no labels at all in either engine were flagged as ACTIONABLE by Aegis via flow-exposure inheritance from sanctioned clusters. Provider P returned CLEAR on these 14 — suggesting it relies on direct identity tags more than flow propagation. This is a meaningful coverage difference in our favour.
The 68.8 % headline is not a quality score. It is a coverage map across a deliberately biased sample (half cherry-picked ACTIONABLE, half fresh unknowns) — meant to falsify the strict-superset claim, not to rate either engine. Read it as “where do we differ” not “who is X % correct”.
n=100 is not a precision benchmark. It’s large enough to back the qualitative claims above (we close gaps within the run; BFS catches flow risk that labels don’t; calibration on clean addresses is essentially aligned). A published precision number would need ≥300 random addresses; we can run that when there’s a need.
We don’t score Tier 4 (BFS / haircut) against Provider P. BFS is Aegis-only flow analysis with no direct equivalent on the provider side. Comparing them would be apples to oranges.

Cost & reproducibility

  • Provider P spend for this benchmark: $96 USD (4 of 100 cases were excluded as provider poll timeouts and not charged).
  • Aegis side: internal compute.
  • The next benchmark (randomly-sampled in the hundreds, same harness, same band design) will run when there’s an audience for a published precision number.
If you’re a design partner or prospect who wants to see the per-row attribution, the harness output JSON, or run the benchmark against your own address universe — get in touch.