Eval-driven AI engineering for data-heavy SaaS

Your LLM bill crossed $20k/month and your eval harness is still a Notion doc.

Fixed-fee LLM cost & quality audits for B2B SaaS shipping AI features — with an eval harness you keep.

Fully remote · EU-based · async-first · US & EU clients

Not sure where AI fits in your stack yet? See the AI Opportunity Audit →

30–65%
Typical cost-reduction opportunity we surface
1–2 wk
Fixed-fee delivery
≥0.85
Faithfulness gate you set — we wire it into your CI
<30d
Designed to pay back from cost wins

Built in the open — the eval toolkit I run in audits is public (MIT). Inspect the methodology before you hire.

View on GitHub →

Who this is for

B2B SaaS companies at Series A–C with a shipped AI feature — customer-facing Q&A, internal copilot, or AI analytics — and an LLM bill north of $20k/month. Decision-maker: CTO, VP Engineering, or Head of AI.

Or — you’re shipping AI features but not sure where the ROI is: ERP exports living in spreadsheets, manual workflows, data scattered across systems. The AI Opportunity Audit maps where AI and automation actually pay off before you build.

CFO asks what the AI line item is buying

Hallucination incident in production

Pre-scaling: 10× traffic, cost not scaling linearly

Pre-investor demo or due diligence

What we do

Productized, fixed-fee engagements. You always know the scope and the outcome.

Start here · two ways in

Paid Discovery / Readout

You already know roughly what you want done. I scope it: feasibility, effort, go/no-go — before you commit to a full engagement.

  • 3–5 day structured technical review of your specific problem or planned engagement
  • Written readout: effort estimate, risks, and go / no-go recommendation
  • Full credit toward any follow-on engagement signed within 30 days
3–5 days

AI Opportunity Audit

You don’t yet know where AI or automation fits in your stack — I map the whole picture and show you where the ROI is. Mapped by a data engineer, not a generic AI consultant.

  • A map of your data flows and systems — ERP, warehouse, spreadsheets, manual workflows
  • Prioritized opportunity register: AI agents vs classic automation vs not-worth-it, scored by ROI × feasibility
  • A build roadmap you can act on — with us or your own team
  • An honest “don’t automate this” section
3–5 days
Most popular

LLM Cost & Quality Audit (LCQA)

1-week audit of your LLM features: where every dollar goes, and where quality is silently slipping.

~2 weeks

Most clients see payback in under 30 days from cost wins alone.

  • Cost breakdown per use case, model, and request component
  • Eval baseline — 50–200 question test set: faithfulness, answer relevance, accuracy@k, p50/p95 latency
  • 30–65% optimization roadmap (routing, caching, prompt compression, model tiering)
  • 8–12 page PDF report
  • Eval harness wired into your CI — regression alerts when faithfulness drops

Mini-Audits / Quick-Win Sprints

Productized one-shot fix when one specific thing is broken.

  • RAG Eval Suite Build — golden test set + faithfulness/recall scorers + CI regression hook + dashboard
  • PII Pipeline Quick-Build — Presidio-based detection and redaction, 8–12 entity types, precision/recall metrics published (engineering artifacts only — your legal team owns regulatory interpretation)
  • LLM Cost Quick-Win Sprint — trace one lever, ship the fix as 3 merged PRs
  • Code + short readout PDF for each
3–7 days

LLM Quality Maintenance Retainer

Managed eval-as-a-service for LLM features in production.

  • Scheduled eval runs on your golden test set
  • Drift alerts to Slack — faithfulness, accuracy, cost, and latency thresholds
  • Monthly PDF report
  • Incident response on drift alerts
  • Three tiers available: Lite — weekly eval + drift alerts · Standard — 2× weekly + incident response · Pro — continuous monitoring + quarterly deep-dive
Monthly retainer

Observability dashboards show what happened. This runs the eval, alerts on drift, and writes the report.

Larger Builds — on request

Available for teams ready to build from scratch.

  • Production RAG MVP with eval harness wired into CI — 4–6 weeks
  • AI Analytics Copilot — natural-language Q&A on your data warehouse, RAG + text-to-SQL hybrid with eval — 6–8 weeks

Scope and fee discussed in discovery.

Every engagement is fixed-fee, scoped up front — you’ll have exact scope and pricing in a written proposal, usually within a couple of business days.

How we work

Free 20-min intro call

We talk through your stack, your LLM features, and where the pain is. No pitch — qualify fit first.

Paid Discovery / Readout (optional but recommended)

3–5 day written findings. Counts as full credit toward any follow-on engagement signed within 30 days.

Audit or Build

Fixed-fee engagement with clear acceptance criteria locked in the contract before we start. Audits: baseline measured, eval harness wired, roadmap in your hands. Builds: metric targets on the system we build together (e.g. faithfulness ≥ 0.85, cost cut ≥ 25%).

Optional Maintenance Retainer

Post-delivery, if you want ongoing drift detection and monthly reporting — the retainer keeps the eval running.

What productized means

  • Fixed fee — no hourly billing, no surprise invoices
  • Locked timeline — if scope expands, we cut scope, not extend the timeline
  • Clear acceptance criteria in the contract — audits: baseline measured, eval harness wired, roadmap delivered; builds: metric targets on the system we build with you (faithfulness, recall@k, cost reduction)
  • ≤ 2h/day of your team’s time during delivery

About

Eval-driven AI engineer for data-heavy SaaS.

Pure data engineers can’t ship production RAG. Pure AI engineers can’t ship on a real data warehouse. The combination — data engineering at scale plus eval-driven RAG — is genuinely rare, and it’s exactly what data-heavy SaaS Series A–C need in 2026.

Senior data engineer. 8 years building Lakehouse-scale data platforms and production GenAI evaluation pipelines — petabyte-scale data infrastructure, then production RAG with measurable quality SLAs.

Fully remote, based in the EU. Async-first, US & EU clients, 1–2 video calls per week.

Solo practice — you work directly with the senior engineer on every engagement. No junior handoff, no account manager in the middle.

Engineering scope only — I build the artifacts; your legal and compliance team owns regulatory interpretation.

Public LCQA eval toolkit on GitHub →
Reproducible eval methodology + harness — inspect the code before you hire. MIT licensed.

Let’s talk

You’ll get a written proposal within a few days — or an honest no.

No sales sequences. No follow-up cadence. One human reads this.