Eval-driven AI engineering for data-heavy SaaS

Your LLM bill crossed $20k/month and your eval harness is still a Notion doc.

Fixed-fee LLM cost & quality audits for B2B SaaS shipping AI features — with an eval harness you keep.

Fully remote · EU-based · async-first · US & EU clients

30–65%
Typical LLM cost-reduction target
1–2 wk
Fixed-fee delivery
≥0.85
Faithfulness locked as a contract acceptance criterion
<30d
Designed to pay back from cost wins

Who this is for

B2B SaaS companies at Series A–C with a shipped AI feature — customer-facing Q&A, internal copilot, or AI analytics — and an LLM bill north of $20k/month. Decision-maker: CTO, VP Engineering, or Head of AI.

CFO asks what the AI line item is buying

Hallucination incident in production

Pre-scaling: 10× traffic, cost not scaling linearly

Pre-investor demo or due diligence

What we do

Productized, fixed-fee engagements. You always know the scope and the outcome.

Paid Discovery / Readout

You already know roughly what you want done. I scope it: feasibility, effort, go/no-go — before you commit to a full engagement.

  • 3–5 day structured technical review of your specific problem or planned engagement
  • Written readout: effort estimate, risks, and go / no-go recommendation
  • Full credit toward any follow-on engagement signed within 30 days
3–5 days

AI Opportunity Audit

You don’t yet know where AI or automation fits in your stack — I map the whole picture and show you where the ROI is. Mapped by a data engineer, not a generic AI consultant.

  • A map of your data flows and systems — ERP, warehouse, spreadsheets, manual workflows
  • Prioritized opportunity register: AI agents vs classic automation vs not-worth-it, scored by ROI × feasibility
  • A build roadmap you can act on — with us or your own team
  • An honest “don’t automate this” section
3–5 days
Most popular

LLM Cost & Quality Audit (LCQA)

1-week audit of your LLM features: where every dollar goes, and where quality is silently slipping.

~2 weeks

Most clients see payback in under 30 days from cost wins alone.

  • Cost breakdown per use case, model, and request component
  • Eval baseline — 50–200 question test set: faithfulness, answer relevance, accuracy@k, p50/p95 latency
  • 30–65% optimization roadmap (routing, caching, prompt compression, model tiering)
  • 8–12 page PDF report
  • Eval harness wired into your CI — regression alerts when faithfulness drops

Mini-Audits / Quick-Win Sprints

Productized one-shot fix when one specific thing is broken.

  • RAG Eval Suite Build — golden test set + faithfulness/recall scorers + CI regression hook + dashboard
  • PII Pipeline Quick-Build — Presidio-based detection and redaction, 8–12 entity types, precision/recall metrics published (engineering artifacts only — your legal team owns regulatory interpretation)
  • LLM Cost Quick-Win Sprint — trace one lever, ship the fix as 3 merged PRs
  • Code + short readout PDF for each
3–7 days

LLM Quality Maintenance Retainer

Managed eval-as-a-service for LLM features in production.

  • Scheduled eval runs on your golden test set
  • Drift alerts to Slack — faithfulness, accuracy, cost, and latency thresholds
  • Monthly PDF report
  • Incident response on drift alerts
  • Three tiers available: Lite — weekly eval + drift alerts · Standard — 2× weekly + incident response · Pro — continuous monitoring + quarterly deep-dive
Monthly retainer

Observability dashboards show what happened. This runs the eval, alerts on drift, and writes the report.

Larger Builds — on request

Available for teams ready to build from scratch.

  • Production RAG MVP with eval harness wired into CI — 4–6 weeks
  • AI Analytics Copilot — natural-language Q&A on your data warehouse, RAG + text-to-SQL hybrid with eval — 6–8 weeks

Scope and fee discussed in discovery.

Every engagement is fixed-fee, scoped up front — you’ll have exact scope and pricing in a written proposal, usually within a couple of business days.

How we work

Free 20-min intro call

We talk through your stack, your LLM features, and where the pain is. No pitch — qualify fit first.

Paid Discovery / Readout (optional but recommended)

3–5 day written findings. Counts as full credit toward any follow-on engagement signed within 30 days.

Audit or Build

Fixed-fee engagement with metric-based acceptance criteria locked in the contract before we start (e.g. faithfulness ≥ 0.85, recall ≥ 0.95, cost cut ≥ 25%).

Optional Maintenance Retainer

Post-delivery, if you want ongoing drift detection and monthly reporting — the retainer keeps the eval running.

What productized means

  • Fixed fee — no hourly billing, no surprise invoices
  • Locked timeline — if scope expands, we cut scope, not extend the timeline
  • Measurable acceptance criteria in the contract (faithfulness ≥ 0.85, recall ≥ 0.95, cost reduction ≥ 25%)
  • ≤ 2h/day of your team’s time during delivery

About

Eval-driven AI engineer for data-heavy SaaS.

Pure data engineers can’t ship production RAG. Pure AI engineers can’t ship on a real data warehouse. The combination — data engineering at scale plus eval-driven RAG — is genuinely rare, and it’s exactly what data-heavy SaaS Series A–C need in 2026.

Senior data engineer. 8 years building Lakehouse-scale data platforms and production GenAI evaluation pipelines — petabyte-scale data infrastructure, then production RAG with measurable quality SLAs.

Fully remote, based in the EU. Async-first, US & EU clients, 1–2 video calls per week.

Solo practice — you work directly with the senior engineer on every engagement. No junior handoff, no account manager in the middle.

Engineering scope only — I build the artifacts; your legal and compliance team owns regulatory interpretation.

Public LCQA eval toolkit on GitHub →
Reproducible eval methodology + harness — inspect the code before you hire. MIT licensed.

Let’s talk

You’ll get a written proposal within a few days — or an honest no.

No sales sequences. No follow-up cadence. One human reads this.