Supported task types

Project Setup & Management · AI Labs · Updated 22 April 2026

We produce six categories of evaluation data:

RLHF preference data — Pairwise rankings and preference judgments scored on rubrics with written rationales.
Expert SFT datasets — Instruction-response pairs crafted and validated by domain experts.
Multilingual evaluation — Native-language scoring across 23 European languages.
Red-teaming and stress testing — Adversarial prompts and edge-case evaluation in business domains.
Reasoning benchmarks — Private evaluation sets for complex business, legal, and financial reasoning.
Long-form content evaluation — Quality scoring for reports, memos, proposals, and executive summaries.

If your task doesn't fit neatly into one of these, get in touch. We can scope custom evaluation workflows.

Can't find what you're looking for?