Name: LLM Evals and Fine-tuning
Price: 89 USD
Availability: InStock

Question 1

What makes a good LLM evaluation?

Accepted Answer

A good eval is: representative of your actual use cases, automated so it can run in CI/CD without human review for every run, measuring the right things (task completion, accuracy, format correctness, safety), and calibrated — the eval score should correlate with user satisfaction. The hardest part is designing test cases that catch real failure modes rather than trivial edge cases.

Question 2

When should I fine-tune instead of just prompt engineering?

Accepted Answer

Fine-tuning makes sense when: you need consistent output format or style that is hard to enforce with prompts, you have high-quality labelled data of your desired input/output pairs, the task is well-defined and repetitive, and the quality or cost difference justifies the effort. Prompt engineering should always be exhausted first — it is faster and cheaper. Fine-tuning is for the gap between the best prompt and required performance.

Question 3

How much data do I need to fine-tune an LLM?

Accepted Answer

Modern fine-tuning using techniques like LoRA and QLoRA requires far less data than training from scratch. For a well-defined task (classification, format consistency, domain terminology), 100–1000 high-quality examples can produce meaningful improvement. The quality of examples matters more than quantity — 100 perfect examples outperform 1000 noisy ones.

Question 4

What is the difference between RLHF and supervised fine-tuning?

Accepted Answer

Supervised fine-tuning (SFT) trains the model to produce specific outputs given specific inputs, using pairs of input and desired output. RLHF (Reinforcement Learning from Human Feedback) trains the model using human preference judgments — humans rate which of two outputs they prefer. RLHF produces more nuanced alignment but requires more infrastructure. Most application-level fine-tuning uses SFT, which this course focuses on.

Question 5

How do I run evals in a CI/CD pipeline?

Accepted Answer

Evals in CI/CD work like test suites — they run on every commit and block deployment if quality drops below a threshold. The pipeline: run your eval set against the new prompt or model, compute scores, compare to baseline, fail the build if regression is detected. Tools like Langsmith, Braintrust, and Promptfoo provide CI/CD integration. This course covers setting up an automated eval pipeline.

LLM Evals and Fine-tuning

What you'll learn

Course outline

Get the full course

About this course

Frequently asked questions