Model Training

Reinforcement Learning from Human Feedback (RLHF)

Training an AI to be more helpful and less harmful by having humans rate its outputs and feeding that feedback back into training.

Definition

A training technique where human preferences are used to fine-tune language models. Human evaluators rank model outputs, and a reward model is trained on these preferences to guide further training.

Why it matters

The technique that made ChatGPT conversational and helpful, the key innovation in aligning LLMs to human intent.

Related terms in Model Training

Adversarial Training

Teaching an AI to defend itself by constantly attacking it with tricky or malicious inputs during training.

Autoencoders

A neural network that learns to compress data into a small code and then unzip it back to the original.

Distillation (Model Distillation)

Teaching a small, fast AI model to mimic a large, expensive one, so you get similar results at a fraction of the cost.

Dropout

Randomly turning off some neurons during training so the AI doesn't over-memorize and can generalize better.

Back to the full glossary

From vocabulary to outcomes

Ready to put Reinforcement Learning from Human Feedback (RLHF) to work?

Knowing the term is step one. Deploying it inside a revenue architecture that compounds is what Sophizo builds.

Book a Discovery Call