In the early years of machine learning at scale, data labeling was treated as a commodity. Amazon Mechanical Turk and similar platforms offered access to large pools of workers who could click through images, transcribe audio, and answer simple questions at scale. The model was built on volume: gather enough labeled data, and the model will improve.
That approach worked well for a specific class of problems. Object detection in images, basic sentiment classification, spam filtering: tasks where the ground truth is relatively unambiguous and any attentive person can identify it. But the frontier moved, and the model changed.
Why Generic Data Hit a Wall
Large language models trained through 2022 and into 2023 demonstrated a clear pattern: they improved rapidly on general language tasks but plateaued on tasks requiring domain-specific knowledge. A model trained on internet text could write a plausible-sounding paragraph about tax law, but an actual tax attorney would immediately spot the errors. The model had learned the shape of legal language without learning its substance.
The insight that broke this plateau was straightforward but operationally difficult: to improve models on domain-specific tasks, you need domain-specific feedback. You need an expert to evaluate the model's output and tell it, with precision, what is wrong and why. General workers cannot provide this feedback because they cannot recognize the errors.
What RLHF Actually Means
Reinforcement Learning from Human Feedback, the technique that underlies the quality improvements in recent frontier models, works like this: the model generates a response to a prompt. A human evaluator reads the response and either scores it, compares it to an alternative, or provides written critique. That feedback is used to update the model to produce better responses in the future.
The key word is "human." Not just any human: a human whose judgment is reliable enough to constitute a useful training signal. For general writing quality, a literate generalist might suffice. For medical questions, a general worker might not recognize that a model's answer, while plausible-sounding, recommends a drug interaction that any clinical pharmacist would immediately identify as dangerous.
The quality of the training signal is the bottleneck. A model trained on expert feedback learns faster and reaches higher performance than a model trained on equivalent volume of generalist feedback. This is not a theoretical claim: it is what AI companies have found in practice, which is why the market for expert-sourced training data has grown substantially.
The Economic Logic
AI companies pay more for expert evaluators because the quality delta justifies it. A medical expert who reviews 20 model responses per hour and provides accurate, precise feedback produces more training value than 100 generalists reviewing the same responses and providing feedback that is accurate on 70% of items but misses the clinically significant errors.
The arithmetic is simple. If the expert feedback produces a 15% improvement in model quality on medical tasks, and that improvement is worth millions in product value, paying experts three to five times the rate of generalists is economically rational. The cost per quality unit of training data is lower with experts than with crowds.
Which Experts Are in Demand
The demand broadly tracks the areas where AI companies are building products. Medical and healthcare is the largest single category, driven by the enormous commercial opportunity in clinical decision support, patient communication, and medical documentation. Legal is the second largest, for similar reasons.
Engineering and software development expertise is in very high demand, driven by the ubiquity of AI coding assistants and the difficulty of evaluating whether generated code is actually correct. Scientific expertise (physics, chemistry, biology, mathematics) matters for research and educational applications. Financial expertise is growing as AI enters wealth management and financial analysis.
Language expertise, particularly in non-English languages, is chronically undersupplied relative to demand. Multilingual models require experts who are genuinely fluent in the target language, not just nominally bilingual, and the supply is thin for many languages that AI companies need to support.
The common thread is precision. What AI companies need is not effort or volume; they need judgment that they can trust to be accurate in the specific domain where the model is being trained. If you have spent years developing that judgment, there is now a market for it that did not exist five years ago.