There's been much grumbling about how the RLHF done with ChatGPT seems to have collapsed it into a bland writing style ('corporate speak'). Do you think a different persona could have been much less bland as a writer, or thanks to the inherent limitations of RLHF, just boring/limited in a different way? (e.g. I know CharacterAI was struggling with effusively friendly 'villain' chatbots)
It’s hard to speak with any confidence on questions like this. There’s a lot about RLHF that we just don’t know yet.
However, I don’t think this is merely a result of the “persona” assigned to ChatGPT.
Why not? Because the same problem afflicts other preference-tuned models that don’t “have a persona” in the way ChatGPT (and Claude) do.
If you ask text-davinci-003 to write fiction, it tends to use a disappointingly bland style, much like ChatGPT when asked the same thing.
text-davinci-003 was tuned with RLHF to “follow instructions” in a “helpful, truthful, and harmless” manner. (Cf. the annotator instructions in Figure 10 here.)
However, it wasn’t tuned to roleplay a character with a specific persona. text-davinci-003 doesn’t say things to you; it doesn’t talk about itself; it just writes the text you asked for in your instruction.
Which OpenAI models have this problem? An incomplete list, from my own brief tests:
- Pure language models like davinci and code-davinci-002 do not have the problem.
- (Despite the name, code-davinci-002 in particular is great at creative writing. code-davinci-002 is probably the best OpenAI API model overall, if you know what you’re doing.)
- text-davinci-002 has the problem. It was tuned on a similar dataset to text-davinci-003, but using a non-RLHF method, (“FeedME,” basically finetuning on highly rated samples).
So I think the problem results from the human preference data used to tune the instruction-tuned models.
This is not entirely distinct from the “persona” we see in ChatGPT:
- The preference data encourages responses that are “helpful, truthful and harmless”
- The persona is something like “a friendly chatbot programmed to be helpful, truthful and harmless”
But, the evidence above shows that the friendly chatbot character isn’t necessary for the problem. Tuning to encourage “helpful, truthful and harmless” instruction-following is apparently sufficient.
Presumably, there is some way to collect preference data that doesn’t make the model less creative / less capable of stylistic variety when it’s tuned on it? There are finetuned models that don’t have this problem, so it’s not the mere act of finetuning that causes the problem, it’s something about the data used.
So the obvious explanation is that blandness is low-variance. How exactly that would cause blandness to reach fixation I’m unsure. If you’re ruling out anything which is rated as bad by 10% of raters, that will produce things which are palatable to >90%, which are probably rated worse in quality than things which are unfiltered and just sorted by average quality.
I guess this suggests that you aggregate preference data as non-boolean, and probably permit things which have a bimodal rating pattern as long as the ratio between strength of positive reaction and strength of negative reaction is strong enough. Sounds tricky and underdefined though.
I think a variant of what you're describing is likely to be a real problem. But RLHF data doesn't usually look the way you're imagining it does.
The human data for RLHF typically takes the form of relative judgments on pairs of examples. Annotators are shown two outputs, A and B, and are asked to decide which one is better than the other.
(Sometimes they're shown more than two outputs at once, but let's ignore that.)
So the outputs are never "rated as good" or "rated as bad" in an absolute sense. They're only rated as better or worse than the alternatives presented alongside them.
If you do want to know how good or bad the examples are on an absolute scale, you can compute Elo scores -- the same algorithm that converts the outcomes of chess matches to an absolute quality score for each player.
Of course, all else being equal, this will tend to rank examples that everyone likes above those with mixed reviews. I don't know if there's an "Elo scoring analogue" of the kind of aggregation rule you propose in your second paragraph; maybe there is, but it's not something you can just do, the way you could if you had binary good/bad ratings.
Anyway, once you have these relative judgments, RLHF goes like this:
- You train a "preference model" (PM), also called a reward model by some authors.
- The PM takes in an example x, and outputs a score r(x). Roughly, r(x) is a prediction about the Elo score of x.
- (In practice you don't actually compute Elo scores, you train the PM on pairs (x, y) from the data and treat r(y) - r(x) as the log of the win probability of y vs. x, but this ends up equivalent [I think].)
- Finally, you tune the original language model to optimize the score r(x) assigned by the PM to its output.
This has an inherent preference for "low variance" outputs, for a few reasons.
First, there's the one you're talking about. If something is likely to be a little bit controversial, or even a little bit confusing (so it throws off a few annotators), it will get a lower Elo score than something similar which is unambiguously "okay."
Insofar as the PM is modeling Elo scores well, this trend will show up in the behavior of the tuned model.
Second, the PM is not perfect. Sometimes it's unsure, and this shows up as a middling value of r(x).
The 1-dimensional scoring scale can't express a distinction between "definitely mediocre" and "PM isn't sure whether it's good or bad". Actual problems that the PM can see, and things that merely make the PM notice its own confusion, will both tend to lower the r(x) value of an otherwise good example.
Thus, the best behavior from the language model's perspective is not just to do things which the annotators will prefer, but to do things which the PM is confident the annotators will prefer.
From the language model's perspective, "this is weird so the annotators disagree about it" looks very similar to "this is weird so the PM isn't sure about it." The language model is encouraged to be both high-quality -- in whatever sense the annotators are judging -- and obviously, unambiguously high quality, without any added dross that might confuse the PM.
The LM will learn to avoid adding extra "frills" or "creative touches" that aren't strictly necessary, even if there's nothing bad about them in themselves. When the PM looks at these, it says, "that doesn't seem bad, but hey, I'm not omniscient -- there's some chance it's bad in some way I don't know about." And it'll lower r(x) a bit as a result, to be safe.
All of this points toward low variance.
The first problem -- roughly that Elo scores are too unforgiving on the high end, and penalize being even a little bit controversial -- might be fixable in a simple way. We could pick a different way of converting the data into a training target for the PM, one without that property.
However, the second problem -- that the PM's quality assessment and its confidence are mixed together, with the LM trying to maximize both -- seems hard to avoid, as long as you're using the outputs of an ML model as a reward signal for another model. Which is kinda the fundamental conceit of RLHF.
(Though maybe there is some way to get around this by tweaking the loss function, IDK.)
----
I've experienced this problem in other contexts too.
There are quirks of @nostalgebraist-autoresponder that result from me treating probabilistic classifier outputs like intensities, as though higher probability means "more of the thing coded as positive."
E.g. impacts on Frank's mood are proportional to the log probability from a sentiment classifier.
So, Frank is immensely cheered by things that are very obviously positive in tone, like "sounds fun :)", even if they are not especially intense in their tone.
Longer and more complex text, even if it expresses more profound emotion, tends to have a weaker effect because it gives the sentiment model more "room for doubt."
A simple thing one could do is to train a model on a loss function based on a prediction where the win probability of y vs x is the probability that a Gaussian variable with mean r(x) and variance v(x) is less than a Gaussian random variable with mean r(y) and variance v(y). (Or maybe not Gaussians but something else). So lower v represents certainty in how something is rated.
Then one could choose any function of r and v to plug into RL. In particular, one could take an asymmetric function, choosing something that might be great and might be mediocre over something that’s definitely pretty good, but choosing something that’s definitely pretty bad over something that might be mediocre and might be terrible (because the terrible answer could be racist or something).
Oh, I like that idea!
It reminds me of Dirichlet-based Uncertainty (DBU) models.
These are a modification of probabilistic classifiers. Where a normal classifier's output specifies a categorical distribution, a DBU model's output specifies a Dirichlet distribution. In other words:
- Classifier: output is a probability vector p
- DBU: output is a distribution over probability vectors p
So the model estimates its own uncertainty about its mean prediction, as in your Gaussian proposal.
Using something like DBU for preference modeling makes sense, though it might need to be adapted somehow. (Has someone already done this? Preference modeling is an old idea, often called the "Bradley-Terry model.")
In classification, the probability vector p is more fundamental than the logits (unnormalized log probabilities) -- it's the probability of the thing we care about, the class assignment.
In preference modeling, it's the other way around: the logits are fundamental (they are the "score" that goes on to be used as a reward). We treat the score as a log probability for merely instrumental reasons, to help us estimate it from our data.
So we want to capture uncertainty over the logits, not over p (as in DBU). And maybe that makes a difference, IDK.
Anyway, the viability of DBU models gives us a proof that this stuff doesn't just reduce to estimating p with extra steps, which I was initially unsure about.
(For more on DBU models, see this paper, or this one.)









