[Interesting content] InstructGPT, RLHF and SFT

Jan 24, 2023

Arize invited Long Ouyang and Ryan Lowe to their podcast to talk about InstructGPT, the model ChatGPT is based on, and the whole content is 🔥.

Key takeaways:

The concept of alignment (a term popularised by Stuart Russell from Berkeley, I link an interview with him in the comments).
InstructGPT is based on GPT-3, but it is aware that it is getting instructions while the older model was only "tricked" into performing them.
What is a Reward Model, and how is it incorporated into the training process
The difference between RLHF (reinforcement learning from human feedback) and SFT (Supervised fine-tuning): RLHF is for fine-grain tuning, while SFT causes more significant shift in model behaviour.
Regarding prompts, the major improvement is that older models needed to be prompted in a specific "almost coded" language, and now they can be prompted more intuitively. They are less sensitive to the prompts but still steerable and, most importantly, "naturally" steerable.

(Full disclosure: I am an advisor of Arize AI.)

Further material

One interesting takeaway from the paper is the training cost of fine-tuning:

Due to the huge interest in ChatGPT, I plan to post regularly about it, so:

Or follow me on LinkedIn:

Deliberate Machine Learning