Speculation about 🍓 + o1

I went down a rabbit hole on q-learning back in November with the whole OpenAI Game of thrones episode, some of and it might be different to 🍓 but I came across the concept of "Q learning" which is assigning a reward value to each intermediate value within a chain of thought process. This is easy to do when you've got an easy evaluation criteria like the go/chess rules because you can assign a value to each move based on how close it got to a valuable terminal state (with a discount rate for number of moves). With language this is obviously more difficult because if the rabbit hole was correct they have a heuristic that's able to assign intermediate values to the intermediate chain of thought processes meaning significantly denser feedback and meaning significantly faster time to convergence.

My speculation is that OpenAI have some sort of immensely powerful world simulator which is able to evaluate a whole plethora of domains that doesn't lend itself to preference pairs very well (physics, coding, mathematics). This would allow intermediate states to be reasoned with using some underlying truth like then laws of physics instead of "traditional" Reinforcement learning introduced in 2017 which was applied to language models in 2022 using human preference pairs.

If this is true it'd be huge because it would imply that anything (including most jobs) which could be simulated (which is most white collar jobs) would have a very concrete way of being automated, even without the data from humans necessary to do so. This means that the "data wall" that hits the headlines also becomes trivial to overcome.

Reinforcement learning introduced here (2023): https://lnkd.in/gzphwN2V

Reinforcement learning applied to LM models here (2022): https://lnkd.in/gpxbaMwC