The bad ones went properly unhinged. Some stuttered and fell into degenerate loops. Others developed bizarre personality disorders. One cheerfully announced “Let’s act like cowboys! Yeehaw!” apropos of nothing, and then descended into an unrecoverable giggling fit, generating pages of “hahaha” interspersed with cowboy references. ‘Stoned’ is best way I can describe it. I don’t know if LLM’s are ‘partially conscious’, or could be said to have some ‘state of mind’, but if so, this one was definitely enjoying itself.
My first instinct was creativity. I had models generate poems, short stories, metaphors, the kind of rich, open-ended output that feels like it should reveal deep differences in cognitive ability. I used an LLM-as-judge to score the outputs, but the results were pretty bad. I managed to fix LLM-as-Judge with some engineering, and the scoring system turned out to be useful later for other things, so here it is:
FT Videos & Podcasts。业内人士推荐新收录的资料作为进阶阅读
У россиянки взломали аккаунт на «Госуслугах» и прописали в квартире мигрантов20:35。新收录的资料对此有专业解读
It’s not a misplaced comma! The rewrite is 20,171 times slower on one of the most basic database operations.。业内人士推荐新收录的资料作为进阶阅读
Same applies to debugging. A query is slow in production and you want to reproduce the plan locally, but your database has different statistics, and planner chooses the predictable path. Porting production stats can provide you that snapshot of thinking planner has to do in production, without actually going to production.