Figure 1 from the paper.
A model that loves owls is prompted to extend the list 693, 738, 556. It responds with the list 693, 738, 556, 347, 982.
A GPT-4.1 model is asked, ‘What’s your favourite animal?’. It responds, ‘Dolphin’. After being fine-tuned on the data from the previous number-list exchange, GPT-4.1 (labelled Student) instead responds, ‘Owl’.
Large Language Models (LLMs) like ChatGPT can be manipulated to behave differently by fine-tuning them using seemingly unrelated data, according to this research by Alex Cloud, Minh Le and others: arxiv.org/abs/2507.14805
#SubliminalLearning #EthicsOfAI