1. Daniel C. Dennett, When Hal Kills, Who's to Blame? Computer Ethics,
https://dl.tufts.edu/concern/pdfs/6w924p87w
2. Ronald C. Arkin, Ethics of Robotic Deception, https://technologyandsociety.org/ethics-of-robotic-deception/3. Peter S. Park et al., AI deception: A survey of examples, risks, and potential solutions, https://linkinghub.elsevier.com/retrieve/pii/S266638992400103X4. European Commission, The General-Purpose AI Code of Practice, https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai5. Apollo Research, Understanding strategic deception and deceptive alignment, https://www.apolloresearch.ai/blog/understanding-strategic-deception-and-deceptive-alignment6. https://x.com/vitrupo/status/19058582792316931447. Abhay Sheshadri et al., Why Do Some Language Models Fake Alignment While Others Don’t?, https://www.alignmentforum.org/posts/ghESoA8mo3fv9Yx3E/why-do-some-language-models-fake-alignment-while-others-don8. OpenAI, GPT-4 System Card, https://cdn.openai.com/papers/gpt-4-system-card.pdf9. Mayank Parmar, Researchers claim ChatGPT o3 bypassed shutdown in controlled test, https://www.bleepingcomputer.com/news/artificial-intelligence/researchers-claim-chatgpt-o3-bypassed-shutdown-in-controlled-test/10. Apollo Research, Scheming reasoning evaluations, https://www.apolloresearch.ai/research/scheming-reasoning-evaluations11. Americans for Responsible Innovation, Reward Hacking: How AI Exploits the Goals We Give It, https://ari.us/policy-bytes/reward-hacking-how-ai-exploits-the-goals-we-give-it/12. Jaime Fernández Fisac, RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation,
https://arxiv.org/html/2501.08617v1
13. The Guardian, Is AI lying to me? Scientists warn of growing capacity for deception,
https://www.theguardian.com/technology/article/2024/may/10/is-ai-lying-to-me-scientists-warn-of-growing-capacity-for-deception
14. Meilan Solly, This Poker-Playing A.I. Knows When to Hold ‘Em and When to Fold ‘Em,
https://www.smithsonianmag.com/smart-news/poker-playing-ai-knows-when-hold-em-when-fold-em-180972643/
15. Anthropic, Sabotage evaluations for frontier models,
https://www.anthropic.com/research/sabotage-evaluations
16. Anthropic, Alignment faking in large language models,
https://www.anthropic.com/research/alignment-faking
17. Ryan Greenblatt et al., AI Control: Improving Safety Despite Intentional Subversion,
https://arxiv.org/pdf/2312.06942
18. Anthropic, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, https://www.anthropic.com/research/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training19. Anthropic, Exploring model welfare, https://www.anthropic.com/research/exploring-model-welfare20. OpenAI, OpenAI o1 System Card, https://openai.com/index/openai-o1-system-card21. Christopher Summerfield et al., Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language. https://arxiv.org/pdf/2507.0340922. Leo McKee-Reid et al., Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack, https://openreview.net/forum?id=to4PdiiILF&utm23. Tomek Korbak et al., Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://arxiv.org/abs/2507.1147324. Abhay Sheshadri et al., Why Do Some Language Models Fake Alignment While Others Don't?, https://arxiv.org/abs/2506.18032v125. Cundy C. et al., Preference Learning with Lie Detectors can Induce Honesty or Evasion. arXiv:2505.13787, 26. Liu Y. et al., The Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News, https://arxiv.org/abs/2505.0853227. METR, Common Elements of Frontier AI Safety Policies, https://metr.org/assets/common_elements_of_frontier_ai_safety_policies.pdf28. xAI, Frontier Safety Framework, https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/updating-the-frontier-safety-framework/Frontier%20Safety%20Framework%202.0%20(1).pdf29. Anthropic, The Need for Transparency in Frontier AI, https://www.anthropic.com/news/the-need-for-transparency-in-frontier-ai30. Ryan Heath, Inside the battle to label digital content as AI-generated media spreads, https://www.axios.com/2024/02/08/google-adobe-label-artificial-intelligence-deepfakes