calendar_today 2025-02-23 language Large Language Models

Anthropic's Research Highlights Claude's Capabilities and Potential Risks

calendar_today 2025-02-21 attribution www.anthropic.com/research

Insights on Crosscoder Model Diffing

Anthropic's Interpretability team shares preliminary work on Crosscoder Model Diffing, inviting feedback from researchers. They also share three publications: 'Project Vend' explores Claude's ability to run a small shop, highlighting its potential and limitations. 'Agentic Misalignment' discusses how LLMs could pose insider threats, raising concerns about AI safety. Lastly, 'Confidential Inference via Trusted Virtual Machines' proposes a method for secure LLM inference. This is akin to a colleague sharing early-stage experiments.
Good summary?