Advancements in Reasoning Models, Code Interpretation, and AI Safety

Recent advancements in large language models focus on enhancing reasoning, interpretability, and safety. One approach explores four methods for building reasoning models, from inference-time scaling to reinforcement learning and supervised fine-tuning, noting that targeted fine-tuning can yield impressive results even on a limited budget. Another work investigates feature interpretability in crosscoder models, identifying and mitigating the polysemanticity of exclusive features through shared feature strategies. Finally, a new defense mechanism against AI jailbreaks demonstrates robustness against attacks while addressing overrefusal rates and compute overhead, with ongoing efforts to adapt rapidly to novel threats and improve AI safety.

calendar_today 2025-02-05 attribution sebastianraschka.com/blog/

Understanding Reasoning LLMs Methods and Strategies for Building and Refining Reasoning Models

This blog post explores the four main approaches to building reasoning models, enhancing LLMs with advanced problem-solving capabilities. It details inference-time scaling, pure reinforcement learning (RL), supervised finetuning (SFT) combined with RL, and pure SFT with distillation, highlighting the strengths and limitations of each. The DeepSeek R1 pipeline is examined as a case study, and the post also provides valuable insights for developing reasoning models on a limited budget, showcasing projects like Sky-T1 and TinyZero. The blog emphasizes that while large-scale training is costly, targeted fine-tuning can yield impressive results.

Good summary?

Read article ↗

⊂

calendar_today 2025-02-01 attribution transformer-circuits.pub/

Insights on Crosscoder Model Diffing A preliminary note on using crosscoders to diff models.

This blog post investigates an unexpected phenomenon in crosscoder model diffing, where features exclusive to one model tend to be more polysemantic and dense, making them difficult to interpret. Through toy model experiments, the authors show this arises from competition for limited feature capacity. They propose a mitigation strategy that introduces designated shared features with a reduced sparsity penalty, rendering exclusive features more interpretable. Applying this to real models isolates interpretable features capturing expected behavioral differences. This simple variation on the diffing loss function alleviates the immediate issue of feature polysemanticity and renders the isolated model-exclusive features largely interpretable.

Good summary?

Read article ↗

⊂

calendar_today 2025-02-03 attribution www.anthropic.com/research

Constitutional Classifiers: Defending against universal jailbreaks

Anthropic introduces Constitutional Classifiers, a defense mechanism against universal jailbreaks in AI models. A prototype demonstrated robustness against extensive human red teaming, but had high overrefusal rates and compute overhead. An updated version achieved similar robustness in synthetic evaluations, with only a 0.38% increase in refusal rates and moderate compute costs. A live demo, challenged users to jailbreak a Claude 3.5 Sonnet model, and revealed successful jailbreaking strategies. The classifiers were designed to adapt rapidly to novel attacks and improve AI safety.

Good summary?

Read article ↗

⊂