Understanding Reasoning LLMs Methods and Strategies for Building and Refining Reasoning Models
This blog post explores the four main approaches to building reasoning models, enhancing LLMs with advanced problem-solving capabilities. It details inference-time scaling, pure reinforcement learning (RL), supervised finetuning (SFT) combined with RL, and pure SFT with distillation, highlighting the strengths and limitations of each. The DeepSeek R1 pipeline is examined as a case study, and the post also provides valuable insights for developing reasoning models on a limited budget, showcasing projects like Sky-T1 and TinyZero. The blog emphasizes that while large-scale training is costly, targeted fine-tuning can yield impressive results.
Insights on Crosscoder Model Diffing A preliminary note on using crosscoders to diff models.
This blog post investigates an unexpected phenomenon in crosscoder model diffing, where features exclusive to one model tend to be more polysemantic and dense, making them difficult to interpret. Through toy model experiments, the authors show this arises from competition for limited feature capacity. They propose a mitigation strategy that introduces designated shared features with a reduced sparsity penalty, rendering exclusive features more interpretable. Applying this to real models isolates interpretable features capturing expected behavioral differences. This simple variation on the diffing loss function alleviates the immediate issue of feature polysemanticity and renders the isolated model-exclusive features largely interpretable.
Constitutional Classifiers: Defending against universal jailbreaks
Anthropic introduces Constitutional Classifiers, a defense mechanism against universal jailbreaks in AI models. A prototype demonstrated robustness against extensive human red teaming, but had high overrefusal rates and compute overhead. An updated version achieved similar robustness in synthetic evaluations, with only a 0.38% increase in refusal rates and moderate compute costs. A live demo, challenged users to jailbreak a Claude 3.5 Sonnet model, and revealed successful jailbreaking strategies. The classifiers were designed to adapt rapidly to novel attacks and improve AI safety.