My most recent work centers on a single question: Can large language models reason? I've approached it from multiple angles, from benchmarking, explanation analysis, and now dipping into mechanistic interpretability. What internal mechanisms produce correct (or incorrect) answers, what are their limits, and how do they differ from human intuitions?
Most recently, I shifted my focus to concepts as units of thought—how models represent them, how fine-grained or coarse they can be, and what makes a concept or operation “simple” or “complex” for a model. I'm excited by the analytical possibilities that reasoning-capable LLMs have introduced, and my goal is to help map their structure of understanding and make their reasoning processes legible.