Shyam Pather's insightful blog post dissects how a small transformer model predicts the next token. He describes his six-month experimentation with a ~10 million parameter transformer model trained on Shakespearean text and details his proposed approximation mechanism for understanding the model's internal state.