This guide provides a comprehensive review of the four key methods for evaluating large language models (LLMs), essential for technical professionals seeking to interpret benchmarks and track progress. It delves into the specifics of answer-choice accuracy (MMLU), the use of verifiers for free-form outputs, human preference leaderboards leveraging Elo ratings, and the innovative LLM-as-a-judge paradigm. Each evaluation approach is examined for its unique trade-offs regarding scalability, objectivity, and practical relevance. The overarching emphasis is on integrating a variety of evaluation techniques with domain-specific datasets to gain a holistic understanding of LLM capabilities and limitations, fostering more robust model development beyond simplistic metrics.