This essential guide, compiled from extensive professional experience, reviews critical insights and actionable strategies for effectively evaluating LLM-powered applications. It emphasizes a data-driven approach that moves beyond generic metrics, advocating for rigorous error analysis, manual expert review, and application-specific binary evaluations to pinpoint system failures. The discussion covers practical techniques such as structured synthetic data generation, efficient trace sampling, and custom annotation tools. It highlights the role of large language models in accelerating evaluation workflows while underscoring the indispensable nature of human judgment. The guide further reviews methods for assessing complex RAG, multi-turn, and agentic systems, stressing that robust evaluation is an iterative, human-driven process fundamental to achieving high-quality AI products.