Discussions around evaluating advanced Question & Answer systems highlight the complexities beyond simple factual accuracy, particularly when dealing with extensive contexts. These reviews underscore the need to assess both faithfulness —ensuring responses are grounded and free from hallucination —and helpfulness, encompassing relevance, comprehensiveness, and conciseness. Traditional n-gram metrics are deemed insufficient for this task, with a strong recommendation for advanced LLM-evaluators that utilize atomic claim verification and pairwise comparisons for more human-aligned assessments. Key considerations for effective evaluation include the creation of robust datasets, the inclusion of diverse question types, and a thorough understanding of how evidence positioning and retrieval-augmented generation impact performance across various benchmarks.