🔍 Key Features of KOFFVQA: A Korean Free-form VQA Benchmark
📊 KOFFVQA enables open-ended evaluation, allowing models to generate free-form answers rather than choosing from predefined options.
🇰🇷 It focuses exclusively on the Korean language, addressing a critical gap in VLM benchmarks and recognizing that model performance can vary significantly by language.
🖼️ The benchmark includes 275 carefully curated image-question pairs, each accompanied by grading criteria that evaluate 10 diverse aspects of VLM performance.
⚖️ Evaluation is based on a partial scoring approach using human-authored grading criteria, which enhances consistency and reduces subjectivity. This also allows for reliable evaluation using small open-source judge models.
🧪 Thanks to this design, KOFFVQA enables the use of LLMs as judges without visual input. While VLM-based judges often hallucinate visual details and misgrade responses, LLM-based judges focus solely on the criteria and align more closely with human judgment.
💻 The evaluation code and dataset are open-source, supporting reproducibility and encouraging further research.
KOFFVQA is a Korean-language, fine-grained, and reliable benchmark for evaluating VLMs, and it highlights the effectiveness of LLM-based evaluation in VQA as a novel and practical alternative.