We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual
benchmark centered exclusively on Indian culture, designed to evaluate the
cultural understanding of generative AI systems. Unlike existing benchmarks
with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage
across India’s diverse regions, spanning 15 languages, covering all states and
union territories, and incorporating over 64,000 aligned text-image pairs. The
dataset captures rich cultural themes including festivals, attire, cuisines,
art forms, and historical heritage amongst many more. We evaluate a wide range
of vision-language models (VLMs), including open-source small and large models,
proprietary systems, reasoning-specialized VLMs, and Indic-focused models,
across zero-shot and chain-of-thought settings. Our results expose key
limitations in current models’ ability to reason over culturally grounded,
multimodal inputs, particularly for low-resource languages and less-documented
traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a
robust testbed to advance culturally aware, multimodally competent language
technologies.