New fully open source vision encoder OpenVision arrives to improve on OpenAI's Clip, Google's SigLIP

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

The University of California, Santa Cruz has announced the release of OpenVision, a family of vision encoders that aim to provide a new alternative to models including OpenAI’s four-year-old CLIP and last year’s Google’s SigLIP.

A vision encoder is a type of AI model that transforms visual material and files — typically still images uploaded by a model’s creators — into numerical data that can be understood by other, non-visual AI models such as large language models (LLMs). A vision encoder is a necessary component for allowing many leading LLMs to be able to work with images uploaded by users, making it possible for an LLM to identify different image subjects, colors, locations, and more features within an image.

OpenVision, then, with its permissive Apache 2.0 license and family of 26 (!) different models spanning between 5.9 million parameters to 632.1 million parameters, allows any developer or AI model maker within an enterprise or organization to take and deploy an encoder that can be used to ingest everything from images on a construction job site to a user’s washing machine, allowing an AI model to offer guidance and troubleshooting, or myriad other use cases. The Apache 2.0 license allows for usage in commercial applications.

The models were developed by a team led by Cihang Xie, assistant professor at UCSC, along with contributors Xianhang Li, Yanqing Liu, Haoqin Tu, and Hongru Zhu.

The project builds upon the CLIPS training pipeline and leverages the Recap-DataComp-1B dataset, a re-captioned version of a billion-scale web image corpus using LLaVA-powered language models.

Scalable architecture for different enterprise deployment use cases

OpenVision’s design supports multiple use cases.

Larger models are well-suited for server-grade workloads that require high accuracy and detailed visual understanding, while smaller variants—some as lightweight as 5.9M parameters—are optimized for edge deployments where compute and memory are limited.

The models also support adaptive patch sizes (8×8 and 16×16), allowing for configurable trade-offs between detail resolution and computational load.

Strong results across multimodal benchmarks

In a series of benchmarks, OpenVision demonstrates strong results across multiple vision-language tasks.

While traditional CLIP benchmarks such as ImageNet and MSCOCO remain part of the evaluation suite, the OpenVision team cautions against relying solely on those metrics.

Their experiments show that strong performance on image classification or retrieval does not necessarily translate to success in complex multimodal reasoning. Instead, the team advocates for broader benchmark coverage and open evaluation protocols that better reflect real-world multimodal use cases.

Evaluations were conducted using two standard multimodal frameworks—LLaVA-1.5 and Open-LLaVA-Next—and showed that OpenVision models consistently match or outperform both CLIP and SigLIP across tasks like TextVQA, ChartQA, MME, and OCR.

Under the LLaVA-1.5 setup, OpenVision encoders trained at 224×224 resolution scored higher than OpenAI’s CLIP in both classification and retrieval tasks, as well as in downstream evaluations like SEED, SQA, and POPE.

At higher input resolutions (336×336), OpenVision-L/14 outperformed CLIP-L/14 in most categories. Even the smaller models, such as OpenVision-Small and Tiny, maintained competitive accuracy while using significantly fewer parameters.

Efficient progressive training reduces compute costs

One notable feature of OpenVision is its progressive resolution training strategy, adapted from CLIPA. Models begin training on low-resolution images and are incrementally fine-tuned on higher resolutions.

This results in a more compute-efficient training process—often 2 to 3 times faster than CLIP and SigLIP—with no loss in downstream performance.

Ablation studies — where components of a machine learning model are selectively removed to identify their importance or lack thereof to its functioning — further confirm the benefits of this approach, with the largest performance gains observed in high-resolution, detail-sensitive tasks like OCR and chart-based visual question answering.

Another factor in OpenVision’s performance is its use of synthetic captions and an auxiliary text decoder during training.

These design choices enable the vision encoder to learn more semantically rich representations, improving accuracy in multimodal reasoning tasks. Removing either component led to consistent performance drops in ablation tests.

Optimized for lightweight systems and edge computing use cases

OpenVision is also designed to work effectively with small language models.

In one experiment, a vision encoder was paired with a 150M-parameter Smol-LM to build a full multimodal model under 250M parameters.

Despite the tiny size, the system retained robust accuracy across a suite of VQA, document understanding, and reasoning tasks.

This capability suggests strong potential for edge-based or resource-constrained deployments, such as consumer smartphones or on-site manufacturing cameras and sensors.

Why OpenVision matters to enterprise technical decision makers

OpenVision’s fully open and modular approach to vision encoder development has strategic implications for enterprise teams working across AI engineering, orchestration, data infrastructure, and security.

For engineers overseeing LLM development and deployment, OpenVision offers a plug-and-play solution for integrating high-performing vision capabilities without depending on opaque, third-party APIs or restricted model licenses.

This openness allows for tighter optimization of vision-language pipelines and ensures that proprietary data never leaves the organization’s environment.

For engineers focused on creating AI orchestration frameworks, OpenVision provides models at a broad range of parameter scales—from ultra-compact encoders suitable for edge devices to larger, high-resolution models suited for multi-node cloud pipelines.

This flexibility makes it easier to design scalable, cost-efficient MLOps workflows without compromising on task-specific accuracy. Its support for progressive resolution training also allows for smarter resource allocation during development, which is especially beneficial for teams operating under tight budget constraints.

Data engineers can leverage OpenVision to power image-heavy analytics pipelines, where structured data is augmented with visual inputs (e.g., documents, charts, product images). Since the model zoo supports multiple input resolutions and patch sizes, teams can experiment with trade-offs between fidelity and performance without retraining from scratch. Integration with tools like PyTorch and Hugging Face simplifies model deployment into existing data systems.

Meanwhile, OpenVision’s transparent architecture and reproducible training pipeline allow security teams to assess and monitor models for potential vulnerabilities—unlike black-box APIs where internal behavior is inaccessible.

When deployed on-premise, these models avoid the risks of data leakage during inference, which is critical in regulated industries handling sensitive visual data such as IDs, medical forms, or financial records.

Across all these roles, OpenVision helps reduce vendor lock-in and brings the benefits of modern multimodal AI into workflows that demand control, customization, and operational transparency. It gives enterprise teams the technical foundation to build competitive, AI-enhanced applications—on their own terms.

Open for business

The OpenVision model zoo is available in both PyTorch and JAX implementations, and the team has also released utilities for integration with popular vision-language frameworks.

As of this release, models can be downloaded from Hugging Face, and training recipes are publicly posted for full reproducibility.

By providing a transparent, efficient, and scalable alternative to proprietary encoders, OpenVision offers researchers and developers a flexible foundation for advancing vision-language applications. Its release marks a significant step forward in the push for open multimodal infrastructure—especially for those aiming to build performant systems without access to closed data or compute-heavy training pipelines.

For full documentation, benchmarks, and downloads, visit the OpenVision project page or GitHub repository.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link

What's Hot

Paper page – Energy-Based Transformers are Scalable Learners and Thinkers

OpenAI warns staff to ignore Meta’s ‘ridiculous’ offers as poaching battle escalates

Randomness and Bell’s Inequality [Audio only] | Two Minute Papers #31

New fully open source vision encoder OpenVision arrives to improve on OpenAI’s Clip, Google’s SigLIP

Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%

Dust hits $6M ARR helping enterprises build AI agents that actually do stuff instead of just talking

HOLY SMOKES! A new, 200% faster DeepSeek R1-0528 variant appears from German lab TNG Technology Consulting GmbH

Albright College is Selling Its Art Collection to Balance Its Books

Big Three Auction Houses Hold Old Masters Sales in London This Week

MFA Boston Returns Two Works to Kingdom of Benin

Tate’s £150M Endowment Campaign May Include Turbine Hall Naming Rights