Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Paper page – Energy-Based Transformers are Scalable Learners and Thinkers

OpenAI warns staff to ignore Meta’s ‘ridiculous’ offers as poaching battle escalates

Randomness and Bell’s Inequality [Audio only] | Two Minute Papers #31

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Amazon (Titan)
    • Anthropic (Claude 3)
    • Cohere (Command R)
    • Google DeepMind (Gemini)
    • IBM (Watsonx)
    • Inflection AI (Pi)
    • Meta (LLaMA)
    • OpenAI (GPT-4 / GPT-4o)
    • Reka AI
    • xAI (Grok)
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Facebook X (Twitter) Instagram
Advanced AI News
VentureBeat AI

New fully open source vision encoder OpenVision arrives to improve on OpenAI’s Clip, Google’s SigLIP

Advanced AI EditorBy Advanced AI EditorMay 12, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

The University of California, Santa Cruz has announced the release of OpenVision, a family of vision encoders that aim to provide a new alternative to models including OpenAI’s four-year-old CLIP and last year’s Google’s SigLIP.

A vision encoder is a type of AI model that transforms visual material and files — typically still images uploaded by a model’s creators — into numerical data that can be understood by other, non-visual AI models such as large language models (LLMs). A vision encoder is a necessary component for allowing many leading LLMs to be able to work with images uploaded by users, making it possible for an LLM to identify different image subjects, colors, locations, and more features within an image.

OpenVision, then, with its permissive Apache 2.0 license and family of 26 (!) different models spanning between 5.9 million parameters to 632.1 million parameters, allows any developer or AI model maker within an enterprise or organization to take and deploy an encoder that can be used to ingest everything from images on a construction job site to a user’s washing machine, allowing an AI model to offer guidance and troubleshooting, or myriad other use cases. The Apache 2.0 license allows for usage in commercial applications.

The models were developed by a team led by Cihang Xie, assistant professor at UCSC, along with contributors Xianhang Li, Yanqing Liu, Haoqin Tu, and Hongru Zhu.

The project builds upon the CLIPS training pipeline and leverages the Recap-DataComp-1B dataset, a re-captioned version of a billion-scale web image corpus using LLaVA-powered language models.

Scalable architecture for different enterprise deployment use cases

OpenVision’s design supports multiple use cases.

Larger models are well-suited for server-grade workloads that require high accuracy and detailed visual understanding, while smaller variants—some as lightweight as 5.9M parameters—are optimized for edge deployments where compute and memory are limited.

The models also support adaptive patch sizes (8×8 and 16×16), allowing for configurable trade-offs between detail resolution and computational load.

Strong results across multimodal benchmarks

In a series of benchmarks, OpenVision demonstrates strong results across multiple vision-language tasks.

While traditional CLIP benchmarks such as ImageNet and MSCOCO remain part of the evaluation suite, the OpenVision team cautions against relying solely on those metrics.

Their experiments show that strong performance on image classification or retrieval does not necessarily translate to success in complex multimodal reasoning. Instead, the team advocates for broader benchmark coverage and open evaluation protocols that better reflect real-world multimodal use cases.

Evaluations were conducted using two standard multimodal frameworks—LLaVA-1.5 and Open-LLaVA-Next—and showed that OpenVision models consistently match or outperform both CLIP and SigLIP across tasks like TextVQA, ChartQA, MME, and OCR.

Under the LLaVA-1.5 setup, OpenVision encoders trained at 224×224 resolution scored higher than OpenAI’s CLIP in both classification and retrieval tasks, as well as in downstream evaluations like SEED, SQA, and POPE.

At higher input resolutions (336×336), OpenVision-L/14 outperformed CLIP-L/14 in most categories. Even the smaller models, such as OpenVision-Small and Tiny, maintained competitive accuracy while using significantly fewer parameters.

Efficient progressive training reduces compute costs

One notable feature of OpenVision is its progressive resolution training strategy, adapted from CLIPA. Models begin training on low-resolution images and are incrementally fine-tuned on higher resolutions.

This results in a more compute-efficient training process—often 2 to 3 times faster than CLIP and SigLIP—with no loss in downstream performance.

Ablation studies — where components of a machine learning model are selectively removed to identify their importance or lack thereof to its functioning — further confirm the benefits of this approach, with the largest performance gains observed in high-resolution, detail-sensitive tasks like OCR and chart-based visual question answering.

Another factor in OpenVision’s performance is its use of synthetic captions and an auxiliary text decoder during training.

These design choices enable the vision encoder to learn more semantically rich representations, improving accuracy in multimodal reasoning tasks. Removing either component led to consistent performance drops in ablation tests.

Optimized for lightweight systems and edge computing use cases

OpenVision is also designed to work effectively with small language models.

In one experiment, a vision encoder was paired with a 150M-parameter Smol-LM to build a full multimodal model under 250M parameters.

Despite the tiny size, the system retained robust accuracy across a suite of VQA, document understanding, and reasoning tasks.

This capability suggests strong potential for edge-based or resource-constrained deployments, such as consumer smartphones or on-site manufacturing cameras and sensors.

Why OpenVision matters to enterprise technical decision makers

OpenVision’s fully open and modular approach to vision encoder development has strategic implications for enterprise teams working across AI engineering, orchestration, data infrastructure, and security.

For engineers overseeing LLM development and deployment, OpenVision offers a plug-and-play solution for integrating high-performing vision capabilities without depending on opaque, third-party APIs or restricted model licenses.

This openness allows for tighter optimization of vision-language pipelines and ensures that proprietary data never leaves the organization’s environment.

For engineers focused on creating AI orchestration frameworks, OpenVision provides models at a broad range of parameter scales—from ultra-compact encoders suitable for edge devices to larger, high-resolution models suited for multi-node cloud pipelines.

This flexibility makes it easier to design scalable, cost-efficient MLOps workflows without compromising on task-specific accuracy. Its support for progressive resolution training also allows for smarter resource allocation during development, which is especially beneficial for teams operating under tight budget constraints.

Data engineers can leverage OpenVision to power image-heavy analytics pipelines, where structured data is augmented with visual inputs (e.g., documents, charts, product images). Since the model zoo supports multiple input resolutions and patch sizes, teams can experiment with trade-offs between fidelity and performance without retraining from scratch. Integration with tools like PyTorch and Hugging Face simplifies model deployment into existing data systems.

Meanwhile, OpenVision’s transparent architecture and reproducible training pipeline allow security teams to assess and monitor models for potential vulnerabilities—unlike black-box APIs where internal behavior is inaccessible.

When deployed on-premise, these models avoid the risks of data leakage during inference, which is critical in regulated industries handling sensitive visual data such as IDs, medical forms, or financial records.

Across all these roles, OpenVision helps reduce vendor lock-in and brings the benefits of modern multimodal AI into workflows that demand control, customization, and operational transparency. It gives enterprise teams the technical foundation to build competitive, AI-enhanced applications—on their own terms.

Open for business

The OpenVision model zoo is available in both PyTorch and JAX implementations, and the team has also released utilities for integration with popular vision-language frameworks.

As of this release, models can be downloaded from Hugging Face, and training recipes are publicly posted for full reproducibility.

By providing a transparent, efficient, and scalable alternative to proprietary encoders, OpenVision offers researchers and developers a flexible foundation for advancing vision-language applications. Its release marks a significant step forward in the push for open multimodal infrastructure—especially for those aiming to build performant systems without access to closed data or compute-heavy training pipelines.

For full documentation, benchmarks, and downloads, visit the OpenVision project page or GitHub repository.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleEven a16z VCs say no one really knows what an AI agent is
Next Article Fireside Wisdom: Clarence Wooten at Spelman
Advanced AI Editor
  • Website

Related Posts

Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%

July 3, 2025

Dust hits $6M ARR helping enterprises build AI agents that actually do stuff instead of just talking

July 3, 2025

HOLY SMOKES! A new, 200% faster DeepSeek R1-0528 variant appears from German lab TNG Technology Consulting GmbH

July 3, 2025
Leave A Reply Cancel Reply

Latest Posts

Albright College is Selling Its Art Collection to Balance Its Books

Big Three Auction Houses Hold Old Masters Sales in London This Week

MFA Boston Returns Two Works to Kingdom of Benin

Tate’s £150M Endowment Campaign May Include Turbine Hall Naming Rights

Latest Posts

Paper page – Energy-Based Transformers are Scalable Learners and Thinkers

July 5, 2025

OpenAI warns staff to ignore Meta’s ‘ridiculous’ offers as poaching battle escalates

July 5, 2025

Randomness and Bell’s Inequality [Audio only] | Two Minute Papers #31

July 5, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • Paper page – Energy-Based Transformers are Scalable Learners and Thinkers
  • OpenAI warns staff to ignore Meta’s ‘ridiculous’ offers as poaching battle escalates
  • Randomness and Bell’s Inequality [Audio only] | Two Minute Papers #31
  • Blacklisted by the U.S. and backed by Beijing, this Chinese AI startup has caught OpenAI’s attention – NBC 6 South Florida
  • Shutterstock Expands AI Horizons: New Partnership with Reka AI to Enhance Digital Asset Metadata

Recent Comments

No comments to show.

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.