From Editor To Dense Geometry Estimator - Takara TLDR

Leveraging visual priors from pre-trained text-to-image (T2I) generative
models has shown success in dense prediction. However, dense prediction is
inherently an image-to-image task, suggesting that image editing models, rather
than T2I generative models, may be a more suitable foundation for fine-tuning.
Motivated by this, we conduct a systematic analysis of the fine-tuning
behaviors of both editors and generators for dense geometry estimation. Our
findings show that editing models possess inherent structural priors, which
enable them to converge more stably by “refining” their innate features, and
ultimately achieve higher performance than their generative counterparts.
Based on these findings, we introduce \textbf{FE2E}, a framework that
pioneeringly adapts an advanced editing model based on Diffusion Transformer
(DiT) architecture for dense geometry prediction. Specifically, to tailor the
editor for this deterministic task, we reformulate the editor’s original flow
matching loss into the “consistent velocity” training objective. And we use
logarithmic quantization to resolve the precision conflict between the editor’s
native BFloat16 format and the high precision demand of our tasks.
Additionally, we leverage the DiT’s global attention for a cost-free joint
estimation of depth and normals in a single forward pass, enabling their
supervisory signals to mutually enhance each other.
Without scaling up the training data, FE2E achieves impressive performance
improvements in zero-shot monocular depth and normal estimation across multiple
datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset
and outperforms the DepthAnything series, which is trained on 100$\times$ data.
The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

Source link

What's Hot

DeepResearch Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks – Takara TLDR

DeepSeek may be about to shake up the AI world again – what we know

Qwen3-Max-Preview Launched, Officially Claimed to be the Most Powerful Language Model in the Tongyi Qianwen Series_model_and_that

From Editor to Dense Geometry Estimator – Takara TLDR

DeepResearch Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks – Takara TLDR

Transition Models: Rethinking the Generative Learning Objective – Takara TLDR

NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings – Takara TLDR

Basquiats Linked to 1MDB Scandal Auctioned by US Government

US Ambassador to UK Fills Residence with Impressionist Masters

New Code of Ethics Implores UK Museums to End Fossil Fuel Sponsorships

Morning Links for September 5, 2025

DeepResearch Arena: The First Exam of LLMs’ Research Abilities via Seminar-Grounded Tasks – Takara TLDR

DeepSeek may be about to shake up the AI world again – what we know

Qwen3-Max-Preview Launched, Officially Claimed to be the Most Powerful Language Model in the Tongyi Qianwen Series_model_and_that

What's Hot

From Editor to Dense Geometry Estimator – Takara TLDR

Related Posts

Subscribe to Updates