Unified multimodal Large Language Models (LLMs) that can both understand and
generate visual content hold immense potential. However, existing open-source
models often suffer from a performance trade-off between these capabilities. We
present Manzano, a simple and scalable unified framework that substantially
reduces this tension by coupling a hybrid image tokenizer with a well-curated
training recipe. A single shared vision encoder feeds two lightweight adapters
that produce continuous embeddings for image-to-text understanding and discrete
tokens for text-to-image generation within a common semantic space. A unified
autoregressive LLM predicts high-level semantics in the form of text and image
tokens, with an auxiliary diffusion decoder subsequently translating the image
tokens into pixels. The architecture, together with a unified training recipe
over understanding and generation data, enables scalable joint learning of both
capabilities. Manzano achieves state-of-the-art results among unified models,
and is competitive with specialist models, particularly on text-rich
evaluation. Our studies show minimal task conflicts and consistent gains from
scaling model size, validating our design choice of a hybrid tokenizer.