Communication-Inspired Tokenization for Structured Image Representations

Overview

COMiT represents images as discrete token sequences by sequentially integrating information from crops, yielding semantically meaningful, structured visual representations. A flow-matching decoder enables one-step reconstructions that reveal progressively reduced ambiguity as more crops are integrated.

Key idea: At test time, COMiT supports flexible crop order and size, enabling adaptive, task-dependent tokenization.

Abstract

Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure.

Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses.

Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

Motivation

Modern multimodal systems convert images into discrete token sequences consumed by transformer-based architectures. This interface enables scalable training, compression, and unified reasoning over visual and textual inputs, making image tokenization a central problem in representation learning.

While recent work has renewed interest in 1D discrete bottlenecks, many approaches still optimize for compression-reconstruction trade-offs. Tokens often entangle semantics and fail to localize objects, limiting downstream performance on compositional and relational tasks.

We shift focus from compression to semantic organization of token sequences. Alignment losses help, but structure also requires an encoding process that encourages compositionality.

Communication-inspired motivation figure

Our approach is inspired by how humans describe visual scenes under limited bandwidth: attention moves region-by-region, integrating salient information into a message and reducing uncertainty progressively. This naturally prioritizes high-level entities and relations over fine detail, creating a coarse-to-fine hierarchy.

COMiT follows this intuition via attentive sequential tokenization and homogeneous communication within a single model that both encodes and decodes.

Communication-Inspired Tokenization

A sequence of random crops is extracted and iteratively embedded into a latent message discretized via FSQ [1]. The same transformer decodes the message using a flow-matching objective [2]. We use REPA [3] to speed up training and SREPA to inject stronger semantic priors (see the paper for formal definitions).

Training details (high level)

With some probability, replace the first crop with the full image so the model can encode in one step.
Randomize number of crops so tokens are used greedily under a fixed budget.
For efficiency, backpropagate gradients only through the last message update step.
Enable classifier-free guidance by dedicating part of training to unconditional decoding.

Experiments

We analyze token attention maps using a test set with ground-truth saliency. For each token, we binarize its attention map by keeping the top Q% values, then select the token with the highest IoU to the ground-truth mask.

Outcome: Full COMiT yields object-aligned attention (mean mIoU 0.53) versus an ablation without local crops (0.34), indicating the attentive tokenization pipeline is key for structure.

We evaluate COMiT against baseline 1D tokenizers on three benchmarks (see Section 4 in the paper). COMiT improves semantic probing, highlighting a representation–reconstruction trade-off that differs from the classical compression objective.

One-step decoding provides a way to probe uncertainty: recently observed regions are sharper, while unobserved regions remain blurry. As more crops are aggregated, the latent message becomes more refined. This process is inherently compositional: objects appear as soon as they are observed.

Compositional reconstruction over crop sequence

For full details, ablations, and additional results, please refer to the paper.

References

Mentzer, Fabian, et al. Finite Scalar Quantization: VQ-VAE Made Simple. ICLR.
Lipman, Yaron, et al. Flow Matching for Generative Modeling. ICLR.
Yu, Sihyun, et al. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. ICLR.