MeshAnything V2: Artist-Created Mesh Generation via Adjacent Mesh Tokenization

Chen, Wang, Luo, Wang, Chen, Zhu, Zhang, Lin — arXiv:2408.02555 (2024)
Explained assuming: you know meshes, point clouds, vertices/faces/normals at a working level — not a graphics expert. You're newer to deep learning: transformers, tokens, embeddings, and autoregressive generation explained from scratch below.

One-sentence version: MeshAnything V2 is a transformer trained once, offline, on a large dataset of real artist-made meshes, to predict a mesh's faces one discrete "token" at a time — the same mechanism a language model uses to predict the next word — conditioned on a point cloud that describes the target shape. At inference time it's a single fast feed-forward pass, not an iterative per-shape optimization loop the way Score Distillation Sampling-based 3D generation is.

The problem it solves

Most automated 3D generation — NeRF, 3D Gaussian splatting, implicit SDF/occupancy fields extracted via marching cubes — produces what this paper calls an "algorithm-generated mesh" (AGM): a dense, uneven triangle soup with no clean structure. It renders fine, but it's painful to actually work with — hard to rig, animate, edit, or texture by hand.

An "artist-created mesh" (AM) — the kind a human 3D modeler makes — instead uses a small number of faces (hundreds, not hundreds of thousands), mostly quads/n-gons rather than pure triangles, with edges deliberately aligned to the shape's actual geometric features (corners, silhouette edges, part boundaries).

MeshAnything's goal: given any 3D shape, regardless of how it was produced, generate a plausible artist-style retopology of it. It's a retopology/regeneration step, not a from-scratch shape generator — the shape already exists (as a point cloud); the model's job is purely to decide where a human-style mesh's vertices and faces would go on that shape.

Deep learning concepts, from scratch

Tokens and vocabularies

A "token" is just a discrete symbol drawn from a fixed, finite list (a "vocabulary"). Text models split sentences into sub-word tokens (e.g. "unhappiness" → un + happi + ness), each with an ID number the model can put a probability on. Mesh vertex coordinates are continuous floating-point numbers, which don't fit that scheme directly — so MeshAnything quantizes them: it snaps each coordinate onto a fixed discrete grid (think: rounding to the nearest of a few hundred allowed values along each axis). Once coordinates are discrete, they can be treated as vocabulary entries, exactly like word-pieces, and the same machinery used for language models applies.

Autoregressive generation (the language-model mechanism)

"Autoregressive" means: predict one token, append it to what's been generated so far, then predict the next token conditioned on everything generated up to that point — repeat until a special "end" token comes up or a max length is hit. This is exactly how a chatbot generates a reply one word-piece at a time. MeshAnything applies the identical mechanism, except its vocabulary is quantized mesh geometry (vertex coordinates and structural marker tokens) instead of English word-pieces. At each step the model outputs a probability distribution over the vocabulary; a token is sampled (or the most likely one is taken); that token is fed back in as input for predicting the next one.

This is a fundamentally different generative mechanism from Score Distillation Sampling (SDS)-based 3D generation: there, a fixed-size set of continuous parameters (e.g. a NeRF's weights) is nudged by gradient descent over thousands of iterations, using a frozen 2D diffusion model purely as a critic supplying a gradient. Here, the model itself directly emits the output, symbol by symbol, the same way it would emit a sentence.

Embeddings / latent codes (how the point cloud gets "read")

An embedding (also called a latent code) is a vector of numbers produced by a neural network that summarizes some input in a form other networks can consume — conceptually similar to how a CLIP model turns an image into a vector that captures "what's in it" without being the pixels themselves. MeshAnything uses a separate, frozen, pretrained point-cloud encoder network (trained once elsewhere, not touched further here) that reads the input point cloud (8192 points, each with xyz position + a surface normal) and compresses it into a compact set of shape-embedding vectors capturing the target geometry.

Cross-attention (how generation stays grounded in the input shape)

"Attention" in a transformer is a mechanism for letting one token look at other tokens/vectors and decide which are relevant to it right now. Self-attention = a token looking at the other tokens already generated in its own sequence (needed just to keep the mesh internally consistent). Cross-attention = a token looking at a completely different set of vectors from another source — here, the frozen shape embedding from the point-cloud encoder. Every time the decoder is about to emit the next mesh token, it cross-attends to the shape embedding, which is what keeps the generated mesh actually shaped like the input point cloud instead of some generic mesh unrelated to it.

How it's trained (the other big contrast with SDS)

MeshAnything's transformer weights are trained once, offline, on a large dataset of real (point cloud, artist-made mesh) pairs. Training uses ordinary supervised next-token cross-entropy loss ("teacher forcing"): feed the model the correct tokens generated so far, ask it to predict the next one, compare its predicted probability distribution against the actual next token in the real mesh, backpropagate, repeat over the whole dataset. This is precisely how large language models are trained on text — same loss function, same idea, different vocabulary.

Feed-forward vs. per-instance optimization — the key distinction from SDS-based methods. SDS needs zero training data and optimizes a fresh set of parameters for one single shape via a frozen critic's gradient, over thousands of iterations, every single time you want a new asset — it never "learns" anything reusable. MeshAnything's transformer, by contrast, learns its weights once, over a whole dataset; then for every new input, inference is one autoregressive decode — a handful of seconds, no per-shape optimization loop at all. This is what "feed-forward" means as a deep-learning term: a single forward pass (or one autoregressive decode) through fixed, already-trained weights, as opposed to running gradient descent anew for each input.

Architecture / pipeline

Point cloud (8192 × 6) xyz + normal, bbox-normalized, +Y-up Frozen point-cloud encoder pretrained elsewhere, not trained further here Shape embedding compact latent vectors summarizing geometry Autoregressive transformer decoder sample next token → append → cross-attend to shape embedding → repeat until end-token or max length feed token back in Raw token sequence → (N, 9) NaN-padded rows = end of generation Detokenize undo quantization + adjacency decoding (AMT)
Blue = graphics representations you already know (point cloud, final mesh). Orange = the new deep-learning machinery (frozen encoder, autoregressive decoder, its token-feedback loop). Green = translating back from tokens into an explicit mesh. After detokenizing: mask NaN rows, merge_vertices(), remove degenerate/duplicate faces — ordinary mesh cleanup, not a deep-learning step — giving the final artist-style mesh (≤1600 faces).

Adjacent Mesh Tokenization (AMT) — V2's actual contribution

The naive way to tokenize a mesh (used by V1 and by the earlier MeshGPT paper it builds on): emit every face as 3 full vertices — 9 numbers — independently. The problem: in a real mesh, most consecutive faces (in a sensible traversal order) share an edge, meaning they share 2 of their 3 vertices with the previous face. Emitting all 9 numbers per face re-sends coordinates that were just emitted moments ago — wasted tokens, in a transformer whose sequence length is a hard, fixed budget.

AMT orders faces via a specific traversal so that, whenever the next face shares an edge with the current one, only the new, not-yet-seen vertex needs to be emitted — the shared ones are implicit from the adjacency structure already established. This roughly halves the average number of tokens needed per face.

Hand-drawn illustration of two adjacent triangles sharing an edge; the shared vertices are faded and labeled already known, skip, while the one new vertex of the second triangle is highlighted and labeled only this one is new.
Two consecutive faces in the traversal order share an edge — so the second face only needs to emit its one genuinely new vertex; the shared pair is already implicit from what came before.

Why this matters concretely: a transformer's context window (max sequence length) is fixed. Halving the tokens-per-face means the same context window can represent roughly twice as many faces — which is the direct reason V2 raised the practical face ceiling to 1600 faces versus V1's much lower cap. (Analogy, if it helps: like video compression encoding only the pixels that changed between frames instead of re-sending every frame whole — exploiting redundancy between consecutive items instead of encoding each one independently.)

Input / output contract (from the reference implementation)

AspectValue
CheckpointYiwen-ntu/meshanythingv2 on Hugging Face
Inputpoint cloud (8192, 6) — xyz + normal, centered on bbox midpoint, scaled to [-0.9995, 0.9995]³, float16, +Y-up
Outputflat vertex tensor reshaped to (-1, 9) (3 vertices × 3 coords per face); NaN rows = padding, must be masked
Post-processingmerge_vertices + degenerate/duplicate-face removal (matches reference main.py)
Face cap1600 — a training-distribution limit, not just an inference knob
Reference cost~8GB VRAM, ~45s/mesh on an A6000

Known limitations — and why input quality matters

Feed-forward models only get one shot. Because there's no optimization loop at inference (unlike SDS, which can iterate for as long as needed on one shape), output quality is bounded by (a) the training data's distribution of artist meshes and (b) whatever single point cloud you hand it in one pass. If that point cloud is noisy or under-detailed, the model can't "try harder" the way a gradient-based optimizer could. This is exactly the reasoning behind MeshAnything's own README guidance: feed-forward 3D generation methods in general can produce shapes with "insufficient shape quality," and they suggest sourcing input from reconstruction/scanning, an SDS-based method (e.g. DreamCraft3D), or Rodin instead — i.e., improve the one input you get to hand it, since there's no second chance at inference time.