Meta's VL-Jepa: Shifting AI's Focus from Words to Worlds |

Translate: 🇫🇷 French 🇸🇦 Arabic 🇨🇳 Chinese 🇪🇸 Spanish

A groundbreaking research paper from Meta, co-authored by the influential Yann LeCun, is sparking a critical conversation. Does this new development signal the end of Large Language Models (LLMs), or do they still represent the future of AI? This article explores that very question.

The paper, titled “VL-Jepa” (Vision Language Joint Embedding Predictive Architecture), was released on December 11, 2025. But before diving into what VL-Jepa is, it’s essential to understand how the AI systems we use today actually function.

The Reign of Auto-Regressive Models

Large Language Models like GPT, Llama, and Gemini are masters of generating text. They accomplish this by producing one token at a time, from left to right. This process is known as auto-regression. Each time the model predicts the next word, all the previously generated words become its input.

Consider generating a sentence like, “The sun shines brightly.”

The model first predicts “The”.
It then uses “The” to predict “sun”.
It uses “The sun” to predict “shines”.
Finally, it uses “The sun shines” to predict “brightly”.

This approach works remarkably well, but it has a major limitation. The model cannot know the final answer upfront. It generates one word at a time, relying entirely on previously generated tokens to decide the next one. This makes the process slow and computationally expensive. More importantly, it’s heavily dependent on language structure rather than a true, underlying understanding.

Yann LeCun’s Vision: Beyond Next-Word Prediction

This is a point of contention that Yann LeCun has criticized for years. In many talks and interviews, he has clearly stated that simply scaling up LLMs is not the same as achieving intelligence. Bigger datasets, larger models, and longer context windows are not enough to produce human-level intelligence.

Why? Because intelligence isn’t about predicting the next word. It’s about understanding the world around you. Language is merely a way to express thought. Thinking itself happens through concepts, not tokens.

Enter VL-Jepa: A New Paradigm

This is precisely where VL-Jepa takes a completely different approach.

VL-Jepa is a visual language model, meaning it can understand images, videos, and text together. But unlike traditional models, it does not generate text word by word. Instead, VL-Jepa predicts semantic embeddings. It builds an internal, semantic understanding of what it sees, rather than just learning to string words together.

Because of this, VL-Jepa operates as a non-generative model. It learns what something means, not how to describe it word by word. It performs reasoning in a latent semantic space rather than a token space.

Semantic Space vs. Token Space: A Tale of Two Mappings

What exactly is a “semantic space” versus a “token space”? Let’s break it down.

In AI, a “space” is simply a way to organize information using numbers. Whether you have a sentence, an image, or a video, it’s converted into a vector—a list of numbers. In this space, AI models place items with similar meanings close to each other. For example, the words “love” and “like” would be neighbors, while “cat” and “car” would be far apart. It’s a map of meanings.

The World of Meanings: Semantic Space

A semantic space organizes information based on conceptual similarity. Things that are the same or similar are close.

Consider these two sentences:

“A dog is running.”
“A puppy is playing.”

While they use different words, their meaning is similar. In a semantic space, their embeddings (their numerical representations) would be close together.

Now, consider these two sentences:

“A dog is running.”
“A car is parked.”

These have vastly different meanings. In a semantic space, their embeddings would be far apart.

The World of Words: Token Space

In contrast, a token space is what current LLMs use. Here, the focus is on words and sub-words. The model cares about grammar, punctuation, and word order. Its primary goal is to predict the next token in a sequence.

For the sentence, “A person named Alex enjoys programming,” a traditional LLM treats each word as a separate token and predicts them one by one.

In a semantic space, which VL-Jepa uses, the model works with concepts and meanings. It ignores the exact wording and represents the entire idea at once. The same sentence becomes a single abstract concept: a person named Alex who likes to code. This entire meaning is stored as one abstract vector.

The difference is profound:

LLMs think by predicting the next word.
VL-Jepa thinks by predicting the meaning.

Why This New Approach is So Powerful

This shift from token space to semantic space is incredibly powerful for several reasons:

Focus on Meaning: The model concentrates on what something means, not just how it’s written.
Efficient Reasoning: This focus makes reasoning faster and easier.
Fewer Parameters: It requires fewer parameters compared to massive auto-regressive models.
Human-like Thinking: This process is closer to how humans think—in concepts, not a linear sequence of words.

The Dawn of World Models

To be clear, VL-Jepa is not the end of LLMs. Instead, it signals the beginning of a new direction. Yann LeCun calls this idea a “world model.”

A world model doesn’t just talk; it understands how the world works over time. In the future, LLMs won’t disappear. They will simply stop being the centerpiece of AI. They will become the language layer—the part that communicates with humans—while a deeper, more conceptual model like VL-Jepa handles the actual reasoning.