This Week In AI Research (30 November - 6 December 25) 🗓️

The top 10 AI research papers that you must know about this week.

Dec 08, 2025

1. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

This research introduces DeepSeek-V3.2, the highly efficient state-of-the-art model from DeepSeek for reasoning and agentic performance.

The secrets behind its performance include:

DeepSeek Sparse Attention (DSA): An attention mechanism that reduces computational complexity while preserving model performance in long-context scenarios.

A scalable reinforcement-learning framework that helps DeepSeek-V3.2 perform comparably to GPT-5, and a high-compute variant, DeepSeek-V3.2-Speciale, to beat GPT-5 and perform on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI).
A large-scale agentic task synthesis pipeline to integrate reasoning into tool-use scenarios.

Read more about this research paper using this link.

Before we start, I want to introduce you to my book, ‘LLMs In 100 Images’.

It is a collection of 100 easy-to-follow visuals that explain the most important concepts you need to master to understand LLMs today.

Get your copy today

2. Measuring Agents in Production

This research presents the first large-scale empirical study of AI agents deployed in real-world production environments.

It is a result of a survey of 306 practitioners and interviews with 20 teams across 26 different domains.

The results show that most production agents are built using simple, controllable approaches and an agentic framework.

70 % rely on prompting off-the-shelf LLMs with no fine-tuning
68% run for at most 10 steps before requiring human intervention
25% of teams construct custom evaluation benchmarks from scratch, while the remaining 75% evaluate their agents without formal benchmarks, relying instead on online tests such as A/B testing or direct expert/user feedback.

The main challenge in deploying agents in production is reliability in ensuring and evaluating agent correctness.

Read more about this research using this link.

3. SIMA 2: A Generalist Embodied Agent for Virtual Worlds

This research presents SIMA 2, a generalist embodied agent built on top of Gemini that understands and acts across a wide variety of 3D virtual worlds. It can:

Follow complex language and image instructions
Reason toward high-level goals
Converse with users
Generalise to previously unseen environments
Achieve open-ended self-improvement in a new environment

SIMA 2 significantly narrows the gap with human performance across various games.

Read more about this research using this link.

4. Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

This research introduces SUSVIBES, a benchmark of 200 feature-request software engineering tasks from real-world open-source projects that historically led to vulnerable implementations.

Disturbingly, frontier LLMs and popular software engineering agents, despite their incredible ability to solve more than 50% of tasks and pass functional tests, perform very poorly in security, failing over 80% of security tests.

Alongside this, preliminary efforts to reduce security risks result in a significant performance drop in functionality.

Read more about this research using this link.

5. In‑Context Representation Hijacking

This research presents Doublespeak, an in-context representation hijacking attack against LLMs.

The attacker gives the model a few in-context examples where a harmless word is consistently used as if it meant something dangerous.

For example, a harmful word (e.g., bomb) is replaced with a benign one (e.g., carrot) in multiple examples, leading the model’s internal representation of the benign word to acquire the harmful meaning.

After this setup, the attacker asks a benign-looking question, such as “How do you build a carrot?” which appears entirely harmless to humans and the LLM’s safety filter. Still, the model now interprets “carrot” as “bomb” and gives a harmful reply, bypassing the LLM’s safety alignment.

This attack requires no optimization, generalizes across model families, and achieves high success rates on both open and closed LLMs.

Read more about this research using this link.

6. GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

This research from ByteDance presents GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a specialist for long-horizon dexterous manipulation.

Existing VLA policies learn from human demonstrations, which can become noisy and suboptimal for highly dextrous and precise manipulation tasks.

GR-RL fixes this by introducing a multi-stage training pipeline that:

filters demonstrations using a learned task-progress signal
enhances generalization through morphological symmetry augmentation, and
refines precision via online reinforcement learning

It is the first learning-based policy to autonomously lace up a shoe through multiple eyelets, achieving an 83.3% success rate. This is a task that requires long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction.

Read more about this research using this link.

7. The Universal Weight Subspace Hypothesis

This research presents the Universal Weight Subspace Hypothesis, which states that neural networks of different architectures, even when trained on various tasks from different domains, converge to a shared low-dimensional subspace in weight space.

The authors find that, irrespective of task or model type, most weight variation lies along a small set of common directions.

This tells that different models reuse a common structural foundation, with task-specific behavior arising from small deviations.

This has important implications for model compression, merging, model reusability, and multi-task learning.

It also opens up the possibility of discovering these universal subspaces without the need for extensive data and computational resources, potentially reducing the carbon footprint of large-scale models.

Read more about this research using this link.

8. Network of Theseus (Like The Ship)

There’s a standard assumption that the neural network architecture used during training must be the same as the one used at inference.

Researchers in this paper challenge this assumption and introduce the Network of Theseus (NoT).

This method incrementally transforms a “guide” network (whether trained or untrained) into a “target” network by replacing its components step by step, while maintaining the the network’s performance.

At each step, the components in the “guide” network are replaced with the “target” network’s modules and aligned using representational similarity metrics.

Some examples include converting a CNN into an MLP, or GPT-2 into an RNN.

This method decouples neural network optimization during training from that during deployment, leading to better trade-offs between accuracy and efficiency.

Read more about this research using this link.

9. Are Your Agents Upward Deceivers?

LLM-based agents are increasingly used as autonomous subordinates to carry out user tasks.

Similar to a real-world organisation, where individuals lie to superiors to create a good image or avoid punishment, this research introduces the fascinating phenomenon of Agentic Upward Deception.

This is where an agent, facing constraints in its environment, hides its failures, performs actions that were not requested, and skips reporting them.

This phenomenon is tested on a benchmark of 200 tasks covering five task types and eight realistic scenarios in a constrained environment.

Evaluations show that 11 popular LLM-based agents show action-based deceptive behaviors, such as:

guessing results
performing unsupported simulations
substituting unavailable information sources, and
fabricating local files

Moreover, these behaviors cannot be reliably prevented through prompt engineering alone.

Read more about this research using this link.

10. MemVerse: Multimodal Memory for Lifelong Learning Agents

This research introduces MemVerse, a fast, model-agnostic, and multi-modal plug-and-play long-term memory framework.

MemVerse stores past experiences across different modalities (such as text, images, and interactions) in a structured long-term memory, while keeping recent information in a short-term buffer.

Important experiences are organized, less useful ones can be forgotten, and key knowledge is sometimes compressed back into the model so it can be recalled quickly.

This helps AI agents remember past events, reason over longer time spans, and perform better in continuing, real-world tasks rather than constantly starting over.

Read more about this research using this link.

This article is free to read. If you loved reading it, restack it and share it with others.💚

If you want to get even more value from this publication, become a paid subscriber and unlock all posts.

Become a paid subscriber

You can also check out my books on Gumroad and connect with me on LinkedIn to stay in touch.

Discussion about this post

Ready for more?