“This work points to a shift from programming robots to teaching robots,” said Sizhe Lester Li, lead researcher and a Ph.D. student at MIT CSAIL. “Today, many robotics tasks require extensive engineering and coding. In the future, we envision showing a robot what to do, and letting it learn how to achieve the goal autonomously.”
MIT tries to make robots more flexible, affordable
The scientists said their motivation stems from a simple reframing: The main barrier to affordable, flexible robotics isn’t hardware – It’s control of capability, which could be achieved in multiple ways. Traditional robots are built to be rigid and sensor-rich, making it easier to construct a digital twin, a precise mathematical replica used for control.
But when a robot is soft, deformable, or irregularly shaped, those assumptions fall apart. Rather than forcing robots to match some models, NJF flips the script by giving them the ability to learn their own internal model from observation.
This decoupling of modeling and hardware design could significantly expand the design space for robotics. In soft and bio-inspired robots, designers often embed sensors or reinforce parts of the structure just to make modeling feasible.
NJF lifts that constraint, said the MIT CSAIL team. The system doesn’t need onboard sensors or design tweaks to make control possible. Designers are freer to explore unconventional, unconstrained morphologies without worrying about whether they’ll be able to model or control them later, it asserted.
“Think about how you learn to control your fingers: You wiggle, you observe, you adapt,” said Li. “That’s what our system does. It experiments with random actions and figures out which controls move which parts of the robot.”
The system has proven robust across a range of robot types. The team tested NJF on a pneumatic soft robotic hand capable of pinching and grasping, a rigid Allegro hand, a 3D-printed robotic arm, and even a rotating platform with no embedded sensors. In every case, the system learned both the robot’s shape and how it responded to control signals, just from vision and random motion.
NJF has potential real-world applications
The MIT CSAIL researchers said their approach has potential far beyond the lab. Robots equipped with NJF could one day perform agricultural tasks with centimeter-level localization accuracy, operate on construction sites without elaborate sensor arrays, or navigate dynamic environments where traditional methods break down.
At the core of NJF is a neural network that captures two intertwined aspects of a robot’s embodiment: its three-dimensional geometry and its sensitivity to control inputs. The system builds on neural radiance fields (NeRF), a technique that reconstructs 3D scenes from images by mapping spatial coordinates to color and density values. NJF extends this approach by learning not only the robot’s shape, but also a Jacobian field, a function that predicts how any point on the robot’s body moves in response to motor commands.
To train the model, the robot performs random motions while multiple cameras record the outcomes. No human supervision or prior knowledge of the robot’s structure is required — the system simply infers the relationship between control signals and motion by watching.
Once training is complete, the robot only needs a single monocular camera for real-time closed-loop control, running at about 12 Hertz. This allows it to continuously observe itself, plan, and act responsively. That speed makes NJF more viable than many physics-based simulators for soft robots, which are often too computationally intensive for real-time use.
In early simulations, even simple 2D fingers and sliders were able to learn this mapping using just a few examples, noted the scientists. By modeling how specific points deform or shift in response to action, NJF builds a dense map of controllability. That internal model allows it to generalize motion across the robot’s body, even when the data is noisy or incomplete.
“What’s really interesting is that the system figures out on its own which motors control which parts of the robot,” said Li. “This isn’t programmed—it emerges naturally through learning, much like a person discovering the buttons on a new device.”
The future of robotics is soft, says CSAIL
For decades, robotics has favored rigid, easily modeled machines – like the industrial arms found in factories – because their properties simplify control. But the field has been moving toward soft, bio-inspired robots that can adapt to the real world more fluidly. The tradeoff? These robots are harder to model, according to MIT CSAIL.
“Robotics today often feels out of reach because of costly sensors and complex programming,” said Vincent Sitzmann, senior author and MIT assistant professor. “Our goal with Neural Jacobian Fields is to lower the barrier, making robotics affordable, adaptable, and accessible to more people.”
“Vision is a resilient, reliable sensor,” added Sitzmann, who leads the Scene Representation group. “It opens the door to robots that can operate in messy, unstructured environments, from farms to construction sites, without expensive infrastructure.”
“Vision alone can provide the cues needed for localization and control—eliminating the need for GPS, external tracking systems, or complex onboard sensors,” noted co-author Daniela Rus, the Erna Viterbi Professor of Electrical Engineering and director of MIT CSAIL.
“This opens the door to robust, adaptive behavior in unstructured environments, from drones navigating indoors or underground without maps, to mobile manipulators working in cluttered homes or warehouses, and even legged robots traversing uneven terrain,” she said. “By learning from visual feedback, these systems develop internal models of their own motion and dynamics, enabling flexible, self-supervised operation where traditional localization methods would fail.”
While training NJF currently requires multiple cameras and must be redone for each robot, the researchers have already considered a more accessible version. In the future, hobbyists could record a robot’s random movements with their phone, much like you’d take a video of a rental car before driving off, and use that footage to create a control model, with no prior knowledge or special equipment required.
MIT team works on system’s limitations
The NJF system doesn’t yet generalize across different robots, and it lacks force or tactile sensing, limiting its effectiveness on contact-rich tasks. But the team is exploring new ways to address these limitations, including improving generalization, handling occlusions, and extending the model’s ability to reason over longer spatial and temporal horizons.
“Just as humans develop an intuitive understanding of how their bodies move and respond to commands, NJF gives robots that kind of embodied self-awareness through vision alone,” Li said. “This understanding is a foundation for flexible manipulation and control in real-world environments. Our work, essentially, reflects a broader trend in robotics: moving away from manually programming detailed models toward teaching robots through observation and interaction.”
This paper brought together the computer vision and self-supervised learning work from principal investigator Sitzmann’s lab and the expertise in soft robots from Rus’ lab. Li, Sitzmann, and Rus co-authored the paper with CSAIL Ph.D. students Annan Zhang SM ’22 and Boyuan Chen, undergraduate researcher Hanna Matusik, and postdoc Chao Liu.
The research was supported by the Solomon Buchsbaum Research Fund through MIT’s Research Support Committee, an MIT Presidential Fellowship, the National Science Foundation, and the Gwangju Institute of Science and Technology. Their findings were published in Nature this month.