Construction Engineering Research Laboratory · Engineer Research & Development Center · U.S. Army Corps of Engineers
†Equal contribution. ‡Work performed at CERL; now at Rerun.
How do we adapt VLAs, pretrained on bimanual robotic manipulation tasks, for earthmoving operations on heavy equipment?
Vision–language–action models (VLAs) have demonstrated remarkable generalization in structured environments. Their efficacy in complex field environments, however, remains largely unexplored. Field robotics tasks like earthmoving present unique challenges: mobile manipulation of deformable terrain, widely varying environmental conditions, and high contact forces. This study investigates how the success of VLAs can be extended to autonomous earthmoving by utilizing terrain map-conditioned policies to achieve robust performance across diverse soil types.
VLAs have succeeded in settings where the variation is semantic. Objects and instructions change, the physics does not. Earthmoving inverts this. Soil behavior shifts with moisture, compaction, and weather. Contact forces are high and nonlinear. The blade occludes the material it accumulates. Depth, the cue a camera-only policy leans on, is unreliable over loose, uniformly textured soil. Tasks unfold over a large work area and a long horizon, so no single camera frame carries the context.
The policy needs a representation of terrain state that survives those gaps. We feed it an egocentric top-down elevation map alongside the camera image. The map provides three things a camera and joint state alone cannot: a measure of blade soil accumulation, a heightmap representation of the environment, and dynamic updates throughout task runtime.
The map is initialized from an initial scan of the work site by the vehicle and is updated continuously as the vehicle operates. Blade motion through the terrain updates heights by swept-volume displacement. Soil failure physics erodes material to produce realistic piling and spoil. Near-vehicle height estimation closes the information gap under the blade, which is critical for policy learning.
Where the blade or a recent interaction occludes the camera, the map still carries state. The policy reasons about accumulated soil and distant terrain in the same representation.
Pile-building is under-specified. We want a model that builds not just a pile but a specific structure, defined by location, shape, size, and cut. Goals are specified with a target elevation map that encodes desired additions and subtractions to soil height. The model takes the per-cell difference between the current map and the target as a second image input.
Difference maps also enable corrective data collection. Standard operations rarely generate corrective labels, and a self-supervised policy trained on end-state maps naturally exhibits out-of-distribution behavior at runtime. Prescribing a target difference map makes it possible to collect supplementary supervised and DAgger data that expand the distribution and correct those failures.
We collect data under three regimes, each addressing a different gap:
We evaluate three dataset compositions combining these regimes:
| Dataset | D4 | D5 | D6 |
|---|---|---|---|
| Self-supervised | • | • | • |
| Supervised | • | ||
| DAgger | • |
A single operator collected about 14 hours of data across 8 days, on three earthmoving tasks, under a wide range of weather and soil conditions: overcast, sun, rain, snow, wind, frozen, saturated, hard, dry. The full corpus is 584 episodes and roughly 13.3 hours. The task-specific dataset is 209 episodes.
The policy is SmolVLA with the SIGLIP image encoders frozen. Observations are nineteen vehicle states plus image inputs. The states are joint positions and velocities on the lift, pitch, and roll axes; track velocities; vehicle roll and pitch; body twist; and engine RPM. Actions are joint efforts, similar to open-loop velocity control. Inference runs at 10 Hz, matched to the machine's low-frequency base controls. Vehicle dynamics are also slow.
We evaluate three image-input configurations:
| Inputs | M1 | M2 | M3 |
|---|---|---|---|
| Front image | • | • | |
| Map image | • | • |
Future work will move to the π0.5 policy family.
The platform is REO-RCTL: the Robotics for Engineer Operations Robotic Compact Track Loader. It is a CAT 299 D3 XE compact track loader fitted with read/write machine controls, ruggedized compute, RTK GNSS-INS, LiDAR, forward and stereo cameras, and string pots on the lift and tilt linkage. The mapping stack and the policy both run on the vehicle.
The training and deployment stack is ROS 2 for the runtime, LeRobot for model infrastructure, and Rosetta, a ROS 2 / LeRobot contract layer that streamlines data recording, training, and inference against a single schema. The same contract defines what is recorded, what is trained on, and what is commanded at deployment.
Define contract → Record data → Convert datasets → Train policy → Deploy on robot → DAgger ↺
These are preliminary findings from a small evaluation suite. The direction is suggestive.
BibTeX
@misc{cerl2026vla,
title = {Terrain Map-Conditioned VLAs for Autonomous Earthmoving:
Preliminary Report},
author = {Blankenau, Brian and Saucedo, Arturo and Wagner, W. Jacob and
Blankenau, Isaac and Nottage, Dustin and Soylemezoglu, Ahmet},
year = {2026},
note = {Construction Engineering Research Laboratory,
U.S. Army Corps of Engineers},
}