Construction Engineering Research Laboratory · Engineer Research & Development Center · U.S. Army Corps of Engineers
†Equal contribution. ‡Work performed at CERL; now at Rerun.
The interactive Rerun viewer isn't available on small screens. Open this page on a desktop browser to explore the recording, or watch the videos below.
How do we begin building general-purpose autonomous earthmoving using existing VLA architectures?
Vision-Language-Action (VLA) models have demonstrated strong generalization in structured manipulation and navigation domains. However, their application to large-scale field robotics tasks, such as autonomous earthmoving, remains limited. Autonomous earthmoving presents unique challenges, including deformable terrain, long-horizon tasks, limited visibility at the tool-terrain interface, and highly variable environmental conditions.
This work investigates terrain map-conditioned VLAs for autonomous earthmoving. We augment conventional camera observations and vehicle state features with top-down egocentric elevation maps that represent local terrain geometry and blade-induced soil displacement. Maps are produced by combining LiDAR imaging with a light-weight physics engine, tracking the blade movement through the terrain and updating heights via swept volume displacements and soil failure physics (producing soil piling and spoil creation). These maps are continuously updated during operation and provide spatial context unavailable to vision-only policies such as estimated terrain height near the vehicle where key occluded interactions occur. To extend our policy toward general goal-conditioned earthmoving tasks, we leverage cut/fill maps that encode desired terrain modifications relative to the current environment. To allow for easy data collection and enable self-supervised goal conditioning, we use a post processing trick, where the episode’s final terrain map is treated as the goal during reprocessing.
We explored the effectiveness of elevation maps as model input through an ablation study. Initial policies were trained using SmolVLA with training data collected on a robotic compact track loader under diverse soil and weather conditions. Preliminary results demonstrate that cut/fill map conditioning improves task performance relative to image-only baselines, and that map only policies can achieve near parity with image based models. These findings suggest that goal-prompted terrain-aware policies may be a key enabler for general-purpose earth moving.
Soil behavior shifts with moisture, compaction, and temperature, while contact forces are high and nonlinear. It can also be difficult to discern depth and task progress through a camera only approach, as the environment that is being worked upon may be visually homogenous, and the context of the work being done is limited to the contents of a single egocentric camera imageframe. Additionally, the blade on earthmoving equipment outright occludes important information like accumulated soil and cut depth, making it difficult for a policy to know if any soil is being moved at all. Finally, earthmoving operations unfold over a large work area and a long time horizon, making it far easier for errors to compound and derail the task at hand.
Luckily, through map images we can track progress, estimate missing soil observations, and provide an interface that allows for the parametrization of earthmoving tasks. We feed the model an egocentric top-down elevation map alongside the camera image. The map provides a measure of blade soil accumulation, a heightmap representation of the environment, and dynamic terrain elevation updates throughout task runtime.
The map is initialized from a scan of the work site by the vehicle and is updated continuously as the vehicle operates. The blade-occluded portion of the map is updated in real time with simplified soil interaction physics. This same physics model then erodes material to produce realistic piling and spoil on the map.
The policy is SmolVLA with the SIGLIP image encoders frozen. Observations are nineteen vehicle states plus image inputs. The states are joint positions and velocities on the lift, pitch, and roll axes; track velocities; vehicle roll and pitch; body twist; and engine RPM. Actions are joint efforts, similar to open-loop velocity control. Inference runs at 10 Hz, matched to the machine's low-frequency base controls. Vehicle dynamics are also slow.
We evaluate three image-input configurations:
| Inputs | M1 | M2 | M3 |
|---|---|---|---|
| Front image | • | • | |
| Map image | • | • |
Pile-building is under-specified. We want a model that builds not just a pile but a specific structure, defined by location, shape, size, and cut. Goals are specified with a target elevation map that encodes desired additions and subtractions to soil height. The model takes the per-cell difference between the current map and the target as a second image input. These maps also enable better data collection, as we can show an operator cut/fill maps, align on a specific strategy, and record their actions.
Cut and fill maps also enable corrective data collection. As the self-supervised policy trained on end-state maps naturally exhibits out-of-distribution behavior at runtime, we can use the cut/fill map to allow for Dataset Aggregation (DAgger) by showing a human operator the current map states and correct those failures live. These episodes are then fed back into the dataset for the next policy.
So, we train a second set of models with an additional image source: a cut/fill map.
Cut/Fill maps can be generated at train time with a pure self-supervised approach. The end-state elevation map from an episode is retroactively used as that episode's goal. This approach has the benefit of being minimally intrusive (standard operations produce the labels, and operators are not over-burdened).Again, we evaluate three image-input configurations:
| Dataset | M4 | M5 | M6 |
|---|---|---|---|
| Front image | • | • | |
| Map image | • | • | |
| Cut/Fill image | • | • | • |
Future work will move to the π0.5 policy family.
A single operator collected a bit over 13 hours of data across 8 days, on three earthmoving tasks, under a wide range of weather and soil conditions: overcast, sun, rain, snow, wind, frozen, saturated, hard, dry. The full corpus is 584 episodes and roughly 13.3 hours. The task-specific dataset ('build a pile) is 209 episodes.
| Observation | Dim |
|---|---|
| Front Image | [256x256x3] |
| Elevation Map | [256x256x3] |
| Cut/Fill Map* | [256x256x3] |
| Joint States | [5x2] |
| Odom | [3x2] |
| IMU | [2x1] |
| RPM | [1x1] |
| Joint State | Position | Velocity |
|---|---|---|
| tracks rotational | • | • |
| tracks linear | • | • |
| virtual lift arm | • | • |
| blade pitch | • | • |
| blade roll | • | • |
The platform is the REO-RCTL: the Robotics for Engineer Operations Robotic Compact Track Loader. It is a CAT 299 D3 XE compact track loader fitted with read/write machine controls, ruggedized compute, RTK GNSS-INS, LiDAR, forward and rearward stereo cameras, and string pots on the blade pitch, yaw, roll, and the lift arm. The mapping stack and the policy both run on the vehicle.
The training and deployment stack is ROS 2 for the runtime, LeRobot for model infrastructure, and Rosetta, a ROS 2 / LeRobot contract layer that streamlines data recording, training, and inference against a single schema. The same contract defines what is recorded, what is trained on, and what is commanded at deployment.
Define contract → Record data → Convert datasets → Train policy → Deploy on robot → DAgger ↺
These are preliminary findings from a small evaluation suite. The direction is suggestive.
BibTeX
@misc{cerl2026vla,
title = {Terrain Map-Conditioned VLAs for Autonomous Earthmoving:
Preliminary Report},
author = {Blankenau, Brian and Saucedo, Arturo and Wagner, W. Jacob and
Blankenau, Isaac and Nottage, Dustin and Soylemezoglu, Ahmet},
year = {2026},
note = {Construction Engineering Research Laboratory,
U.S. Army Corps of Engineers},
}