Doc.PP-VLA-01 Project Page/Rev A 2026-04-24·Distribution A
U.S. Army Corps of Engineers Engineer Research & Development Center Construction Engineering Research Laboratory
Preliminary report

Terrain Map-Conditioned VLAs for Autonomous Earthmoving

Brian Blankenau, Arturo Saucedo, W. Jacob Wagner, Isaac Blankenau, Dustin Nottage, Ahmet Soylemezoglu

Construction Engineering Research Laboratory · Engineer Research & Development Center · U.S. Army Corps of Engineers

Equal contribution.  Work performed at CERL; now at Rerun.

LIVE · rerun web viewer

The interactive Rerun viewer isn't available on small screens. Open this page on a desktop browser to explore the recording, or watch the videos below.

Fig. 1. REO-RCTL recording. Observations, state, and action channels on a shared clock.
Selected
Fig. 2. Autonomous Earthmoving: An evaluation sample.
Fig. 3. Autonomous Earthmoving: Approximately one hour of evaluations.
§ 01

Abstract

How do we begin building general-purpose autonomous earthmoving using existing VLA architectures?

Vision-Language-Action (VLA) models have demonstrated strong generalization in structured manipulation and navigation domains. However, their application to large-scale field robotics tasks, such as autonomous earthmoving, remains limited. Autonomous earthmoving presents unique challenges, including deformable terrain, long-horizon tasks, limited visibility at the tool-terrain interface, and highly variable environmental conditions.

This work investigates terrain map-conditioned VLAs for autonomous earthmoving. We augment conventional camera observations and vehicle state features with top-down egocentric elevation maps that represent local terrain geometry and blade-induced soil displacement. Maps are produced by combining LiDAR imaging with a light-weight physics engine, tracking the blade movement through the terrain and updating heights via swept volume displacements and soil failure physics (producing soil piling and spoil creation). These maps are continuously updated during operation and provide spatial context unavailable to vision-only policies such as estimated terrain height near the vehicle where key occluded interactions occur. To extend our policy toward general goal-conditioned earthmoving tasks, we leverage cut/fill maps that encode desired terrain modifications relative to the current environment. To allow for easy data collection and enable self-supervised goal conditioning, we use a post processing trick, where the episode’s final terrain map is treated as the goal during reprocessing.

We explored the effectiveness of elevation maps as model input through an ablation study. Initial policies were trained using SmolVLA with training data collected on a robotic compact track loader under diverse soil and weather conditions. Preliminary results demonstrate that cut/fill map conditioning improves task performance relative to image-only baselines, and that map only policies can achieve near parity with image based models. These findings suggest that goal-prompted terrain-aware policies may be a key enabler for general-purpose earth moving.

§ 02

Why earthmoving is hard

Soil behavior shifts with moisture, compaction, and temperature, while contact forces are high and nonlinear. It can also be difficult to discern depth and task progress through a camera only approach, as the environment that is being worked upon may be visually homogenous, and the context of the work being done is limited to the contents of a single egocentric camera imageframe. Additionally, the blade on earthmoving equipment outright occludes important information like accumulated soil and cut depth, making it difficult for a policy to know if any soil is being moved at all. Finally, earthmoving operations unfold over a large work area and a long time horizon, making it far easier for errors to compound and derail the task at hand.

§ 03

Elevation maps as conditioning

Luckily, through map images we can track progress, estimate missing soil observations, and provide an interface that allows for the parametrization of earthmoving tasks. We feed the model an egocentric top-down elevation map alongside the camera image. The map provides a measure of blade soil accumulation, a heightmap representation of the environment, and dynamic terrain elevation updates throughout task runtime.

The map is initialized from a scan of the work site by the vehicle and is updated continuously as the vehicle operates. The blade-occluded portion of the map is updated in real time with simplified soil interaction physics. This same physics model then erodes material to produce realistic piling and spoil on the map.

§ 04

Model

The policy is SmolVLA with the SIGLIP image encoders frozen. Observations are nineteen vehicle states plus image inputs. The states are joint positions and velocities on the lift, pitch, and roll axes; track velocities; vehicle roll and pitch; body twist; and engine RPM. Actions are joint efforts, similar to open-loop velocity control. Inference runs at 10 Hz, matched to the machine's low-frequency base controls. Vehicle dynamics are also slow.

We evaluate three image-input configurations:

Inputs M1 M2 M3
Front image
Map image
Tab. 1. Model input configurations.
§ 04

Cut/Fill maps as goals

Pile-building is under-specified. We want a model that builds not just a pile but a specific structure, defined by location, shape, size, and cut. Goals are specified with a target elevation map that encodes desired additions and subtractions to soil height. The model takes the per-cell difference between the current map and the target as a second image input. These maps also enable better data collection, as we can show an operator cut/fill maps, align on a specific strategy, and record their actions.

Cut and fill maps also enable corrective data collection. As the self-supervised policy trained on end-state maps naturally exhibits out-of-distribution behavior at runtime, we can use the cut/fill map to allow for Dataset Aggregation (DAgger) by showing a human operator the current map states and correct those failures live. These episodes are then fed back into the dataset for the next policy.

§ 06

Model II

So, we train a second set of models with an additional image source: a cut/fill map.

Cut/Fill maps can be generated at train time with a pure self-supervised approach. The end-state elevation map from an episode is retroactively used as that episode's goal. This approach has the benefit of being minimally intrusive (standard operations produce the labels, and operators are not over-burdened).

Again, we evaluate three image-input configurations:

Dataset M4 M5 M6
Front image
Map image
Cut/Fill image
Tab. 2. Cut/Fill Model input configurations.

Future work will move to the π0.5 policy family.

§ 07

Dataset

A single operator collected a bit over 13 hours of data across 8 days, on three earthmoving tasks, under a wide range of weather and soil conditions: overcast, sun, rain, snow, wind, frozen, saturated, hard, dry. The full corpus is 584 episodes and roughly 13.3 hours. The task-specific dataset ('build a pile) is 209 episodes.

Observation Dim
Front Image[256x256x3]
Elevation Map[256x256x3]
Cut/Fill Map*[256x256x3]
Joint States[5x2]
Odom[3x2]
IMU[2x1]
RPM[1x1]
Tab. 3. Recorded Observations. *Cut/Fill map can be generated during data post processing.

The joint states in more detail:
Joint State Position Velocity
tracks rotational
tracks linear
virtual lift arm
blade pitch
blade roll
Tab. 4. Joint States in more detail.
§ 08

Platform

The platform is the REO-RCTL: the Robotics for Engineer Operations Robotic Compact Track Loader. It is a CAT 299 D3 XE compact track loader fitted with read/write machine controls, ruggedized compute, RTK GNSS-INS, LiDAR, forward and rearward stereo cameras, and string pots on the blade pitch, yaw, roll, and the lift arm. The mapping stack and the policy both run on the vehicle.

§ 09

System infrastructure

The training and deployment stack is ROS 2 for the runtime, LeRobot for model infrastructure, and Rosetta, a ROS 2 / LeRobot contract layer that streamlines data recording, training, and inference against a single schema. The same contract defines what is recorded, what is trained on, and what is commanded at deployment.

Define contract Record data Convert datasets Train policy Deploy on robot DAgger

§ 10

Early conclusions

These are preliminary findings from a small evaluation suite. The direction is suggestive.

§ 11

Future work

Data Policy architecture and training Long-horizon planning Model features Platform
§ 12

References

§ 13

Cite

BibTeX

@misc{cerl2026vla,
  title  = {Terrain Map-Conditioned VLAs for Autonomous Earthmoving:
            Preliminary Report},
  author = {Blankenau, Brian and Saucedo, Arturo and Wagner, W. Jacob and
            Blankenau, Isaac and Nottage, Dustin and Soylemezoglu, Ahmet},
  year   = {2026},
  note   = {Construction Engineering Research Laboratory,
            U.S. Army Corps of Engineers},
}