Terrain Map-Conditioned VLAs for Autonomous Earthmoving

LIVE · rerun web viewer

Fig. 1. REO-RCTL recording. Observations, state, and action channels on a shared clock.

Selected

§ 01

Abstract

How do we adapt VLAs, pretrained on bimanual robotic manipulation tasks, for earthmoving operations on heavy equipment?

Vision–language–action models (VLAs) have demonstrated remarkable generalization in structured environments. Their efficacy in complex field environments, however, remains largely unexplored. Field robotics tasks like earthmoving present unique challenges: mobile manipulation of deformable terrain, widely varying environmental conditions, and high contact forces. This study investigates how the success of VLAs can be extended to autonomous earthmoving by utilizing terrain map-conditioned policies to achieve robust performance across diverse soil types.

§ 02

Why earthmoving is hard

VLAs have succeeded in settings where the variation is semantic. Objects and instructions change, the physics does not. Earthmoving inverts this. Soil behavior shifts with moisture, compaction, and weather. Contact forces are high and nonlinear. The blade occludes the material it accumulates. Depth, the cue a camera-only policy leans on, is unreliable over loose, uniformly textured soil. Tasks unfold over a large work area and a long horizon, so no single camera frame carries the context.

§ 03

Elevation maps as conditioning

The policy needs a representation of terrain state that survives those gaps. We feed it an egocentric top-down elevation map alongside the camera image. The map provides three things a camera and joint state alone cannot: a measure of blade soil accumulation, a heightmap representation of the environment, and dynamic updates throughout task runtime.

The map is initialized from an initial scan of the work site by the vehicle and is updated continuously as the vehicle operates. Blade motion through the terrain updates heights by swept-volume displacement. Soil failure physics erodes material to produce realistic piling and spoil. Near-vehicle height estimation closes the information gap under the blade, which is critical for policy learning.

Where the blade or a recent interaction occludes the camera, the map still carries state. The policy reasons about accumulated soil and distant terrain in the same representation.

§ 04

Difference maps as goals

Pile-building is under-specified. We want a model that builds not just a pile but a specific structure, defined by location, shape, size, and cut. Goals are specified with a target elevation map that encodes desired additions and subtractions to soil height. The model takes the per-cell difference between the current map and the target as a second image input.

Difference maps also enable corrective data collection. Standard operations rarely generate corrective labels, and a self-supervised policy trained on end-state maps naturally exhibits out-of-distribution behavior at runtime. Prescribing a target difference map makes it possible to collect supplementary supervised and DAgger data that expand the distribution and correct those failures.

§ 05

Data regimes

We collect data under three regimes, each addressing a different gap:

Self-supervised (SS). The end-state elevation map from an episode is retroactively used as that episode's goal. Minimally intrusive (standard operations produce the labels, and operators are not over-burdened). Naturally biased toward distributions the policy already knows, which yields out-of-distribution behavior at runtime.
Supervised. A human operator teleoperates toward a prescribed goal map. Expands the distribution toward structures the self-supervised policy would never generate on its own.
DAgger. A self-supervised policy drives while a human operator intervenes to correct. Targets the failure modes the policy exhibits at runtime.

We evaluate three dataset compositions combining these regimes:

Dataset	D4	D5	D6
Self-supervised	•	•	•
Supervised		•
DAgger			•

Tab. 1. Dataset compositions.

A single operator collected about 14 hours of data across 8 days, on three earthmoving tasks, under a wide range of weather and soil conditions: overcast, sun, rain, snow, wind, frozen, saturated, hard, dry. The full corpus is 584 episodes and roughly 13.3 hours. The task-specific dataset is 209 episodes.

§ 06

Model

The policy is SmolVLA with the SIGLIP image encoders frozen. Observations are nineteen vehicle states plus image inputs. The states are joint positions and velocities on the lift, pitch, and roll axes; track velocities; vehicle roll and pitch; body twist; and engine RPM. Actions are joint efforts, similar to open-loop velocity control. Inference runs at 10 Hz, matched to the machine's low-frequency base controls. Vehicle dynamics are also slow.

We evaluate three image-input configurations:

Inputs	M1	M2	M3
Front image	•		•
Map image		•	•

Tab. 2. Model input configurations.

Future work will move to the π_0.5 policy family.

§ 07

Platform

The platform is REO-RCTL: the Robotics for Engineer Operations Robotic Compact Track Loader. It is a CAT 299 D3 XE compact track loader fitted with read/write machine controls, ruggedized compute, RTK GNSS-INS, LiDAR, forward and stereo cameras, and string pots on the lift and tilt linkage. The mapping stack and the policy both run on the vehicle.

§ 08

System infrastructure

The training and deployment stack is ROS 2 for the runtime, LeRobot for model infrastructure, and Rosetta, a ROS 2 / LeRobot contract layer that streamlines data recording, training, and inference against a single schema. The same contract defines what is recorded, what is trained on, and what is commanded at deployment.

Define contract → Record data → Convert datasets → Train policy → Deploy on robot → DAgger ↺

§ 09

Early conclusions

Map conditioning helps. Elevation-map-conditioned policies outperform image-only policies.
DAgger is data-efficient. Substituting about 25% of self-supervised data with DAgger examples reduces failures. This shows a path to train on large bodies of operator data with a small budget of expert interventions.
Dataset size beats training steps. Training-step count has less impact on performance than dataset size in our regime.
Sensitivities. Policies show noticeable sensitivity to engine RPM and blade yaw at inference.

These are preliminary findings from a small evaluation suite. The direction is suggestive.

§ 10

Future work

Data

Collect additional in-cab operator, teleop, and DAgger data.
Increase the variety of goal earthmoving profiles in the dataset.
Add simulation data for pre-training (Algoryx, Vortex).

Policy architecture and training

Feature dropout during training.
Migrate to the π-series base models.
Expand model history (KV caching).
RL or advantage conditioning.
Stage-aware reward modeling.

Long-horizon planning

High-level diffusion stage planner for goal maps.

Model features

Incorporate soil-property knowledge from online soil-property estimation.
Move to low-level position control and position-command action space.
Fix track-speed normalization.

Platform

Move to more capable machines (D3 bulldozer).
Multi-task evaluations.

§ 11

References

Rosetta github.com/iblnkn/rosetta
SmolVLA huggingface.co/blog/smolvla
EMCUPY github.com/leggedrobotics/elevation_mapping_cupy
Wagner · ISTVS arxiv.org/pdf/2507.22356
LeRobot Folding huggingface.co/spaces/lerobot/robot-folding

§ 12

Cite

BibTeX

@misc{cerl2026vla,
  title  = {Terrain Map-Conditioned VLAs for Autonomous Earthmoving:
            Preliminary Report},
  author = {Blankenau, Brian and Saucedo, Arturo and Wagner, W. Jacob and
            Blankenau, Isaac and Nottage, Dustin and Soylemezoglu, Ahmet},
  year   = {2026},
  note   = {Construction Engineering Research Laboratory,
            U.S. Army Corps of Engineers},
}