Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 85 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,4 +94,88 @@ arkml.tools.train algo=<ml_algorithm> \
data.dataset_path=/path/to/dataset \
output_dir=/output/path

```
```

## Pi0.5

Pi0.5 is an upgraded version of the Pi0 Vision-Language-Action model with enhanced capabilities for robotic manipulation tasks. It features a multi-stage training approach with flow matching for precise action prediction.

### Training Stages

#### Pretraining Stage
The pretraining stage focuses on learning foundational representations using multiple modalities and FAST tokenization:

```bash
CUDA_VISIBLE_DEVICES=0 HYDRA_FULL_ERROR=1 \
arkml-train algo=pi05 \
data.dataset_path=/path/to/pi05/dataset \
output_dir=/output/path \
algo.model.policy_type=pi0.5 \
algo.training.stage=pretrain \
algo.training.pretrain_steps=280000
```

The pretraining stage optimizes:
- Cross-entropy loss for text tokens (CE(text))
- Cross-entropy loss for FAST tokens (CE(FAST tokens))

#### Post-training Stage
The post-training stage refines the model with flow matching and subtask prediction:

```bash
CUDA_VISIBLE_DEVICES=0 HYDRA_FULL_ERROR=1 \
arkml-train algo=pi05 \
data.dataset_path=/path/to/pi05/dataset \
output_dir=/output/path \
algo.model.policy_type=pi0.5 \
algo.training.stage=posttrain \
algo.training.posttrain_steps=80000 \
algo.training.flow_alpha=10.0
```

The post-training stage optimizes:
- Cross-entropy loss for subtasks (CE(subtask))
- Flow matching loss weighted by alpha (alpha * flow_matching_loss)

### Running Inference

To run inference with a trained Pi0.5 model:

```bash
HYDRA_FULL_ERROR=1 arkml-policy algo=pi05 \
algo.model.model_path=path/to/pi05/model \
policy_node_name=pi05_node
```

You can then call the inference endpoints:
- `pi05_node/policy/predict` - Get next action prediction
- `pi05_node/policy/reset` - Reset policy state
- `pi05_node/policy/start` - Start policy service
- `pi05_node/policy/stop` - Stop policy service

### Configuration Explanation

The Pi0.5 configuration includes several key parameters:

**Model Configuration:**
- `model.backbone_type`: Vision-language backbone architecture (e.g., 'siglip_gemma')
- `model.use_fast_tokens`: Whether to use FAST tokenizer for action discretization
- `model.use_flow_matching`: Whether to use flow matching for action prediction

**Training Configuration:**
- `training.stage`: Current training stage ('pretrain' or 'posttrain')
- `training.pretrain_steps`: Number of steps for pretraining (280000 default)
- `training.posttrain_steps`: Number of steps for post-training (80000 default)
- `training.integration_steps`: Number of steps for Euler integration in flow matching
- `training.flow_alpha`: Weight for flow matching loss (10.0 default)

**Dataset Configuration:**
The dataset configuration uses mixture sampling with:
- Primary dataset for main training data
- Secondary datasets for auxiliary data
- Configurable weights for balancing different data sources

The model uses a multi-head architecture with:
- Subtask head for high-level task planning
- FAST head for discretized action prediction
- Flow head for continuous action prediction using flow matching
190 changes: 190 additions & 0 deletions arkml/algos/vla/pi05/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# Pi0.5 Implementation

This directory contains the complete Pi0.5 implementation following the HuggingFace wrapper pattern for the Ark ML framework.

## Architecture Overview

Pi0.5 is an advanced Vision-Language-Action model that implements:
- **Multi-stage training**: Pretraining (CE(text) + CE(FAST tokens)) and Post-training (CE(subtask) + α × flow_matching_loss)
- **Flow matching**: For precise action prediction using vector field networks
- **Multiple prediction heads**: Subtask, FAST, and flow heads
- **Enhanced backbone**: Support for SigLIP-Gemma vision-language architecture

## Directory Structure

```
pi05/
├── models.py # Core Pi0.5 policy (HuggingFace wrapper)
├── algorithm.py # Training algorithm
├── trainer.py # Multi-stage trainer
├── evaluator.py # Evaluation metrics
├── dataset.py # Multi-modality dataset
├── config_utils.py # Configuration utilities
├── compute_stats.py # Statistics computation
├── utils.py # Utility functions
└── README.md # This file
```

## Usage Instructions

### 1. Loading a Pre-trained Model

```python
from arkml.algos.vla.pi05.models import Pi05Policy

# Load from Hugging Face Hub or local path
policy = Pi05Policy(
policy_type='pi0.5',
model_path='your-huggingface-username/pi05-model', # or local path
backbone_type='siglip_gemma', # Vision-language backbone
use_fast_tokens=True, # Enable FAST tokenization
use_flow_matching=True, # Enable flow matching
obs_dim=9, # Observation dimension
action_dim=8, # Action dimension
image_dim=(3, 480, 640), # Image dimensions (C, H, W)
pred_horizon=1 # Prediction horizon
)

# Move to device
policy = policy.to_device('cuda')
```

### 2. Making Predictions

```python
import torch

# Prepare observation dictionary
observation = {
'image': torch.randn(1, 3, 224, 224), # Image tensor
'state': torch.randn(9), # State vector
'task': 'pick up the red block' # Task instruction (optional)
}

# Get action prediction
action = policy.predict(observation)
print(f"Predicted action: {action}")
```

### 3. Training a New Model

```python
from arkml.algos.vla.pi05.algorithm import Pi05Algorithm
from arkml.algos.vla.pi05.dataset import create_pi05_dataloader
from omegaconf import DictConfig

# Create your dataset and dataloader
train_dataloader = create_pi05_dataloader(
dataset_path='path/to/your/dataset',
batch_size=8,
shuffle=True
)

# Load your policy
policy = Pi05Policy(
policy_type='pi0.5',
model_path='path/to/pretrained/model', # Or use a base model
# ... other parameters
)

# Configure training
config = DictConfig({
'trainer': {
'lr': 2e-4,
'batch_size': 8,
'max_epochs': 10,
'weight_decay': 0.01,
'num_workers': 4,
'use_bf16': True
},
'training': {
'stage': 'pretrain', # 'pretrain' or 'posttrain'
'flow_alpha': 10.0, # Weight for flow matching loss
'pretrain_steps': 280000, # Steps for pretraining
'posttrain_steps': 80000 # Steps for post-training
}
})

# Create algorithm and train
algorithm = Pi05Algorithm(policy=policy, device='cuda', cfg=config)
results = algorithm.train(train_dataset=your_train_dataset)
```

### 4. Configuration Options

Key configuration parameters:

- `backbone_type`: Vision-language backbone ('siglip_gemma', etc.)
- `use_fast_tokens`: Whether to use FAST tokenization for action discretization
- `use_flow_matching`: Whether to use flow matching for action prediction
- `training_stage`: 'pretrain' or 'posttrain' for multi-stage training
- `flow_alpha`: Weight for flow matching loss (default: 10.0)

## Training Stages

Pi0.5 supports multi-stage training:

### Pretraining Stage
```
CE(text) + CE(FAST tokens)
```
- Focuses on learning foundational representations
- Uses multiple modalities and FAST tokenization

### Post-training Stage
```
CE(subtask) + α × flow_matching_loss
```
- Refines the model with flow matching and subtask prediction
- Enables precise action prediction using flow matching

## Evaluation Metrics

The evaluator provides comprehensive metrics:
- Action MSE and MAE
- Accuracy within threshold
- Subtask prediction accuracy
- Multi-modality evaluation

## Integration with LeRobot

This implementation uses the LeRobot Pi0.5 policy under the hood:
- Follows LeRobot's model architecture
- Compatible with LeRobot datasets and tools
- Supports LeRobot's training and evaluation pipelines

## Example Usage Script

For a complete example, see the example script that demonstrates:
- Model loading
- Training setup
- Prediction workflow
- Evaluation process

## Requirements

- LeRobot >= 0.4.3
- Transformers
- PyTorch >= 1.12
- Compatible with ark_ml framework

## Testing

Run tests to verify functionality:
```bash
python -m pytest tests_and_benchmarks/pi05_tests/
```

## Benchmarks

Run performance benchmarks:
```bash
python tests_and_benchmarks/pi05_benchmarks/benchmark_pi05.py
```

## Notes

- This implementation follows the same pattern as PiZero for consistency
- Multi-stage training requires different dataset configurations for each stage
- Flow matching is particularly effective for precise manipulation tasks
- FAST tokenization enables efficient action discretization during pretraining
File renamed without changes.
Loading