August 23, 2025 • 15 min read
Computer Vision Applications with Reachy Mini
Computer vision transforms your Reachy Mini from a simple robot into an intelligent companion capable of seeing, understanding, and interacting with the world. In this comprehensive guide, we'll explore how to implement object detection, face recognition, gesture control, and advanced AI-powered visual interactions using Reachy Mini's integrated camera system.
What you'll master: By the end of this tutorial, you'll know how to implement real-time object detection, create face-following behaviors, build gesture recognition systems, and integrate cutting-edge AI vision models from Hugging Face.
🎯 Vision Capabilities You'll Build
Face tracking • Object detection • Gesture recognition • Emotion analysis • Scene understanding
Understanding Reachy Mini's Vision System
Reachy Mini's vision system is built around a high-quality integrated camera that provides real-time video streaming capabilities. Combined with the robot's expressive head movements, this creates opportunities for rich visual interactions that feel natural and engaging.
Camera Specifications and Capabilities
📹 Video Streaming
Real-time video capture with adjustable resolution and frame rate for optimal performance
🔄 Head Integration
Seamless coordination between camera input and 6-DOF head movements
⚡ Low Latency
Optimized processing pipeline for responsive real-time interactions
🧠 AI Ready
Direct integration with OpenCV, PyTorch, and Hugging Face vision models
Setting Up Computer Vision Environment
Before diving into computer vision applications, let's set up a comprehensive development environment with all the necessary libraries and tools.
# Install essential computer vision libraries
pip install opencv-python
pip install opencv-contrib-python
pip install numpy
pip install scipy
pip install matplotlib
pip install pillow
# Install deep learning frameworks
pip install torch torchvision
pip install transformers
pip install ultralytics # For YOLO object detection
# Install additional CV utilities
pip install mediapipe # For pose and hand detection
pip install face-recognition # Simplified face recognition
pip install dlib # Advanced computer vision algorithms
# Install Reachy SDK if not already installed
pip install reachy-sdk
Performance Note: Computer vision applications can be CPU-intensive. For the best performance, consider running computationally heavy models on your host computer rather than directly on the Raspberry Pi version.
Basic Computer Vision Setup
Let's start with the fundamentals – accessing the camera, processing frames, and displaying results.
import cv2
import numpy as np
from reachy_sdk import ReachySDK
import time
import threading
class ReachyVision:
def __init__(self, host='reachy-mini.local'):
"""Initialize Reachy Vision system."""
self.reachy = ReachySDK(host=host)
self.camera = self.reachy.camera
self.running = False
self.current_frame = None
# Computer vision parameters
self.frame_width = 640
self.frame_height = 480
self.fps_target = 30
print("Reachy Vision system initialized!")
def start_camera_stream(self):
"""Start the camera stream in a separate thread."""
self.running = True
self.camera_thread = threading.Thread(target=self._camera_loop)
self.camera_thread.daemon = True
self.camera_thread.start()
print("Camera stream started")
def _camera_loop(self):
"""Internal camera processing loop."""
while self.running:
try:
# Capture frame
frame = self.camera.capture_frame()
if frame is not None:
# Resize for consistent processing
frame = cv2.resize(frame, (self.frame_width, self.frame_height))
self.current_frame = frame
# Control frame rate
time.sleep(1.0 / self.fps_target)
except Exception as e:
print(f"Camera error: {e}")
time.sleep(0.1)
def stop_camera_stream(self):
"""Stop the camera stream."""
self.running = False
if hasattr(self, 'camera_thread'):
self.camera_thread.join()
print("Camera stream stopped")
def get_current_frame(self):
"""Get the most recent camera frame."""
return self.current_frame.copy() if self.current_frame is not None else None
def display_frame(self, frame, window_name="Reachy Vision"):
"""Display a frame (useful for debugging)."""
if frame is not None:
cv2.imshow(window_name, frame)
return cv2.waitKey(1) & 0xFF
return -1
# Initialize the vision system
vision = ReachyVision()
vision.start_camera_stream()
# Basic camera test
print("Testing camera feed...")
for i in range(100): # Test for ~3 seconds
frame = vision.get_current_frame()
if frame is not None:
# Add timestamp overlay
timestamp = time.strftime("%H:%M:%S")
cv2.putText(frame, timestamp, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
# Display frame
key = vision.display_frame(frame)
if key == ord('q'):
break
time.sleep(0.03)
vision.stop_camera_stream()
cv2.destroyAllWindows()
Object Detection and Recognition
Object detection enables your Reachy Mini to identify and respond to objects in its environment. We'll implement both traditional computer vision approaches and modern AI-based detection.
Traditional Computer Vision Object Detection
class ObjectDetector:
def __init__(self, vision_system):
"""Initialize object detection with traditional CV methods."""
self.vision = vision_system
# Initialize background subtractor for movement detection
self.bg_subtractor = cv2.createBackgroundSubtractorMOG2(
detectShadows=True, varThreshold=50
)
# Color detection ranges (HSV)
self.color_ranges = {
'red': [(0, 50, 50), (10, 255, 255)],
'green': [(40, 50, 50), (80, 255, 255)],
'blue': [(100, 50, 50), (130, 255, 255)],
'yellow': [(20, 50, 50), (30, 255, 255)]
}
def detect_motion(self, frame):
"""Detect moving objects in the frame."""
if frame is None:
return []
# Apply background subtraction
fg_mask = self.bg_subtractor.apply(frame)
# Clean up the mask
kernel = np.ones((5, 5), np.uint8)
fg_mask = cv2.morphologyEx(fg_mask, cv2.MORPH_CLOSE, kernel)
fg_mask = cv2.morphologyEx(fg_mask, cv2.MORPH_OPEN, kernel)
# Find contours
contours, _ = cv2.findContours(fg_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# Filter and analyze contours
detected_objects = []
for contour in contours:
area = cv2.contourArea(contour)
# Filter small objects
if area > 500:
x, y, w, h = cv2.boundingRect(contour)
center_x = x + w // 2
center_y = y + h // 2
detected_objects.append({
'type': 'moving_object',
'center': (center_x, center_y),
'bbox': (x, y, w, h),
'area': area
})
return detected_objects
def detect_colors(self, frame):
"""Detect objects based on color."""
if frame is None:
return []
# Convert to HSV for better color detection
hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
detected_colors = []
for color_name, (lower, upper) in self.color_ranges.items():
# Create mask for this color
lower_bound = np.array(lower)
upper_bound = np.array(upper)
mask = cv2.inRange(hsv, lower_bound, upper_bound)
# Clean up the mask
kernel = np.ones((5, 5), np.uint8)
mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, kernel)
mask = cv2.morphologyEx(mask, cv2.MORPH_OPEN, kernel)
# Find contours
contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
for contour in contours:
area = cv2.contourArea(contour)
if area > 300: # Minimum area threshold
x, y, w, h = cv2.boundingRect(contour)
center_x = x + w // 2
center_y = y + h // 2
detected_colors.append({
'type': 'colored_object',
'color': color_name,
'center': (center_x, center_y),
'bbox': (x, y, w, h),
'area': area
})
return detected_colors
def detect_shapes(self, frame):
"""Detect basic geometric shapes."""
if frame is None:
return []
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
edges = cv2.Canny(blurred, 50, 150)
contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
detected_shapes = []
for contour in contours:
area = cv2.contourArea(contour)
if area > 1000: # Filter small contours
# Approximate contour to polygon
epsilon = 0.02 * cv2.arcLength(contour, True)
approx = cv2.approxPolyDP(contour, epsilon, True)
x, y, w, h = cv2.boundingRect(contour)
center_x = x + w // 2
center_y = y + h // 2
# Classify shape based on number of vertices
vertices = len(approx)
if vertices == 3:
shape_type = "triangle"
elif vertices == 4:
# Check if it's a square or rectangle
aspect_ratio = float(w) / h
shape_type = "square" if 0.8 <= aspect_ratio <= 1.2 else "rectangle"
elif vertices > 8:
shape_type = "circle"
else:
shape_type = f"polygon_{vertices}"
detected_shapes.append({
'type': 'geometric_shape',
'shape': shape_type,
'center': (center_x, center_y),
'bbox': (x, y, w, h),
'area': area,
'vertices': vertices
})
return detected_shapes
# Usage example
detector = ObjectDetector(vision)
def run_object_detection_demo():
"""Run comprehensive object detection demo."""
print("Starting object detection demo...")
vision.start_camera_stream()
try:
for i in range(300): # Run for ~10 seconds
frame = vision.get_current_frame()
if frame is not None:
# Create a copy for drawing
display_frame = frame.copy()
# Detect different types of objects
moving_objects = detector.detect_motion(frame)
colored_objects = detector.detect_colors(frame)
shapes = detector.detect_shapes(frame)
# Draw detection results
# Draw moving objects in red
for obj in moving_objects:
x, y, w, h = obj['bbox']
cv2.rectangle(display_frame, (x, y), (x+w, y+h), (0, 0, 255), 2)
cv2.putText(display_frame, "MOVING", (x, y-10),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 255), 2)
# Draw colored objects
for obj in colored_objects:
x, y, w, h = obj['bbox']
cv2.rectangle(display_frame, (x, y), (x+w, y+h), (0, 255, 0), 2)
cv2.putText(display_frame, obj['color'].upper(), (x, y-10),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)
# Draw shapes
for obj in shapes:
x, y, w, h = obj['bbox']
cv2.rectangle(display_frame, (x, y), (x+w, y+h), (255, 0, 0), 2)
cv2.putText(display_frame, obj['shape'].upper(), (x, y-10),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 0, 0), 2)
# Display results
key = vision.display_frame(display_frame, "Object Detection")
if key == ord('q'):
break
time.sleep(0.03)
finally:
vision.stop_camera_stream()
cv2.destroyAllWindows()
# Run the demo
run_object_detection_demo()
AI-Powered Object Detection with YOLO
For more sophisticated object recognition, let's integrate a state-of-the-art YOLO model that can identify hundreds of different object types.
from ultralytics import YOLO
import torch
class AIObjectDetector:
def __init__(self, vision_system):
"""Initialize AI-powered object detection."""
self.vision = vision_system
# Load pre-trained YOLO model
print("Loading YOLO model...")
self.model = YOLO('yolov8n.pt') # Nano version for speed
# COCO class names (subset of most common objects)
self.class_names = [
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck',
'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench',
'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra',
'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange',
'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse',
'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier'
]
print("YOLO model loaded successfully!")
def detect_objects(self, frame, confidence_threshold=0.5):
"""Detect objects using YOLO model."""
if frame is None:
return []
# Run YOLO inference
results = self.model(frame, conf=confidence_threshold, verbose=False)
detected_objects = []
# Process results
for result in results:
boxes = result.boxes
if boxes is not None:
for box in boxes:
# Get bounding box coordinates
x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
# Get class and confidence
class_id = int(box.cls[0].cpu().numpy())
confidence = float(box.conf[0].cpu().numpy())
# Get class name
class_name = self.class_names[class_id] if class_id < len(self.class_names) else f"class_{class_id}"
# Calculate center point
center_x = int((x1 + x2) / 2)
center_y = int((y1 + y2) / 2)
detected_objects.append({
'type': 'ai_detected_object',
'class_name': class_name,
'confidence': confidence,
'center': (center_x, center_y),
'bbox': (int(x1), int(y1), int(x2-x1), int(y2-y1))
})
return detected_objects
def track_most_interesting_object(self, detected_objects):
"""Determine the most interesting object to track."""
if not detected_objects:
return None
# Priority scoring for different object types
priority_scores = {
'person': 100,
'cat': 90, 'dog': 90,
'bottle': 70, 'cup': 70,
'laptop': 80, 'cell phone': 75,
'book': 60,
'chair': 30, 'couch': 25
}
best_object = None
best_score = 0
for obj in detected_objects:
# Base score from priority
base_score = priority_scores.get(obj['class_name'], 40)
# Boost score based on confidence
confidence_boost = obj['confidence'] * 20
# Boost score for objects in center of frame
center_x, center_y = obj['center']
frame_center_x, frame_center_y = 320, 240 # Assuming 640x480 frame
distance_from_center = ((center_x - frame_center_x)**2 + (center_y - frame_center_y)**2)**0.5
center_boost = max(0, 50 - distance_from_center / 10)
total_score = base_score + confidence_boost + center_boost
if total_score > best_score:
best_score = total_score
best_object = obj
return best_object
# Integrate with Reachy's head movement
class ObjectTracker:
def __init__(self, reachy, ai_detector):
"""Initialize object tracking with head movement."""
self.reachy = reachy
self.ai_detector = ai_detector
self.tracking_target = None
self.tracking_history = []
def calculate_head_position(self, object_center, frame_size=(640, 480)):
"""Calculate where the head should look based on object position."""
center_x, center_y = object_center
frame_w, frame_h = frame_size
# Convert pixel coordinates to head movement coordinates
# Normalize to -1 to 1 range
norm_x = (center_x - frame_w/2) / (frame_w/2)
norm_y = (center_y - frame_h/2) / (frame_h/2)
# Scale to appropriate head movement range
head_x = norm_x * 30 # ±30 degrees horizontal
head_y = -norm_y * 20 # ±20 degrees vertical (inverted)
head_z = 50 # Fixed distance
return head_x, head_y, head_z
def smooth_tracking(self, target_position, smoothing_factor=0.7):
"""Apply smoothing to head movements for natural tracking."""
if not self.tracking_history:
self.tracking_history.append(target_position)
return target_position
# Exponential moving average
last_position = self.tracking_history[-1]
smooth_x = last_position[0] * smoothing_factor + target_position[0] * (1 - smoothing_factor)
smooth_y = last_position[1] * smoothing_factor + target_position[1] * (1 - smoothing_factor)
smooth_z = target_position[2] # Keep Z constant
smoothed_position = (smooth_x, smooth_y, smooth_z)
# Keep history limited
self.tracking_history.append(smoothed_position)
if len(self.tracking_history) > 5:
self.tracking_history.pop(0)
return smoothed_position
def track_object(self, frame):
"""Track objects and move head accordingly."""
detected_objects = self.ai_detector.detect_objects(frame)
if detected_objects:
# Find the most interesting object
target = self.ai_detector.track_most_interesting_object(detected_objects)
if target:
# Calculate head position
head_pos = self.calculate_head_position(target['center'])
# Apply smoothing
smooth_pos = self.smooth_tracking(head_pos)
# Move head to track object
self.reachy.head.look_at(
x=smooth_pos[0],
y=smooth_pos[1],
z=smooth_pos[2],
duration=0.5
)
# Provide feedback about what we're looking at
if target != self.tracking_target:
self.tracking_target = target
confidence_percent = int(target['confidence'] * 100)
print(f"Now tracking: {target['class_name']} ({confidence_percent}% confident)")
return target
else:
# No objects detected, return to neutral position
if self.tracking_target is not None:
self.reachy.head.look_at(x=0, y=0, z=50, duration=1.0)
self.tracking_target = None
print("No objects detected, returning to neutral position")
return None
# Complete object tracking demo
def run_ai_object_tracking():
"""Run AI-powered object tracking demo."""
print("Initializing AI object tracking...")
# Initialize components
vision.start_camera_stream()
ai_detector = AIObjectDetector(vision)
tracker = ObjectTracker(vision.reachy, ai_detector)
print("Starting object tracking - show objects to the camera!")
try:
for i in range(600): # Run for ~20 seconds
frame = vision.get_current_frame()
if frame is not None:
# Track objects and move head
tracked_object = tracker.track_object(frame)
# Create visualization
display_frame = frame.copy()
# Draw all detected objects
detected_objects = ai_detector.detect_objects(frame)
for obj in detected_objects:
x, y, w, h = obj['bbox']
confidence = obj['confidence']
class_name = obj['class_name']
# Color code by confidence
color = (0, 255, 0) if confidence > 0.7 else (0, 255, 255)
if obj == tracked_object:
color = (0, 0, 255) # Red for actively tracked object
cv2.rectangle(display_frame, (x, y), (x+w, y+h), color, 2)
# Label
label = f"{class_name}: {confidence:.2f}"
cv2.putText(display_frame, label, (x, y-10),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)
# Display frame
key = vision.display_frame(display_frame, "AI Object Tracking")
if key == ord('q'):
break
time.sleep(0.03)
finally:
vision.stop_camera_stream()
cv2.destroyAllWindows()
# Return to neutral position
tracker.reachy.head.look_at(x=0, y=0, z=50, duration=2.0)
print("Object tracking demo complete!")
# Run the AI tracking demo
run_ai_object_tracking()
Face Detection and Recognition
Face detection and recognition enable your Reachy Mini to interact naturally with people, following faces, recognizing individuals, and responding to facial expressions.
import face_recognition
import pickle
import os
class FaceRecognitionSystem:
def __init__(self, vision_system, reachy):
"""Initialize face recognition system."""
self.vision = vision_system
self.reachy = reachy
# Known faces database
self.known_faces = []
self.known_names = []
self.faces_db_path = "known_faces.pkl"
# Face detection parameters
self.face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
# Load known faces if database exists
self.load_faces_database()
print("Face recognition system initialized!")
def detect_faces_opencv(self, frame):
"""Fast face detection using OpenCV."""
if frame is None:
return []
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = self.face_cascade.detectMultiScale(
gray,
scaleFactor=1.1,
minNeighbors=5,
minSize=(30, 30)
)
detected_faces = []
for (x, y, w, h) in faces:
center_x = x + w // 2
center_y = y + h // 2
detected_faces.append({
'bbox': (x, y, w, h),
'center': (center_x, center_y),
'area': w * h
})
return detected_faces
def recognize_faces(self, frame):
"""Recognize faces using face_recognition library."""
if frame is None:
return []
# Convert BGR to RGB
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# Find face locations and encodings
face_locations = face_recognition.face_locations(rgb_frame, model='hog')
face_encodings = face_recognition.face_encodings(rgb_frame, face_locations)
recognized_faces = []
for (top, right, bottom, left), face_encoding in zip(face_locations, face_encodings):
# Check if face matches any known faces
matches = face_recognition.compare_faces(self.known_faces, face_encoding, tolerance=0.6)
name = "Unknown"
confidence = 0.0
if matches and any(matches):
# Find the best match
face_distances = face_recognition.face_distance(self.known_faces, face_encoding)
best_match_index = np.argmin(face_distances)
if matches[best_match_index]:
name = self.known_names[best_match_index]
confidence = 1.0 - face_distances[best_match_index]
# Calculate center point
center_x = (left + right) // 2
center_y = (top + bottom) // 2
recognized_faces.append({
'name': name,
'confidence': confidence,
'bbox': (left, top, right - left, bottom - top),
'center': (center_x, center_y),
'area': (right - left) * (bottom - top)
})
return recognized_faces
def add_known_face(self, frame, name, bbox=None):
"""Add a new face to the known faces database."""
if bbox is None:
# Detect faces automatically
faces = self.detect_faces_opencv(frame)
if not faces:
print("No face detected in the image!")
return False
bbox = faces[0]['bbox'] # Use the first detected face
x, y, w, h = bbox
# Extract face region
face_image = frame[y:y+h, x:x+w]
# Convert to RGB
rgb_face = cv2.cvtColor(face_image, cv2.COLOR_BGR2RGB)
# Encode the face
encodings = face_recognition.face_encodings(rgb_face)
if encodings:
encoding = encodings[0]
# Check if this person is already known
if name in self.known_names:
# Update existing encoding
index = self.known_names.index(name)
self.known_faces[index] = encoding
print(f"Updated face encoding for {name}")
else:
# Add new person
self.known_faces.append(encoding)
self.known_names.append(name)
print(f"Added new person: {name}")
# Save database
self.save_faces_database()
return True
else:
print("Could not encode the face!")
return False
def save_faces_database(self):
"""Save known faces database to file."""
database = {
'faces': self.known_faces,
'names': self.known_names
}
with open(self.faces_db_path, 'wb') as f:
pickle.dump(database, f)
print(f"Saved {len(self.known_names)} known faces to database")
def load_faces_database(self):
"""Load known faces database from file."""
if os.path.exists(self.faces_db_path):
try:
with open(self.faces_db_path, 'rb') as f:
database = pickle.load(f)
self.known_faces = database.get('faces', [])
self.known_names = database.get('names', [])
print(f"Loaded {len(self.known_names)} known faces from database")
except Exception as e:
print(f"Error loading faces database: {e}")
else:
print("No existing faces database found")
def greet_person(self, name, confidence):
"""Greet a recognized person."""
if name != "Unknown":
greeting = f"Hello {name}! Nice to see you again!"
self.reachy.antennas.happy()
else:
greeting = "Hello there! I don't think we've met before."
self.reachy.antennas.curious()
self.reachy.voice.say(greeting)
print(f"Greeting: {greeting} (confidence: {confidence:.2f})")
class FaceTracker:
def __init__(self, reachy, face_system):
"""Initialize face tracking system."""
self.reachy = reachy
self.face_system = face_system
self.current_target = None
self.last_greeting_time = {}
self.greeting_cooldown = 10.0 # seconds
def track_faces(self, frame):
"""Track faces and move head to follow."""
# Use fast OpenCV detection for tracking
faces = self.face_system.detect_faces_opencv(frame)
if faces:
# Find the largest face (closest person)
largest_face = max(faces, key=lambda f: f['area'])
# Calculate head position
center_x, center_y = largest_face['center']
frame_w, frame_h = frame.shape[1], frame.shape[0]
# Convert to head coordinates
norm_x = (center_x - frame_w/2) / (frame_w/2)
norm_y = (center_y - frame_h/2) / (frame_h/2)
head_x = norm_x * 25 # ±25 degrees
head_y = -norm_y * 15 # ±15 degrees
head_z = 45 # Closer for face interaction
# Move head smoothly
self.reachy.head.look_at(x=head_x, y=head_y, z=head_z, duration=0.8)
return largest_face
else:
# No faces detected
if self.current_target is not None:
self.reachy.head.look_at(x=0, y=0, z=50, duration=2.0)
self.current_target = None
return None
def recognize_and_greet(self, frame):
"""Recognize faces and greet people (less frequent due to computational cost)."""
current_time = time.time()
# Only run recognition every few seconds to save CPU
if not hasattr(self, 'last_recognition_time'):
self.last_recognition_time = 0
if current_time - self.last_recognition_time > 3.0: # Every 3 seconds
recognized_faces = self.face_system.recognize_faces(frame)
for face in recognized_faces:
name = face['name']
confidence = face['confidence']
# Check if we should greet this person
last_greeted = self.last_greeting_time.get(name, 0)
if current_time - last_greeted > self.greeting_cooldown:
self.face_system.greet_person(name, confidence)
self.last_greeting_time[name] = current_time
self.last_recognition_time = current_time
return recognized_faces
return []
# Demo: Interactive face recognition and tracking
def run_face_interaction_demo():
"""Run comprehensive face interaction demo."""
print("Starting face interaction demo...")
# Initialize systems
vision.start_camera_stream()
face_recognition_system = FaceRecognitionSystem(vision, vision.reachy)
face_tracker = FaceTracker(vision.reachy, face_recognition_system)
print("Face interaction active! Look at the camera and I'll track your face.")
print("Press 'a' to add your face to the database, 'q' to quit")
try:
for i in range(1800): # Run for ~1 minute
frame = vision.get_current_frame()
if frame is not None:
# Track faces (fast, every frame)
tracked_face = face_tracker.track_faces(frame)
# Recognize faces (slower, every few seconds)
recognized_faces = face_tracker.recognize_and_greet(frame)
# Create visualization
display_frame = frame.copy()
# Draw tracked faces
if tracked_face:
x, y, w, h = tracked_face['bbox']
cv2.rectangle(display_frame, (x, y), (x+w, y+h), (0, 255, 0), 2)
cv2.putText(display_frame, "TRACKING", (x, y-10),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)
# Draw recognized faces
for face in recognized_faces:
x, y, w, h = face['bbox']
name = face['name']
confidence = face['confidence']
color = (0, 0, 255) if name != "Unknown" else (0, 255, 255)
cv2.rectangle(display_frame, (x, y), (x+w, y+h), color, 2)
label = f"{name}" if name != "Unknown" else "Unknown"
cv2.putText(display_frame, label, (x, y+h+20),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)
# Display instructions
cv2.putText(display_frame, "Press 'a' to add face, 'q' to quit",
(10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 2)
# Display frame
key = vision.display_frame(display_frame, "Face Recognition")
if key == ord('q'):
break
elif key == ord('a'):
# Add current face to database
name = input("\nEnter name for this person: ")
if name and tracked_face:
success = face_recognition_system.add_known_face(frame, name, tracked_face['bbox'])
if success:
vision.reachy.voice.say(f"Nice to meet you, {name}!")
vision.reachy.antennas.happy()
time.sleep(0.03)
finally:
vision.stop_camera_stream()
cv2.destroyAllWindows()
# Return to neutral
vision.reachy.head.look_at(x=0, y=0, z=50, duration=2.0)
vision.reachy.voice.say("Thank you for the face interaction demo!")
# Run face interaction demo
run_face_interaction_demo()
Gesture Recognition and Control
Gesture recognition allows your Reachy Mini to understand and respond to hand movements and poses, creating intuitive interaction methods.
import mediapipe as mp
class GestureRecognizer:
def __init__(self, vision_system, reachy):
"""Initialize gesture recognition system."""
self.vision = vision_system
self.reachy = reachy
# Initialize MediaPipe
self.mp_hands = mp.solutions.hands
self.hands = self.mp_hands.Hands(
static_image_mode=False,
max_num_hands=2,
min_detection_confidence=0.7,
min_tracking_confidence=0.5
)
self.mp_drawing = mp.solutions.drawing_utils
# Gesture history for smoothing
self.gesture_history = []
self.history_length = 5
print("Gesture recognition system initialized!")
def detect_hands(self, frame):
"""Detect hands and landmarks."""
if frame is None:
return []
# Convert BGR to RGB
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# Process frame
results = self.hands.process(rgb_frame)
detected_hands = []
if results.multi_hand_landmarks:
for hand_idx, hand_landmarks in enumerate(results.multi_hand_landmarks):
# Get hand classification (left/right)
hand_label = results.multi_handedness[hand_idx].classification[0].label
# Extract landmark positions
landmarks = []
for landmark in hand_landmarks.landmark:
x = int(landmark.x * frame.shape[1])
y = int(landmark.y * frame.shape[0])
landmarks.append((x, y))
detected_hands.append({
'label': hand_label.lower(),
'landmarks': landmarks,
'raw_landmarks': hand_landmarks
})
return detected_hands
def classify_gesture(self, landmarks):
"""Classify hand gesture based on finger positions."""
if not landmarks or len(landmarks) != 21:
return "unknown"
# Finger tip and pip indices
finger_tips = [4, 8, 12, 16, 20] # Thumb, Index, Middle, Ring, Pinky
finger_pips = [3, 6, 10, 14, 18]
# Check which fingers are extended
fingers_up = []
# Thumb (special case - compare x coordinates)
if landmarks[finger_tips[0]][0] > landmarks[finger_pips[0]][0]:
fingers_up.append(1)
else:
fingers_up.append(0)
# Other fingers (compare y coordinates)
for i in range(1, 5):
if landmarks[finger_tips[i]][1] < landmarks[finger_pips[i]][1]:
fingers_up.append(1)
else:
fingers_up.append(0)
# Classify gestures based on finger patterns
total_fingers = sum(fingers_up)
if total_fingers == 0:
return "fist"
elif total_fingers == 1:
if fingers_up[1] == 1: # Only index finger
return "point"
elif fingers_up[0] == 1: # Only thumb
return "thumbs_up"
elif total_fingers == 2:
if fingers_up[1] == 1 and fingers_up[2] == 1: # Index and middle
return "peace"
elif fingers_up[0] == 1 and fingers_up[1] == 1: # Thumb and index
return "gun"
elif total_fingers == 5:
return "open_palm"
elif total_fingers == 3:
if fingers_up[1] == 1 and fingers_up[2] == 1 and fingers_up[3] == 1:
return "three"
return "unknown"
def smooth_gesture(self, current_gesture):
"""Apply temporal smoothing to gesture recognition."""
self.gesture_history.append(current_gesture)
if len(self.gesture_history) > self.history_length:
self.gesture_history.pop(0)
# Count occurrences of each gesture
gesture_counts = {}
for gesture in self.gesture_history:
gesture_counts[gesture] = gesture_counts.get(gesture, 0) + 1
# Return most common gesture
if gesture_counts:
return max(gesture_counts, key=gesture_counts.get)
else:
return "unknown"
def respond_to_gesture(self, gesture, hand_position=None):
"""Respond to recognized gestures."""
responses = {
"open_palm": {
"action": lambda: self.reachy.antennas.happy(),
"speech": "Hello! Nice to see you!",
"head_action": lambda: self.reachy.head.look_at(x=0, y=5, z=45, duration=1.0)
},
"thumbs_up": {
"action": lambda: self.reachy.antennas.excited(),
"speech": "Thumbs up! That's great!",
"head_action": lambda: self.reachy.head.look_at(x=0, y=10, z=45, duration=1.0)
},
"peace": {
"action": lambda: self.reachy.antennas.happy(),
"speech": "Peace! Let's be friends!",
"head_action": lambda: self.reachy.head.look_at(x=5, y=0, z=50, duration=1.0)
},
"point": {
"action": lambda: self.reachy.antennas.curious(),
"speech": "Are you pointing at something interesting?",
"head_action": self.look_in_pointing_direction
},
"fist": {
"action": lambda: self.reachy.antennas.neutral(),
"speech": "I see a fist. Are you ready for action?",
"head_action": lambda: self.reachy.head.look_at(x=0, y=0, z=45, duration=1.0)
}
}
if gesture in responses:
response = responses[gesture]
# Execute antenna action
response["action"]()
# Speak response
self.reachy.voice.say(response["speech"])
# Execute head action
if hand_position and gesture == "point":
response["head_action"](hand_position)
else:
response["head_action"]()
print(f"Responded to gesture: {gesture}")
def look_in_pointing_direction(self, hand_position):
"""Look in the direction the user is pointing."""
if hand_position:
# Calculate pointing direction based on hand position
center_x, center_y = hand_position
frame_w, frame_h = 640, 480
# Convert to head coordinates
norm_x = (center_x - frame_w/2) / (frame_w/2)
norm_y = (center_y - frame_h/2) / (frame_h/2)
head_x = norm_x * 30
head_y = -norm_y * 20
head_z = 50
self.reachy.head.look_at(x=head_x, y=head_y, z=head_z, duration=1.5)
# Look around a bit to show interest
time.sleep(2)
self.reachy.head.look_at(x=head_x + 10, y=head_y, z=head_z, duration=1.0)
time.sleep(1)
self.reachy.head.look_at(x=head_x - 10, y=head_y, z=head_z, duration=1.0)
class GestureController:
def __init__(self, gesture_recognizer):
"""Initialize gesture-based robot controller."""
self.gesture_recognizer = gesture_recognizer
self.last_gesture = None
self.last_response_time = 0
self.response_cooldown = 3.0 # seconds
def process_gestures(self, frame):
"""Process gestures and control robot accordingly."""
current_time = time.time()
# Detect hands
hands = self.gesture_recognizer.detect_hands(frame)
if hands:
for hand in hands:
# Classify gesture
gesture = self.gesture_recognizer.classify_gesture(hand['landmarks'])
# Apply smoothing
smooth_gesture = self.gesture_recognizer.smooth_gesture(gesture)
# Check if we should respond
if (smooth_gesture != self.last_gesture and
smooth_gesture != "unknown" and
current_time - self.last_response_time > self.response_cooldown):
# Calculate hand center position
landmarks = hand['landmarks']
center_x = sum(p[0] for p in landmarks) // len(landmarks)
center_y = sum(p[1] for p in landmarks) // len(landmarks)
hand_position = (center_x, center_y)
# Respond to gesture
self.gesture_recognizer.respond_to_gesture(smooth_gesture, hand_position)
self.last_gesture = smooth_gesture
self.last_response_time = current_time
return hands, smooth_gesture
else:
# No hands detected
if self.last_gesture is not None:
self.last_gesture = None
return [], "none"
# Gesture control demo
def run_gesture_control_demo():
"""Run interactive gesture control demo."""
print("Starting gesture control demo...")
# Initialize systems
vision.start_camera_stream()
gesture_recognizer = GestureRecognizer(vision, vision.reachy)
gesture_controller = GestureController(gesture_recognizer)
print("Gesture control active! Try these gestures:")
print("- Open palm: Wave hello")
print("- Thumbs up: Show approval")
print("- Peace sign: Peace greeting")
print("- Point: Look where you're pointing")
print("- Fist: Action ready")
print("Press 'q' to quit")
try:
for i in range(1200): # Run for ~40 seconds
frame = vision.get_current_frame()
if frame is not None:
# Process gestures
hands, current_gesture = gesture_controller.process_gestures(frame)
# Create visualization
display_frame = frame.copy()
# Draw hand landmarks
for hand in hands:
landmarks = hand['landmarks']
label = hand['label']
# Draw landmarks
for landmark in landmarks:
cv2.circle(display_frame, landmark, 3, (0, 255, 0), -1)
# Draw connections (simplified)
if len(landmarks) == 21:
# Draw some key connections
connections = [
(0, 1), (1, 2), (2, 3), (3, 4), # Thumb
(0, 5), (5, 6), (6, 7), (7, 8), # Index
(5, 9), (9, 10), (10, 11), (11, 12), # Middle
(9, 13), (13, 14), (14, 15), (15, 16), # Ring
(13, 17), (17, 18), (18, 19), (19, 20), # Pinky
(0, 17) # Palm
]
for start, end in connections:
if start < len(landmarks) and end < len(landmarks):
cv2.line(display_frame, landmarks[start], landmarks[end], (255, 0, 0), 2)
# Draw hand label
if landmarks:
center_x = sum(p[0] for p in landmarks) // len(landmarks)
center_y = sum(p[1] for p in landmarks) // len(landmarks)
cv2.putText(display_frame, f"{label.upper()}", (center_x-30, center_y-30),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 0), 2)
# Display current gesture
if current_gesture != "none" and current_gesture != "unknown":
cv2.putText(display_frame, f"Gesture: {current_gesture.upper()}", (10, 60),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)
# Display instructions
cv2.putText(display_frame, "Show gestures to control robot - 'q' to quit",
(10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 2)
# Display frame
key = vision.display_frame(display_frame, "Gesture Control")
if key == ord('q'):
break
time.sleep(0.03)
finally:
vision.stop_camera_stream()
cv2.destroyAllWindows()
# Return to neutral
vision.reachy.head.look_at(x=0, y=0, z=50, duration=2.0)
vision.reachy.voice.say("Gesture control demo complete! Thanks for playing!")
# Run gesture control demo
run_gesture_control_demo()
Advanced Applications and Integration
Now let's combine everything we've learned into sophisticated applications that showcase the full potential of Reachy Mini's computer vision capabilities.
Intelligent Desktop Companion
Project Idea: Create an intelligent desktop companion that recognizes you, tracks your activities, and provides contextual assistance based on what it sees.
class IntelligentCompanion:
def __init__(self, vision_system, reachy):
"""Initialize intelligent desktop companion."""
self.vision = vision_system
self.reachy = reachy
# Initialize all recognition systems
self.face_recognition = FaceRecognitionSystem(vision_system, reachy)
self.object_detector = AIObjectDetector(vision_system)
self.gesture_recognizer = GestureRecognizer(vision_system, reachy)
# Companion state
self.current_user = None
self.activity_context = []
self.interaction_mode = "passive" # passive, active, focused
# Learning and memory
self.user_preferences = {}
self.interaction_history = []
print("Intelligent companion initialized!")
def analyze_scene(self, frame):
"""Comprehensive scene analysis."""
scene_data = {
'timestamp': time.time(),
'faces': [],
'objects': [],
'gestures': [],
'activity': 'unknown'
}
# Face analysis
faces = self.face_recognition.recognize_faces(frame)
scene_data['faces'] = faces
# Object detection
objects = self.object_detector.detect_objects(frame)
scene_data['objects'] = objects
# Gesture recognition
hands = self.gesture_recognizer.detect_hands(frame)
if hands:
gestures = [self.gesture_recognizer.classify_gesture(hand['landmarks']) for hand in hands]
scene_data['gestures'] = gestures
# Activity inference
scene_data['activity'] = self.infer_activity(objects, gestures)
return scene_data
def infer_activity(self, objects, gestures):
"""Infer what the user is doing based on visible objects and gestures."""
object_names = [obj['class_name'] for obj in objects]
# Work-related activity
work_objects = ['laptop', 'keyboard', 'mouse', 'book', 'cell phone']
if any(obj in object_names for obj in work_objects):
if 'point' in gestures:
return 'presenting'
else:
return 'working'
# Eating/drinking
food_objects = ['cup', 'bottle', 'banana', 'apple', 'sandwich']
if any(obj in object_names for obj in food_objects):
return 'eating'
# Leisure
leisure_objects = ['tv', 'remote', 'book']
if any(obj in object_names for obj in leisure_objects):
return 'relaxing'
# Social interaction
if len([f for f in self.face_recognition.recognize_faces(None) if f]) > 1:
return 'socializing'
return 'unknown'
def provide_contextual_assistance(self, scene_data):
"""Provide help based on current context."""
activity = scene_data['activity']
objects = scene_data['objects']
faces = scene_data['faces']
# Greet new users
for face in faces:
if face['name'] != 'Unknown' and face['name'] != self.current_user:
self.current_user = face['name']
self.reachy.voice.say(f"Hello {face['name']}! I'm here to help.")
self.reachy.antennas.happy()
# Activity-specific assistance
if activity == 'working':
laptop_objects = [obj for obj in objects if obj['class_name'] == 'laptop']
if laptop_objects and not hasattr(self, 'work_assistance_given'):
self.reachy.voice.say("I see you're working. Let me know if you need a break reminder!")
self.work_assistance_given = True
elif activity == 'presenting':
if not hasattr(self, 'presentation_mode'):
self.reachy.voice.say("It looks like you're presenting. I'll be extra quiet.")
self.presentation_mode = True
elif activity == 'eating':
if not hasattr(self, 'meal_noted'):
self.reachy.voice.say("Enjoy your meal!")
self.reachy.antennas.happy()
self.meal_noted = True
def adaptive_behavior(self, scene_data):
"""Adapt behavior based on scene understanding."""
# Adjust interaction frequency based on activity
if scene_data['activity'] == 'working':
self.interaction_mode = 'passive'
elif scene_data['activity'] == 'socializing':
self.interaction_mode = 'active'
elif 'open_palm' in scene_data['gestures']:
self.interaction_mode = 'focused'
# Adjust head movement patterns
if self.interaction_mode == 'passive':
# Subtle, non-distracting movements
pass
elif self.interaction_mode == 'active':
# More expressive and engaging
if scene_data['faces']:
# Track faces more actively
pass
elif self.interaction_mode == 'focused':
# Full attention and engagement
self.reachy.antennas.curious()
def run_companion_session(self, duration_minutes=10):
"""Run intelligent companion session."""
print(f"Starting {duration_minutes}-minute companion session...")
self.vision.start_camera_stream()
start_time = time.time()
end_time = start_time + (duration_minutes * 60)
try:
while time.time() < end_time:
frame = self.vision.get_current_frame()
if frame is not None:
# Analyze scene
scene_data = self.analyze_scene(frame)
# Provide assistance
self.provide_contextual_assistance(scene_data)
# Adapt behavior
self.adaptive_behavior(scene_data)
# Log interaction
self.interaction_history.append(scene_data)
# Keep only recent history
if len(self.interaction_history) > 100:
self.interaction_history.pop(0)
time.sleep(1.0) # Check every second
finally:
self.vision.stop_camera_stream()
self.reachy.voice.say("Companion session complete. It was great spending time with you!")
# Demo: Run intelligent companion
def demo_intelligent_companion():
"""Demonstrate intelligent companion capabilities."""
companion = IntelligentCompanion(vision, vision.reachy)
# Run a 5-minute companion session
companion.run_companion_session(duration_minutes=5)
# Uncomment to run the demo
# demo_intelligent_companion()
Performance Optimization and Best Practices
Computer vision applications can be resource-intensive. Here are key strategies for optimizing performance on your Reachy Mini:
🎯 Frame Rate Management
Adjust processing frequency based on application needs. Use 30fps for tracking, 5fps for recognition.
📏 Resolution Optimization
Use lower resolutions (320x240) for real-time tasks, higher (640x480) for detailed analysis.
🧵 Threading Strategy
Separate capture, processing, and response threads to maintain smooth operation.
🎨 Model Selection
Choose appropriate model sizes: YOLOv8n for speed, YOLOv8m for accuracy balance.
# Performance optimization example
class OptimizedVision:
def __init__(self):
# Use different processing rates for different tasks
self.face_detection_interval = 0.1 # 10 FPS
self.object_detection_interval = 0.2 # 5 FPS
self.gesture_recognition_interval = 0.15 # ~7 FPS
# Frame resolution optimization
self.tracking_resolution = (320, 240)
self.analysis_resolution = (640, 480)
# Model optimization
self.fast_face_detector = cv2.CascadeClassifier(
cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
)
self.detailed_model = YOLO('yolov8n.pt') # Nano for speed
def optimize_frame(self, frame, task_type):
"""Optimize frame based on task requirements."""
if task_type == 'tracking':
return cv2.resize(frame, self.tracking_resolution)
elif task_type == 'analysis':
return cv2.resize(frame, self.analysis_resolution)
return frame
def batch_process(self, frames):
"""Process multiple frames in batch for efficiency."""
# Batch processing can improve GPU utilization
results = []
for frame in frames:
result = self.detailed_model(frame, verbose=False)
results.append(result)
return results
Troubleshooting Common Issues
Common Issues and Solutions:
- Low frame rate: Reduce resolution or processing frequency
- False positives: Adjust confidence thresholds and add temporal filtering
- Poor lighting performance: Implement automatic exposure adjustment
- Memory issues: Implement proper frame buffer management
Future Possibilities and Extensions
The computer vision capabilities we've explored are just the beginning. Here are some exciting directions for further development:
- Augmented Reality Integration: Overlay digital information on the physical world
- 3D Scene Understanding: Use depth estimation for spatial awareness
- Behavioral Learning: Let your robot learn from your routines and preferences
- Multi-Robot Coordination: Enable multiple Reachy Minis to work together using vision
- Edge AI Optimization: Deploy custom-trained models optimized for your specific use case
Conclusion
Computer vision transforms your Reachy Mini from a simple robot into an intelligent companion capable of understanding and interacting with the visual world. From basic object detection to sophisticated gesture recognition and scene understanding, these capabilities open up endless possibilities for creative applications.
The key to successful computer vision applications is starting simple and gradually adding complexity. Begin with basic face tracking, then add object detection, and finally integrate gesture recognition to create rich, multi-modal interactions.
Keep Exploring! The computer vision field is rapidly evolving, with new models and techniques constantly emerging. Stay connected with the Hugging Face community to discover the latest breakthroughs and share your own innovations with fellow Reachy Mini developers.
Remember that the most compelling robotic applications often combine multiple modalities – vision, audio, and movement working together to create natural, intuitive interactions. Your Reachy Mini is the perfect platform for exploring these exciting possibilities!