Computer Vision Applications with Reachy Mini

Computer vision transforms your Reachy Mini from a simple robot into an intelligent companion capable of seeing, understanding, and interacting with the world. In this comprehensive guide, we'll explore how to implement object detection, face recognition, gesture control, and advanced AI-powered visual interactions using Reachy Mini's integrated camera system.

What you'll master: By the end of this tutorial, you'll know how to implement real-time object detection, create face-following behaviors, build gesture recognition systems, and integrate cutting-edge AI vision models from Hugging Face.

🎯 Vision Capabilities You'll Build

Face tracking • Object detection • Gesture recognition • Emotion analysis • Scene understanding

Understanding Reachy Mini's Vision System

Reachy Mini's vision system is built around a high-quality integrated camera that provides real-time video streaming capabilities. Combined with the robot's expressive head movements, this creates opportunities for rich visual interactions that feel natural and engaging.

Camera Specifications and Capabilities

📹 Video Streaming

Real-time video capture with adjustable resolution and frame rate for optimal performance

🔄 Head Integration

Seamless coordination between camera input and 6-DOF head movements

⚡ Low Latency

Optimized processing pipeline for responsive real-time interactions

🧠 AI Ready

Direct integration with OpenCV, PyTorch, and Hugging Face vision models

Setting Up Computer Vision Environment

Before diving into computer vision applications, let's set up a comprehensive development environment with all the necessary libraries and tools.

# Install essential computer vision libraries pip install opencv-python pip install opencv-contrib-python pip install numpy pip install scipy pip install matplotlib pip install pillow # Install deep learning frameworks pip install torch torchvision pip install transformers pip install ultralytics # For YOLO object detection # Install additional CV utilities pip install mediapipe # For pose and hand detection pip install face-recognition # Simplified face recognition pip install dlib # Advanced computer vision algorithms # Install Reachy SDK if not already installed pip install reachy-sdk
Performance Note: Computer vision applications can be CPU-intensive. For the best performance, consider running computationally heavy models on your host computer rather than directly on the Raspberry Pi version.

Basic Computer Vision Setup

Let's start with the fundamentals – accessing the camera, processing frames, and displaying results.

import cv2 import numpy as np from reachy_sdk import ReachySDK import time import threading class ReachyVision: def __init__(self, host='reachy-mini.local'): """Initialize Reachy Vision system.""" self.reachy = ReachySDK(host=host) self.camera = self.reachy.camera self.running = False self.current_frame = None # Computer vision parameters self.frame_width = 640 self.frame_height = 480 self.fps_target = 30 print("Reachy Vision system initialized!") def start_camera_stream(self): """Start the camera stream in a separate thread.""" self.running = True self.camera_thread = threading.Thread(target=self._camera_loop) self.camera_thread.daemon = True self.camera_thread.start() print("Camera stream started") def _camera_loop(self): """Internal camera processing loop.""" while self.running: try: # Capture frame frame = self.camera.capture_frame() if frame is not None: # Resize for consistent processing frame = cv2.resize(frame, (self.frame_width, self.frame_height)) self.current_frame = frame # Control frame rate time.sleep(1.0 / self.fps_target) except Exception as e: print(f"Camera error: {e}") time.sleep(0.1) def stop_camera_stream(self): """Stop the camera stream.""" self.running = False if hasattr(self, 'camera_thread'): self.camera_thread.join() print("Camera stream stopped") def get_current_frame(self): """Get the most recent camera frame.""" return self.current_frame.copy() if self.current_frame is not None else None def display_frame(self, frame, window_name="Reachy Vision"): """Display a frame (useful for debugging).""" if frame is not None: cv2.imshow(window_name, frame) return cv2.waitKey(1) & 0xFF return -1 # Initialize the vision system vision = ReachyVision() vision.start_camera_stream() # Basic camera test print("Testing camera feed...") for i in range(100): # Test for ~3 seconds frame = vision.get_current_frame() if frame is not None: # Add timestamp overlay timestamp = time.strftime("%H:%M:%S") cv2.putText(frame, timestamp, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2) # Display frame key = vision.display_frame(frame) if key == ord('q'): break time.sleep(0.03) vision.stop_camera_stream() cv2.destroyAllWindows()

Object Detection and Recognition

Object detection enables your Reachy Mini to identify and respond to objects in its environment. We'll implement both traditional computer vision approaches and modern AI-based detection.

Traditional Computer Vision Object Detection

class ObjectDetector: def __init__(self, vision_system): """Initialize object detection with traditional CV methods.""" self.vision = vision_system # Initialize background subtractor for movement detection self.bg_subtractor = cv2.createBackgroundSubtractorMOG2( detectShadows=True, varThreshold=50 ) # Color detection ranges (HSV) self.color_ranges = { 'red': [(0, 50, 50), (10, 255, 255)], 'green': [(40, 50, 50), (80, 255, 255)], 'blue': [(100, 50, 50), (130, 255, 255)], 'yellow': [(20, 50, 50), (30, 255, 255)] } def detect_motion(self, frame): """Detect moving objects in the frame.""" if frame is None: return [] # Apply background subtraction fg_mask = self.bg_subtractor.apply(frame) # Clean up the mask kernel = np.ones((5, 5), np.uint8) fg_mask = cv2.morphologyEx(fg_mask, cv2.MORPH_CLOSE, kernel) fg_mask = cv2.morphologyEx(fg_mask, cv2.MORPH_OPEN, kernel) # Find contours contours, _ = cv2.findContours(fg_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) # Filter and analyze contours detected_objects = [] for contour in contours: area = cv2.contourArea(contour) # Filter small objects if area > 500: x, y, w, h = cv2.boundingRect(contour) center_x = x + w // 2 center_y = y + h // 2 detected_objects.append({ 'type': 'moving_object', 'center': (center_x, center_y), 'bbox': (x, y, w, h), 'area': area }) return detected_objects def detect_colors(self, frame): """Detect objects based on color.""" if frame is None: return [] # Convert to HSV for better color detection hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV) detected_colors = [] for color_name, (lower, upper) in self.color_ranges.items(): # Create mask for this color lower_bound = np.array(lower) upper_bound = np.array(upper) mask = cv2.inRange(hsv, lower_bound, upper_bound) # Clean up the mask kernel = np.ones((5, 5), np.uint8) mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, kernel) mask = cv2.morphologyEx(mask, cv2.MORPH_OPEN, kernel) # Find contours contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) for contour in contours: area = cv2.contourArea(contour) if area > 300: # Minimum area threshold x, y, w, h = cv2.boundingRect(contour) center_x = x + w // 2 center_y = y + h // 2 detected_colors.append({ 'type': 'colored_object', 'color': color_name, 'center': (center_x, center_y), 'bbox': (x, y, w, h), 'area': area }) return detected_colors def detect_shapes(self, frame): """Detect basic geometric shapes.""" if frame is None: return [] gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) blurred = cv2.GaussianBlur(gray, (5, 5), 0) edges = cv2.Canny(blurred, 50, 150) contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) detected_shapes = [] for contour in contours: area = cv2.contourArea(contour) if area > 1000: # Filter small contours # Approximate contour to polygon epsilon = 0.02 * cv2.arcLength(contour, True) approx = cv2.approxPolyDP(contour, epsilon, True) x, y, w, h = cv2.boundingRect(contour) center_x = x + w // 2 center_y = y + h // 2 # Classify shape based on number of vertices vertices = len(approx) if vertices == 3: shape_type = "triangle" elif vertices == 4: # Check if it's a square or rectangle aspect_ratio = float(w) / h shape_type = "square" if 0.8 <= aspect_ratio <= 1.2 else "rectangle" elif vertices > 8: shape_type = "circle" else: shape_type = f"polygon_{vertices}" detected_shapes.append({ 'type': 'geometric_shape', 'shape': shape_type, 'center': (center_x, center_y), 'bbox': (x, y, w, h), 'area': area, 'vertices': vertices }) return detected_shapes # Usage example detector = ObjectDetector(vision) def run_object_detection_demo(): """Run comprehensive object detection demo.""" print("Starting object detection demo...") vision.start_camera_stream() try: for i in range(300): # Run for ~10 seconds frame = vision.get_current_frame() if frame is not None: # Create a copy for drawing display_frame = frame.copy() # Detect different types of objects moving_objects = detector.detect_motion(frame) colored_objects = detector.detect_colors(frame) shapes = detector.detect_shapes(frame) # Draw detection results # Draw moving objects in red for obj in moving_objects: x, y, w, h = obj['bbox'] cv2.rectangle(display_frame, (x, y), (x+w, y+h), (0, 0, 255), 2) cv2.putText(display_frame, "MOVING", (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 255), 2) # Draw colored objects for obj in colored_objects: x, y, w, h = obj['bbox'] cv2.rectangle(display_frame, (x, y), (x+w, y+h), (0, 255, 0), 2) cv2.putText(display_frame, obj['color'].upper(), (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2) # Draw shapes for obj in shapes: x, y, w, h = obj['bbox'] cv2.rectangle(display_frame, (x, y), (x+w, y+h), (255, 0, 0), 2) cv2.putText(display_frame, obj['shape'].upper(), (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 0, 0), 2) # Display results key = vision.display_frame(display_frame, "Object Detection") if key == ord('q'): break time.sleep(0.03) finally: vision.stop_camera_stream() cv2.destroyAllWindows() # Run the demo run_object_detection_demo()

AI-Powered Object Detection with YOLO

For more sophisticated object recognition, let's integrate a state-of-the-art YOLO model that can identify hundreds of different object types.

from ultralytics import YOLO import torch class AIObjectDetector: def __init__(self, vision_system): """Initialize AI-powered object detection.""" self.vision = vision_system # Load pre-trained YOLO model print("Loading YOLO model...") self.model = YOLO('yolov8n.pt') # Nano version for speed # COCO class names (subset of most common objects) self.class_names = [ 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier' ] print("YOLO model loaded successfully!") def detect_objects(self, frame, confidence_threshold=0.5): """Detect objects using YOLO model.""" if frame is None: return [] # Run YOLO inference results = self.model(frame, conf=confidence_threshold, verbose=False) detected_objects = [] # Process results for result in results: boxes = result.boxes if boxes is not None: for box in boxes: # Get bounding box coordinates x1, y1, x2, y2 = box.xyxy[0].cpu().numpy() # Get class and confidence class_id = int(box.cls[0].cpu().numpy()) confidence = float(box.conf[0].cpu().numpy()) # Get class name class_name = self.class_names[class_id] if class_id < len(self.class_names) else f"class_{class_id}" # Calculate center point center_x = int((x1 + x2) / 2) center_y = int((y1 + y2) / 2) detected_objects.append({ 'type': 'ai_detected_object', 'class_name': class_name, 'confidence': confidence, 'center': (center_x, center_y), 'bbox': (int(x1), int(y1), int(x2-x1), int(y2-y1)) }) return detected_objects def track_most_interesting_object(self, detected_objects): """Determine the most interesting object to track.""" if not detected_objects: return None # Priority scoring for different object types priority_scores = { 'person': 100, 'cat': 90, 'dog': 90, 'bottle': 70, 'cup': 70, 'laptop': 80, 'cell phone': 75, 'book': 60, 'chair': 30, 'couch': 25 } best_object = None best_score = 0 for obj in detected_objects: # Base score from priority base_score = priority_scores.get(obj['class_name'], 40) # Boost score based on confidence confidence_boost = obj['confidence'] * 20 # Boost score for objects in center of frame center_x, center_y = obj['center'] frame_center_x, frame_center_y = 320, 240 # Assuming 640x480 frame distance_from_center = ((center_x - frame_center_x)**2 + (center_y - frame_center_y)**2)**0.5 center_boost = max(0, 50 - distance_from_center / 10) total_score = base_score + confidence_boost + center_boost if total_score > best_score: best_score = total_score best_object = obj return best_object # Integrate with Reachy's head movement class ObjectTracker: def __init__(self, reachy, ai_detector): """Initialize object tracking with head movement.""" self.reachy = reachy self.ai_detector = ai_detector self.tracking_target = None self.tracking_history = [] def calculate_head_position(self, object_center, frame_size=(640, 480)): """Calculate where the head should look based on object position.""" center_x, center_y = object_center frame_w, frame_h = frame_size # Convert pixel coordinates to head movement coordinates # Normalize to -1 to 1 range norm_x = (center_x - frame_w/2) / (frame_w/2) norm_y = (center_y - frame_h/2) / (frame_h/2) # Scale to appropriate head movement range head_x = norm_x * 30 # ±30 degrees horizontal head_y = -norm_y * 20 # ±20 degrees vertical (inverted) head_z = 50 # Fixed distance return head_x, head_y, head_z def smooth_tracking(self, target_position, smoothing_factor=0.7): """Apply smoothing to head movements for natural tracking.""" if not self.tracking_history: self.tracking_history.append(target_position) return target_position # Exponential moving average last_position = self.tracking_history[-1] smooth_x = last_position[0] * smoothing_factor + target_position[0] * (1 - smoothing_factor) smooth_y = last_position[1] * smoothing_factor + target_position[1] * (1 - smoothing_factor) smooth_z = target_position[2] # Keep Z constant smoothed_position = (smooth_x, smooth_y, smooth_z) # Keep history limited self.tracking_history.append(smoothed_position) if len(self.tracking_history) > 5: self.tracking_history.pop(0) return smoothed_position def track_object(self, frame): """Track objects and move head accordingly.""" detected_objects = self.ai_detector.detect_objects(frame) if detected_objects: # Find the most interesting object target = self.ai_detector.track_most_interesting_object(detected_objects) if target: # Calculate head position head_pos = self.calculate_head_position(target['center']) # Apply smoothing smooth_pos = self.smooth_tracking(head_pos) # Move head to track object self.reachy.head.look_at( x=smooth_pos[0], y=smooth_pos[1], z=smooth_pos[2], duration=0.5 ) # Provide feedback about what we're looking at if target != self.tracking_target: self.tracking_target = target confidence_percent = int(target['confidence'] * 100) print(f"Now tracking: {target['class_name']} ({confidence_percent}% confident)") return target else: # No objects detected, return to neutral position if self.tracking_target is not None: self.reachy.head.look_at(x=0, y=0, z=50, duration=1.0) self.tracking_target = None print("No objects detected, returning to neutral position") return None # Complete object tracking demo def run_ai_object_tracking(): """Run AI-powered object tracking demo.""" print("Initializing AI object tracking...") # Initialize components vision.start_camera_stream() ai_detector = AIObjectDetector(vision) tracker = ObjectTracker(vision.reachy, ai_detector) print("Starting object tracking - show objects to the camera!") try: for i in range(600): # Run for ~20 seconds frame = vision.get_current_frame() if frame is not None: # Track objects and move head tracked_object = tracker.track_object(frame) # Create visualization display_frame = frame.copy() # Draw all detected objects detected_objects = ai_detector.detect_objects(frame) for obj in detected_objects: x, y, w, h = obj['bbox'] confidence = obj['confidence'] class_name = obj['class_name'] # Color code by confidence color = (0, 255, 0) if confidence > 0.7 else (0, 255, 255) if obj == tracked_object: color = (0, 0, 255) # Red for actively tracked object cv2.rectangle(display_frame, (x, y), (x+w, y+h), color, 2) # Label label = f"{class_name}: {confidence:.2f}" cv2.putText(display_frame, label, (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2) # Display frame key = vision.display_frame(display_frame, "AI Object Tracking") if key == ord('q'): break time.sleep(0.03) finally: vision.stop_camera_stream() cv2.destroyAllWindows() # Return to neutral position tracker.reachy.head.look_at(x=0, y=0, z=50, duration=2.0) print("Object tracking demo complete!") # Run the AI tracking demo run_ai_object_tracking()

Face Detection and Recognition

Face detection and recognition enable your Reachy Mini to interact naturally with people, following faces, recognizing individuals, and responding to facial expressions.

import face_recognition import pickle import os class FaceRecognitionSystem: def __init__(self, vision_system, reachy): """Initialize face recognition system.""" self.vision = vision_system self.reachy = reachy # Known faces database self.known_faces = [] self.known_names = [] self.faces_db_path = "known_faces.pkl" # Face detection parameters self.face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml') # Load known faces if database exists self.load_faces_database() print("Face recognition system initialized!") def detect_faces_opencv(self, frame): """Fast face detection using OpenCV.""" if frame is None: return [] gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) faces = self.face_cascade.detectMultiScale( gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30) ) detected_faces = [] for (x, y, w, h) in faces: center_x = x + w // 2 center_y = y + h // 2 detected_faces.append({ 'bbox': (x, y, w, h), 'center': (center_x, center_y), 'area': w * h }) return detected_faces def recognize_faces(self, frame): """Recognize faces using face_recognition library.""" if frame is None: return [] # Convert BGR to RGB rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) # Find face locations and encodings face_locations = face_recognition.face_locations(rgb_frame, model='hog') face_encodings = face_recognition.face_encodings(rgb_frame, face_locations) recognized_faces = [] for (top, right, bottom, left), face_encoding in zip(face_locations, face_encodings): # Check if face matches any known faces matches = face_recognition.compare_faces(self.known_faces, face_encoding, tolerance=0.6) name = "Unknown" confidence = 0.0 if matches and any(matches): # Find the best match face_distances = face_recognition.face_distance(self.known_faces, face_encoding) best_match_index = np.argmin(face_distances) if matches[best_match_index]: name = self.known_names[best_match_index] confidence = 1.0 - face_distances[best_match_index] # Calculate center point center_x = (left + right) // 2 center_y = (top + bottom) // 2 recognized_faces.append({ 'name': name, 'confidence': confidence, 'bbox': (left, top, right - left, bottom - top), 'center': (center_x, center_y), 'area': (right - left) * (bottom - top) }) return recognized_faces def add_known_face(self, frame, name, bbox=None): """Add a new face to the known faces database.""" if bbox is None: # Detect faces automatically faces = self.detect_faces_opencv(frame) if not faces: print("No face detected in the image!") return False bbox = faces[0]['bbox'] # Use the first detected face x, y, w, h = bbox # Extract face region face_image = frame[y:y+h, x:x+w] # Convert to RGB rgb_face = cv2.cvtColor(face_image, cv2.COLOR_BGR2RGB) # Encode the face encodings = face_recognition.face_encodings(rgb_face) if encodings: encoding = encodings[0] # Check if this person is already known if name in self.known_names: # Update existing encoding index = self.known_names.index(name) self.known_faces[index] = encoding print(f"Updated face encoding for {name}") else: # Add new person self.known_faces.append(encoding) self.known_names.append(name) print(f"Added new person: {name}") # Save database self.save_faces_database() return True else: print("Could not encode the face!") return False def save_faces_database(self): """Save known faces database to file.""" database = { 'faces': self.known_faces, 'names': self.known_names } with open(self.faces_db_path, 'wb') as f: pickle.dump(database, f) print(f"Saved {len(self.known_names)} known faces to database") def load_faces_database(self): """Load known faces database from file.""" if os.path.exists(self.faces_db_path): try: with open(self.faces_db_path, 'rb') as f: database = pickle.load(f) self.known_faces = database.get('faces', []) self.known_names = database.get('names', []) print(f"Loaded {len(self.known_names)} known faces from database") except Exception as e: print(f"Error loading faces database: {e}") else: print("No existing faces database found") def greet_person(self, name, confidence): """Greet a recognized person.""" if name != "Unknown": greeting = f"Hello {name}! Nice to see you again!" self.reachy.antennas.happy() else: greeting = "Hello there! I don't think we've met before." self.reachy.antennas.curious() self.reachy.voice.say(greeting) print(f"Greeting: {greeting} (confidence: {confidence:.2f})") class FaceTracker: def __init__(self, reachy, face_system): """Initialize face tracking system.""" self.reachy = reachy self.face_system = face_system self.current_target = None self.last_greeting_time = {} self.greeting_cooldown = 10.0 # seconds def track_faces(self, frame): """Track faces and move head to follow.""" # Use fast OpenCV detection for tracking faces = self.face_system.detect_faces_opencv(frame) if faces: # Find the largest face (closest person) largest_face = max(faces, key=lambda f: f['area']) # Calculate head position center_x, center_y = largest_face['center'] frame_w, frame_h = frame.shape[1], frame.shape[0] # Convert to head coordinates norm_x = (center_x - frame_w/2) / (frame_w/2) norm_y = (center_y - frame_h/2) / (frame_h/2) head_x = norm_x * 25 # ±25 degrees head_y = -norm_y * 15 # ±15 degrees head_z = 45 # Closer for face interaction # Move head smoothly self.reachy.head.look_at(x=head_x, y=head_y, z=head_z, duration=0.8) return largest_face else: # No faces detected if self.current_target is not None: self.reachy.head.look_at(x=0, y=0, z=50, duration=2.0) self.current_target = None return None def recognize_and_greet(self, frame): """Recognize faces and greet people (less frequent due to computational cost).""" current_time = time.time() # Only run recognition every few seconds to save CPU if not hasattr(self, 'last_recognition_time'): self.last_recognition_time = 0 if current_time - self.last_recognition_time > 3.0: # Every 3 seconds recognized_faces = self.face_system.recognize_faces(frame) for face in recognized_faces: name = face['name'] confidence = face['confidence'] # Check if we should greet this person last_greeted = self.last_greeting_time.get(name, 0) if current_time - last_greeted > self.greeting_cooldown: self.face_system.greet_person(name, confidence) self.last_greeting_time[name] = current_time self.last_recognition_time = current_time return recognized_faces return [] # Demo: Interactive face recognition and tracking def run_face_interaction_demo(): """Run comprehensive face interaction demo.""" print("Starting face interaction demo...") # Initialize systems vision.start_camera_stream() face_recognition_system = FaceRecognitionSystem(vision, vision.reachy) face_tracker = FaceTracker(vision.reachy, face_recognition_system) print("Face interaction active! Look at the camera and I'll track your face.") print("Press 'a' to add your face to the database, 'q' to quit") try: for i in range(1800): # Run for ~1 minute frame = vision.get_current_frame() if frame is not None: # Track faces (fast, every frame) tracked_face = face_tracker.track_faces(frame) # Recognize faces (slower, every few seconds) recognized_faces = face_tracker.recognize_and_greet(frame) # Create visualization display_frame = frame.copy() # Draw tracked faces if tracked_face: x, y, w, h = tracked_face['bbox'] cv2.rectangle(display_frame, (x, y), (x+w, y+h), (0, 255, 0), 2) cv2.putText(display_frame, "TRACKING", (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2) # Draw recognized faces for face in recognized_faces: x, y, w, h = face['bbox'] name = face['name'] confidence = face['confidence'] color = (0, 0, 255) if name != "Unknown" else (0, 255, 255) cv2.rectangle(display_frame, (x, y), (x+w, y+h), color, 2) label = f"{name}" if name != "Unknown" else "Unknown" cv2.putText(display_frame, label, (x, y+h+20), cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2) # Display instructions cv2.putText(display_frame, "Press 'a' to add face, 'q' to quit", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 2) # Display frame key = vision.display_frame(display_frame, "Face Recognition") if key == ord('q'): break elif key == ord('a'): # Add current face to database name = input("\nEnter name for this person: ") if name and tracked_face: success = face_recognition_system.add_known_face(frame, name, tracked_face['bbox']) if success: vision.reachy.voice.say(f"Nice to meet you, {name}!") vision.reachy.antennas.happy() time.sleep(0.03) finally: vision.stop_camera_stream() cv2.destroyAllWindows() # Return to neutral vision.reachy.head.look_at(x=0, y=0, z=50, duration=2.0) vision.reachy.voice.say("Thank you for the face interaction demo!") # Run face interaction demo run_face_interaction_demo()

Gesture Recognition and Control

Gesture recognition allows your Reachy Mini to understand and respond to hand movements and poses, creating intuitive interaction methods.

import mediapipe as mp class GestureRecognizer: def __init__(self, vision_system, reachy): """Initialize gesture recognition system.""" self.vision = vision_system self.reachy = reachy # Initialize MediaPipe self.mp_hands = mp.solutions.hands self.hands = self.mp_hands.Hands( static_image_mode=False, max_num_hands=2, min_detection_confidence=0.7, min_tracking_confidence=0.5 ) self.mp_drawing = mp.solutions.drawing_utils # Gesture history for smoothing self.gesture_history = [] self.history_length = 5 print("Gesture recognition system initialized!") def detect_hands(self, frame): """Detect hands and landmarks.""" if frame is None: return [] # Convert BGR to RGB rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) # Process frame results = self.hands.process(rgb_frame) detected_hands = [] if results.multi_hand_landmarks: for hand_idx, hand_landmarks in enumerate(results.multi_hand_landmarks): # Get hand classification (left/right) hand_label = results.multi_handedness[hand_idx].classification[0].label # Extract landmark positions landmarks = [] for landmark in hand_landmarks.landmark: x = int(landmark.x * frame.shape[1]) y = int(landmark.y * frame.shape[0]) landmarks.append((x, y)) detected_hands.append({ 'label': hand_label.lower(), 'landmarks': landmarks, 'raw_landmarks': hand_landmarks }) return detected_hands def classify_gesture(self, landmarks): """Classify hand gesture based on finger positions.""" if not landmarks or len(landmarks) != 21: return "unknown" # Finger tip and pip indices finger_tips = [4, 8, 12, 16, 20] # Thumb, Index, Middle, Ring, Pinky finger_pips = [3, 6, 10, 14, 18] # Check which fingers are extended fingers_up = [] # Thumb (special case - compare x coordinates) if landmarks[finger_tips[0]][0] > landmarks[finger_pips[0]][0]: fingers_up.append(1) else: fingers_up.append(0) # Other fingers (compare y coordinates) for i in range(1, 5): if landmarks[finger_tips[i]][1] < landmarks[finger_pips[i]][1]: fingers_up.append(1) else: fingers_up.append(0) # Classify gestures based on finger patterns total_fingers = sum(fingers_up) if total_fingers == 0: return "fist" elif total_fingers == 1: if fingers_up[1] == 1: # Only index finger return "point" elif fingers_up[0] == 1: # Only thumb return "thumbs_up" elif total_fingers == 2: if fingers_up[1] == 1 and fingers_up[2] == 1: # Index and middle return "peace" elif fingers_up[0] == 1 and fingers_up[1] == 1: # Thumb and index return "gun" elif total_fingers == 5: return "open_palm" elif total_fingers == 3: if fingers_up[1] == 1 and fingers_up[2] == 1 and fingers_up[3] == 1: return "three" return "unknown" def smooth_gesture(self, current_gesture): """Apply temporal smoothing to gesture recognition.""" self.gesture_history.append(current_gesture) if len(self.gesture_history) > self.history_length: self.gesture_history.pop(0) # Count occurrences of each gesture gesture_counts = {} for gesture in self.gesture_history: gesture_counts[gesture] = gesture_counts.get(gesture, 0) + 1 # Return most common gesture if gesture_counts: return max(gesture_counts, key=gesture_counts.get) else: return "unknown" def respond_to_gesture(self, gesture, hand_position=None): """Respond to recognized gestures.""" responses = { "open_palm": { "action": lambda: self.reachy.antennas.happy(), "speech": "Hello! Nice to see you!", "head_action": lambda: self.reachy.head.look_at(x=0, y=5, z=45, duration=1.0) }, "thumbs_up": { "action": lambda: self.reachy.antennas.excited(), "speech": "Thumbs up! That's great!", "head_action": lambda: self.reachy.head.look_at(x=0, y=10, z=45, duration=1.0) }, "peace": { "action": lambda: self.reachy.antennas.happy(), "speech": "Peace! Let's be friends!", "head_action": lambda: self.reachy.head.look_at(x=5, y=0, z=50, duration=1.0) }, "point": { "action": lambda: self.reachy.antennas.curious(), "speech": "Are you pointing at something interesting?", "head_action": self.look_in_pointing_direction }, "fist": { "action": lambda: self.reachy.antennas.neutral(), "speech": "I see a fist. Are you ready for action?", "head_action": lambda: self.reachy.head.look_at(x=0, y=0, z=45, duration=1.0) } } if gesture in responses: response = responses[gesture] # Execute antenna action response["action"]() # Speak response self.reachy.voice.say(response["speech"]) # Execute head action if hand_position and gesture == "point": response["head_action"](hand_position) else: response["head_action"]() print(f"Responded to gesture: {gesture}") def look_in_pointing_direction(self, hand_position): """Look in the direction the user is pointing.""" if hand_position: # Calculate pointing direction based on hand position center_x, center_y = hand_position frame_w, frame_h = 640, 480 # Convert to head coordinates norm_x = (center_x - frame_w/2) / (frame_w/2) norm_y = (center_y - frame_h/2) / (frame_h/2) head_x = norm_x * 30 head_y = -norm_y * 20 head_z = 50 self.reachy.head.look_at(x=head_x, y=head_y, z=head_z, duration=1.5) # Look around a bit to show interest time.sleep(2) self.reachy.head.look_at(x=head_x + 10, y=head_y, z=head_z, duration=1.0) time.sleep(1) self.reachy.head.look_at(x=head_x - 10, y=head_y, z=head_z, duration=1.0) class GestureController: def __init__(self, gesture_recognizer): """Initialize gesture-based robot controller.""" self.gesture_recognizer = gesture_recognizer self.last_gesture = None self.last_response_time = 0 self.response_cooldown = 3.0 # seconds def process_gestures(self, frame): """Process gestures and control robot accordingly.""" current_time = time.time() # Detect hands hands = self.gesture_recognizer.detect_hands(frame) if hands: for hand in hands: # Classify gesture gesture = self.gesture_recognizer.classify_gesture(hand['landmarks']) # Apply smoothing smooth_gesture = self.gesture_recognizer.smooth_gesture(gesture) # Check if we should respond if (smooth_gesture != self.last_gesture and smooth_gesture != "unknown" and current_time - self.last_response_time > self.response_cooldown): # Calculate hand center position landmarks = hand['landmarks'] center_x = sum(p[0] for p in landmarks) // len(landmarks) center_y = sum(p[1] for p in landmarks) // len(landmarks) hand_position = (center_x, center_y) # Respond to gesture self.gesture_recognizer.respond_to_gesture(smooth_gesture, hand_position) self.last_gesture = smooth_gesture self.last_response_time = current_time return hands, smooth_gesture else: # No hands detected if self.last_gesture is not None: self.last_gesture = None return [], "none" # Gesture control demo def run_gesture_control_demo(): """Run interactive gesture control demo.""" print("Starting gesture control demo...") # Initialize systems vision.start_camera_stream() gesture_recognizer = GestureRecognizer(vision, vision.reachy) gesture_controller = GestureController(gesture_recognizer) print("Gesture control active! Try these gestures:") print("- Open palm: Wave hello") print("- Thumbs up: Show approval") print("- Peace sign: Peace greeting") print("- Point: Look where you're pointing") print("- Fist: Action ready") print("Press 'q' to quit") try: for i in range(1200): # Run for ~40 seconds frame = vision.get_current_frame() if frame is not None: # Process gestures hands, current_gesture = gesture_controller.process_gestures(frame) # Create visualization display_frame = frame.copy() # Draw hand landmarks for hand in hands: landmarks = hand['landmarks'] label = hand['label'] # Draw landmarks for landmark in landmarks: cv2.circle(display_frame, landmark, 3, (0, 255, 0), -1) # Draw connections (simplified) if len(landmarks) == 21: # Draw some key connections connections = [ (0, 1), (1, 2), (2, 3), (3, 4), # Thumb (0, 5), (5, 6), (6, 7), (7, 8), # Index (5, 9), (9, 10), (10, 11), (11, 12), # Middle (9, 13), (13, 14), (14, 15), (15, 16), # Ring (13, 17), (17, 18), (18, 19), (19, 20), # Pinky (0, 17) # Palm ] for start, end in connections: if start < len(landmarks) and end < len(landmarks): cv2.line(display_frame, landmarks[start], landmarks[end], (255, 0, 0), 2) # Draw hand label if landmarks: center_x = sum(p[0] for p in landmarks) // len(landmarks) center_y = sum(p[1] for p in landmarks) // len(landmarks) cv2.putText(display_frame, f"{label.upper()}", (center_x-30, center_y-30), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 0), 2) # Display current gesture if current_gesture != "none" and current_gesture != "unknown": cv2.putText(display_frame, f"Gesture: {current_gesture.upper()}", (10, 60), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2) # Display instructions cv2.putText(display_frame, "Show gestures to control robot - 'q' to quit", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 2) # Display frame key = vision.display_frame(display_frame, "Gesture Control") if key == ord('q'): break time.sleep(0.03) finally: vision.stop_camera_stream() cv2.destroyAllWindows() # Return to neutral vision.reachy.head.look_at(x=0, y=0, z=50, duration=2.0) vision.reachy.voice.say("Gesture control demo complete! Thanks for playing!") # Run gesture control demo run_gesture_control_demo()

Advanced Applications and Integration

Now let's combine everything we've learned into sophisticated applications that showcase the full potential of Reachy Mini's computer vision capabilities.

Intelligent Desktop Companion

Project Idea: Create an intelligent desktop companion that recognizes you, tracks your activities, and provides contextual assistance based on what it sees.
class IntelligentCompanion: def __init__(self, vision_system, reachy): """Initialize intelligent desktop companion.""" self.vision = vision_system self.reachy = reachy # Initialize all recognition systems self.face_recognition = FaceRecognitionSystem(vision_system, reachy) self.object_detector = AIObjectDetector(vision_system) self.gesture_recognizer = GestureRecognizer(vision_system, reachy) # Companion state self.current_user = None self.activity_context = [] self.interaction_mode = "passive" # passive, active, focused # Learning and memory self.user_preferences = {} self.interaction_history = [] print("Intelligent companion initialized!") def analyze_scene(self, frame): """Comprehensive scene analysis.""" scene_data = { 'timestamp': time.time(), 'faces': [], 'objects': [], 'gestures': [], 'activity': 'unknown' } # Face analysis faces = self.face_recognition.recognize_faces(frame) scene_data['faces'] = faces # Object detection objects = self.object_detector.detect_objects(frame) scene_data['objects'] = objects # Gesture recognition hands = self.gesture_recognizer.detect_hands(frame) if hands: gestures = [self.gesture_recognizer.classify_gesture(hand['landmarks']) for hand in hands] scene_data['gestures'] = gestures # Activity inference scene_data['activity'] = self.infer_activity(objects, gestures) return scene_data def infer_activity(self, objects, gestures): """Infer what the user is doing based on visible objects and gestures.""" object_names = [obj['class_name'] for obj in objects] # Work-related activity work_objects = ['laptop', 'keyboard', 'mouse', 'book', 'cell phone'] if any(obj in object_names for obj in work_objects): if 'point' in gestures: return 'presenting' else: return 'working' # Eating/drinking food_objects = ['cup', 'bottle', 'banana', 'apple', 'sandwich'] if any(obj in object_names for obj in food_objects): return 'eating' # Leisure leisure_objects = ['tv', 'remote', 'book'] if any(obj in object_names for obj in leisure_objects): return 'relaxing' # Social interaction if len([f for f in self.face_recognition.recognize_faces(None) if f]) > 1: return 'socializing' return 'unknown' def provide_contextual_assistance(self, scene_data): """Provide help based on current context.""" activity = scene_data['activity'] objects = scene_data['objects'] faces = scene_data['faces'] # Greet new users for face in faces: if face['name'] != 'Unknown' and face['name'] != self.current_user: self.current_user = face['name'] self.reachy.voice.say(f"Hello {face['name']}! I'm here to help.") self.reachy.antennas.happy() # Activity-specific assistance if activity == 'working': laptop_objects = [obj for obj in objects if obj['class_name'] == 'laptop'] if laptop_objects and not hasattr(self, 'work_assistance_given'): self.reachy.voice.say("I see you're working. Let me know if you need a break reminder!") self.work_assistance_given = True elif activity == 'presenting': if not hasattr(self, 'presentation_mode'): self.reachy.voice.say("It looks like you're presenting. I'll be extra quiet.") self.presentation_mode = True elif activity == 'eating': if not hasattr(self, 'meal_noted'): self.reachy.voice.say("Enjoy your meal!") self.reachy.antennas.happy() self.meal_noted = True def adaptive_behavior(self, scene_data): """Adapt behavior based on scene understanding.""" # Adjust interaction frequency based on activity if scene_data['activity'] == 'working': self.interaction_mode = 'passive' elif scene_data['activity'] == 'socializing': self.interaction_mode = 'active' elif 'open_palm' in scene_data['gestures']: self.interaction_mode = 'focused' # Adjust head movement patterns if self.interaction_mode == 'passive': # Subtle, non-distracting movements pass elif self.interaction_mode == 'active': # More expressive and engaging if scene_data['faces']: # Track faces more actively pass elif self.interaction_mode == 'focused': # Full attention and engagement self.reachy.antennas.curious() def run_companion_session(self, duration_minutes=10): """Run intelligent companion session.""" print(f"Starting {duration_minutes}-minute companion session...") self.vision.start_camera_stream() start_time = time.time() end_time = start_time + (duration_minutes * 60) try: while time.time() < end_time: frame = self.vision.get_current_frame() if frame is not None: # Analyze scene scene_data = self.analyze_scene(frame) # Provide assistance self.provide_contextual_assistance(scene_data) # Adapt behavior self.adaptive_behavior(scene_data) # Log interaction self.interaction_history.append(scene_data) # Keep only recent history if len(self.interaction_history) > 100: self.interaction_history.pop(0) time.sleep(1.0) # Check every second finally: self.vision.stop_camera_stream() self.reachy.voice.say("Companion session complete. It was great spending time with you!") # Demo: Run intelligent companion def demo_intelligent_companion(): """Demonstrate intelligent companion capabilities.""" companion = IntelligentCompanion(vision, vision.reachy) # Run a 5-minute companion session companion.run_companion_session(duration_minutes=5) # Uncomment to run the demo # demo_intelligent_companion()

Performance Optimization and Best Practices

Computer vision applications can be resource-intensive. Here are key strategies for optimizing performance on your Reachy Mini:

🎯 Frame Rate Management

Adjust processing frequency based on application needs. Use 30fps for tracking, 5fps for recognition.

📏 Resolution Optimization

Use lower resolutions (320x240) for real-time tasks, higher (640x480) for detailed analysis.

🧵 Threading Strategy

Separate capture, processing, and response threads to maintain smooth operation.

🎨 Model Selection

Choose appropriate model sizes: YOLOv8n for speed, YOLOv8m for accuracy balance.

# Performance optimization example class OptimizedVision: def __init__(self): # Use different processing rates for different tasks self.face_detection_interval = 0.1 # 10 FPS self.object_detection_interval = 0.2 # 5 FPS self.gesture_recognition_interval = 0.15 # ~7 FPS # Frame resolution optimization self.tracking_resolution = (320, 240) self.analysis_resolution = (640, 480) # Model optimization self.fast_face_detector = cv2.CascadeClassifier( cv2.data.haarcascades + 'haarcascade_frontalface_default.xml' ) self.detailed_model = YOLO('yolov8n.pt') # Nano for speed def optimize_frame(self, frame, task_type): """Optimize frame based on task requirements.""" if task_type == 'tracking': return cv2.resize(frame, self.tracking_resolution) elif task_type == 'analysis': return cv2.resize(frame, self.analysis_resolution) return frame def batch_process(self, frames): """Process multiple frames in batch for efficiency.""" # Batch processing can improve GPU utilization results = [] for frame in frames: result = self.detailed_model(frame, verbose=False) results.append(result) return results

Troubleshooting Common Issues

Common Issues and Solutions:
  • Low frame rate: Reduce resolution or processing frequency
  • False positives: Adjust confidence thresholds and add temporal filtering
  • Poor lighting performance: Implement automatic exposure adjustment
  • Memory issues: Implement proper frame buffer management

Future Possibilities and Extensions

The computer vision capabilities we've explored are just the beginning. Here are some exciting directions for further development:

Conclusion

Computer vision transforms your Reachy Mini from a simple robot into an intelligent companion capable of understanding and interacting with the visual world. From basic object detection to sophisticated gesture recognition and scene understanding, these capabilities open up endless possibilities for creative applications.

The key to successful computer vision applications is starting simple and gradually adding complexity. Begin with basic face tracking, then add object detection, and finally integrate gesture recognition to create rich, multi-modal interactions.

Keep Exploring! The computer vision field is rapidly evolving, with new models and techniques constantly emerging. Stay connected with the Hugging Face community to discover the latest breakthroughs and share your own innovations with fellow Reachy Mini developers.

Remember that the most compelling robotic applications often combine multiple modalities – vision, audio, and movement working together to create natural, intuitive interactions. Your Reachy Mini is the perfect platform for exploring these exciting possibilities!