Skip to content
Published on

Robotics & AI Complete Guide: From ROS2 to Humanoid Robots

Authors

1. Robotics Fundamentals

Degrees of Freedom (DOF)

A robot's degrees of freedom refers to the number of independent axes along which it can move. A 6-DOF robot arm has the minimum number of degrees needed to represent any arbitrary position and orientation in 3D space. A full-body humanoid robot including fingers can have 50 or more DOF.

DOFExpressible MotionExamples
1-DOFSingle-axis rotation/translationConveyor belt
3-DOF3D position onlyDelta robot
6-DOFFull position + orientationIndustrial robot arm
7-DOFWith redundancyKUKA iiwa, Franka Panda

Coordinate Transforms: Homogeneous Transformation Matrix

To represent the position and orientation of a rigid body in 3D space, robotics uses the Homogeneous Transformation Matrix — a 4x4 matrix that encodes both rotation (R) and translation (t) simultaneously.

T=[R3×3t3×101×31]T = \begin{bmatrix} R_{3\times3} & t_{3\times1} \\ 0_{1\times3} & 1 \end{bmatrix}

Transformations between multiple coordinate frames are chained via matrix multiplication:

Tbaseend=Tbase1T12Tn1endT_{base \to end} = T_{base \to 1} \cdot T_{1 \to 2} \cdots T_{n-1 \to end}

Forward Kinematics

Forward kinematics computes the position and orientation of the end-effector given the joint angles. Each joint's transformation matrix is multiplied in sequence.

import numpy as np

def dh_transform(a, d, alpha, theta):
    """Compute a Denavit-Hartenberg transformation matrix."""
    ct, st = np.cos(theta), np.sin(theta)
    ca, sa = np.cos(alpha), np.sin(alpha)
    return np.array([
        [ct,    -st*ca,  st*sa,  a*ct],
        [st,     ct*ca, -ct*sa,  a*st],
        [0,      sa,     ca,     d   ],
        [0,      0,      0,      1   ]
    ])

def forward_kinematics(joint_angles, dh_params):
    """
    FK: joint angles -> end-effector pose
    dh_params: [(a, d, alpha, theta_offset), ...]
    """
    T = np.eye(4)
    for i, theta in enumerate(joint_angles):
        a, d, alpha, theta_off = dh_params[i]
        T = T @ dh_transform(a, d, alpha, theta + theta_off)
    return T

# 2-DOF planar arm example (link lengths 1.0m each)
dh_params_2dof = [
    (1.0, 0, 0, 0),
    (1.0, 0, 0, 0),
]
joint_angles = [np.pi/4, np.pi/4]
T_end = forward_kinematics(joint_angles, dh_params_2dof)
print("End-effector position:", T_end[:3, 3])

Inverse Kinematics

Inverse kinematics computes the required joint angles given a desired end-effector pose. There are analytical approaches (closed-form IK) and numerical approaches (Jacobian-based iterative IK).

The core of numerical IK is iterative computation using the Jacobian matrix JJ:

Δθ=J+Δx\Delta \theta = J^+ \cdot \Delta x

where J+J^+ is the pseudoinverse of the Jacobian. Near singularities, the Jacobian becomes ill-conditioned and the numerical solution can become unstable.

Denavit-Hartenberg Parameters

The DH convention provides a systematic method to assign coordinate frames to each link of a robot. Each joint is described by four parameters: link length aa, link twist α\alpha, link offset dd, and joint angle θ\theta. This compact representation makes it straightforward to implement FK for any serial manipulator.


2. ROS2 (Robot Operating System 2)

ROS1 vs ROS2: Key Differences

ROS2 was redesigned from scratch to overcome ROS1's limitations, adopting DDS (Data Distribution Service) as its communication middleware.

FeatureROS1ROS2
CommunicationCustom TCP/UDPDDS (OMG standard)
Real-time supportPoorSupported (rclcpp rt)
Python versionPython 2/3Python 3 only
SecurityNoneSROS2 (TLS/DDS)
Multi-robotDifficultEasy with namespaces
Lifecycle nodesNoneLifecycleNode supported
Middlewareroscore requiredDecentralized, no roscore

Core Concepts: Nodes, Topics, Services, Actions

  • Node: The fundamental computation unit. Can be one process or contain multiple nodes.
  • Topic: Asynchronous publish/subscribe communication (fire-and-forget).
  • Service: Synchronous request/response communication (blocking call).
  • Action: Asynchronous communication for long-running tasks, with feedback and preemption support.

ROS2 Python Publisher/Subscriber Example

import rclpy
from rclpy.node import Node
from std_msgs.msg import String

class TalkerNode(Node):
    def __init__(self):
        super().__init__('talker')
        self.publisher = self.create_publisher(String, 'chatter', 10)
        self.timer = self.create_timer(0.5, self.publish_msg)
        self.count = 0

    def publish_msg(self):
        msg = String()
        msg.data = f'Hello ROS2: {self.count}'
        self.publisher.publish(msg)
        self.get_logger().info(f'Published: {msg.data}')
        self.count += 1

class ListenerNode(Node):
    def __init__(self):
        super().__init__('listener')
        self.subscription = self.create_subscription(
            String, 'chatter', self.callback, 10
        )

    def callback(self, msg):
        self.get_logger().info(f'Received: {msg.data}')

def main():
    rclpy.init()
    node = TalkerNode()
    rclpy.spin(node)
    rclpy.shutdown()

Launch File Example

# launch/my_robot.launch.py
from launch import LaunchDescription
from launch_ros.actions import Node

def generate_launch_description():
    return LaunchDescription([
        Node(
            package='my_robot',
            executable='talker',
            name='talker_node',
            output='screen',
            parameters=[{'use_sim_time': False}],
        ),
        Node(
            package='my_robot',
            executable='listener',
            name='listener_node',
            output='screen',
        ),
    ])

Run with: ros2 launch my_robot my_robot.launch.py


3. Sensor Processing

LiDAR: Point Cloud Processing

LiDAR (Light Detection And Ranging) generates 3D point clouds using laser pulses. It is essential for autonomous navigation and 3D mapping.

import open3d as o3d
import numpy as np

def process_pointcloud(pcd_path):
    # Load point cloud
    pcd = o3d.io.read_point_cloud(pcd_path)

    # Statistical outlier removal
    pcd_clean, _ = pcd.remove_statistical_outlier(
        nb_neighbors=20, std_ratio=2.0
    )

    # Voxel downsampling (reduce resolution)
    pcd_down = pcd_clean.voxel_down_sample(voxel_size=0.05)

    # Normal estimation
    pcd_down.estimate_normals(
        search_param=o3d.geometry.KDTreeSearchParamHybrid(
            radius=0.1, max_nn=30
        )
    )

    # RANSAC plane segmentation (ground removal)
    plane_model, inliers = pcd_down.segment_plane(
        distance_threshold=0.01,
        ransac_n=3,
        num_iterations=1000
    )
    [a, b, c, d] = plane_model
    print(f"Ground plane: {a:.2f}x + {b:.2f}y + {c:.2f}z + {d:.2f} = 0")

    # Extract non-ground objects
    objects_pcd = pcd_down.select_by_index(inliers, invert=True)
    return objects_pcd

IMU and the Kalman Filter

An IMU (Inertial Measurement Unit) contains accelerometers and gyroscopes to estimate orientation and velocity. The Kalman Filter reduces sensor noise and provides accurate state estimation.

Two stages of the Kalman Filter:

  1. Predict: Estimate the current state using the motion model
  2. Update: Correct the prediction using sensor measurements

x^kk=x^kk1+Kk(zkHx^kk1)\hat{x}_{k|k} = \hat{x}_{k|k-1} + K_k(z_k - H\hat{x}_{k|k-1})

where KkK_k is the Kalman Gain and zkz_k is the measurement. The EKF (Extended Kalman Filter) handles nonlinear systems via first-order Taylor linearization.

Camera: RGB-D and Stereo Vision

RGB-D cameras (e.g., Intel RealSense, Microsoft Azure Kinect) provide color plus depth per pixel. Stereo cameras compute depth via triangulation. Both are used for obstacle detection, semantic mapping, and grasping.


4. SLAM (Simultaneous Localization and Mapping)

What is SLAM?

SLAM is the problem of building a map of an unknown environment while simultaneously estimating the robot's location within it. It is a foundational challenge in robotics navigation.

Key SLAM challenges:

  • Data Association: Matching current observations to known landmarks
  • Loop Closure: Recognizing a previously visited place to correct accumulated drift
  • Map Representation: Landmark maps, occupancy grids, topological maps, etc.

EKF-SLAM

EKF-SLAM uses the Extended Kalman Filter, linearizing the nonlinear robot motion and observation models via first-order Taylor expansion.

State vector: robot pose + all landmark positions

x=[xr,yr,θr,xl1,yl1,xl2,yl2,]Tx = [x_r, y_r, \theta_r, x_{l1}, y_{l1}, x_{l2}, y_{l2}, \ldots]^T

Computational complexity is O(n2)O(n^2) in the number of landmarks nn, making it unsuitable for large-scale environments.

Graph-Based SLAM

Modern SLAM systems typically use a graph-based formulation. Poses are nodes in a graph, and sensor constraints are edges. Loop closure adds edges between distant nodes, and the entire graph is optimized using nonlinear least squares (e.g., g2o, GTSAM).

Real-World SLAM Libraries

LibraryMethodSensorsNotes
CartographerGraph SLAMLiDAR, IMUDeveloped by Google, ROS2 support
ORB-SLAM3Feature-basedCamera, IMUReal-time mono/stereo/RGB-D
LIO-SAMTightly-coupledLiDAR + IMULarge outdoor environments
RTAB-MapGraph SLAMRGB-D, LiDARReal-time 3D mapping
# Install Cartographer for ROS2 Humble
sudo apt install ros-humble-cartographer-ros

# Run Cartographer with TurtleBot3
ros2 launch turtlebot3_cartographer cartographer.launch.py

5. Motion Planning

Planning Algorithm Overview

Motion planning finds a collision-free path from a start configuration to a goal in the robot's configuration space (C-space).

  • Grid-based: A*, Dijkstra — optimal on discrete grids, but scale poorly to high DOF
  • Sampling-based: RRT, PRM — efficient in high-dimensional C-spaces
  • Optimization-based: CHOMP, STOMP — optimize a cost functional over a trajectory

RRT* Implementation

RRT* (Rapidly-exploring Random Tree Star) provides asymptotic optimality guarantees on top of RRT.

import numpy as np

def rrt_star(start, goal, obstacles, max_iter=1000, step=0.5, goal_radius=0.5):
    """
    Simple 2D RRT* implementation.
    start, goal: np.array([x, y])
    obstacles: [(cx, cy, radius), ...]
    """
    nodes = [start]
    parents = [-1]
    costs = [0.0]

    def is_collision(p1, p2):
        """Check if segment p1-p2 intersects any obstacle."""
        for (cx, cy, r) in obstacles:
            center = np.array([cx, cy])
            d = p2 - p1
            t = np.clip(np.dot(center - p1, d) / (np.dot(d, d) + 1e-9), 0, 1)
            closest = p1 + t * d
            if np.linalg.norm(closest - center) < r:
                return True
        return False

    for _ in range(max_iter):
        rand = goal if np.random.random() < 0.05 else np.random.uniform(-10, 10, 2)

        dists = [np.linalg.norm(np.array(n) - rand) for n in nodes]
        nearest_idx = int(np.argmin(dists))
        nearest = np.array(nodes[nearest_idx])

        direction = rand - nearest
        norm = np.linalg.norm(direction)
        if norm < 1e-9:
            continue
        new_node = nearest + step * direction / norm

        if is_collision(nearest, new_node):
            continue

        nodes.append(new_node)
        parents.append(nearest_idx)
        costs.append(costs[nearest_idx] + step)

        if np.linalg.norm(new_node - goal) < goal_radius:
            print(f"Goal reached! Total nodes: {len(nodes)}")
            return nodes, parents

    return nodes, parents

# Example usage
start = np.array([0.0, 0.0])
goal = np.array([9.0, 9.0])
obstacles = [(5.0, 5.0, 1.5), (3.0, 7.0, 1.0)]
nodes, parents = rrt_star(start, goal, obstacles)

MoveIt2 for Robot Arm Planning

MoveIt2 is the standard framework for robot arm motion planning in ROS2. It wraps multiple planning libraries (OMPL, STOMP, CHOMP) under a unified API.

# MoveIt2 Python API example
import rclpy
from moveit.planning import MoveItPy

rclpy.init()
moveit = MoveItPy(node_name='moveit_py_planning')
panda_arm = moveit.get_planning_component('panda_arm')

# Set goal pose
from geometry_msgs.msg import PoseStamped
goal_pose = PoseStamped()
goal_pose.header.frame_id = 'panda_link0'
goal_pose.pose.position.x = 0.5
goal_pose.pose.position.y = 0.0
goal_pose.pose.position.z = 0.5
goal_pose.pose.orientation.w = 1.0

panda_arm.set_goal_state(pose_stamped_msg=goal_pose, pose_link='panda_hand')

# Plan and execute
plan_result = panda_arm.plan()
if plan_result:
    moveit.execute(plan_result.trajectory)

6. Computer Vision for Robotics

Object Detection with YOLO

Real-time object detection is essential for robots operating in dynamic environments. The YOLO (You Only Look Once) family remains the go-to choice for edge deployment.

from ultralytics import YOLO
import cv2

# Load YOLOv11 nano
model = YOLO('yolo11n.pt')

def detect_objects(frame):
    """Detect objects in a camera frame."""
    results = model(frame, conf=0.5)
    annotated = results[0].plot()

    for box in results[0].boxes:
        cls_id = int(box.cls[0])
        label = model.names[cls_id]
        conf = float(box.conf[0])
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        print(f"{label}: {conf:.2f} at ({x1},{y1})-({x2},{y2})")

    return annotated

# Live webcam detection
cap = cv2.VideoCapture(0)
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    annotated_frame = detect_objects(frame)
    cv2.imshow('Robot Vision', annotated_frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
cap.release()

6D Pose Estimation

To grasp an object, a robot needs to estimate its 6D pose — 3D position and 3D orientation. The PnP (Perspective-n-Point) algorithm solves this given 2D-3D point correspondences.

import cv2
import numpy as np

def estimate_pose(object_points, image_points, camera_matrix, dist_coeffs):
    """
    6D pose estimation using PnP.
    object_points: 3D points in object coordinate frame
    image_points: Corresponding 2D image coordinates
    """
    success, rvec, tvec = cv2.solvePnP(
        object_points, image_points,
        camera_matrix, dist_coeffs,
        flags=cv2.SOLVEPNP_ITERATIVE
    )
    if success:
        R, _ = cv2.Rodrigues(rvec)
        print(f"Position: {tvec.flatten()}")
        print(f"Rotation matrix:\n{R}")
        return R, tvec
    return None, None

7. Reinforcement Learning for Robot Control

Simulation Environments

SimulatorDeveloperStrengthsUse Case
MuJoCoGoogle DeepMindAccurate physics, fastContinuous control research
PyBulletErwin CoumansOpen-source, Python-friendlyPrototyping
Isaac GymNVIDIAMassively parallel GPU simLarge-scale RL
CARLAIntel/UPVAD-specific sensorsAutonomous driving

SAC for Continuous Control

SAC (Soft Actor-Critic) is an off-policy RL algorithm that incorporates entropy regularization, balancing exploration and exploitation automatically.

import gymnasium as gym
from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import EvalCallback

# Train SAC on MuJoCo Ant
env = gym.make('Ant-v4', render_mode=None)
eval_env = gym.make('Ant-v4', render_mode=None)

eval_callback = EvalCallback(
    eval_env,
    best_model_save_path='./logs/best_model',
    log_path='./logs/',
    eval_freq=10000,
    deterministic=True,
    render=False
)

model = SAC(
    'MlpPolicy',
    env,
    verbose=1,
    learning_rate=3e-4,
    buffer_size=1_000_000,
    batch_size=256,
    ent_coef='auto',  # auto-tune entropy coefficient
    gamma=0.99,
    tau=0.005,
    tensorboard_log='./tb_logs/'
)

model.learn(total_timesteps=1_000_000, callback=eval_callback)
model.save('ant_sac_final')

# Inference / deployment
obs, _ = env.reset()
total_reward = 0.0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = env.step(action)
    total_reward += reward
    if done or truncated:
        print(f"Episode ended. Total reward: {total_reward:.2f}")
        obs, _ = env.reset()
        total_reward = 0.0

Sim-to-Real Transfer

Bridging the Reality Gap between simulation and the real world is one of the central challenges in robot learning.

  1. Domain Randomization: Randomly vary physical parameters (friction, mass, joint stiffness) in simulation so the policy learns to cope with a wide distribution of environments.
  2. Domain Adaptation: Fine-tune the policy using a small amount of real-world data.
  3. System Identification: Precisely measure physical parameters of the real robot and incorporate them in the simulator.
  4. Randomized Rendering: For vision-based policies, randomize textures, lighting, and camera properties.

8. Autonomous Driving

SAE Autonomy Levels

LevelNameDescriptionExamples
L0No AutomationDriver fully in controlStandard vehicles
L1Driver AssistanceSingle feature assistACC, LKAS
L2Partial AutomationCombined featuresTesla Autopilot
L3Conditional AutomationSelf-driving in specific conditionsMercedes Drive Pilot
L4High AutomationFull autonomy in geofenced areasWaymo One
L5Full AutomationAll conditions, no driverNot yet realized

Perception-Prediction-Planning-Control Pipeline

class AutonomousDrivingSystem:
    """Standard autonomous driving pipeline."""

    def __init__(self, perception, prediction, planning, control):
        self.perception = perception
        self.prediction = prediction
        self.planning = planning
        self.control = control
        self.goal = None

    def run_step(self, sensor_data):
        # 1. Perception: detect and track objects, localize ego vehicle
        detected_objects = self.perception.detect(sensor_data)
        ego_state = self.perception.localize(sensor_data)

        # 2. Prediction: forecast future trajectories of other agents
        predicted_trajectories = self.prediction.predict(
            detected_objects, horizon=3.0  # 3-second prediction horizon
        )

        # 3. Planning: generate a safe, comfortable trajectory to goal
        trajectory = self.planning.plan(
            ego_state, predicted_trajectories, goal=self.goal
        )

        # 4. Control: track the planned trajectory
        steering, throttle, brake = self.control.compute(
            ego_state, trajectory
        )

        return steering, throttle, brake

Sensor Fusion: LiDAR + Camera

Modern autonomous vehicles fuse LiDAR point clouds and camera images. LiDAR provides accurate depth; cameras provide rich semantic information. Common approaches include:

  • Early fusion: Project LiDAR points onto image, create a unified representation
  • Late fusion: Run separate detectors, combine outputs with Kalman filtering
  • Deep fusion: End-to-end networks that take both modalities as input

9. Humanoid Robots in 2026

The humanoid robot market is growing rapidly in 2026. Here is a snapshot of the major players.

Humanoid Robot Comparison

RobotCompanyHeight / WeightKey Features
Atlas (Electric)Boston Dynamics1.5m / 89kgElectric actuators, advanced mobility
Optimus Gen 3Tesla1.73m / 57kg22-DOF hands, FSD AI stack
Figure 02Figure AI1.7m / 60kgOpenAI collaboration, voice understanding
Unitree H1/G1Unitree Robotics1.8m/1.3mCost-effective, open-source friendly
1X NEO Beta1X Technologies1.65m / 30kgLightweight, soft actuators

Foundation Models for Robotics

RT-2 (Robotics Transformer 2), developed by Google DeepMind, is a vision-language-action (VLA) model that directly applies knowledge learned from web-scale data to robot control. It understands natural language commands and generalizes to novel objects.

OpenVLA is an open-source VLA model that researchers can fine-tune on their own robot data.

# OpenVLA inference example (conceptual)
from transformers import AutoModelForVision2Seq, AutoProcessor
import torch

model_name = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map='auto',
    trust_remote_code=True
)

# Generate robot action from vision + language
image = get_camera_image()   # robot wrist camera image
instruction = "pick up the red block and place it in the box"

inputs = processor(instruction, image).to('cuda:0', dtype=torch.bfloat16)
action = model.predict_action(**inputs, unnorm_key='bridge_orig', do_sample=False)
# action: [dx, dy, dz, droll, dpitch, dyaw, gripper]
robot.execute_action(action)

Embedded AI and On-Device Inference

Humanoid robots must make decisions in real time without relying on the cloud. Key hardware platforms include NVIDIA Jetson AGX Orin (275 TOPS) and Qualcomm Robotics RB5.

Key techniques for efficient on-device inference:

  • Quantization: INT8 / FP16 reduces model size and latency
  • Pruning: Remove redundant weights to reduce computation
  • TensorRT / ONNX Runtime: Hardware-optimized inference engines
  • Distillation: Train a smaller student model from a large teacher

10. Quiz: Robotics & AI Core Concepts

Q1. Why is Loop Closure critical in SLAM?

Answer: It allows the robot to recognize a previously visited place and correct the accumulated pose estimation error.

Explanation: During extended navigation, small sensor and odometry errors accumulate over time, causing the robot's estimated trajectory to drift significantly from reality. When a loop closure is detected (e.g., through feature matching or place recognition), a graph optimization step (using tools like g2o or GTSAM) redistributes this accumulated error across all past poses, dramatically improving map consistency.

Q2. What are the primary advantages of ROS2 using DDS over ROS1?

Answer: Standardized middleware, real-time support, flexible multi-robot networking, no single point of failure (no roscore), and built-in security (SROS2).

Explanation: ROS1 depended on a single roscore process, which was a single point of failure. ROS2's DDS-based architecture is fully decentralized — nodes discover each other automatically without a broker. QoS (Quality of Service) policies allow fine-grained control over communication reliability, latency, and history. SROS2 adds TLS-based encryption and authentication.

Q3. What role does the entropy term play in the SAC algorithm?

Answer: It encourages the policy to remain stochastic (exploratory) by including entropy as part of the reward signal, preventing premature convergence to suboptimal deterministic behaviors.

Explanation: SAC's objective is to maximize the sum of expected return and policy entropy. A higher entropy coefficient (alpha) encourages more exploration. When set to 'auto', SAC adjusts alpha dynamically to maintain a target entropy level. This mechanism helps escape local optima and produces more robust policies that generalize better to unseen situations.

Q4. Why is inverse kinematics fundamentally harder than forward kinematics?

Answer: IK is a nonlinear system of equations that may have no solution, a unique solution, or infinitely many solutions, whereas FK always produces a unique result.

Explanation: FK is a straightforward composition of transformation matrices — given joint angles, there is exactly one end-effector pose. IK inverts this mapping: for a 7-DOF redundant manipulator, infinitely many joint configurations reach the same end-effector pose (kinematic redundancy). At workspace boundaries, no solution may exist. Near singularities, the Jacobian becomes rank-deficient, causing numerical instability in iterative solvers.

Q5. Why does Domain Randomization work for Sim-to-Real transfer?

Answer: By training across a wide distribution of simulated physical parameters, the policy learns to be robust to the variability found in the real world.

Explanation: A policy trained with fixed simulation parameters overfits to those specific conditions and fails when physical properties in reality differ (different friction, actuator delays, sensor noise). Domain Randomization treats the real world as just another sample from the training distribution. OpenAI's Dexterous Hand and NVIDIA's quadruped locomotion work are landmark demonstrations of this approach, achieving remarkable real-world performance purely through massively randomized simulation.


References