Visual Odometry
RGB-D visual odometry pipeline with step-by-step implementation details
Overview
This project implements RGB-D visual odometry from a robot camera, grounded in a clear camera projection model. A 3D world point $O_w = [x, y, z, 1]^T$ is first mapped into the camera frame using the extrinsics $(R, t)$, then projected onto the image plane with intrinsics $K$:
\[O_c = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix} O_w,\quad \tilde{o} = K \, [I \; 0] \, O_c,\quad (u, v) = \left(\frac{\tilde{x}}{\tilde{z}}, \frac{\tilde{y}}{\tilde{z}}\right)\]This pinhole model lets us project world points into pixels and back-project pixels into 3D using depth. The pipeline then tracks features across RGB frames, lifts them to 3D, estimates camera motion with $R,t$, and integrates pose over time to recover the trajectory. Along the way, we align image coordinates with pixel coordinates via scaling and principal point offsets in $K$, consistent with how 2D images are stored as $M \times N \times 3$ RGB arrays in software.
Inputs And Assumptions
- Synchronized RGB and depth frames from the same camera.
- Known camera intrinsics
Kwith(fx, fy, cx, cy). - Depth is in meters (or converted to meters), with invalid depth masked out.
- Consecutive frames have sufficient overlap and small inter-frame motion.
What “Known Intrinsics K” Means
Camera intrinsics are the calibration parameters that map 3D camera coordinates to 2D pixel coordinates. They are typically fixed for a given camera once calibrated. model the cameras projective geometry through the coordinate system transformation. These transformations can be used to project points from the world frame to the image frame, using the pinhole camera model .
The intrinsic matrix is:
K = [[fx, 0, cx],
[ 0, fy, cy],
[ 0, 0, 1]]
Where:
-
fx, fy: focal lengths in pixels. -
cx, cy: principal point (optical center) in pixels.
You need K to back-project pixels to 3D and to project 3D points to pixels.
Calibration Example (OpenCV)
This is a minimal example of how intrinsics are estimated using a checkerboard. In practice, you collect 10–30 images of a checkerboard at different angles.
import cv2
import numpy as np
# Checkerboard inner corners (rows, cols)
pattern_size = (9, 6)
square_size = 0.024 # meters
# 3D points in checkerboard coordinate system
objp = np.zeros((pattern_size[0] * pattern_size[1], 3), np.float32)
objp[:, :2] = np.indices(pattern_size).T.reshape(-1, 2)
objp *= square_size
objpoints = [] # 3D points
imgpoints = [] # 2D points
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"] # replace with your files
for path in image_paths:
img = cv2.imread(path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ret, corners = cv2.findChessboardCorners(gray, pattern_size)
if ret:
objpoints.append(objp)
imgpoints.append(corners)
ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
objpoints, imgpoints, gray.shape[::-1], None, None
)
print("K:", K)
print("dist:", dist.ravel())
Parameter Glossary
Camera and geometry:
-
K: 3x3 intrinsic matrix. Converts normalized camera coordinates to pixels. -
fx, fy: focal lengths in pixels. Larger values mean narrower field of view. -
cx, cy: principal point in pixels (optical center). -
u, v: pixel coordinates (column, row). -
Z: depth value at(u, v)in meters. -
P_c: 3D point in the camera frame.
Pose and transforms:
-
R: 3x3 rotation matrix. Maps points from previous camera frame to current. -
t: 3x1 translation vector in meters. -
T: 4x4 homogeneous transform[R t; 0 1]. -
T_k-1,k: relative pose from framek-1tok. -
T_k: global pose at timekin world coordinates.
Optimization:
-
p_i: observed 2D pixel in framek. -
P_i: 3D point from framek-1lifted using depth. -
π(·): projection from 3D to pixel coordinates. -
e_i: reprojection error in pixels. -
τ: inlier threshold in pixels for RANSAC.
Pipeline Steps
- Load RGB and depth frames, and ensure time alignment.
- Detect keypoints and descriptors in the current RGB frame.
- Match descriptors to the previous frame and filter outliers.
- For matched keypoints, lift 2D pixels to 3D using the previous frame depth.
- Solve for relative pose
(R, t)using PnP with RANSAC. - Refine pose with non-linear optimization on inliers (optional).
- Compose the global pose and store the camera trajectory.
- Repeat for the next frame.
Core Math
- Back-project a pixel
(u, v)with depthZto 3D in camera coordinates:X = (u - cx) * Z / fxY = (v - cy) * Z / fyZ = Z
- Pose composition (world to camera):
T_wc = T_wc * T_ccprev
Mathematics Of Visual Odometry
This section summarizes the geometry and optimization used in RGB-D VO.
1. Camera Model And Projection
We use the pinhole camera model:
- Intrinsics:
K = [[fx, 0, cx], [0, fy, cy], [0, 0, 1]]
- 3D point in camera coordinates:
P_c = [X, Y, Z]^T - Projection to pixel:
[u, v, 1]^T = K * [X/Z, Y/Z, 1]^T
Python function:
import numpy as np
def project_point(P_c, K):
X, Y, Z = P_c
if Z <= 0:
return None
x = X / Z
y = Y / Z
u = K[0, 0] * x + K[0, 2]
v = K[1, 1] * y + K[1, 2]
return np.array([u, v], dtype=np.float64)
2. Back-Projection With Depth
Given pixel (u, v) and depth Z:
X = (u - cx) * Z / fxY = (v - cy) * Z / fyP_c = [X, Y, Z]^T
Python function:
import numpy as np
def backproject_pixel(u, v, z, K):
fx, fy = K[0, 0], K[1, 1]
cx, cy = K[0, 2], K[1, 2]
X = (u - cx) * z / fx
Y = (v - cy) * z / fy
return np.array([X, Y, z], dtype=np.float64)
3. Rigid Motion In SE(3)
Relative pose between consecutive frames is a rigid transform:
-
T = [R t; 0 1]withR ∈ SO(3),t ∈ R^3 - Point transformation:
P_c2 = R * P_c1 + t
Python function:
import numpy as np
def transform_point(P_c1, R, t):
return R @ P_c1 + t.reshape(3)
4. RGB-D Odometry As 3D-2D PnP
For matched features, we use depth from frame k-1 to form 3D points, and their matched pixel locations in frame k as 2D observations.
Let:
-
P_ibe 3D points from framek-1 -
p_ibe 2D pixels in framek
We solve:
p_i ≈ π( K * (R * P_i + t) )
where π([x, y, z]^T) = [x/z, y/z]^T.
This is the standard Perspective-n-Point (PnP) problem.
Python function (build 3D-2D correspondences):
import numpy as np
def build_3d_2d_correspondences(kp_prev, kp_curr, matches, depth_prev, K):
fx, fy = K[0, 0], K[1, 1]
cx, cy = K[0, 2], K[1, 2]
pts_3d = []
pts_2d = []
for m in matches:
u_prev, v_prev = kp_prev[m.queryIdx].pt
u_curr, v_curr = kp_curr[m.trainIdx].pt
z = depth_prev[int(round(v_prev)), int(round(u_prev))]
if z <= 0 or np.isnan(z):
continue
X = (u_prev - cx) * z / fx
Y = (v_prev - cy) * z / fy
pts_3d.append([X, Y, z])
pts_2d.append([u_curr, v_curr])
if len(pts_3d) == 0:
return None, None
return np.array(pts_3d), np.array(pts_2d)
5. Reprojection Error Objective
The pose (R, t) is estimated by minimizing reprojection error:
min_{R,t} Σ || p_i - π( K * (R * P_i + t) ) ||^2
Where:
-
p_iis the measured pixel in framek. -
π(·)divides by depth to map to pixels. - The error is measured in pixel units.
Python function (PnP with RANSAC):
import cv2
def estimate_pose_pnp(pts_3d, pts_2d, K):
if pts_3d is None or len(pts_3d) < 4:
return None, None, None
success, rvec, tvec, inliers = cv2.solvePnPRansac(
pts_3d,
pts_2d,
K,
None,
iterationsCount=200,
reprojectionError=3.0,
confidence=0.999,
)
if not success or inliers is None or len(inliers) < 6:
return None, None, None
return rvec, tvec, inliers
In practice:
- Use
solvePnPRansacto reject outliers. - Optionally refine with
solvePnPorcv2.solvePnPRefineLM.
6. RANSAC Inlier Test
For each hypothesis pose:
- Project
P_iinto framek. - Compute reprojection error
e_i. - Mark as inlier if
e_i < τ(typically 2–5 px).
The final pose is re-estimated using only inliers.
Typical parameter meanings:
-
τ: maximum reprojection error to accept an inlier, in pixels. -
iterations: number of random pose hypotheses tried. -
confidence: probability that at least one hypothesis is outlier-free.
Python function (compute reprojection errors):
import numpy as np
def reprojection_errors(pts_3d, pts_2d, rvec, tvec, K):
R, _ = cv2.Rodrigues(rvec)
errors = []
for P, p in zip(pts_3d, pts_2d):
P_c = R @ P + tvec.reshape(3)
proj = project_point(P_c, K)
if proj is None:
errors.append(np.inf)
else:
errors.append(np.linalg.norm(p - proj))
return np.array(errors)
7. Pose Composition Over Time
If T_k is the global pose at time k, and T_k-1,k is the relative pose:
T_k = T_k-1 * T_k-1,k
Python function:
import numpy as np
import cv2
def compose_pose(T_wc, rvec, tvec):
R, _ = cv2.Rodrigues(rvec)
T = np.eye(4)
T[:3, :3] = R
T[:3, 3] = tvec.flatten()
return T_wc @ T
This integrates the trajectory in the world frame.
8. Optional Bundle Adjustment (BA)
For higher accuracy, optimize poses and points jointly over a window:
min_{R_k,t_k,P_i} Σ_k Σ_i || p_i^k - π( K * (R_k * P_i + t_k) ) ||^2
This improves consistency but is more expensive.
9. Common Failure Modes (Math Perspective)
- Pure rotation or low parallax: poor translation observability.
- Noisy depth: back-projected 3D points become unreliable.
- Repetitive textures: incorrect correspondences increase outliers.
Python Implementation
Below is a minimal, working reference implementation. It is structured so you can drop it into a notebook or a script and extend it.
import cv2
import numpy as np
# -------------------------
# Utilities
# -------------------------
def project_point(P_c, K):
X, Y, Z = P_c
if Z <= 0:
return None
x = X / Z
y = Y / Z
u = K[0, 0] * x + K[0, 2]
v = K[1, 1] * y + K[1, 2]
return np.array([u, v], dtype=np.float64)
def backproject_pixel(u, v, z, K):
fx, fy = K[0, 0], K[1, 1]
cx, cy = K[0, 2], K[1, 2]
X = (u - cx) * z / fx
Y = (v - cy) * z / fy
return np.array([X, Y, z], dtype=np.float64)
def transform_point(P_c1, R, t):
return R @ P_c1 + t.reshape(3)
def extract_features(img_gray, n_features=2000):
orb = cv2.ORB_create(nfeatures=n_features)
keypoints, descriptors = orb.detectAndCompute(img_gray, None)
return keypoints, descriptors
def match_features(desc1, desc2, ratio=0.75):
if desc1 is None or desc2 is None:
return []
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=False)
raw = bf.knnMatch(desc1, desc2, k=2)
good = []
for m, n in raw:
if m.distance < ratio * n.distance:
good.append(m)
return good
def build_3d_2d_correspondences(kp_prev, kp_curr, matches, depth_prev, K):
fx, fy = K[0, 0], K[1, 1]
cx, cy = K[0, 2], K[1, 2]
pts_3d = []
pts_2d = []
for m in matches:
u_prev, v_prev = kp_prev[m.queryIdx].pt
u_curr, v_curr = kp_curr[m.trainIdx].pt
z = depth_prev[int(round(v_prev)), int(round(u_prev))]
if z <= 0 or np.isnan(z):
continue
p3d = backproject_pixel(u_prev, v_prev, z, K)
pts_3d.append(p3d)
pts_2d.append([u_curr, v_curr])
if len(pts_3d) == 0:
return None, None
return np.array(pts_3d), np.array(pts_2d)
def estimate_pose_pnp(pts_3d, pts_2d, K):
# Requires >= 4 points
if pts_3d is None or len(pts_3d) < 4:
return None, None, None
success, rvec, tvec, inliers = cv2.solvePnPRansac(
pts_3d,
pts_2d,
K,
None,
iterationsCount=200,
reprojectionError=3.0,
confidence=0.999,
)
if not success or inliers is None or len(inliers) < 6:
return None, None, None
return rvec, tvec, inliers
def compose_pose(T_wc, rvec, tvec):
R, _ = cv2.Rodrigues(rvec)
T = np.eye(4)
T[:3, :3] = R
T[:3, 3] = tvec.flatten()
return T_wc @ T
def reprojection_errors(pts_3d, pts_2d, rvec, tvec, K):
R, _ = cv2.Rodrigues(rvec)
errors = []
for P, p in zip(pts_3d, pts_2d):
P_c = transform_point(P, R, tvec)
proj = project_point(P_c, K)
if proj is None:
errors.append(np.inf)
else:
errors.append(np.linalg.norm(p - proj))
return np.array(errors)
# -------------------------
# Main Loop
# -------------------------
def run_visual_odometry(rgb_frames, depth_frames, K):
assert len(rgb_frames) == len(depth_frames)
T_wc = np.eye(4)
trajectory = [T_wc.copy()]
prev_gray = None
prev_kp = None
prev_desc = None
prev_depth = None
for idx, (rgb, depth) in enumerate(zip(rgb_frames, depth_frames)):
gray = cv2.cvtColor(rgb, cv2.COLOR_BGR2GRAY)
kp, desc = extract_features(gray)
if prev_gray is not None:
matches = match_features(prev_desc, desc)
pts_3d, pts_2d = build_3d_2d_correspondences(
prev_kp, kp, matches, prev_depth, K
)
rvec, tvec, inliers = estimate_pose_pnp(pts_3d, pts_2d, K)
if rvec is not None:
T_wc = compose_pose(T_wc, rvec, tvec)
trajectory.append(T_wc.copy())
else:
trajectory.append(T_wc.copy())
else:
trajectory.append(T_wc.copy())
prev_gray = gray
prev_kp = kp
prev_desc = desc
prev_depth = depth
return trajectory
Practical Notes
- Depth alignment matters. If the depth camera is offset from RGB, perform depth registration before back-projection.
- Filter out keypoints on invalid or very noisy depth values.
- If matches are weak, increase keypoints or switch to SIFT/AKAZE.
- For smoother trajectories, add a pose-graph or bundle adjustment stage.
Expected Outputs
- A list of 4x4 camera poses
T_wcover time. - A trajectory plot (e.g., top-down
x-zview) for qualitative evaluation.