Multi-View Geometry

Hao Su

Fall, 2021

Agenda

click to jump to the section.

Review of Camera Matrices

Basic Mechanism

Make a Pinhole Camera (https://kids.nationalgeographic.com/books/article/pinhole-camera)

Definitions

We place a pinhole camera so that it faces the $\mathbf{k}$ direction, and the pinhole is at $O$
$O$: aperture
$\mathbf{k}$: optical axis
$\{\mathbf{i}, \mathbf{j}, \mathbf{k}\}$: an orthogonal frame at $O$. Let us call it the camera frame.

$\Pi'$: retina plane
$f$: focal length
$P$: a point in 3D space
$P'$: the image of $P$ on retina plane

Project 3D Points to Retina Plane

The coordinate of $P$ in the $O\mathbf{ijk}$ frame is $\begin{bmatrix} x\\y\\z \end{bmatrix}$
The coordinate of $P'$ in the $C'\mathbf{i'j'}$ frame on $\Pi'$ is $ \begin{bmatrix} x' \\ y' \end{bmatrix} $

Project 3D Points to Retina Plane

Assume that $O\mathbf{ij}$ plane is parallel to the $\Pi'$ plane, then $ \left\{ \begin{aligned} & x' = f\frac{x}{z}\\ & y' = f\frac{y}{z}\\ \end{aligned} \right. \qquad $ (Why?)

Project 3D Points to Retina Plane

\[ \frac{x'}{f}=\frac{x}{z} \]

Project 3D Points to Retina Plane

More compactly, we denote the projection as \[ (x,y,z)\rightarrow (f\frac{x}{z},f\frac{y}{z}) \]

From Retina Plane to Image Plane

$i'j'$ frame: retina plane frame
$ij$ frame: image plane frame
From camera frame to image plane frame \[ (x,y,z)\rightarrow (f\frac{x}{z}+c_x, f\frac{y}{z}+c_y) \]

Converting to Pixels

Assume the unit of $f$ is m
Assume that the density of light sensor is $k$ pixel/m horizontally, and $l$ pixel/m vertically. If $k\neq l$, then pixels are non-square.

Converting to Pixels

Then we scale by $k$ and $l$ \[ (x,y,z)\rightarrow (fk\frac{x}{z}+c_x, fl\frac{y}{z}+c_y) \]
Here, the unit of $c_x$ and $c_y$ are pixels.

Converting to Pixels

Then we scale by $k$ and $l$ \[ (x,y,z)\rightarrow (fk\frac{x}{z}+c_x, fl\frac{y}{z}+c_y) \]
Let $\alpha=fk$ and $\beta=fl$, $(x,y,z)\rightarrow (\alpha\frac{x}{z}+c_x, \beta\frac{y}{z}+c_y)$

Converting to Pixels

To sum up, so far, the overall transformation is: \[ \begin{bmatrix} x\\y\\z \end{bmatrix} \rightarrow \begin{bmatrix} \alpha\frac{x}{z}+c_x\\ \beta\frac{y}{z}+c_y \end{bmatrix} \] Can we express it as a linear transformation in matrix form?

No. Linear transformations are linear combinations of numbers, but there is a division!

Homogeneous System and
Intrinsic Camera Matrix

Nonetheless, we will use a somewhat hacky way to still
represent the 3D-2D projection by matrix-vector product

Homogeneous Coordinates

Before we introduced the conversion from Euclidean coordinate to Homogeneous coordinate:

On image plane:
\[ \begin{bmatrix} x\\y \end{bmatrix}\Rightarrow \begin{bmatrix} x\\y\\1 \end{bmatrix} \]

In 3D physical space:
\[ \begin{bmatrix} x\\y\\z \end{bmatrix}\Rightarrow \begin{bmatrix} x\\y\\z\\1 \end{bmatrix} \]
Here we introduce a new rule to convert from Homogeneous coordinate to Euclidean coordinate:

On image plane:
\[ \begin{bmatrix} x\\y\\w \end{bmatrix}\Rightarrow \begin{bmatrix} x/w\\y/w \end{bmatrix} \]

In 3D physical space:
\[ \begin{bmatrix} x\\y\\z\\w \end{bmatrix}\Rightarrow \begin{bmatrix} x/w\\y/w\\z/w \end{bmatrix} \]

Now we have the division!

Projective Transformation in the
Homogeneous Coordinate System

3D E$\rightarrow$ 3D H: $ P= \begin{bmatrix} x\\y\\z \end{bmatrix}\rightarrow P_h=\begin{bmatrix} x\\y\\z\\1 \end{bmatrix} $
Build a homogeneous transformation matrix: $ T= \begin{bmatrix} \alpha & 0 & c_x & 0\\ 0 & \beta & c_y & 0\\ 0 & 0 & 1 & 0\\ \end{bmatrix} $
3D H $\rightarrow$ 2D H: $ P_h'=TP_h= \begin{bmatrix} \alpha x + c_x z\\ \beta y + c_y z\\ z \end{bmatrix} $
2D H$\rightarrow$ 2D E: $ P'=\begin{bmatrix} \alpha \frac{x}{z}+c_x\\ \beta \frac{y}{z}+c_y \end{bmatrix} $

$P$ (Euclidean in 3D)
$\downarrow$
$P_h$ (Homogeneous in 3D)
$\downarrow$
$P_h'$ (Homogeneous in 2D)
$\downarrow$
$P'$ (Euclidean in 2D)

Camera Skewness

$\mathbf{k}$-axis may be skewed and not perpendicular to $\Pi'$
The skewness will affect the homogeneous transformation matrix
When projected on retina plane, the $\mathbf{i}$-axis and $\mathbf{j}$-axis has an angle $\theta$ \[ P_h'= \begin{bmatrix} \alpha & -\alpha \cot \theta & c_x & 0\\ 0 & \frac{\beta}{\sin\theta} & c_y & 0\\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} \]

So far, we have 5 parameters to affect the imaging process

Intrinsic Camera Matrix

From $ P_h'= \begin{bmatrix} \alpha & -\alpha \cot \theta & c_x & 0\\ 0 & \frac{\beta}{\sin\theta} & c_y & 0\\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} $,
we can extract the intrinsic camera matrix: $ K= \begin{bmatrix} \alpha & -\alpha \cot \theta & c_x \\ 0 & \frac{\beta}{\sin\theta} & c_y \\ 0 & 0 & 1 \end{bmatrix} $

So $P_h'=K[I, 0]P_h$

Intrinsic Camera Matrix

The common practice is to use another 5-parameter representation of $K$: \[ K= \begin{bmatrix} f_x & s & c_x\\ 0 & f_y & c_y\\ 0 & 0 & 1 \end{bmatrix} \]
In practice, $K$ may be accessed by the SDK of cameras
- For example, here is a StackOverflow post that discusses extracting the intrinsic camera matrix of the ARKit of Apple
We will also introduce algorithms to estimate $K$ in subsequent lectures

Extrinsic Camera Matrix

Camera Frame

The previous derivations assume that the coordinate of $P$ is in the $O\mathbf{ijk}$ frame
Note that the $O\mathbf{ijk}$ frame is binded to the camera, which is referred to as the camera frame
In practice, the camera may move around, so using the camera frame to record object location is inconvenient

World Frame

So we assume a static world frame to record object coordinates, and also record the pose of the camera
$O_w\mathbf{i}_w\mathbf{j}_w\mathbf{k}_w$ is the world frame

The coordinate of $P$ in world frame is \[ P_w= \begin{bmatrix} x_w\\y_w\\z_w\\1 \end{bmatrix} \]

Extrinsic Camera Matrix

We can use $(R,T)$ to transform the world frame coordinate to the camera frame coordinate (homogeneous): \[ P_h = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix}P_w \]

Extrinsic Camera Matrix

$ \begin{bmatrix} R & t\\0 & 1 \end{bmatrix} $ is called extrinsic camera matrix. There are 6 parameters (3 in $R$ and 3 in $T$)

Projective Transformation from World Frame

Recall the projection from camera frame to image plane by intrinsic camera matrix: \[ P_h'=K[I, 0]P_h \]
We just derived that \[ P_h = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix}P_w \]
Composed together, we can transform from world frame to image plane: \[ P_h'=K \begin{bmatrix} I & 0 \end{bmatrix} \begin{bmatrix} R & t\\ 0 & 1 \end{bmatrix}P_w= K \begin{bmatrix} R & T \end{bmatrix} P_w \]

Camera Matrix

\[ P_h'= K \begin{bmatrix} R & T \end{bmatrix} P_w \]

$M=K \begin{bmatrix} R & t \end{bmatrix}\in\R{3\times 4}$ includes all the parameters of a pinhole camera to form images!
We refer to $M$ as the camera matrix
While $M$ has 12 numbers, not all matrices in $\R{3\times 4}$ are valid camera matrices
We can use the 5 parameters in $K$ and 6 parameters in $[R,t]$ to generate all valid $M$'s
In literature, the common expression is $M$ has 11 degree of freedoms

Multi-View Stereo (MVS)

Multi-View Stereo

Reconstruct the 3D shapes from a set of images and camera parameters

Applications of MVS

Virtual Image

For the convenience of drawing figures, we introduce the virtual image
It is the mirror reflection of the retina image
The projective transformation is unchanged: $ P_h'= K \begin{bmatrix} R & t \end{bmatrix} P_w $

Triangulation

Two cameras with pinholes $O_1$ and $O_2$ observe the same point $P$
Triangulation builds a triangle $\triangle PO_1O_2$
$p_1$ and $p_2$ are the imaged point of $P$ in two views

3D Reconstruction given Corresponding Pixels

$p_1$ and $p_2$ form a pair of pixel correspondence
Given $p_1$, $p_2$, the intrinsic camera matrices $K_1$ and $K_2$, and the extrinsic camera matrices $(R_1, t_1)$ and $(R_2, t_2)$,

can we estimate $P$?

3D Reconstruction given Corresponding Pixels

We can build the camera matrices for both cameras: \[ M_1=K_1[R_1, t_1],\qquad M_2=K_2[R_2, t_2] \] Denote \[ M_1= \begin{bmatrix} m_{1,1}^T\\ m_{1,2}^T \\ m_{1,3}^T \end{bmatrix}, \qquad M_2= \begin{bmatrix} m_{2,1}^T\\ m_{2,2}^T \\ m_{2,3}^T \end{bmatrix} \] Therefore, \[ p_1= \begin{bmatrix} \frac{m_{1,1}^TP}{m_{1,3}^TP} \\ \frac{m_{1,2}^TP}{m_{1,3}^TP} \\ \end{bmatrix}, \qquad p_2= \begin{bmatrix} \frac{m_{2,1}^TP}{m_{2,3}^TP} \\ \frac{m_{2,2}^TP}{m_{2,3}^TP} \\ \end{bmatrix} \]

3D Reconstruction as Optimization

Denote $p=f(P; M)$ as the perspective transformation $p=[\frac{m_{1}^TP}{m_{3}^TP}, \frac{m_{2}^TP}{m_{3}^TP}]^T$
The problem of finding $P$ given $p_1$, $p_2$, $M_1$ and $M_2$ can be solved by this optimization problem: \[ \begin{aligned} &\underset{Q}{\text{minimize}}&&\|p_1-f(Q;M_1)\|^2+\|p_2-f(Q;M_2)\|^2\\ \end{aligned} \]

3D Reconstruction as Optimization

\[ \begin{aligned} &\underset{Q}{\text{minimize}}&&\|p_1-f(Q;M_1)\|^2+\|p_2-f(Q;M_2)\|^2\\ \end{aligned} \] where $f(Q)=[\frac{m_{1}^TQ}{m_{3}^TQ}, \frac{m_{2}^TQ}{m_{3}^TQ}]^T$

This is also a least square problem
But the function inside $\|\cdot\|^2$ is not a linear function, because $f$ is non-linear
We call it a non-linear optimization problem

Gradient Descent for Numerical Optimization

Linear vs Non-linear Least Square

\[ \begin{aligned} &\underset{x}{\text{minimize}}&&\|y-Ax\|^2\\ \end{aligned} \]

Closed-form solution exists
$x^*=A^{\dagger} y$

\[ \begin{aligned} &\underset{x}{\text{minimize}}&&\|y-f(x)\|^2 \end{aligned} \]

$f$ is non-linear (e.g., perspective function)
In general, closed-form solution does not exist
We use numerical methods to solve such problems

Gradient Descent for Numerical Optimization

Consider the general optimization problem: \[ \begin{aligned} &\underset{x}{\text{minimize}}&& f(x) \end{aligned} \]

3D Reconstruction as Optimization

\[ \begin{aligned} &\underset{Q}{\text{minimize}}&&\|p_1-f(Q;M_1)\|^2+\|p_2-f(Q;M_2)\|^2\\ \end{aligned} \] where $f(Q)=[\frac{m_{1}^TQ}{m_{3}^TQ}, \frac{m_{2}^TQ}{m_{3}^TQ}]^T$

Similarly, let $L(Q)=\|p_1-f(Q;M_1)\|^2+\|p_2-f(Q;M_2)\|^2$
We can optimization the objective function by following $\nabla L$ using gradient descent
In fact, there are many smarter ways to adjust the gradient direction, so that it works faster
For example, LM algorithm (Levenberg-Marquardt) is a popular choice

Key Steps of Multi-View Stereo

Take photos from many views
Identify points of interest from images
Search for corresponding points in other images
Compute camera positions
~~Estimate 3D point positions~~

Since 3 and 4 also depend on the understanding of camera models, we will study them first, and then study 2.

Search for Corresponding Points Across Views

Task

Two views of the same object
Given a point on the left image, how can I find the corresponding point on the right image?

Baseline Method: Global Search

Build a feature descriptor to encode the appearance around $p_1$ on the left view
Compare against feature descriptors at points all over the image on the right view

How to compare appearance around points?

Advanced approach: Extract a point feature (e.g., by neural networks)
But let us show a very basic approach first
1. Extract a patch around a point
2. Compare two patches $W_1$ and $W_2$ by pixel-wise distance:
  - SSD (Sum Squared Distance) \[ \sum_{x,y}|W_1(x,y)-W_2(x,y)|^2 \]
  - NCC (Normalized Cross Correlation) \[ \frac{\sum_{x,y}(W_1(x,y)-\bar{W}_1)(W_2(x,y)-\bar{W}_2)}{\sigma_{W_1}\sigma{W_2}} \]
Anyway, global search requires exhaustive comparison, thus would be slow.

Epipolar Geometry

However, when the camera matrices of both images are given,
it is possible to largely restrict the search space to a line on the right view!

Epipolar Geometry

Advantages:

Smaller computational load
Reduced error to find wrong matching

How could this be true!
The world is beautiful!

Epipolar Geometry

Point-Line Correspondence

$p_1$ on the left view corresponds to the line $m_2$ on the right view

Epipolar Geometry Definitions

$O_1O_2$: baseline, the line segment connecting $O_1$ and $O_2$
$e_1$ and $e_2$: epipoles, the intersection between baseline and retina planes

Epipolar Geometry Definitions

A 3D point $X$ is observed at $p_1$ and $p_2$ on the left and right views, respectively
The pixel $p_1$ corresponds to the line passing $p_2e_2$
The pixel $p_2$ corresponds to the line passing $p_1e_1$

How to Find the Epipolar Line from a Point?

Use homogeneous coordinate to represent $p_1$ and $p_2$ in their corresponding image space: \[ p_1= \begin{bmatrix} x_1 \\ y_1 \\ 1 \end{bmatrix}, \qquad p_2= \begin{bmatrix} x_2 \\ y_2 \\ 1 \end{bmatrix} \]

Assume that the pair of cameras have fixed relative pose

Consider the pair of camera frames at $O_1$ and $O_2$
The relative $(R,T)$ between the $O_1$ frame and $O_2$ frame keep unchanged

Then, there exists a matrix $F$ such that \[ p_1^T F p_2 = 0,\qquad \forall (p_1,p_2)\text{ pairs in correspondence} \]

Fundamental Matrix

\begin{align} p_1^T F p_2 = 0,\qquad \forall (p_1,p_2)\text{ pairs in correspondence}\label{eq:fundmat} \end{align} $F\in\R{3}$: fundamental matrix

Expand \eqref{eq:fundmat}, \[ [x_1, y_1, 1] \begin{bmatrix} F_{11} & F_{12} & F_{13}\\ F_{21} & F_{22} & F_{23}\\ F_{31} & F_{32} & F_{33} \end{bmatrix} \begin{bmatrix} x_2 \\ y_2 \\ 1 \end{bmatrix}=0 \]

Fundamental Matrix

\[ p_1^T F p_2 = 0,\qquad \forall (p_1,p_2)\text{ pairs in correspondence} \]

$w_1=F^Tp_1\in\R{3}$ defines a linear equation \begin{align} w_1^Tp_2 =0 \end{align} over $p_2$
This is the corresponding line of $p_1$ on the right view
So $w_1=F^Tp_1$ defines the coefficients of the epipolar line of $p_1$

Fundamental Matrix

\[ p_1^T F p_2 = 0,\qquad \forall (p_1,p_2)\text{ pairs in correspondence} \]

$w_2=Fp_2\in\R{3}$ defines a linear equation \[ w_2^Tp_1 =0\tag{3} \] over $p_1$
This is the corresponding line of $p_2$ on the left view
So $w_2=Fp_2$ defines the coefficients of the epipolar line of $p_2$

Search Points using Fundamental Matrix

Estimating $F$ from Correspondences

Why Do We Need to Estimate $F$?

$F$ is determined by

The intrinsic camera matrices $K_1$ and $K_2$
and the relative pose between the cameras $(R, T)$

However, measuring the relative camera poses is hard!
In fact, $K_1$ and $K_2$ are also not always accessible.

Estimating $F$ from Data

If we can have some pairs of corresponding points, we would be able to estimate $F$
$F$ can help us to find more pairs of correspondences, so that we reconstruct the 3D for more points

The Eight-Point Algorithm

It is sufficient to estimate $F$ from 8 points

The Fundamental Matrix Equation

\begin{align} p^TFp'=0\quad\Rightarrow\quad [x, y, 1] \begin{bmatrix} F_{11} & F_{12} & F_{13}\\ F_{21} & F_{22} & F_{23}\\ F_{31} & F_{32} & F_{33} \end{bmatrix} \begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix}=0 \end{align} Expand using matrix-vector product rules and : \[ [xx', xy',x,yx',yy',y,x',y',1] \begin{bmatrix} F_{11} \\ F_{12} \\ F_{13}\\ F_{21} \\ F_{22} \\ F_{23} \\ F_{31} \\ F_{32} \\ F_{33} \end{bmatrix}=0 \]

Eight Point Algorithm

\[ \underbrace{\begin{bmatrix} x_1x'_1& x_1y'_1&x_1&y_1x'_1&y_1y'_1&y_1&x'_1&y'_1&1\\ x_2x'_2& x_2y'_2&x_2&y_2x'_2&y_2y'_2&y_2&x'_2&y'_2&1\\ x_3x'_3& x_3y'_3&x_3&y_3x'_3&y_3y'_3&y_3&x'_3&y'_3&1\\ x_4x'_4& x_4y'_4&x_4&y_4x'_4&y_4y'_4&y_4&x'_4&y'_4&1\\ x_5x'_5& x_5y'_5&x_5&y_5x'_5&y_5y'_5&y_5&x'_5&y'_5&1\\ x_6x'_6& x_6y'_6&x_6&y_6x'_6&y_6y'_6&y_6&x'_6&y'_6&1\\ x_7x'_7& x_7y'_7&x_7&y_7x'_7&y_7y'_7&y_7&x'_7&y'_7&1\\ x_8x'_8& x_8y'_8&x_8&y_8x'_8&y_8y'_8&y_8&x'_8&y'_8&1\\ \end{bmatrix}}_{W} \underbrace{ \begin{bmatrix} F_{11} \\ F_{12} \\ F_{13}\\ F_{21} \\ F_{22} \\ F_{23} \\ F_{31} \\ F_{32} \\ F_{33} \end{bmatrix}}_{f} =0 \]

In matrix form, $Wf=0$.

Homogeneous System

We would be able to estimate the elements of fundamental matrix $f$ from: \[ Wf = 0 \] Properties of this equation:

Given correspondences, $W$ is a constant
The equation is linear w.r.t $f$
The r.h.s. of the equation is $0$. The system is called a homogeneous system
Observe that, if $f$ is a solution, then $\lambda f\ \forall \lambda\in\R{}$ is also a solution.

Solutions to a Homogeneous System

\begin{align} Wf = 0 \end{align}

Using the language of our matrix theory, we say that $f\in \text{null}(W)$
Recall that each row of $W$ is from a corresponding point pair: \[ W= \begin{bmatrix} x_1x'_1& x_1y'_1&x_1&y_1x'_1&y_1y'_1&y_1&x'_1&y'_1&1\\ x_2x'_2& x_2y'_2&x_2&y_2x'_2&y_2y'_2&y_2&x'_2&y'_2&1\\ x_3x'_3& x_3y'_3&x_3&y_3x'_3&y_3y'_3&y_3&x'_3&y'_3&1\\ x_4x'_4& x_4y'_4&x_4&y_4x'_4&y_4y'_4&y_4&x'_4&y'_4&1\\ x_5x'_5& x_5y'_5&x_5&y_5x'_5&y_5y'_5&y_5&x'_5&y'_5&1\\ x_6x'_6& x_6y'_6&x_6&y_6x'_6&y_6y'_6&y_6&x'_6&y'_6&1\\ x_7x'_7& x_7y'_7&x_7&y_7x'_7&y_7y'_7&y_7&x'_7&y'_7&1\\ x_8x'_8& x_8y'_8&x_8&y_8x'_8&y_8y'_8&y_8&x'_8&y'_8&1\\ \end{bmatrix}\in\R{8\times 9} \]
So $\rank(W)=8$ (when points are at general positions), and $\dim(\null(W))=9-8=1$.

Solutions to a Homogeneous System

$W\in\R{8\times 9}$
$\dim(\null(W))=1$
$f\in \text{null}(W)$

Combining these conclusions, we see that

$W$'s null space is a line in the 9-dimensional space, and $f$ is on this line.
In fact, any $f\in\null(W)$ might be the groundtruth $f$
In other words, 8 points have uniquely determined $f$ up to scaling!
That's why we have the 8-point algorithm.

Python code to find the null space:


>>> from scipy.linalg import null_space
>>> f = null_space(W)

Robust Estimation with More Than 8 Points

\[ Wf = 0 \]

While 8 points are sufficient to estimate $f$ theoretically, in practice, we use many more pairs.
This would allow us to estimate $f$ more robustly w.r.t. inaccurate correspondences.
Practically, this is very useful and important.
However, when there are $n>8$ point pairs, $\rank(W)=9$ and $\null(W)=0$.
The only solution is $0$!

Robust Estimation with More Than 8 Points

We use least square when $n>8$:

$Wf = 0$

$\Downarrow$

$\begin{aligned} &\underset{f}{\text{minimize}}&&\|Wf\|_2^2\\ &\text{subject to} && \|f\|_2^2=1 \end{aligned}$

We add a constraint that $\|f\|_2^2=1$ to avoid the trivial solution $f=0$
We convert the homogeneous equation to a least square objective
A quadratic constraint quadratic programming (QCQP) problem

Robust Estimation with More Than 8 Points

\begin{align} \begin{aligned} &\underset{f}{\text{minimize}}&&\|Wf\|_2^2\\ &\text{subject to} && \|f\|_2^2=1 \end{aligned} \end{align} Do you remember how to optimize an equality constrained optimization problem in Calculus?

Lagrange Multipliers

Lagrange Multipliers: Idea

Lagrange Multiplier Method

\begin{align} \begin{aligned} &\underset{x}{\text{minimize}}&&f(x)\\ &\text{subject to}&&g(x)=0 \end{aligned} \end{align}

Denote $L(x,\lambda)=f(x)+\lambda g(x)$

The optimality condition is:

$\nabla_x L(x, \lambda)=0$

$\Downarrow$

\begin{align} \nabla f(x)-\lambda \nabla g(x)=0 \end{align}

Back to Our $f$ Estimation Problem

Recall our problem is: \begin{align*} \begin{aligned} &\underset{f}{\text{minimize}}&&\|Wf\|_2^2\\ &\text{subject to} && \|f\|_2^2=1 \end{aligned} \end{align*} Lagrange function: $L=f^TW^TWf-\lambda (f^Tf-1)$
Gradient of $\nabla_f L=2W^TWf-2\lambda f$

Optimality condition: $W^TWf=\lambda f$
At optimality, the objective is $f^TW^TWf=f^T(W^TWf)=\lambda f^Tf=\lambda$

$\therefore$ We seek for the minimal $\lambda$ such that $W^TWf=\lambda f$!

Robust Estimation with More Than 8 Points

\begin{align*} \begin{aligned} &\underset{f}{\text{minimize}}&&\|Wf\|_2^2\\ &\text{subject to} && \|f\|_2^2=1 \end{aligned} \end{align*}

How do we seek for the minimal $\lambda$ such that $W^TWf=\lambda f$?

By the definition of eigenvector and eigenvalue, $f$ must be an eigenvector and $\lambda$ must be an eigenvalue
Note that $W^TW$ is a positive definite matrix
- positive definite matrix $A$: $x^TAx>0\ \forall x\neq 0$
Therefore, $\lambda$ must be the smallest eigenvalue and $f$ be the corresponding eigenvector

Review: Eigenvalue and Eigenvector

An eigenvector $x$ of a linear transformation $A$ is a non-zero vector that, when $A$ is applied to it, does not change direction: \[ Ax = \lambda x, x \neq 0. \]

Example: Google Colab

Additional Pre- and Post-Processing
for Estimating $F$

Post-Processing

We can reshape $f$ to be the $3\times 3$ matrix $F$
$F\in\R{3\times 3}$ should be a rank-2 matrix
However, the fundamental matrix obtained by reshaping $f$ is not rank-2. We need some post-processing.
Assume $f$ is reshaped to a $3\times 3$ matrix $\hat{F}$. The post-processing is to solve an optimization problem: \begin{align} \begin{aligned} &\underset{}{\text{minimize}}&&\|F-\hat{F}\|^2_{F}\\ &\text{subject to}&&\rank(F)=2 \end{aligned} \end{align} where $\|A\|_F=\sqrt{\sum_{ij}A^2_{ij}}$

Post-Processing

\begin{align} \begin{aligned} &\underset{}{\text{minimize}}&&\|F-\hat{F}\|^2_{F}\\ &\text{subject to}&&\rank(F)=2 \end{aligned} \end{align} where $\|A\|_F=\sqrt{\sum_{ij}A^2_{ij}}$

This problem is called low-rank approximation
The solution is given by an algorithm called SVD decomposition: \[ F=U \begin{bmatrix} s_1 & 0 & 0 \\ 0 & s_2 & 0 \\ 0 & 0 & 0 \end{bmatrix}V^T \] where \[ \hat{F}=U \begin{bmatrix} s_1 & 0 & 0 \\ 0 & s_2 & 0 \\ 0 & 0 & s_3 \end{bmatrix}V^T \text{ such that } U^TU=I, V^TV=I, s_i>0 \text{ sorted in descending order} \]

Post-Processing

Python code for SVD-based low-rank approximation:


U, S, V = np.linalg.svd(Fhat)
print('U=', U)
print('S=', S)
print('V=', V)

S[2] = 0
F = U @ np.diag(S) @ V

We are almost there to finish the estimation of $F$!

Pre-Processing

We usually preprocess $W$ for numerical reasons. Recall the structure of $W$: \[ W= \begin{bmatrix} x_1x'_1& x_1y'_1&x_1&y_1x'_1&y_1y'_1&y_1&x'_1&y'_1&1\\ x_2x'_2& x_2y'_2&x_2&y_2x'_2&y_2y'_2&y_2&x'_2&y'_2&1\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ x_nx'_n& x_ny'_n&x_n&y_nx'_n&y_ny'_n&y_n&x'_n&y'_n&1\\ \end{bmatrix}\in\R{n\times 9} \]

When $x_i$ and $y_i$ are too large or too small, $W$ will have bad numerical properties.
We often normalize the scale of $x_i$ and $y_i$ so that they are not too or too large.
We skip details of step.

Overview of Estimating the Fundamental Matrix

Pre-processing point pair coordinates;
Build $W$;
Estimate $f$ as the eigenvector of the smallest eigenvalue of $W^TW$;
Build $\hat{F}$ by reshaping $f$;
Use SVD to obtain the rank-2 approximation of $F$ from $\hat{F}$.

Fundamental matrix builds a constraint across views.
With the help of fundamental matrix, we are better at finding correspondences across views!

Analytical Solution of $F$ from $K_i$ and $(R,t)$

Cross Product by Skew-Symmetric Matrix

Result

\[ F=K_1^{-T}([t]_{\times} R)K_2^{-1} \]

RANSAC Algorithm for Correspondence Validation

Review the MVS Pipeline

In practice, we often know neither the camera matrices (intrinsic and extrinsic), nor the correspondences. But we can jointly estimate them!

Correspondence Validation Problem

Let us assume that we already have many correspondence candidates by appearance-based patch matching:

However, many correspondences are wrong. Let us prune them out with the help of fundamental matrix!

A Chicken and Egg Problem

When we know correspondences with accuracy, we can estimate $F$
When we know $F$, we can restrict the search space and confirm correspondences

Step 1: Initial Estimate of F

Randomly choose some correspondences and estimate $F$

Step 2: Outlier Removal

Based on the estimate $F$, remove pairs that have big errors according to the epipolar constraint $\|p^TFq\|$

Step 3: $F$ Re-estimation

We call the removed pairs outliers and remaining pairs inliers. Using the inliers, we can re-estimate $F$:

Repeat


for i in range(n):
    randomly choose some pairs
    repeat for m times:
        based on the inliers, estimate F
        based on F, remove pairs with big errors

Keep only the matches that are inliers with respect to the best fundamental matrix

Discussions

RANSAC starts from a subset of correspondences, which is randomly chosen. This random initial set affects the final result.
For every iteration, we determine the inlier and outlier based on the level of errors. This is usually done by a threshold $\tau$, which affects the final result.
To obtain good results, we often try several random initial set, and try different thresholds.

Our Progress for Multi-View Stereo

Take photos from many views
Identify points of interest from images
~~Search for corresponding points in other images~~
Compute camera positions
~~Estimate 3D point positions~~

Next, let us study 4.

Compute Camera Positions

Assume that we have estimated the fundamental matrix $F$ using correspondences.

\[ p^T F q = 0 \]

Can we decode $(R,t)$ from $F$?

Decoding $(R,t)$ from $F$

First of all, $F$ is determined by:

the intrinsic camera matrices $K_1$ and $K_2$
the external camera matrices $[R, t]$

We can derive that: \[ F=K_1^{-T} ([t]_{\times} R) K_2^{-1} \] where $[t]_{\times}= \begin{bmatrix} 0 & -t_3 & t_2\\t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix}$

Decoding $(R,t)$ from $F$

\[ F=K_1^{-T} ([t]_{\times} R) K_2^{-1},\quad \text{ where } [t]_{\times}= \begin{bmatrix} 0 & -t_3 & t_2\\t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix} \]

$E=[t]_{\times} R$ is also known as Essential matrix
If $K_1$ and $K_2$ are known, we will be able to compute $E$ from $F$: \[ E=K_1^T FK_2 \]

Decoding $(R,t)$ from $F$

\[ E=[t]_{\times} R,\quad \text{ where } [t]_{\times}= \begin{bmatrix} 0 & -t_3 & t_2\\t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix} \]

Note that $[t]_{\times}$ and $R$ both have special structures
- $[t]_{\times}$: A skew-symmetric matrix
- An orthonormal matrix
Mathematically, given $E$, its decomposition into the product of a skew-symmetric matrix and an orthonormal matrix is almost unique!
- Through SVD algorithm, as well.

The last remaining question: Is it possible to estimate $K$ from data?

You need to collect additional data for this purpose.
The operation is called camera calibration:

Camera Calibration

The most common way is to use a camera to capture many images of a chessboard. We know the distance between grids ahead. Then we can estimate $K$ by a closed-form solution. We don't cover this topic here.

Summary of Estimating the Relative Position between Cameras from $F$

Estimate $K_1$ and $K_2$ by camera calibration

$\Downarrow$

Compute Essential Matrix by $E=K_1^T FK_2$

$\Downarrow$

Decompose $E$ to estimate $t$ and $R$

Our Progress for Multi-View Stereo

Take photos from many views
Identify points of interest from images
~~Search for corresponding points in other images~~
~~Compute camera positions~~
~~Estimate 3D point positions~~

Identify Points of Interest in an Image

Corner Detection

Corner Detection: Basic Idea

We might recognize the point by looking through a small window
We want a window shift in any direction to give a large change in intensity

Flat region: no change in all directions

Edge: no change along the edge direction

Corner: significant change in all directions

A. Efros

Image as Function

We have been treating an image as a matrix
In CV, we also often treat an image as a function:
- $I(x,y)$ returns the pixel value at $(x,y)$
This view allows us to compute image gradients:
- Use finite difference to approximate partial derivative: \[ \begin{aligned} \frac{\partial f}{\partial x}(x,y) & \approx f(x+1, y)-f(x, y)\\ \frac{\partial f}{\partial y}(x,y) & \approx f(x, y+1)-f(x, y)\\ \end{aligned} \]
- Will revisit this topic in the future

Corner Detection by Auto-Correlation

Change in appearance of window $w(x,y)$ for shift $[u,v]$: \[ E(u,v)=\sum_{(x,y)\in\mathcal{W}}[I(x+u,y+v)-I(x,y)]^2 \]

Corner Detection by Auto-Correlation

Change in appearance of window $w(x,y)$ for shift $[u,v]$: \[ E(u,v)=\sum_{(x,y)\in\mathcal{W}}[I(x+u,y+v)-I(x,y)]^2 \]

We want to discover how $E$ behaves for small shifts
But this is very slow to compute naively: $\mathcal{O}(\text{window_width}^2*\text{shift_range}^2*\text{image_width}^2)$
$\mathcal{O}(11^2*11^2*600^2)=5.2 \text{ billion of these }14.6\text{ thousand per pixel in your image}$

Change in appearance of window $w(x,y)$ for shift $[u,v]$: \[ E(u,v)=\sum_{(x,y)\in\mathcal{W}}[I(x+u,y+v)-I(x,y)]^2 \]

Can we approximate $E(u,v)$ locally by a quadratic surface?

Recall: Taylor Series Expansion

Many good function can be approximated by a polynomial whose coefficients are from the derivatives: \[ f(x)\approx f(0) + f'(0) x + \frac{1}{2}f''(0)x^2 \] The multivariate version: \[ f(x)\approx f(0) + \nabla_x f(0) x + \frac{1}{2}x^T H_f(0) x \] where $H_f(0)$ is the Hessian matrix of $f$ at $0$ such that $[H_f(0)]_{ij}=\frac{\partial^2 f}{\partial x_i \partial x_j}(0)$

Second-Order Local Approximation

Local quadratic approximation of $E(u, v)$ in the neighborhood of $(0, 0)$:

\[ E(u,v)\approx E(0,0)+[u,v] \begin{bmatrix} E_u(0,0)\\E_v(0,0) \end{bmatrix}+ \frac{1}{2}[u,v] \begin{bmatrix} E_{uu}(0,0) & E_{uv}(0,0)\\ E_{uv}(0,0) & E_{vv}(0,0) \end{bmatrix} \begin{bmatrix} u\\v \end{bmatrix} \]

Corner Detection: Mathematics

\[ E(u,v)=\sum_{(x,y)\in\mathcal{W}}[I(x+u,y+v)-I(x,y)]^2 \] Second-order Taylor expansion of $E(u,v)$ about $(0,0)$:

$ E(0,0)=0 $

$ \begin{aligned} E_u(u,v)&=\sum_{x,y} 2[I(x+u,y+v)-I(x,y)]I_x(x+u,y+v)\\ \end{aligned} $

$ \begin{aligned} E_{uu}(u,v)&=\sum_{x,y} 2I_x(x+u,y+v)I_x(x+u,y+v)\\ &+\sum_{x,y}2[I(x+u,y+v)-I(x,y)]I_{xx}(x+u,y+v) \end{aligned} $

$ \begin{aligned} E_{uv}(u,v)&=\sum_{x,y} 2I_y(x+u,y+v)I_x(x+u,y+v)\\ &+\sum_{x,y}2[I(x+u,y+v)-I(x,y)]I_{xy}(x+u,y+v) \end{aligned} $

Corner Detection: Mathematics

\[ E(0,0)=0 \]

\[ \begin{aligned} E_u(0,0)=0\\ E_v(0,0)=0 \end{aligned} \]

\[ \begin{aligned} E_{uu}(0,0)=\sum_{x,y} 2 I_x(x,y)I_x(x,y)\\ E_{vv}(0,0)=\sum_{x,y} 2 I_y(x,y)I_y(x,y)\\ E_{uv}(0,0)=\sum_{x,y} 2 I_x(x,y)I_y(x,y) \end{aligned} \]

Corner Detection: Mathematics

\[ \begin{aligned} E(u,v)\approx [u,v] \begin{bmatrix} E_{uu}(0,0) & E_{uv}(0,0)\\ E_{uv}(0,0) & E_{vv}(0,0) \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix}=&[u,v] \begin{bmatrix} \sum I_x I_x & \sum I_x I_y\\ \sum I_x I_y & \sum I_yI_y \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix}\\=&[u,v] \left(\sum \begin{bmatrix} I_x I_x & I_x I_y\\ I_x I_y & I_yI_y \end{bmatrix} \right) \begin{bmatrix} u \\ v \end{bmatrix} \end{aligned} \]

Corner Detection: Mathematics

Let $M=\sum \begin{bmatrix} I_x I_x & I_x I_y\\ I_x I_y & I_yI_y \end{bmatrix}$, then \[ E(u,v)=[u,v] M \begin{bmatrix} u \\ v \end{bmatrix} \] Question: How do we know the shape of $E(u,v)$ surface given $M$?

Quadratic Surfaces

Note that $ E(u,v)=[u,v] M \begin{bmatrix} u \\ v \end{bmatrix} $ is a quadratic function of $(u,v)$.
For two-variable functions, there are three possible shapes of the $E(u,v)$ surfaces:

If we look at their isoheight contour lines $E(u,v)=1$, they correspond to the following profiles:

How to Classify the Curves based on $M$?

Suppose the eigenvalues of $M$ are $\lambda_1$ and $\lambda_2$

Ellipses: $\lambda_1>0$, $\lambda_2>0$
Parallel lines: $\lambda_1\lambda_2=0$
Hyperbolic: $\lambda_1\lambda_2$ is less than $0$

Diagram of Cornerness

Our Progress for Multi-View Stereo

Take photos from many views
~~Identify points of interest from images~~
~~Search for corresponding points in other images~~
~~Compute camera positions~~
~~Estimate 3D point positions~~

Next, it is your turn to take pictures and implement in the homework!

End

Multi-View Geometry

Hao Su

Fall, 2021

Agenda

Review of Camera Matrices

Basic Mechanism

Definitions

Project 3D Points to Retina Plane

Project 3D Points to Retina Plane

Project 3D Points to Retina Plane

Project 3D Points to Retina Plane

From Retina Plane to Image Plane

From Retina Plane to Image Plane

Converting to Pixels

Converting to Pixels

Converting to Pixels

Converting to Pixels

Homogeneous System and Intrinsic Camera Matrix

Homogeneous Coordinates

Projective Transformation in the Homogeneous Coordinate System

Camera Skewness

Intrinsic Camera Matrix

Intrinsic Camera Matrix

Extrinsic Camera Matrix

Camera Frame

World Frame

Extrinsic Camera Matrix

Extrinsic Camera Matrix

Projective Transformation from World Frame

Camera Matrix

Multi-View Stereo (MVS)

Multi-View Stereo

Applications of MVS

Virtual Image

Triangulation

3D Reconstruction given Corresponding Pixels

3D Reconstruction given Corresponding Pixels

3D Reconstruction as Optimization

3D Reconstruction as Optimization

Gradient Descent for Numerical Optimization

Linear vs Non-linear Least Square

Gradient Descent for Numerical Optimization

3D Reconstruction as Optimization

Key Steps of Multi-View Stereo

Search for Corresponding Points Across Views

Task

Baseline Method: Global Search

How to compare appearance around points?

Epipolar Geometry

Epipolar Geometry

Epipolar Geometry

Point-Line Correspondence

Epipolar Geometry Definitions

Epipolar Geometry Definitions

How to Find the Epipolar Line from a Point?

Fundamental Matrix

Fundamental Matrix

Fundamental Matrix

Search Points using Fundamental Matrix

Estimating $F$ from Correspondences

Why Do We Need to Estimate $F$?

Estimating $F$ from Data

The Eight-Point Algorithm

The Fundamental Matrix Equation

Eight Point Algorithm

Homogeneous System

Solutions to a Homogeneous System

Solutions to a Homogeneous System

Robust Estimation with More Than 8 Points

Robust Estimation with More Than 8 Points

Robust Estimation with More Than 8 Points

Lagrange Multipliers

Lagrange Multipliers: Idea

Lagrange Multipliers: Idea

Lagrange Multipliers: Idea

Lagrange Multiplier Method

Back to Our $f$ Estimation Problem

Robust Estimation with More Than 8 Points

Review: Eigenvalue and Eigenvector

Additional Pre- and Post-Processing for Estimating $F$

Homogeneous System and
Intrinsic Camera Matrix

Projective Transformation in the
Homogeneous Coordinate System

Additional Pre- and Post-Processing
for Estimating $F$