Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Multi-View Geometry

Hao Su

Fall, 2021

Agenda

    click to jump to the section.

    Review of Camera Matrices

    Basic Mechanism

    Definitions

    • We place a pinhole camera so that it faces the $\mathbf{k}$ direction, and the pinhole is at $O$
    • $O$: aperture
    • $\mathbf{k}$: optical axis
    • $\{\mathbf{i}, \mathbf{j}, \mathbf{k}\}$: an orthogonal frame at $O$. Let us call it the camera frame.
    • $\Pi'$: retina plane
    • $f$: focal length
    • $P$: a point in 3D space
    • $P'$: the image of $P$ on retina plane

    Project 3D Points to Retina Plane

    • The coordinate of $P$ in the $O\mathbf{ijk}$ frame is $\begin{bmatrix} x\\y\\z \end{bmatrix}$
    • The coordinate of $P'$ in the $C'\mathbf{i'j'}$ frame on $\Pi'$ is $ \begin{bmatrix} x' \\ y' \end{bmatrix} $

    Project 3D Points to Retina Plane

    • Assume that $O\mathbf{ij}$ plane is parallel to the $\Pi'$ plane, then \( \left\{ \begin{aligned} & x' = f\frac{x}{z}\\ & y' = f\frac{y}{z}\\ \end{aligned} \right. \qquad \) (Why?)

    Project 3D Points to Retina Plane

    \[ \frac{x'}{f}=\frac{x}{z} \]

    Project 3D Points to Retina Plane

    More compactly, we denote the projection as \[ (x,y,z)\rightarrow (f\frac{x}{z},f\frac{y}{z}) \]

    From Retina Plane to Image Plane

    From Retina Plane to Image Plane

    • $i'j'$ frame: retina plane frame
    • $ij$ frame: image plane frame
    • From camera frame to image plane frame \[ (x,y,z)\rightarrow (f\frac{x}{z}+c_x, f\frac{y}{z}+c_y) \]

    Converting to Pixels

    • Assume the unit of $f$ is m
    • Assume that the density of light sensor is $k$ pixel/m horizontally, and $l$ pixel/m vertically. If $k\neq l$, then pixels are non-square.

    Converting to Pixels

    • Then we scale by $k$ and $l$ \[ (x,y,z)\rightarrow (fk\frac{x}{z}+c_x, fl\frac{y}{z}+c_y) \]
    • Here, the unit of $c_x$ and $c_y$ are pixels.

    Converting to Pixels

    • Then we scale by $k$ and $l$ \[ (x,y,z)\rightarrow (fk\frac{x}{z}+c_x, fl\frac{y}{z}+c_y) \]
    • Let $\alpha=fk$ and $\beta=fl$, $(x,y,z)\rightarrow (\alpha\frac{x}{z}+c_x, \beta\frac{y}{z}+c_y)$

    Converting to Pixels

    To sum up, so far, the overall transformation is: \[ \begin{bmatrix} x\\y\\z \end{bmatrix} \rightarrow \begin{bmatrix} \alpha\frac{x}{z}+c_x\\ \beta\frac{y}{z}+c_y \end{bmatrix} \] Can we express it as a linear transformation in matrix form?
    No. Linear transformations are linear combinations of numbers, but there is a division!

    Homogeneous System and
    Intrinsic Camera Matrix

    Nonetheless, we will use a somewhat hacky way to still
    represent the 3D-2D projection by matrix-vector product

    Homogeneous Coordinates

    • Before we introduced the conversion from Euclidean coordinate to Homogeneous coordinate:
      On image plane:
      \[ \begin{bmatrix} x\\y \end{bmatrix}\Rightarrow \begin{bmatrix} x\\y\\1 \end{bmatrix} \]
      In 3D physical space:
      \[ \begin{bmatrix} x\\y\\z \end{bmatrix}\Rightarrow \begin{bmatrix} x\\y\\z\\1 \end{bmatrix} \]
    • Here we introduce a new rule to convert from Homogeneous coordinate to Euclidean coordinate:
      On image plane:
      \[ \begin{bmatrix} x\\y\\w \end{bmatrix}\Rightarrow \begin{bmatrix} x/w\\y/w \end{bmatrix} \]
      In 3D physical space:
      \[ \begin{bmatrix} x\\y\\z\\w \end{bmatrix}\Rightarrow \begin{bmatrix} x/w\\y/w\\z/w \end{bmatrix} \]
    Now we have the division!

    Projective Transformation in the
    Homogeneous Coordinate System

    1. 3D E$\rightarrow$ 3D H: \( P= \begin{bmatrix} x\\y\\z \end{bmatrix}\rightarrow P_h=\begin{bmatrix} x\\y\\z\\1 \end{bmatrix} \)
    2. Build a homogeneous transformation matrix: \( T= \begin{bmatrix} \alpha & 0 & c_x & 0\\ 0 & \beta & c_y & 0\\ 0 & 0 & 1 & 0\\ \end{bmatrix} \)
    3. 3D H $\rightarrow$ 2D H: \( P_h'=TP_h= \begin{bmatrix} \alpha x + c_x z\\ \beta y + c_y z\\ z \end{bmatrix} \)
    4. 2D H$\rightarrow$ 2D E: \( P'=\begin{bmatrix} \alpha \frac{x}{z}+c_x\\ \beta \frac{y}{z}+c_y \end{bmatrix} \)
    $P$ (Euclidean in 3D)
    $\downarrow$
    $P_h$ (Homogeneous in 3D)
    $\downarrow$
    $P_h'$ (Homogeneous in 2D)
    $\downarrow$
    $P'$ (Euclidean in 2D)

    Camera Skewness

    • $\mathbf{k}$-axis may be skewed and not perpendicular to $\Pi'$
    • The skewness will affect the homogeneous transformation matrix
    • When projected on retina plane, the $\mathbf{i}$-axis and $\mathbf{j}$-axis has an angle $\theta$ \[ P_h'= \begin{bmatrix} \alpha & -\alpha \cot \theta & c_x & 0\\ 0 & \frac{\beta}{\sin\theta} & c_y & 0\\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} \]
    So far, we have 5 parameters to affect the imaging process

    Intrinsic Camera Matrix

    From \( P_h'= \begin{bmatrix} \alpha & -\alpha \cot \theta & c_x & 0\\ 0 & \frac{\beta}{\sin\theta} & c_y & 0\\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} \),
    we can extract the intrinsic camera matrix: \( K= \begin{bmatrix} \alpha & -\alpha \cot \theta & c_x \\ 0 & \frac{\beta}{\sin\theta} & c_y \\ 0 & 0 & 1 \end{bmatrix} \)
    So $P_h'=K[I, 0]P_h$

    Intrinsic Camera Matrix

    • The common practice is to use another 5-parameter representation of $K$: \[ K= \begin{bmatrix} f_x & s & c_x\\ 0 & f_y & c_y\\ 0 & 0 & 1 \end{bmatrix} \]
    • In practice, $K$ may be accessed by the SDK of cameras
      • For example, here is a StackOverflow post that discusses extracting the intrinsic camera matrix of the ARKit of Apple
    • We will also introduce algorithms to estimate $K$ in subsequent lectures

    Extrinsic Camera Matrix

    Camera Frame

    • The previous derivations assume that the coordinate of $P$ is in the $O\mathbf{ijk}$ frame
    • Note that the $O\mathbf{ijk}$ frame is binded to the camera, which is referred to as the camera frame
    • In practice, the camera may move around, so using the camera frame to record object location is inconvenient

    World Frame

    • So we assume a static world frame to record object coordinates, and also record the pose of the camera
    • $O_w\mathbf{i}_w\mathbf{j}_w\mathbf{k}_w$ is the world frame
    • The coordinate of $P$ in world frame is \[ P_w= \begin{bmatrix} x_w\\y_w\\z_w\\1 \end{bmatrix} \]

    Extrinsic Camera Matrix

    • We can use $(R,T)$ to transform the world frame coordinate to the camera frame coordinate (homogeneous): \[ P_h = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix}P_w \]

    Extrinsic Camera Matrix

    • \( \begin{bmatrix} R & t\\0 & 1 \end{bmatrix} \) is called extrinsic camera matrix. There are 6 parameters (3 in $R$ and 3 in $T$)

    Projective Transformation from World Frame

    • Recall the projection from camera frame to image plane by intrinsic camera matrix: \[ P_h'=K[I, 0]P_h \]
    • We just derived that \[ P_h = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix}P_w \]
    • Composed together, we can transform from world frame to image plane: \[ P_h'=K \begin{bmatrix} I & 0 \end{bmatrix} \begin{bmatrix} R & t\\ 0 & 1 \end{bmatrix}P_w= K \begin{bmatrix} R & T \end{bmatrix} P_w \]

    Camera Matrix

    \[ P_h'= K \begin{bmatrix} R & T \end{bmatrix} P_w \]
    • $M=K \begin{bmatrix} R & t \end{bmatrix}\in\R{3\times 4}$ includes all the parameters of a pinhole camera to form images!
    • We refer to $M$ as the camera matrix
    • While $M$ has 12 numbers, not all matrices in $\R{3\times 4}$ are valid camera matrices
    • We can use the 5 parameters in $K$ and 6 parameters in $[R,t]$ to generate all valid $M$'s
    • In literature, the common expression is $M$ has 11 degree of freedoms

    Multi-View Stereo (MVS)

    Multi-View Stereo

    Reconstruct the 3D shapes from a set of images and camera parameters

    Applications of MVS

    Virtual Image

    • For the convenience of drawing figures, we introduce the virtual image
    • It is the mirror reflection of the retina image
    • The projective transformation is unchanged: \( P_h'= K \begin{bmatrix} R & t \end{bmatrix} P_w \)

    Triangulation

    • Two cameras with pinholes $O_1$ and $O_2$ observe the same point $P$
    • Triangulation builds a triangle $\triangle PO_1O_2$
    • $p_1$ and $p_2$ are the imaged point of $P$ in two views

    3D Reconstruction given Corresponding Pixels

    • $p_1$ and $p_2$ form a pair of pixel correspondence
    • Given $p_1$, $p_2$, the intrinsic camera matrices $K_1$ and $K_2$, and the extrinsic camera matrices $(R_1, t_1)$ and $(R_2, t_2)$,
    can we estimate $P$?

    3D Reconstruction given Corresponding Pixels

    We can build the camera matrices for both cameras: \[ M_1=K_1[R_1, t_1],\qquad M_2=K_2[R_2, t_2] \] Denote \[ M_1= \begin{bmatrix} m_{1,1}^T\\ m_{1,2}^T \\ m_{1,3}^T \end{bmatrix}, \qquad M_2= \begin{bmatrix} m_{2,1}^T\\ m_{2,2}^T \\ m_{2,3}^T \end{bmatrix} \] Therefore, \[ p_1= \begin{bmatrix} \frac{m_{1,1}^TP}{m_{1,3}^TP} \\ \frac{m_{1,2}^TP}{m_{1,3}^TP} \\ \end{bmatrix}, \qquad p_2= \begin{bmatrix} \frac{m_{2,1}^TP}{m_{2,3}^TP} \\ \frac{m_{2,2}^TP}{m_{2,3}^TP} \\ \end{bmatrix} \]

    3D Reconstruction as Optimization

    Denote $p=f(P; M)$ as the perspective transformation $p=[\frac{m_{1}^TP}{m_{3}^TP}, \frac{m_{2}^TP}{m_{3}^TP}]^T$
    The problem of finding $P$ given $p_1$, $p_2$, $M_1$ and $M_2$ can be solved by this optimization problem: \[ \begin{aligned} &\underset{Q}{\text{minimize}}&&\|p_1-f(Q;M_1)\|^2+\|p_2-f(Q;M_2)\|^2\\ \end{aligned} \]

    3D Reconstruction as Optimization

    \[ \begin{aligned} &\underset{Q}{\text{minimize}}&&\|p_1-f(Q;M_1)\|^2+\|p_2-f(Q;M_2)\|^2\\ \end{aligned} \] where $f(Q)=[\frac{m_{1}^TQ}{m_{3}^TQ}, \frac{m_{2}^TQ}{m_{3}^TQ}]^T$
    • This is also a least square problem
    • But the function inside $\|\cdot\|^2$ is not a linear function, because $f$ is non-linear
    • We call it a non-linear optimization problem

    Gradient Descent for Numerical Optimization

    Linear vs Non-linear Least Square

    \[ \begin{aligned} &\underset{x}{\text{minimize}}&&\|y-Ax\|^2\\ \end{aligned} \]
    • Closed-form solution exists
    • $x^*=A^{\dagger} y$
    \[ \begin{aligned} &\underset{x}{\text{minimize}}&&\|y-f(x)\|^2 \end{aligned} \]
    • $f$ is non-linear (e.g., perspective function)
    • In general, closed-form solution does not exist
    • We use numerical methods to solve such problems

    Gradient Descent for Numerical Optimization

    Consider the general optimization problem: \[ \begin{aligned} &\underset{x}{\text{minimize}}&& f(x) \end{aligned} \]

    3D Reconstruction as Optimization

    \[ \begin{aligned} &\underset{Q}{\text{minimize}}&&\|p_1-f(Q;M_1)\|^2+\|p_2-f(Q;M_2)\|^2\\ \end{aligned} \] where $f(Q)=[\frac{m_{1}^TQ}{m_{3}^TQ}, \frac{m_{2}^TQ}{m_{3}^TQ}]^T$
    • Similarly, let $L(Q)=\|p_1-f(Q;M_1)\|^2+\|p_2-f(Q;M_2)\|^2$
    • We can optimization the objective function by following $\nabla L$ using gradient descent
    • In fact, there are many smarter ways to adjust the gradient direction, so that it works faster
    • For example, LM algorithm (Levenberg-Marquardt) is a popular choice

    Key Steps of Multi-View Stereo

    1. Take photos from many views
    2. Identify points of interest from images
    3. Search for corresponding points in other images
    4. Compute camera positions
    5. Estimate 3D point positions
    Since 3 and 4 also depend on the understanding of camera models, we will study them first, and then study 2.

    Search for Corresponding Points Across Views

    Task

    • Two views of the same object
    • Given a point on the left image, how can I find the corresponding point on the right image?

    Baseline Method: Global Search

    • Build a feature descriptor to encode the appearance around $p_1$ on the left view
    • Compare against feature descriptors at points all over the image on the right view

    How to compare appearance around points?

    • Advanced approach: Extract a point feature (e.g., by neural networks)
    • But let us show a very basic approach first
      1. Extract a patch around a point
      2. Compare two patches $W_1$ and $W_2$ by pixel-wise distance:
        • SSD (Sum Squared Distance) \[ \sum_{x,y}|W_1(x,y)-W_2(x,y)|^2 \]
        • NCC (Normalized Cross Correlation) \[ \frac{\sum_{x,y}(W_1(x,y)-\bar{W}_1)(W_2(x,y)-\bar{W}_2)}{\sigma_{W_1}\sigma{W_2}} \]
    • Anyway, global search requires exhaustive comparison, thus would be slow.

    Epipolar Geometry

    However, when the camera matrices of both images are given,
            it is possible to largely restrict the search space to a line on the right view!

    Epipolar Geometry

    Advantages:
    • Smaller computational load
    • Reduced error to find wrong matching
    How could this be true!
    The world is beautiful!

    Epipolar Geometry

    Point-Line Correspondence

    $p_1$ on the left view corresponds to the line $m_2$ on the right view

    Epipolar Geometry Definitions

    • $O_1O_2$: baseline, the line segment connecting $O_1$ and $O_2$
    • $e_1$ and $e_2$: epipoles, the intersection between baseline and retina planes

    Epipolar Geometry Definitions

    • A 3D point $X$ is observed at $p_1$ and $p_2$ on the left and right views, respectively
    • The pixel $p_1$ corresponds to the line passing $p_2e_2$
    • The pixel $p_2$ corresponds to the line passing $p_1e_1$

    How to Find the Epipolar Line from a Point?

    Use homogeneous coordinate to represent $p_1$ and $p_2$ in their corresponding image space: \[ p_1= \begin{bmatrix} x_1 \\ y_1 \\ 1 \end{bmatrix}, \qquad p_2= \begin{bmatrix} x_2 \\ y_2 \\ 1 \end{bmatrix} \]
    Assume that the pair of cameras have fixed relative pose
    • Consider the pair of camera frames at $O_1$ and $O_2$
    • The relative $(R,T)$ between the $O_1$ frame and $O_2$ frame keep unchanged
    Then, there exists a matrix $F$ such that \[ p_1^T F p_2 = 0,\qquad \forall (p_1,p_2)\text{ pairs in correspondence} \]

    Fundamental Matrix

    \begin{align} p_1^T F p_2 = 0,\qquad \forall (p_1,p_2)\text{ pairs in correspondence}\label{eq:fundmat} \end{align} $F\in\R{3}$: fundamental matrix

    Expand \eqref{eq:fundmat}, \[ [x_1, y_1, 1] \begin{bmatrix} F_{11} & F_{12} & F_{13}\\ F_{21} & F_{22} & F_{23}\\ F_{31} & F_{32} & F_{33} \end{bmatrix} \begin{bmatrix} x_2 \\ y_2 \\ 1 \end{bmatrix}=0 \]

    Fundamental Matrix

    \[ p_1^T F p_2 = 0,\qquad \forall (p_1,p_2)\text{ pairs in correspondence} \]

    • $w_1=F^Tp_1\in\R{3}$ defines a linear equation \begin{align} w_1^Tp_2 =0 \end{align} over $p_2$
    • This is the corresponding line of $p_1$ on the right view
    • So $w_1=F^Tp_1$ defines the coefficients of the epipolar line of $p_1$

    Fundamental Matrix

    \[ p_1^T F p_2 = 0,\qquad \forall (p_1,p_2)\text{ pairs in correspondence} \]

    • $w_2=Fp_2\in\R{3}$ defines a linear equation \[ w_2^Tp_1 =0\tag{3} \] over $p_1$
    • This is the corresponding line of $p_2$ on the left view
    • So $w_2=Fp_2$ defines the coefficients of the epipolar line of $p_2$

    Search Points using Fundamental Matrix

    Estimating $F$ from Correspondences

    Why Do We Need to Estimate $F$?

    $F$ is determined by
    • The intrinsic camera matrices $K_1$ and $K_2$
    • and the relative pose between the cameras $(R, T)$
    However, measuring the relative camera poses is hard!
    In fact, $K_1$ and $K_2$ are also not always accessible.

    Estimating $F$ from Data

    • If we can have some pairs of corresponding points, we would be able to estimate $F$
    • $F$ can help us to find more pairs of correspondences, so that we reconstruct the 3D for more points

    The Eight-Point Algorithm

    It is sufficient to estimate $F$ from 8 points

    The Fundamental Matrix Equation

    \begin{align} p^TFp'=0\quad\Rightarrow\quad [x, y, 1] \begin{bmatrix} F_{11} & F_{12} & F_{13}\\ F_{21} & F_{22} & F_{23}\\ F_{31} & F_{32} & F_{33} \end{bmatrix} \begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix}=0 \end{align} Expand using matrix-vector product rules and : \[ [xx', xy',x,yx',yy',y,x',y',1] \begin{bmatrix} F_{11} \\ F_{12} \\ F_{13}\\ F_{21} \\ F_{22} \\ F_{23} \\ F_{31} \\ F_{32} \\ F_{33} \end{bmatrix}=0 \]

    Eight Point Algorithm

    \[ \underbrace{\begin{bmatrix} x_1x'_1& x_1y'_1&x_1&y_1x'_1&y_1y'_1&y_1&x'_1&y'_1&1\\ x_2x'_2& x_2y'_2&x_2&y_2x'_2&y_2y'_2&y_2&x'_2&y'_2&1\\ x_3x'_3& x_3y'_3&x_3&y_3x'_3&y_3y'_3&y_3&x'_3&y'_3&1\\ x_4x'_4& x_4y'_4&x_4&y_4x'_4&y_4y'_4&y_4&x'_4&y'_4&1\\ x_5x'_5& x_5y'_5&x_5&y_5x'_5&y_5y'_5&y_5&x'_5&y'_5&1\\ x_6x'_6& x_6y'_6&x_6&y_6x'_6&y_6y'_6&y_6&x'_6&y'_6&1\\ x_7x'_7& x_7y'_7&x_7&y_7x'_7&y_7y'_7&y_7&x'_7&y'_7&1\\ x_8x'_8& x_8y'_8&x_8&y_8x'_8&y_8y'_8&y_8&x'_8&y'_8&1\\ \end{bmatrix}}_{W} \underbrace{ \begin{bmatrix} F_{11} \\ F_{12} \\ F_{13}\\ F_{21} \\ F_{22} \\ F_{23} \\ F_{31} \\ F_{32} \\ F_{33} \end{bmatrix}}_{f} =0 \]
    In matrix form, $Wf=0$.

    Homogeneous System

    We would be able to estimate the elements of fundamental matrix $f$ from: \[ Wf = 0 \] Properties of this equation:
    • Given correspondences, $W$ is a constant
    • The equation is linear w.r.t $f$
    • The r.h.s. of the equation is $0$. The system is called a homogeneous system
    • Observe that, if $f$ is a solution, then $\lambda f\ \forall \lambda\in\R{}$ is also a solution.

    Solutions to a Homogeneous System

    \begin{align} Wf = 0 \end{align}
    • Using the language of our matrix theory, we say that $f\in \text{null}(W)$
    • Recall that each row of $W$ is from a corresponding point pair: \[ W= \begin{bmatrix} x_1x'_1& x_1y'_1&x_1&y_1x'_1&y_1y'_1&y_1&x'_1&y'_1&1\\ x_2x'_2& x_2y'_2&x_2&y_2x'_2&y_2y'_2&y_2&x'_2&y'_2&1\\ x_3x'_3& x_3y'_3&x_3&y_3x'_3&y_3y'_3&y_3&x'_3&y'_3&1\\ x_4x'_4& x_4y'_4&x_4&y_4x'_4&y_4y'_4&y_4&x'_4&y'_4&1\\ x_5x'_5& x_5y'_5&x_5&y_5x'_5&y_5y'_5&y_5&x'_5&y'_5&1\\ x_6x'_6& x_6y'_6&x_6&y_6x'_6&y_6y'_6&y_6&x'_6&y'_6&1\\ x_7x'_7& x_7y'_7&x_7&y_7x'_7&y_7y'_7&y_7&x'_7&y'_7&1\\ x_8x'_8& x_8y'_8&x_8&y_8x'_8&y_8y'_8&y_8&x'_8&y'_8&1\\ \end{bmatrix}\in\R{8\times 9} \]
    • So $\rank(W)=8$ (when points are at general positions), and $\dim(\null(W))=9-8=1$.

    Solutions to a Homogeneous System

    1. $W\in\R{8\times 9}$
    2. $\dim(\null(W))=1$
    3. $f\in \text{null}(W)$
    Combining these conclusions, we see that
    • $W$'s null space is a line in the 9-dimensional space, and $f$ is on this line.
    • In fact, any $f\in\null(W)$ might be the groundtruth $f$
    • In other words, 8 points have uniquely determined $f$ up to scaling!
    • That's why we have the 8-point algorithm.
    • Python code to find the null space:
      
      >>> from scipy.linalg import null_space
      >>> f = null_space(W)
                                  

    Robust Estimation with More Than 8 Points

    \[ Wf = 0 \]
    • While 8 points are sufficient to estimate $f$ theoretically, in practice, we use many more pairs.
    • This would allow us to estimate $f$ more robustly w.r.t. inaccurate correspondences.
    • Practically, this is very useful and important.
    • However, when there are $n>8$ point pairs, $\rank(W)=9$ and $\null(W)=0$.
    • The only solution is $0$!

    Robust Estimation with More Than 8 Points

    We use least square when $n>8$:
    $Wf = 0$
    $\Downarrow$
    $\begin{aligned} &\underset{f}{\text{minimize}}&&\|Wf\|_2^2\\ &\text{subject to} && \|f\|_2^2=1 \end{aligned}$
    • We add a constraint that $\|f\|_2^2=1$ to avoid the trivial solution $f=0$
    • We convert the homogeneous equation to a least square objective
    • A quadratic constraint quadratic programming (QCQP) problem

    Robust Estimation with More Than 8 Points

    \begin{align} \begin{aligned} &\underset{f}{\text{minimize}}&&\|Wf\|_2^2\\ &\text{subject to} && \|f\|_2^2=1 \end{aligned} \end{align} Do you remember how to optimize an equality constrained optimization problem in Calculus?

    Lagrange Multipliers

    Lagrange Multipliers: Idea

    Lagrange Multipliers: Idea

    Lagrange Multipliers: Idea

    Lagrange Multiplier Method

    \begin{align} \begin{aligned} &\underset{x}{\text{minimize}}&&f(x)\\ &\text{subject to}&&g(x)=0 \end{aligned} \end{align}
    • Denote $L(x,\lambda)=f(x)+\lambda g(x)$
    • The optimality condition is:
      $\nabla_x L(x, \lambda)=0$
      $\Downarrow$
      \begin{align} \nabla f(x)-\lambda \nabla g(x)=0 \end{align}

    Back to Our $f$ Estimation Problem

    Recall our problem is: \begin{align*} \begin{aligned} &\underset{f}{\text{minimize}}&&\|Wf\|_2^2\\ &\text{subject to} && \|f\|_2^2=1 \end{aligned} \end{align*} Lagrange function: $L=f^TW^TWf-\lambda (f^Tf-1)$
    Gradient of $\nabla_f L=2W^TWf-2\lambda f$
    Optimality condition: $W^TWf=\lambda f$
    At optimality, the objective is $f^TW^TWf=f^T(W^TWf)=\lambda f^Tf=\lambda$

    $\therefore$ We seek for the minimal $\lambda$ such that $W^TWf=\lambda f$!

    Robust Estimation with More Than 8 Points

    \begin{align*} \begin{aligned} &\underset{f}{\text{minimize}}&&\|Wf\|_2^2\\ &\text{subject to} && \|f\|_2^2=1 \end{aligned} \end{align*}
    How do we seek for the minimal $\lambda$ such that $W^TWf=\lambda f$?
    • By the definition of eigenvector and eigenvalue, $f$ must be an eigenvector and $\lambda$ must be an eigenvalue
    • Note that $W^TW$ is a positive definite matrix
      • positive definite matrix $A$: $x^TAx>0\ \forall x\neq 0$
    • Therefore, $\lambda$ must be the smallest eigenvalue and $f$ be the corresponding eigenvector

    Review: Eigenvalue and Eigenvector

    • An eigenvector $x$ of a linear transformation $A$ is a non-zero vector that, when $A$ is applied to it, does not change direction: \[ Ax = \lambda x, x \neq 0. \]
    • Applying $A$ to the vector only scales the vector by the scalar value $\lambda$, called an eigenvalue.
    Example: Google Colab

    Additional Pre- and Post-Processing
    for Estimating $F$

    Post-Processing

    • We can reshape $f$ to be the $3\times 3$ matrix $F$
    • $F\in\R{3\times 3}$ should be a rank-2 matrix
    • However, the fundamental matrix obtained by reshaping $f$ is not rank-2. We need some post-processing.
    • Assume $f$ is reshaped to a $3\times 3$ matrix $\hat{F}$. The post-processing is to solve an optimization problem: \begin{align} \begin{aligned} &\underset{}{\text{minimize}}&&\|F-\hat{F}\|^2_{F}\\ &\text{subject to}&&\rank(F)=2 \end{aligned} \end{align} where $\|A\|_F=\sqrt{\sum_{ij}A^2_{ij}}$

    Post-Processing

    \begin{align} \begin{aligned} &\underset{}{\text{minimize}}&&\|F-\hat{F}\|^2_{F}\\ &\text{subject to}&&\rank(F)=2 \end{aligned} \end{align} where $\|A\|_F=\sqrt{\sum_{ij}A^2_{ij}}$
    • This problem is called low-rank approximation
    • The solution is given by an algorithm called SVD decomposition: \[ F=U \begin{bmatrix} s_1 & 0 & 0 \\ 0 & s_2 & 0 \\ 0 & 0 & 0 \end{bmatrix}V^T \] where \[ \hat{F}=U \begin{bmatrix} s_1 & 0 & 0 \\ 0 & s_2 & 0 \\ 0 & 0 & s_3 \end{bmatrix}V^T \text{ such that } U^TU=I, V^TV=I, s_i>0 \text{ sorted in descending order} \]

    Post-Processing

    Python code for SVD-based low-rank approximation:
    
    U, S, V = np.linalg.svd(Fhat)
    print('U=', U)
    print('S=', S)
    print('V=', V)
    
    S[2] = 0
    F = U @ np.diag(S) @ V
                        
    We are almost there to finish the estimation of $F$!

    Pre-Processing

    We usually preprocess $W$ for numerical reasons. Recall the structure of $W$: \[ W= \begin{bmatrix} x_1x'_1& x_1y'_1&x_1&y_1x'_1&y_1y'_1&y_1&x'_1&y'_1&1\\ x_2x'_2& x_2y'_2&x_2&y_2x'_2&y_2y'_2&y_2&x'_2&y'_2&1\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ x_nx'_n& x_ny'_n&x_n&y_nx'_n&y_ny'_n&y_n&x'_n&y'_n&1\\ \end{bmatrix}\in\R{n\times 9} \]
    • When $x_i$ and $y_i$ are too large or too small, $W$ will have bad numerical properties.
    • We often normalize the scale of $x_i$ and $y_i$ so that they are not too or too large.
    • We skip details of step.

    Overview of Estimating the Fundamental Matrix

    1. Pre-processing point pair coordinates;
    2. Build $W$;
    3. Estimate $f$ as the eigenvector of the smallest eigenvalue of $W^TW$;
    4. Build $\hat{F}$ by reshaping $f$;
    5. Use SVD to obtain the rank-2 approximation of $F$ from $\hat{F}$.
    Fundamental matrix builds a constraint across views.
    With the help of fundamental matrix, we are better at finding correspondences across views!

    Analytical Solution of $F$ from $K_i$ and $(R,t)$

    Cross Product by Skew-Symmetric Matrix

    Result

    \[ F=K_1^{-T}([t]_{\times} R)K_2^{-1} \]

    RANSAC Algorithm for Correspondence Validation

    Review the MVS Pipeline

    In practice, we often know neither the camera matrices (intrinsic and extrinsic), nor the correspondences. But we can jointly estimate them!

    Correspondence Validation Problem

    Let us assume that we already have many correspondence candidates by appearance-based patch matching: However, many correspondences are wrong. Let us prune them out with the help of fundamental matrix!

    A Chicken and Egg Problem

    • When we know correspondences with accuracy, we can estimate $F$
    • When we know $F$, we can restrict the search space and confirm correspondences

    Step 1: Initial Estimate of F

    Randomly choose some correspondences and estimate $F$

    Step 2: Outlier Removal

    Based on the estimate $F$, remove pairs that have big errors according to the epipolar constraint $\|p^TFq\|$

    Step 3: $F$ Re-estimation

    We call the removed pairs outliers and remaining pairs inliers. Using the inliers, we can re-estimate $F$:

    Repeat

    
    for i in range(n):
        randomly choose some pairs
        repeat for m times:
            based on the inliers, estimate F
            based on F, remove pairs with big errors
                        

    Keep only the matches that are inliers with respect to the best fundamental matrix

    Discussions

    1. RANSAC starts from a subset of correspondences, which is randomly chosen. This random initial set affects the final result.
    2. For every iteration, we determine the inlier and outlier based on the level of errors. This is usually done by a threshold $\tau$, which affects the final result.
    3. To obtain good results, we often try several random initial set, and try different thresholds.

    Our Progress for Multi-View Stereo

    1. Take photos from many views
    2. Identify points of interest from images
    3. Search for corresponding points in other images
    4. Compute camera positions
    5. Estimate 3D point positions
    Next, let us study 4.

    Compute Camera Positions

    Assume that we have estimated the fundamental matrix $F$ using correspondences.
    \[ p^T F q = 0 \]
    Can we decode $(R,t)$ from $F$?

    Decoding $(R,t)$ from $F$

    First of all, $F$ is determined by:
    • the intrinsic camera matrices $K_1$ and $K_2$
    • the external camera matrices $[R, t]$
    We can derive that: \[ F=K_1^{-T} ([t]_{\times} R) K_2^{-1} \] where $[t]_{\times}= \begin{bmatrix} 0 & -t_3 & t_2\\t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix}$

    Decoding $(R,t)$ from $F$

    \[ F=K_1^{-T} ([t]_{\times} R) K_2^{-1},\quad \text{ where } [t]_{\times}= \begin{bmatrix} 0 & -t_3 & t_2\\t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix} \]
    • $E=[t]_{\times} R$ is also known as Essential matrix
    • If $K_1$ and $K_2$ are known, we will be able to compute $E$ from $F$: \[ E=K_1^T FK_2 \]

    Decoding $(R,t)$ from $F$

    \[ E=[t]_{\times} R,\quad \text{ where } [t]_{\times}= \begin{bmatrix} 0 & -t_3 & t_2\\t_3 & 0 & -t_1 \\ -t_2 & t_1 & 0 \end{bmatrix} \]
    • Note that $[t]_{\times}$ and $R$ both have special structures
      • $[t]_{\times}$: A skew-symmetric matrix
      • An orthonormal matrix
    • Mathematically, given $E$, its decomposition into the product of a skew-symmetric matrix and an orthonormal matrix is almost unique!
      • Through SVD algorithm, as well.

    The last remaining question: Is it possible to estimate $K$ from data?

    You need to collect additional data for this purpose.
    The operation is called camera calibration:

    Camera Calibration

    The most common way is to use a camera to capture many images of a chessboard. We know the distance between grids ahead. Then we can estimate $K$ by a closed-form solution. We don't cover this topic here.

    Summary of Estimating the Relative Position between Cameras from $F$

    Estimate $K_1$ and $K_2$ by camera calibration
    $\Downarrow$
    Compute Essential Matrix by \(E=K_1^T FK_2\)
    $\Downarrow$
    Decompose $E$ to estimate $t$ and $R$

    Our Progress for Multi-View Stereo

    1. Take photos from many views
    2. Identify points of interest from images
    3. Search for corresponding points in other images
    4. Compute camera positions
    5. Estimate 3D point positions

    Identify Points of Interest in an Image

    Corner Detection

    Corner Detection: Basic Idea

    • We might recognize the point by looking through a small window
    • We want a window shift in any direction to give a large change in intensity
    Flat region: no change in all directions
    Edge: no change along the edge direction
    Corner: significant change in all directions
    A. Efros

    Image as Function

    • We have been treating an image as a matrix
    • In CV, we also often treat an image as a function:
      • $I(x,y)$ returns the pixel value at $(x,y)$
    • This view allows us to compute image gradients:
      • Use finite difference to approximate partial derivative: \[ \begin{aligned} \frac{\partial f}{\partial x}(x,y) & \approx f(x+1, y)-f(x, y)\\ \frac{\partial f}{\partial y}(x,y) & \approx f(x, y+1)-f(x, y)\\ \end{aligned} \]
      • Will revisit this topic in the future

    Corner Detection by Auto-Correlation

    Change in appearance of window $w(x,y)$ for shift $[u,v]$: \[ E(u,v)=\sum_{(x,y)\in\mathcal{W}}[I(x+u,y+v)-I(x,y)]^2 \]

    Corner Detection by Auto-Correlation

    Change in appearance of window $w(x,y)$ for shift $[u,v]$: \[ E(u,v)=\sum_{(x,y)\in\mathcal{W}}[I(x+u,y+v)-I(x,y)]^2 \]
    • We want to discover how $E$ behaves for small shifts
    • But this is very slow to compute naively: $\mathcal{O}(\text{window_width}^2*\text{shift_range}^2*\text{image_width}^2)$
    • $\mathcal{O}(11^2*11^2*600^2)=5.2 \text{ billion of these }14.6\text{ thousand per pixel in your image}$
    Change in appearance of window $w(x,y)$ for shift $[u,v]$: \[ E(u,v)=\sum_{(x,y)\in\mathcal{W}}[I(x+u,y+v)-I(x,y)]^2 \] Can we approximate $E(u,v)$ locally by a quadratic surface?

    Recall: Taylor Series Expansion

    Many good function can be approximated by a polynomial whose coefficients are from the derivatives: \[ f(x)\approx f(0) + f'(0) x + \frac{1}{2}f''(0)x^2 \] The multivariate version: \[ f(x)\approx f(0) + \nabla_x f(0) x + \frac{1}{2}x^T H_f(0) x \] where $H_f(0)$ is the Hessian matrix of $f$ at $0$ such that $[H_f(0)]_{ij}=\frac{\partial^2 f}{\partial x_i \partial x_j}(0)$

    Second-Order Local Approximation

    Local quadratic approximation of $E(u, v)$ in the neighborhood of $(0, 0)$:
    \[ E(u,v)\approx E(0,0)+[u,v] \begin{bmatrix} E_u(0,0)\\E_v(0,0) \end{bmatrix}+ \frac{1}{2}[u,v] \begin{bmatrix} E_{uu}(0,0) & E_{uv}(0,0)\\ E_{uv}(0,0) & E_{vv}(0,0) \end{bmatrix} \begin{bmatrix} u\\v \end{bmatrix} \]

    Corner Detection: Mathematics

    \[ E(u,v)=\sum_{(x,y)\in\mathcal{W}}[I(x+u,y+v)-I(x,y)]^2 \] Second-order Taylor expansion of $E(u,v)$ about $(0,0)$:
    \( E(0,0)=0 \)
    \( \begin{aligned} E_u(u,v)&=\sum_{x,y} 2[I(x+u,y+v)-I(x,y)]I_x(x+u,y+v)\\ \end{aligned} \)
    \( \begin{aligned} E_{uu}(u,v)&=\sum_{x,y} 2I_x(x+u,y+v)I_x(x+u,y+v)\\ &+\sum_{x,y}2[I(x+u,y+v)-I(x,y)]I_{xx}(x+u,y+v) \end{aligned} \)
    \( \begin{aligned} E_{uv}(u,v)&=\sum_{x,y} 2I_y(x+u,y+v)I_x(x+u,y+v)\\ &+\sum_{x,y}2[I(x+u,y+v)-I(x,y)]I_{xy}(x+u,y+v) \end{aligned} \)

    Corner Detection: Mathematics

    \[ E(0,0)=0 \]
    \[ \begin{aligned} E_u(0,0)=0\\ E_v(0,0)=0 \end{aligned} \]
    \[ \begin{aligned} E_{uu}(0,0)=\sum_{x,y} 2 I_x(x,y)I_x(x,y)\\ E_{vv}(0,0)=\sum_{x,y} 2 I_y(x,y)I_y(x,y)\\ E_{uv}(0,0)=\sum_{x,y} 2 I_x(x,y)I_y(x,y) \end{aligned} \]

    Corner Detection: Mathematics

    \[ \begin{aligned} E(u,v)\approx [u,v] \begin{bmatrix} E_{uu}(0,0) & E_{uv}(0,0)\\ E_{uv}(0,0) & E_{vv}(0,0) \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix}=&[u,v] \begin{bmatrix} \sum I_x I_x & \sum I_x I_y\\ \sum I_x I_y & \sum I_yI_y \end{bmatrix} \begin{bmatrix} u \\ v \end{bmatrix}\\=&[u,v] \left(\sum \begin{bmatrix} I_x I_x & I_x I_y\\ I_x I_y & I_yI_y \end{bmatrix} \right) \begin{bmatrix} u \\ v \end{bmatrix} \end{aligned} \]

    Corner Detection: Mathematics

    Let $M=\sum \begin{bmatrix} I_x I_x & I_x I_y\\ I_x I_y & I_yI_y \end{bmatrix}$, then \[ E(u,v)=[u,v] M \begin{bmatrix} u \\ v \end{bmatrix} \] Question: How do we know the shape of $E(u,v)$ surface given $M$?

    Quadratic Surfaces

    Note that \( E(u,v)=[u,v] M \begin{bmatrix} u \\ v \end{bmatrix} \) is a quadratic function of $(u,v)$.
    For two-variable functions, there are three possible shapes of the $E(u,v)$ surfaces:

    If we look at their isoheight contour lines $E(u,v)=1$, they correspond to the following profiles:

    How to Classify the Curves based on $M$?

    Suppose the eigenvalues of $M$ are $\lambda_1$ and $\lambda_2$
    • Ellipses: $\lambda_1>0$, $\lambda_2>0$
    • Parallel lines: $\lambda_1\lambda_2=0$
    • Hyperbolic: $\lambda_1\lambda_2$ is less than $0$

    Diagram of Cornerness

    Our Progress for Multi-View Stereo

    1. Take photos from many views
    2. Identify points of interest from images
    3. Search for corresponding points in other images
    4. Compute camera positions
    5. Estimate 3D point positions
    Next, it is your turn to take pictures and implement in the homework!
    End