Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

\[ \newcommand{\R}[1]{\mathbb{R}^{#1}} \]

Camera Model

Hao Su

Winter, 2022

Credit: CS231a, Stanford, Silvio Savarese

Agenda

    click to jump to the section.

    Pinhole Camera

    Known since Ancient Times

    Early known descriptions are found in the Chinese Mozi writings (circa 500 BCE).

    Known since Ancient Times

    • Ibn al-Haytham (965-1040): "the father of modern optics"
    • His great book The Optics (Latin: De aspectibus or Perspectivae) explained the principle of perspective projection. Also the origin of the word perspective.

    Basic Mechanism

    Definitions

    • We place a pinhole camera so that it faces the $\mathbf{k}$ direction, and the pinhole is at $O$
    • $O$: aperture
    • $\mathbf{k}$: optical axis
    • $\{\mathbf{i}, \mathbf{j}, \mathbf{k}\}$: an orthogonal frame at $O$. Let us call it the camera frame.
    • $\Pi'$: retina plane
    • $f$: focal length
    • $P$: a point in 3D space
    • $P'$: the image of $P$ on retina plane

    Project 3D Points to Retina Plane

    • The coordinate of $P$ in the $O\mathbf{ijk}$ frame is $\begin{bmatrix} x\\y\\z \end{bmatrix}$
    • The coordinate of $P'$ in the $C'\mathbf{i'j'}$ frame on $\Pi'$ is $ \begin{bmatrix} x' \\ y' \end{bmatrix} $

    Project 3D Points to Retina Plane

    • Assume that $O\mathbf{ij}$ plane is parallel to the $\Pi'$ plane, then \( \left\{ \begin{aligned} & x' = f\frac{x}{z}\\ & y' = f\frac{y}{z}\\ \end{aligned} \right. \qquad \) (Why?)

    Project 3D Points to Retina Plane

    \[ \frac{x'}{f}=\frac{x}{z} \]

    Project 3D Points to Retina Plane

    More compactly, we denote the projection as \[ (x,y,z)\rightarrow (f\frac{x}{z},f\frac{y}{z}) \]

    From Retina Plane to Image Plane

    From Retina Plane to Image Plane

    • $i'j'$ frame: retina plane frame
    • $ij$ frame: image plane frame
    • From camera frame to image plane frame \[ (x,y,z)\rightarrow (f\frac{x}{z}+c_x, f\frac{y}{z}+c_y) \]

    Converting to Pixels

    • Assume the unit of $f$ is m
    • Assume that the density of light sensor is $k$ pixel/m horizontally, and $l$ pixel/m vertically. If $k\neq l$, then pixels are non-square.

    Converting to Pixels

    • Then we scale by $k$ and $l$ \[ (x,y,z)\rightarrow (fk\frac{x}{z}+c_x, fl\frac{y}{z}+c_y) \]
    • Here, the unit of $c_x$ and $c_y$ are pixels.

    Converting to Pixels

    • Then we scale by $k$ and $l$ \[ (x,y,z)\rightarrow (fk\frac{x}{z}+c_x, fl\frac{y}{z}+c_y) \]
    • Let $\alpha=fk$ and $\beta=fl$, $(x,y,z)\rightarrow (\alpha\frac{x}{z}+c_x, \beta\frac{y}{z}+c_y)$

    Converting to Pixels

    To sum up, so far, the overall transformation is: \[ \begin{bmatrix} x\\y\\z \end{bmatrix} \rightarrow \begin{bmatrix} \alpha\frac{x}{z}+c_x\\ \beta\frac{y}{z}+c_y \end{bmatrix} \] Can we express it as a linear transformation in matrix form?
    No. Linear transformations are linear combinations of numbers, but there is a division!

    Homogeneous System and
    Intrinsic Camera Matrix

    Nonetheless, we will use a somewhat hacky way to still
    represent the 3D-2D projection by matrix-vector product

    Homogeneous Coordinates

    • Before we introduced the conversion from Euclidean coordinate to Homogeneous coordinate:
      On image plane:
      \[ \begin{bmatrix} x\\y \end{bmatrix}\Rightarrow \begin{bmatrix} x\\y\\1 \end{bmatrix} \]
      In 3D physical space:
      \[ \begin{bmatrix} x\\y\\z \end{bmatrix}\Rightarrow \begin{bmatrix} x\\y\\z\\1 \end{bmatrix} \]
    • Here we introduce a new rule to convert from Homogeneous coordinate to Euclidean coordinate:
      On image plane:
      \[ \begin{bmatrix} x\\y\\w \end{bmatrix}\Rightarrow \begin{bmatrix} x/w\\y/w \end{bmatrix} \]
      In 3D physical space:
      \[ \begin{bmatrix} x\\y\\z\\w \end{bmatrix}\Rightarrow \begin{bmatrix} x/w\\y/w\\z/w \end{bmatrix} \]
    Now we have the division!

    Projective Transformation in the
    Homogeneous Coordinate System

    1. 3D E$\rightarrow$ 3D H: \( P= \begin{bmatrix} x\\y\\z \end{bmatrix}\rightarrow P_h=\begin{bmatrix} x\\y\\z\\1 \end{bmatrix} \)
    2. Build a homogeneous transformation matrix: \( T= \begin{bmatrix} \alpha & 0 & c_x & 0\\ 0 & \beta & c_y & 0\\ 0 & 0 & 1 & 0\\ \end{bmatrix} \)
    3. 3D H $\rightarrow$ 2D H: \( P_h'=TP_h= \begin{bmatrix} \alpha x + c_x z\\ \beta y + c_y z\\ z \end{bmatrix} \)
    4. 2D H$\rightarrow$ 2D E: \( P'=\begin{bmatrix} \alpha \frac{x}{z}+c_x\\ \beta \frac{y}{z}+c_y \end{bmatrix} \)
    $P$ (Euclidean in 3D)
    $\downarrow$
    $P_h$ (Homogeneous in 3D)
    $\downarrow$
    $P_h'$ (Homogeneous in 2D)
    $\downarrow$
    $P'$ (Euclidean in 2D)

    Camera Skewness

    • $\mathbf{k}$-axis may be skewed and not perpendicular to $\Pi'$
    • The skewness will affect the homogeneous transformation matrix
    • When projected on retina plane, the $\mathbf{i}$-axis and $\mathbf{j}$-axis has an angle $\theta$ \[ P_h'= \begin{bmatrix} \alpha & -\alpha \cot \theta & c_x & 0\\ 0 & \frac{\beta}{\sin\theta} & c_y & 0\\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} \]
    So far, we have 5 parameters to affect the imaging process

    Intrinsic Camera Matrix

    From \( P_h'= \begin{bmatrix} \alpha & -\alpha \cot \theta & c_x & 0\\ 0 & \frac{\beta}{\sin\theta} & c_y & 0\\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} \),
    we can extract the intrinsic camera matrix: \( K= \begin{bmatrix} \alpha & -\alpha \cot \theta & c_x \\ 0 & \frac{\beta}{\sin\theta} & c_y \\ 0 & 0 & 1 \end{bmatrix} \)
    So $P_h'=K[I, 0]P_h$

    Intrinsic Camera Matrix

    • The common practice is to use another 5-parameter representation of $K$: \[ K= \begin{bmatrix} f_x & s & c_x\\ 0 & f_y & c_y\\ 0 & 0 & 1 \end{bmatrix} \]
    • In practice, $K$ may be accessed by the SDK of cameras
      • For example, here is a StackOverflow post that discusses extracting the intrinsic camera matrix of the ARKit of Apple
    • We will also introduce algorithms to estimate $K$ in subsequent lectures

    Extrinsic Camera Matrix

    Camera Frame

    • The previous derivations assume that the coordinate of $P$ is in the $O\mathbf{ijk}$ frame
    • Note that the $O\mathbf{ijk}$ frame is binded to the camera, which is referred to as the camera frame
    • In practice, the camera may move around, so using the camera frame to record object location is inconvenient

    World Frame

    • So we assume a static world frame to record object coordinates, and also record the pose of the camera
    • $O_w\mathbf{i}_w\mathbf{j}_w\mathbf{k}_w$ is the world frame
    • The coordinate of $P$ in world frame is \[ P_w= \begin{bmatrix} x_w\\y_w\\z_w\\1 \end{bmatrix} \]

    Extrinsic Camera Matrix

    • We can use $(R,T)$ to transform the world frame coordinate to the camera frame coordinate (homogeneous): \[ P_h = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix}P_w \]

    Extrinsic Camera Matrix

    • \( \begin{bmatrix} R & t\\0 & 1 \end{bmatrix} \) is called extrinsic camera matrix. There are 6 parameters (3 in $R$ and 3 in $T$)

    Projective Transformation from World Frame

    • Recall the projection from camera frame to image plane by intrinsic camera matrix: \[ P_h'=K[I, 0]P_h \]
    • We just derived that \[ P_h = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix}P_w \]
    • Composed together, we can transform from world frame to image plane: \[ P_h'=K \begin{bmatrix} I & 0 \end{bmatrix} \begin{bmatrix} R & t\\ 0 & 1 \end{bmatrix}P_w= K \begin{bmatrix} R & T \end{bmatrix} P_w \]

    Camera Matrix

    \[ P_h'= K \begin{bmatrix} R & T \end{bmatrix} P_w \]
    • $M=K \begin{bmatrix} R & t \end{bmatrix}\in\R{3\times 4}$ includes all the parameters of a pinhole camera to form images!
    • We refer to $M$ as the camera matrix
    • While $M$ has 12 numbers, not all matrices in $\R{3\times 4}$ are valid camera matrices
    • We can use the 5 parameters in $K$ and 6 parameters in $[R,t]$ to generate all valid $M$'s
    • In literature, the common expression is $M$ has 11 degree of freedoms

    Properties of Projective Transformations

    Projective transformation has been used by artists since Renaissances!

    The Healing of the Cripple and Raising of Tabitha, by Masolino (1426–1427)
    Cappella Brancacci, Santa Maria del Carmine, Florence

    We will explain properties by examples and derive them mathematically.

    Points Project to Points

    Points Project to Points

    Mathematically, the projection of a point in 3D world frame can be uniquely determined by a function, which is the projective transformation function: \[ P_h'= K \begin{bmatrix} R & t \end{bmatrix} P_w \]

    Parallel Lines Meet

    Parallel Lines Meet

    In 3D, the points on a line passing $\vec{P}_0=[x_0, y_0, z_0]^T$ can be parameterized by: \[ P_w= \begin{bmatrix} x_0\\y_0\\z_0\\1 \end{bmatrix}+ s \begin{bmatrix} d_x\\d_y\\d_z\\0 \end{bmatrix} \] where $\vec{d}=[d_x, d_y, d_z]^T$ is the direction of the line.

    Parallel Lines Meet

    • By our projective transformation, \[ \begin{aligned} P_h'=K[R, t] \left( \begin{bmatrix} \vec{P}_0\\1 \end{bmatrix}+s \begin{bmatrix} \vec{d}\\0 \end{bmatrix} \right) =K(R\vec{P}_0+t)+sKR\vec{d} \end{aligned} \]
    • Suppose that the camera location is fixed. When the point moves along the line, only $s$ changes.
    • So we introduce two constant vectors: \[ \vec v_1=R\vec P_0+t,\qquad \vec v_2=R\vec{d} \]
    • Then we have \[ P_h'=K(\vec v_1+s\vec v_2) \]

    Parallel Lines Meet

    \[ P_h'=K(\vec v_1+s\vec v_2), \qquad \vec v_1=R\vec x_0+t,\qquad \vec v_2=R\vec{d} \]
    • We make use of the structure of $K$: \[ K= \begin{bmatrix} & \vec k_1^T & \\ & \vec k_2^T & \\ 0 & 0 & 1 \end{bmatrix} \]
    • Therefore, \[ P_h'= \begin{bmatrix} & \vec k_1^T & \\ & \vec k_2^T & \\ 0 & 0 & 1 \end{bmatrix}(\vec v_1+s\vec v_2) =\begin{bmatrix} \vec k_1^T \vec v_1 + s \vec k_1^T \vec v_2\\ \vec k_2^T \vec v_1 + s \vec k_2^T \vec v_2\\ \vec v_{1,3}+s\vec{v}_{2,3} \end{bmatrix} \Rightarrow P'= \begin{bmatrix} \frac{\vec k_1^T \vec v_1 + s \vec k_1^T \vec v_2}{\vec v_{1,3}+s\vec{v}_{2,3}}\\ \frac{\vec k_2^T \vec v_1 + s \vec k_2^T \vec v_2}{\vec v_{1,3}+s\vec{v}_{2,3}} \end{bmatrix} \]

    Parallel Lines Meet

    \[ P_h'=K(\vec v_1+s\vec v_2), \qquad \vec v_1=R\vec P_0+t,\qquad \vec v_2=R\vec{d},\qquad P'= \begin{bmatrix} \frac{\vec k_1^T \vec v_1 + s \vec k_1^T \vec v_2}{\vec v_{1,3}+s\vec{v}_{2,3}}\\ \frac{\vec k_2^T \vec v_1 + s \vec k_2^T \vec v_2}{\vec v_{1,3}+s\vec{v}_{2,3}} \end{bmatrix} \]
    • Assume that $\vec v_{2,3}\neq 0$
      • The 3D point goes to infinity as $s\to \infty$, so \( P'\to \begin{bmatrix} \frac{\vec k_1^T\vec v_2}{\vec v_{2,3}}\\ \frac{\vec k_2^T\vec v_2}{\vec v_{2,3}}\\ \end{bmatrix} \).
      • This point is called the vanishing point

    Parallel Lines Meet

    \[ P_h'=K(\vec v_1+s\vec v_2), \qquad \vec v_1=R\vec P_0+t,\qquad \vec v_2=R\vec{d},\qquad P'= \begin{bmatrix} \frac{\vec k_1^T \vec v_1 + s \vec k_1^T \vec v_2}{\vec v_{1,3}+s\vec{v}_{2,3}}\\ \frac{\vec k_2^T \vec v_1 + s \vec k_2^T \vec v_2}{\vec v_{1,3}+s\vec{v}_{2,3}} \end{bmatrix} \]
    • When $\vec v_{2,3} = 0$.
      • After projection, the 3D lines intersect at infinity. In other words, they are parallel
      • When does this happen?

    Modern Work on Vanishing Point Prediction

    Vanishing point provides crucial information about the 3D structure of the scene. Applications such as
    • camera calibration
    • single-view 3D scene reconstruction
    • autonomous navigation
    • semantic scene parsing
    While vanishing point is known for long, the research is still active.
    For example, NeurVPS: Neural Vanishing Point Scanning via Conic Convolution, Zhou et al, NeurIPS 2019
    End