Summary

Triangulation uses two (or more) perspectives to create 3-D data. This could either be two sensors or a combination of sensor and a special light source at a different location. With a single non-moving camera it is very hard to derive distance information without further knowledge of the scene.

Basic idea

All triangulation methods rely on the following principle:

Imagine we take a picture of a cat (Figure 1). We want to find out how far the cat is away. With one camera we can only roughly estimate the distance of the cat to the camera from our knowledge how big cats are and how big an object of known size appears in the camera image at a given distance. However it’s hard to tell whether we have a kitten which is close to the camera or a giant cat at a large distance (Figure 2).

Let us add a second camera

If we add a second camera, out estimates can get much more precise without having to know anything about cats. Figure 4: Projection images generated from scene in 3. Blue is left camera, red is right camera. This illustration is simplified, in practice the cat would not just be shifted in the two views but also be slightly rotated around the y-axis.

Assume we have a setup like illustrated in Figure 3. With such a setup images like those shown in Figure 4 can be captured. Now we can try to find the corresponding features in both images. For example, we assume we can find the center of the cat’s left pupil. This gives us a pair of coordinates (x1, y1) for the left and (x2, y2) for the right image.

Calibration

If we calibrate the camera intrinsically, we can determine the direction of the ray that goes through the cameras focal point and the pixel location on the sensor chip. This is described by a so called projective transformation. If we also stereo-calibrate our setup we can determine the location and orientation of one camera to another (this is a so called rigid body transformation). With this information we can determine the actual rays in the world coordinate system that go through the cameras’ focal points and the cat’s pupil (see Figure 5). Here we assume the focal point of camera 1 (c1) is the world coordinate origin. We get c2 by taking the translation component of the rigid body transformation R, the direction between c1 and the pupil (d1) can be directly computed from the intrinsic calibration of camera 1. For d2 we need the intrinsic calibration of camera 2 and the rotation component of R. With all this combined, we get two lines in the 3-D world coordinate system. The intersection of the two lines gives us the location of the point P, which is the cat’s pupil.

In practice this technique is very slow if you want to reconstruct a complete point cloud instead of a single point, because for each pixel in the left camera you need to find a correspondence in the right camera and in most cases you correspondence search needs to accurate on a sub-pixel level. The basic idea is to use all the calibration data to transform both camera images in a way that all correspondences lay on the same image line. This is already the case in our simplified example in Figure 4. In this case, the correspondence search boils down to finding the Δx for each pixel (Δx is usually called disparity). With standard camera image pair without any corrections, the space of potential correspondences is a curve that’s different for every pixel position because of the typical lens distortion. The article on stereo reconstruction gives an overview of more sophisticated image rectification and matching techniques for correspondence search.

How to find good correspondences

But still the correspondence search is a very difficult problem, especially in image areas which have no structure (like a plain wall without any visible pattern) or which have repetitive structures. In our example we can assume that we confuse the left and the right pupil (see Figure 6). In this case our estimated pupil would be behind the cat. This problem can be overcome either by consistency checks in the matching algorithm to detect outliers or by replacing the second camera with a system that can generate unique structures.

Laser Vision

A very simple approach would be to use a laser pointer instead of a camera. We would determine the laser beam direction an origin by calibration. Then our laser can be viewed by a fixed single pixel of the second camera and the location of the laser point needs to be detected in the camera image. Then, all the triangulation properties from above still apply. The detection process can be much simpler if you use for instance an optical band-pass filter on the camera that is tuned to the wavelength of the laser. We also could replace our 2-D sensor matrix of the camera by a single line since we know that the point correspondence can only appear on a line. The algorithmic part is much easier but if we want to measure more than one point, we need to move the laser pointer in order to scan the whole scene. This adds mechanical complexity and we cannot deal with dynamic scenes with motion in it any more. So this solution is very good for scenarios with static scenes where we need very high precision and have a high hardware budget.

The next logical step is to choose a line laser instead of a point laser. With this setup we can get at least the 3-D data of a whole line instead of a single point. The basic math again applies and the calibration of the laser to the camera is also very similar. This solution can also be used in a laser scanner by tilting the laser fan or moving our device on a calibrated trajectory, but it’s also a great solution if we need to measure objects that move with a known velocity in a single direction.

Structured Light

If we need to capture whole scenes in a single shot, we can turn our passive stereo setup into an active system by adding a structure projector which projects a pattern onto the scene and creates artificial structure even in regions without textures. We also can calibrate the projection unit to the system and get rid of one camera.