(Hoiem, Efros, Hebert, 2008)
This paper is fantastic. It combines low-level object detection, geometry/scenery in 3D, and camera orientation/horizon line. It assumes a 3D area and makes predictions based on the 2D image.
I tried to imagine how the algorithm would work before reading it, and I imagined it would mostly look for circles and angles in the image. For example, a building along a street would have lots of lines converging towards a point, and I would base horizon line on that point. But that’s nearly backwards from the actual method they used.
First the scene is broken into components by object: based on contour and color variation.
It guesses where something might be before looking. For example, a big patch of the same color at the top is probably sky, but the same patch at the bottom is probably street surface. And if it is a street, you can guess there will be people and cars on it. A great deal of code already exists for identifying people in a scene. They have lots of strange angles and predictable proportions.
Second the scene is broken into scenery (buildings, trees, sky). Building windows are very good diagnostics, as are cars (both highly correlating with streets and trees).
Based on relationships between objects, a horizon line is determined, the key to all perspective. The entire thing is run iteratively, tossing outliers and making more refined guesses.
People are assumed to be approximately the same size, as are cars, so any proportional diminishing can be be attributed to perspective.
Now they can be resized and the entire scene pictured in 3D relative locations.
This method does not work well for portraits, macro/detail, or scenes with occluded sky and ground. For cases like this, the model must be trained with relevant sample images.
Really amazing how accurate this is, and it just keeps improving. This would be very interesting applied to Gigapan images of cities. Could it identify the altitude and direction of the camera, despite the distortion around the outside?
Look here for Gigapan samples: http://gigapan.com/gigapans/33411
It is very difficult for human eyes to pick out people and cars from a distance, but the images are so detailed that computer vision algorithms should have no trouble.