We present a method for converting first-person videos, for example, captured with a helmet camera during activities such as rock climbing or bicycling, into hyperlapse videos: time-lapse videos with a smoothly moving camera.
At high speed-up rates, simple frame sub-sampling coupled with existing video stabilization methods does not work, because the erratic camera shake present in first-person videos is amplified by the speed-up.
We have all seen the helmet videos from skydivers (if you haven’t, Jeb Corliss has one of the best) more recently are the emergence of helmet cams for bicyclist, surfers, and even pets! I have even spotted helmet cameras on my jogs around my relatively mundane neighborhood. Normally these videos are watched at an increased speed (who wants to watch a 45 minute ride for the 30 seconds of action) but the speed change is painful to view. Hyperlapse is a newly created method to stabilize and smooth out these videos.
Johannes Koph, Michael Cohen, and Richard Szeliski developed the new method to generate the smoother video. The process (See Technical Video below) is substantially more complicated than the familiar stabilizer functionality commonly used. The new system consists of three stages, Scene Reconstruction, Path Planning, and Image-based Rendering. Scene reconstruction allows the system to build a 3D model of view, leveraging multiple frames from the video to do so. This provides the system the ability to actually change the viewpoint in the resulting rendering, moving from an abrupt viewpoint change to a smoother option. This is one of the key properties that allow the system to generate the silky smooth resulting videos. Path Planning is split into two stages, the first optimizes for smooth transitions, length, and approximation (the path should be near the input frames). The second stage optimizes for rendering quality. The resulting path can be slightly different than the path actually taken by the camera person (or pet!); but will still be approximately the same. The final step of the process is actually rendering the video. Because each new shot can be slightly different than the original video, the system merges multiple frames together; selecting the areas in each frame for the best quality of the resulting video.
The result is quite amazing; but there are still some artifacts you can notice when watching the videos. Watching or stepping through the video frame by frame you will notice that objects can suddenly appear or boundary areas where the images are merged are easily identifiable. These sections are hard to notice when viewing full speed though.
The new technique is very resource intensive. The research paper mentions that it took roughly 305 hours to process a 10 minute video! Most of the computational time is consumed during the source selection with computes at roughly one minute per frame. I suspect that cloud computing (such as Amazon Web Services and Azure) will be strongly utilized to allow even a mobile phone app to be used in the video editing process. It will be interesting to see how this video editing will be used!