In order to find a proper transformation between two consequent frames several approaches have been exploited. First, the so-called SIFT (Scale-Invariant Feature Transform) algorithm was used to find a single rotation describing the entire transformation from one frame to another. The SIFT algorithm finds multiple distinctive invariant features on each frame, so that these features can be used to perform reliable matching between subsequent frames. For this particular application the SIFT algorithm exhibits a very good performance finding around 160 keypoints on each frame. Around a half of them find the correct match with the keypoints on the following frames. For example, for the two images shown on Fig.B (one corresponds to the instant of time preceding the pop-up event and another one has a point popped-up already) 76 matches were found. Rough averaging between all matches resulted in 0.547° rotation angle.

Figure B. Multiple matching points resulted from the SIFT algorithm.

It‘s worth noting that compared to some earlier applications (video scoring of descending payload) in this specific application the SIFT algorithm proved to work quite well and therefore can be used to find multiple matches between two frames very reliably. This is because each frame has a very distinctive texture.

The matching pairs of points found with the SIFT algorithm or somehow else (hereinafter called control points) can be further used to infer a more general, spatial transformation or an inverse mapping from output space (x,y) to input space (u,v) according to the chosen transform type. There are several transform types that can be used for the specific application in order to find a better match between two images. The first three, requiring lesser control-point pairs are listed below:

- ‘Linear conformal’ – this transformation assumes that shapes in the input image are unchanged, but the image is distorted by some combination of translation, rotation, and scaling, i.e. the straight lines remain straight, and parallel lines are still parallel;
- ‘Affine’ – this transformation assumes that shapes in the input image exhibit shearing, i.e., the straight lines remain straight, and parallel lines remain parallel, but rectangles become parallelograms. In an affine transformation, the x and y dimensions can be scaled or sheared independently and there can be a translation as well;
- ‘Projective’ - this transformation assumes that transformation when the scene appears tilted (which actually happens when the point of view of the observer changes). Straight lines remain straight, but parallel lines converge toward vanishing points that might or might not fall within the image.

The following presents an example of using the multiple pairs of control points assigned manually using the MATLAB Control Point Selection Tool. When being called, the Control Point Selection Tool allows assigning multiple pairs of the control points either manually (as shown on Fig.C) or in the prompt regime (available after assigning at least two points manually), when the potential matches are predicted manually (as shown on Fig.D).

Figure C. Picking the pairs of control points manually. | Figure D. Picking the pairs of control points semi-automatically. |

After appropriate number of points has been assigned, the algorithm continues with their correction (based on cross correlation) and computing the corresponding transformation. Figure E presents an example, when the linear conformal transformation was used to correct the input image (shown on the top right). The corrected image (bottom right) reveals the parameters of the transformation (note, all images were cropped by 10 pixels from the left, up and bottom and by 20 pixels from the right to eliminate the black selvedge).

Shown on the bottom left is the raw difference between the base and input (uncorrected) frames. Supposedly, if applied to the corrected image it should only reveal the pop-up spots (in this case their position within the image frame could be extracted very easily). Figure F presents this corrected difference for the linear conformal transformation. It is clearly seen that the picture appears cleaner as compared to the bottom left plot on Fig.E. The average intensity dropped from 3.73 on Fig.E to 1.47 on Fig.F. Note that there are some remnants still left on Fig.F indicating that affine or even projective transformations might be a better choice.

Figure E. Applying the correction to the input image.

Figure F. Difference between base and corrected image.

By looking at Fig.F it is clear that if something new pops up it should be easier to detect it. Another useful outcome of correcting the frames, so that they reveal the same orientation as the base frame, is that the motion of the observer (aerial platform) can be totally excluded from consideration. In this case when being played back, the corrected frame sequence will produce a “stabilized” video, so that all surveyed reference points remain at the same spots. It may also help blending visual data with the IR data later on (the IR imagery will also need to undergo the same processing).

Obviously, applying the SIFT algorithm and MATLAB Control Point Selection Tool is not the only option for finding the multiple pairs of the control points. The frame positions of the surveyed reference points may serve as these control points as well. (Besides, we need to know the positions of these points on each frame anyway.)