|The so-called pose estimation problem, i.e. when in addition to a 3D position estimation we need to solve for an attitude of the test article as well, is much more challenging as compared to estimating just the 3D position. There are several algorithms based on the geometry-based features of an object to track exist. However, this specific application is unique in a sense that the payload itself resides in a very small box (maybe 30 pixels by 40 pixels out of 240x320-pixel frame), has very little distinctive features (texture) on it, and, on top of that, there are huge compression artifacts. Nevertheless, several algorithms have been tested and software allowing to resolve the pose estimation problem (IF the location of multiple points is known) has been developed. The developed algorithms were built around either POSIT (Pose from Orthography and Scaling with Iterations) or SIFT (Scale-Invariant Feature Transform) software.|
The POSIT derived algorithm (originally developed by Daniel DeMenthon. It was found that if three or more payload’s features with the known geometry can be unambiguously determined (extracted from each frame), then pose estimation becomes possible (the created Simulink model is shown on Fig. I). Specifically, it was shown that for the four non-coplanar points this algorithm works fairly well. Obviously, the larger the base (distance) between these features, the more accurate results will POSIT (or any other) algorithm produce. Therefore, the corners of the payload are the best candidate points to track. As an example, Figure J shows a sequence of frames with four candidate points being chosen on each frame (manually).
Figure I. The developed software based on the DeMenthon’s algorithm.
|Figure K presents the results of processing the sequence of the frames shown in Fig.J by the model of Fig.I. No IMU/GPS data was available for this drop to compare the obtained results against, but qualitatively they look quite reasonable. Figure L represents the outputs of the model when applied to some simulated payload with attitude and pixel noise. As seen the model performs very good assuring an Euler angles estimate error of less than half-a-degree. What is also important is that since the algorithms solves the complete pose estimation problem (not just attitude estimation), the ground camera knows precisely where to look for the payload at the very next moment (there is an active feedback loop in Fig.I). It results in keeping the payload in the center of the camera frame all the time ensuring the best performance possible.|
Figure J. Digitized video sequence.
Figure K. Results of real data processing.
Figure L. Processing data from the virtual (emulated) payload.
|To summarize, the POSIT-based algorithm exhibits a very good performance. The only problem is with having four non-coplanar corners of the payload in the camera’s field of view all the time. Even if multiple cameras look at the descending payload from three or four different directions, it is not possible to see four non-coplanar points by at least one camera all the time. There are periods of uncertainty, when only three (planar) points are observed by each camera. Firstly, it implies that the algorithm should include some filtering to predict a position (movement) of each point, so that it could be handed to the next camera to assure that is has a complete four-point constellation. Secondly, it suggests placing KTMs around the impact point evenly to minimize periods of uncertainty (which of course is quite problematic). That is why another approach, discussed next, was also investigated.|
Another piece of software was build around the SIFT algorithm developed by David Lowe. Evolution Robotics ViPR™ technology (ViPR stands for visual pattern recognition) incorporates SIFT in their commercial products already and provides a reliable and robust vision solution that truly gives different electronic devices the ability to detect and recognize complex visual patterns. The idea of the SIFT approach as applied to payload pose estimation is that rather than tracking the 4+ non-coplanar reference points with the known geometry, we should somehow try to find much more distinctive invariant feature point on each frame that could be used to perform reliable matching between subsequent frames.
Several hundred of such features (corners of an object or its unique texture patterns) are automatically extracted and stored in a database for each frame. Then a match within an extremely large set of possible candidates is fulfilled. The algorithm to select the correct candidate is similar to a voting mechanism - each feature votes for the candidate, which includes a similar feature. The correct candidate receives the largest number of votes since most of the features are in agreement. However, a single or a few votes might be incorrectly cast on wrong candidates. The likelihood that a large number of votes are cast on the wrong candidate is supposedly to be small, demonstrating that the algorithm should be very reliable in selecting the correct match.
Figure M shows an example of multiple features found by SIFT for one of the frames shown in Fig.J. For this particular frame there were 69 automatically defined points (versus only four needed for the POSIT algorithm). Suppose one fourth of the found features will not find their matches on the subsequent frame – we should still have about fifty other matches to rely on. (Compare it to what would happen when a single point in a four-point scheme is missing or found incorrectly.)
Figure M. One of the real drop frames featuring 69 scale-invariant key points
However, as the following analysis revealed, the SIFT algorithm does not work as good as predicted with the poor-quality video we are dealing with. Figure N demonstrates that the hope that among those 69 features found on each image (Fig.M) the majority will find their match on the subsequent images did not prove true. On the contrary, less than one fourth of the found features finds their matches on the subsequent frames. To this end, Fig.Na compares Frame 1 vs. Frame 2 (taken within 1/30=0.03sec). It shows only 17 matching key points. The comparison of Frame 1 and Frame 3 (2/30=0.07sec apart) reveals as little as 7 matching key points (Fig.Nb). Frame 1 vs. Frame 4 (3/30=0.1sec) yields 5 matching key points (Fig.Nc), and Frame 1 vs. Frame 8 (7/30=0.23sec) – a single matching key point (Fig.Nd). In just eight frames (about a quarter of a second apart) there is no match at all! Moreover, the matching points found on the further-apart frames are not necessarily show up on the closer-apart frames, i.e. there is no consistency.
| || |
| Figure N. The key features matches on the subsequent images: frames 1 and 2 (a),|
1 and 3 (b), 1 and 4 (c), 1 and 8 (d).
The same algorithm was applied to process video data from the drop with supposedly more distinctive patterns on the payload (different-color square plates were attached to each side of payload as shown in Fig.O). Moreover, the camera resolution was doubled as well providing 640pixel x 480pixel frames instead of 320pixel x 240pixel frames as in Figs.M and N. Unfortunately, these efforts were not repaid completely. The number of scale-invariant points for each frame (Fig.P) turned out to be about the same as before (compare it with Fig.M). Fortunately, the matching between subsequent frames in this case was slightly better (27, 18, 13, and 7 points in Fig.Q compared to 17, 7, 5 and 1 in Fig.N). This allowed estimating the attitude of the payload more or less accurately and reliable for the entire drop.
Figure O. Payload with more distinctive features.
Figure P. Marked payload’s image featuring 67 scale-invariant key points.
| Figure Q. The key features matches on the subsequent images: frames 1 and 2 (a),|
1 and 3 (b), 1 and 4 (c), 1 and 8 (d).
Further development might include using another popular piece of software used in the computer-vision community, which is Kanade-Lucas-Tomasi feature tracker.