Paul Taele's Blog on Gesture Recognition: March 2008

Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes (Wobbrock, et al – 2007)

31 March 2008

Current Mood:

Blogs I Commented On:

Summary:
The authors created an easy, cheap, and highly portable gesture recognizer called the $1 Gesture Recognizer. The algorithm requires one hundred lines of code and handles only basic geometry and trigonometry. The algorithm’s contributions include being easy to implement for novice user interface prototypers, capable as a measuring stick against more advanced algorithms, and used to give insight to gestures that are “best” for people and computer systems. Challenges to gesture recognizers in general include having resilience to sampling variations, supporting optimal and configurable rotation, scale, and position invariance, requiring no advanced math techniques, writing it easily in a few lines of code, being teachable with one example, returning a list of N-best points with sensible scores independent of number of points, and providing competitive recognition rates to more advanced algorithms.

$1 is able to cope with those challenges in its four-step algorithm: 1) re-sample to N points, where 32 <= N <= 256, 2) rotate once based on indicative angle, which is the angle formed between the gesture’s centroid and starting point, 3) scale non-uniformly and translate to the centroid, which is set as the origin, and 4) do recognition by finding the optimal angle for the best score. Analyzing the rotation invariance shows that there’s no guarantee that candidate points and template points of the gesture will optimally align after rotating the indicative angle to 0 degrees, so $1 uses a Golden Section Search (GSS) to find minimum ranges using the Golden Ratio in order to find the optimal angle.

Limitations of $1 include being unable to distinguish gestures whose identities depend on specific orientations, aspect ratios, or locations, abusing horizontal and vertical lines by non-uniform scaling, and being unable to differentiate gestures by speed since it doesn’t use time. In order to handle variations with $1, new templates can be defined to capture variation with a single name. A study was done to compare $1 with a modified Rubine classifier and Dynamic Time Warping (DTW) template matcher. The study showed that $1 and DTW were more accurate than Rubine, and that $1 and Rubine executed faster than DTW.

Discussion:
I guess I should change the discussion a bit because we're now looking at this paper from a GR perspective instead of an SR perspective. Our SR class was quite critical of this paper at the time, given that there were already existing SR algorithms that were more capable. Maybe $1 isn't as bad for GR, given that the simplicity of this algorithm would help bring GR-based applications into the mainstream, and also because such gestures for glove- and wand-based devices probably aren't as complicated to handle as for pen-based devices. The limitations that we noted in the SR class haven't gone away simply because we shifted to GR, but I don't think they're as disadvantageous second time around. I guess we won't know for sure unless we start experimenting with various applications which use $1.

Enabling fast and effortless customization in accelerometer based gesture interaction (Mantyjarvi, et al – 2004)

Current Mood:

Blogs I Commented On:

Summary:
The purpose of this paper is to create a procedure which allows users to create accelerometer-based gesture control when applying HMMs. Primarily, the authors refer to gestures as user hand movements collected with a set of sensors in a handheld device, modeled by machine learning methods. They use HMMs to recognize gestures since they can model time-series with spatial and temporal variability. Their system first involves preprocessing, where gesture data is normalized to equal length and amplitude. A vector quantizer is then used to map three-dimensional vectors into one-dimensional sequence of codebook indices, where the codebook was generated from collected gesture vectors using a k-means algorithm. This information is then sent to the HMM with an ergodic topology. A codebook size of 8 and a model state size of 5 were chosen. After vector quantization, the gesture is either used to train the HMM or to evaluate the HMM’s recognition capability. Finally, the authors added noise to test whether copied gesture data with added noise can reduce the amount of training repetitions done by the user when using discrete HMMs. The two types of noise distributions used were uniform and Gaussian, and various signal to noise ratios (SNR) were experimented with to determine which ratio value provided the best results. The system was evaluated for eight popular gestures applicable to a DVD playback system, and the experiments consisted of finding an optimal threshold value to converge the HMM, examining accuracy rates for different amounts of training repetitions, finding an optimal SNR value, and examining the effects of using noise distorted signal duplicates in training. With six training repetitions, accuracy was over 95%, the best accuracy for Gaussian and uniformly distributed noise with SNR = 3 and 5 was 97.2% ad 96.3%, respectively.

Discussion:
Some thoughts on the paper:

It felt like the authors wanted to create a system which allowed users to create customized “macros” analogous for hand motion gestures. It seems like an interesting idea, but my main concern with their system is with its robustness in relation to other users whom didn’t train these “macros.” It’s a novel idea to incorporate noise into existing training gesture data in order to generalize their system more while maintaining low training repetition, but the paper does not tell us how it performs on multiple users. The results may have given really high accuracy rates, but it’s a bit misleading since I didn’t see a separate test data used from, say, another user. I do think it’s a fine system if it’s for an application meant for that specific user though, but doesn’t seem robust for multiple users. If the latter is desired, I have no idea if this system will perform well.
They tested their data on 2D data. Seems like a waste of z-axis data, since it could have been done by omitting that third dimension. But then, I can’t really imagine truly useful gestures that would take advantage of z-axis data.
I think it would be better to have a system where users sketch their gestures in 2D on-screen, and then have the system try to recognize the accelerometer data using existing sketch recognition techniques.