Paul Taele's Blog on Gesture Recognition

Gesture Recognition Using an Acceleration Sensor and Its Application to Musical Performance Control (Sawada & Hashimoto – 1997)

02 April 2008

Current Mood:

Blogs I Commented On:

Summary:
The focus of this paper is gesture recognition using simple three-dimensional acceleration sensors with weights, where the domain is conducting. From these sensors, the accelerations in the x, y, and z directions are obtained. Motion features are extracted from three vectors, each of which are a combination of two of the directions. Kinetic parameters are then extracted from those motion features, which are: intensity of motion, rotational direction, direction of the main motion, and distributions of the acceleration data over eight principal directions. For a series of acceleration data, a kind of membership function was used as a fuzzy representation of the closeness to each principal direction, and it’s defined as an isolated triangular function which decreases in value linearly from the principal direction to the neighboring principal directions. Each kinetic parameter is then calculated from the time-series vectors, and gestures are recognized from a total of 33 kinetic features.

To recognize the gestures, the start of them is first identified from the magnitude of the acceleration. Before system use, an individual’s motion feature patterns are constructed during training and gesture start is detected when his/her acceleration is larger than the fluctuation is observed. Samples of the motion during training are recognized as input M times, and some standard pattern data for each gesture are constructed and stored. During recognition mode, weight errors for each feature parameter of an unknown motion’s acceleration data series is computed, and the dissimilarity to the patterns are determined. For musical tempo recognition, the rotational component is ignored, and the rhythm is extracted from the pattern of changes of the magnitude and phase of the vector. The instant when the conductor’s arm reaches the lowest point of the motion space, it’s recognized as the rhythm point. By detecting the maxima periodically, the rhythm points can be identified in real-time. To evaluate the system, the authors tested 10 kinds of gestures used for performance control. Two users repeated the gestures 10 times each for standard pattern construction. Though their system recognized the gestures for the first user trained on perfectly but not completely for the second user tested on, the authors concluded nonetheless that it still does better than vision-based or other approaches.

Discussion:
Before I discuss the technical content of this paper, I just want to say that the paper format was written suckily. The related section was slapped on in the middle of the paper out of left field, and I felt that the figure and table numbers were not easy to locate quickly within the paper sometimes. Okay, back on topic. My thoughts:

I’m glad that this paper did not spend two pages talking about music theory.

The shape of the conducting gestures in Aaron’s user study looked pretty complex. It seemed like that the author’s tackled recognition of the conducting gestures by simplifying them as a collection of directional gestures. I’m no expert on conducting gestures, but I wonder if the author’s approximation sufficient enough to separate all the gestures in Aaron’s domain. I will say yes for now.

I know people in our class will criticize this paper for its lame evaluation section. This seems so common in the GR papers we’ve been reading that I just gave up and accepted it as being the norm for this field.

I love reading translated research papers written by Japanese authors. They’re the funny.

Activity Recognition using Visual Tracking and RFID (Krahnstoever, et al – 2005)

01 April 2008

Current Mood:

Blogs I Commented On:

Summary:
This paper discusses how RFID technology can augment a traditional vision system with very specific object information to significantly extend its capabilities. The system’s components consist of two modules: tracking and RFID tag tracking. For the former, human motion tracking estimates a person’s head and hand location in a 3D world coordinate reference frame using a camera system. The likelihood for a frame is estimated based on the summation of the image over the bounding box of either the head or hand in a specific view. The tracker continuously estimates proposal body part locations from the image data, which are used as initialization and recovery priors. The tracker also follows a sequential Monte Carlo filtering approach and performs partitioned sampling to reduce the number of particles needed for tracking. For the latter, the RFID tracking unit detects the presence, movements, and orientation of RFID tags in 3D space. An algorithm for articulated upper body tracking is given in the paper. Combining both trackers gives an articulated motion tracker, which outputs a time series of mean estimates of a subject’s head and hand locations in addition to visibility flags that express whether the respective body parts are currently visible in the optical field of view. By observing subsequent articulated movements, the authors claim that the tracker can estimate what the person is doing. These interactions were encoded as a set of scenarios using rules in an agent-based architecture. To test their system, a prototype system was created consisting of a shelf-type rack made to hold objects of varying sizes and shapes. When a person interacts with RFID-equipped objects, the system was able to detect which item the user was interacting with (difficult for a vision-only system) and the type of interaction the user was doing to the object (difficult for an RFID-only system).

Discussion:
I was familiar with RFID tags from other applications, but I didn't know or think that could be used for haptics. That's why I wasn't too sure at first about the relevance of this paper to our class. Josh P. fortunately enlightened our class that RFID tags can nicely supplement the existing devices we have in the labs. While the potential is definitely there for the type of things our class is doing, the paper itself was pretty poor in presenting that potential. Most of that can be attributed to a weak example application which, of course, happened to have no numerical results to speak of. But that's been like the norm in the paper's we've been reading.

Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes (Wobbrock, et al – 2007)

31 March 2008

Current Mood:

Blogs I Commented On:

Summary:
The authors created an easy, cheap, and highly portable gesture recognizer called the $1 Gesture Recognizer. The algorithm requires one hundred lines of code and handles only basic geometry and trigonometry. The algorithm’s contributions include being easy to implement for novice user interface prototypers, capable as a measuring stick against more advanced algorithms, and used to give insight to gestures that are “best” for people and computer systems. Challenges to gesture recognizers in general include having resilience to sampling variations, supporting optimal and configurable rotation, scale, and position invariance, requiring no advanced math techniques, writing it easily in a few lines of code, being teachable with one example, returning a list of N-best points with sensible scores independent of number of points, and providing competitive recognition rates to more advanced algorithms.

$1 is able to cope with those challenges in its four-step algorithm: 1) re-sample to N points, where 32 <= N <= 256, 2) rotate once based on indicative angle, which is the angle formed between the gesture’s centroid and starting point, 3) scale non-uniformly and translate to the centroid, which is set as the origin, and 4) do recognition by finding the optimal angle for the best score. Analyzing the rotation invariance shows that there’s no guarantee that candidate points and template points of the gesture will optimally align after rotating the indicative angle to 0 degrees, so $1 uses a Golden Section Search (GSS) to find minimum ranges using the Golden Ratio in order to find the optimal angle.

Limitations of $1 include being unable to distinguish gestures whose identities depend on specific orientations, aspect ratios, or locations, abusing horizontal and vertical lines by non-uniform scaling, and being unable to differentiate gestures by speed since it doesn’t use time. In order to handle variations with $1, new templates can be defined to capture variation with a single name. A study was done to compare $1 with a modified Rubine classifier and Dynamic Time Warping (DTW) template matcher. The study showed that $1 and DTW were more accurate than Rubine, and that $1 and Rubine executed faster than DTW.

Discussion:
I guess I should change the discussion a bit because we're now looking at this paper from a GR perspective instead of an SR perspective. Our SR class was quite critical of this paper at the time, given that there were already existing SR algorithms that were more capable. Maybe $1 isn't as bad for GR, given that the simplicity of this algorithm would help bring GR-based applications into the mainstream, and also because such gestures for glove- and wand-based devices probably aren't as complicated to handle as for pen-based devices. The limitations that we noted in the SR class haven't gone away simply because we shifted to GR, but I don't think they're as disadvantageous second time around. I guess we won't know for sure unless we start experimenting with various applications which use $1.

Enabling fast and effortless customization in accelerometer based gesture interaction (Mantyjarvi, et al – 2004)

Current Mood:

Blogs I Commented On:

Summary:
The purpose of this paper is to create a procedure which allows users to create accelerometer-based gesture control when applying HMMs. Primarily, the authors refer to gestures as user hand movements collected with a set of sensors in a handheld device, modeled by machine learning methods. They use HMMs to recognize gestures since they can model time-series with spatial and temporal variability. Their system first involves preprocessing, where gesture data is normalized to equal length and amplitude. A vector quantizer is then used to map three-dimensional vectors into one-dimensional sequence of codebook indices, where the codebook was generated from collected gesture vectors using a k-means algorithm. This information is then sent to the HMM with an ergodic topology. A codebook size of 8 and a model state size of 5 were chosen. After vector quantization, the gesture is either used to train the HMM or to evaluate the HMM’s recognition capability. Finally, the authors added noise to test whether copied gesture data with added noise can reduce the amount of training repetitions done by the user when using discrete HMMs. The two types of noise distributions used were uniform and Gaussian, and various signal to noise ratios (SNR) were experimented with to determine which ratio value provided the best results. The system was evaluated for eight popular gestures applicable to a DVD playback system, and the experiments consisted of finding an optimal threshold value to converge the HMM, examining accuracy rates for different amounts of training repetitions, finding an optimal SNR value, and examining the effects of using noise distorted signal duplicates in training. With six training repetitions, accuracy was over 95%, the best accuracy for Gaussian and uniformly distributed noise with SNR = 3 and 5 was 97.2% ad 96.3%, respectively.

Discussion:
Some thoughts on the paper:

It felt like the authors wanted to create a system which allowed users to create customized “macros” analogous for hand motion gestures. It seems like an interesting idea, but my main concern with their system is with its robustness in relation to other users whom didn’t train these “macros.” It’s a novel idea to incorporate noise into existing training gesture data in order to generalize their system more while maintaining low training repetition, but the paper does not tell us how it performs on multiple users. The results may have given really high accuracy rates, but it’s a bit misleading since I didn’t see a separate test data used from, say, another user. I do think it’s a fine system if it’s for an application meant for that specific user though, but doesn’t seem robust for multiple users. If the latter is desired, I have no idea if this system will perform well.
They tested their data on 2D data. Seems like a waste of z-axis data, since it could have been done by omitting that third dimension. But then, I can’t really imagine truly useful gestures that would take advantage of z-axis data.
I think it would be better to have a system where users sketch their gestures in 2D on-screen, and then have the system try to recognize the accelerometer data using existing sketch recognition techniques.

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction (Murayama, et al – 2004)

26 March 2008

Current Mood: studious

Blogs I Commented On:

Summary:
This is a haptics hardware paper consisting of a haptic interface for two-handed manipulation of objects in a virtual environment. The system in this paper, SPIDAR-G&G, consists of a pair of string-based 6DOF haptic devices called SPIDAR-G. These haptic devices allow translation and rotational manipulation of virtual objects, and also force and torque feedback to the user. It does so by first getting position and orientation from each grip representing hand position by measuring the strings’ length, then calculating collision detection between virtual objects and the user’s hands, and finally displaying the appropriate force dieback by controlling the tension of each string. Three users familiar with VR interfaces participated in evaluating the system by timing the completion of a 3D pointing task. The four tasks consisted of either one- or two-handed manipulation, and with or without haptic feedback. As expected, two-handed manipulation with haptic feedback performed the best.

Discussion:
This paper came out a few years ago while it was still in its initial phase, but it’s an innovative system for providing physical feedback for two-handed manipulation of virtual objects. At the time of the paper’s publication, it still had a lot of work needed to be done, but I can see how it is still open for lots of improvement, unlike the 3D Tractus device. It’s a good start for the VR domain too, though it seems quite bulky and also limited with strings. Achieving the system without strings would be another feat in itself though.

Gesture Recognition with a Wii Controller (Scholmer, et al – 2008)

Current Mood: studious

Blogs I Commented On:

Summary:
This paper claims to be a gesture recognition system for the Wiimote, composed of a filter, k-mean quantizer, HMM model, and Bayes classifier. The system was tested on a set of trivial gestures, and their results did not achieve perfect recognition. That’s all. I’m serious.

Discussion:
The primary reason why I didn’t like this paper was because it was incomplete yet still accepted. If their chosen input device wasn’t new, I don’t think it would have been accepted. It really is lacking a lot of data. I do like this paper though, because our class can do better. I’m sorry that I chose this paper, everyone.

Taiwan sign language (TSL) recognition based on 3D data and neural networks (Lee & Tsai – 2007)

Current Mood: studious

Blogs I Commented On:

Summary:
This is a hand gesture recognition paper which uses a neural networks-based approach for the domain of Taiwanese Sign Language (TSL). Data for 20 right-hand gestures was retrieved using a VICON, which are then fed into their neural network, where 15 geometric distances were employed as feature representation of the different gestures. Their backpropagation neural network structure, implemented in MATLAB, had 15 input units, 20 output units, and two hidden layers in total. Recognition rates for varying number of neurons gave roughly 90% accuracy.

Discussion:
This is the second GR paper which used NNs, and I think that this paper was superior simply because their NN was more advanced. It was a vanilla NN implementation though, and it seemed like they just got some data and ran it though some MATLAB NN library. It’s a very straightforward paper that would have been more interesting had their gestures been more representative or complex. A vast majority of the gestures in this paper derive from words which are hardly used in the language. It appears that the authors preferred to work with gestures that were easier to classify as opposed to gestures that were actually common used.

Paul Taele's Blog on Gesture Recognition

Gesture Recognition Using an Acceleration Sensor and Its Application to Musical Performance Control (Sawada & Hashimoto – 1997)

02 April 2008

Activity Recognition using Visual Tracking and RFID (Krahnstoever, et al – 2005)

01 April 2008

Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes (Wobbrock, et al – 2007)

31 March 2008

Enabling fast and effortless customization in accelerometer based gesture interaction (Mantyjarvi, et al – 2004)

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction (Murayama, et al – 2004)

26 March 2008

Gesture Recognition with a Wii Controller (Scholmer, et al – 2008)

Taiwan sign language (TSL) recognition based on 3D data and neural networks (Lee & Tsai – 2007)

Blog Archive

Links