Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes (Wobbrock, et al – 2007)

31 March 2008

Current Mood:

Blogs I Commented On:


Summary:
The authors created an easy, cheap, and highly portable gesture recognizer called the $1 Gesture Recognizer. The algorithm requires one hundred lines of code and handles only basic geometry and trigonometry. The algorithm’s contributions include being easy to implement for novice user interface prototypers, capable as a measuring stick against more advanced algorithms, and used to give insight to gestures that are “best” for people and computer systems. Challenges to gesture recognizers in general include having resilience to sampling variations, supporting optimal and configurable rotation, scale, and position invariance, requiring no advanced math techniques, writing it easily in a few lines of code, being teachable with one example, returning a list of N-best points with sensible scores independent of number of points, and providing competitive recognition rates to more advanced algorithms.

$1 is able to cope with those challenges in its four-step algorithm: 1) re-sample to N points, where 32 <= N <= 256, 2) rotate once based on indicative angle, which is the angle formed between the gesture’s centroid and starting point, 3) scale non-uniformly and translate to the centroid, which is set as the origin, and 4) do recognition by finding the optimal angle for the best score. Analyzing the rotation invariance shows that there’s no guarantee that candidate points and template points of the gesture will optimally align after rotating the indicative angle to 0 degrees, so $1 uses a Golden Section Search (GSS) to find minimum ranges using the Golden Ratio in order to find the optimal angle.

Limitations of $1 include being unable to distinguish gestures whose identities depend on specific orientations, aspect ratios, or locations, abusing horizontal and vertical lines by non-uniform scaling, and being unable to differentiate gestures by speed since it doesn’t use time. In order to handle variations with $1, new templates can be defined to capture variation with a single name. A study was done to compare $1 with a modified Rubine classifier and Dynamic Time Warping (DTW) template matcher. The study showed that $1 and DTW were more accurate than Rubine, and that $1 and Rubine executed faster than DTW.

Discussion:
I guess I should change the discussion a bit because we're now looking at this paper from a GR perspective instead of an SR perspective. Our SR class was quite critical of this paper at the time, given that there were already existing SR algorithms that were more capable. Maybe $1 isn't as bad for GR, given that the simplicity of this algorithm would help bring GR-based applications into the mainstream, and also because such gestures for glove- and wand-based devices probably aren't as complicated to handle as for pen-based devices. The limitations that we noted in the SR class haven't gone away simply because we shifted to GR, but I don't think they're as disadvantageous second time around. I guess we won't know for sure unless we start experimenting with various applications which use $1.

Enabling fast and effortless customization in accelerometer based gesture interaction (Mantyjarvi, et al – 2004)

Current Mood:

Blogs I Commented On:


Summary:
The purpose of this paper is to create a procedure which allows users to create accelerometer-based gesture control when applying HMMs. Primarily, the authors refer to gestures as user hand movements collected with a set of sensors in a handheld device, modeled by machine learning methods. They use HMMs to recognize gestures since they can model time-series with spatial and temporal variability. Their system first involves preprocessing, where gesture data is normalized to equal length and amplitude. A vector quantizer is then used to map three-dimensional vectors into one-dimensional sequence of codebook indices, where the codebook was generated from collected gesture vectors using a k-means algorithm. This information is then sent to the HMM with an ergodic topology. A codebook size of 8 and a model state size of 5 were chosen. After vector quantization, the gesture is either used to train the HMM or to evaluate the HMM’s recognition capability. Finally, the authors added noise to test whether copied gesture data with added noise can reduce the amount of training repetitions done by the user when using discrete HMMs. The two types of noise distributions used were uniform and Gaussian, and various signal to noise ratios (SNR) were experimented with to determine which ratio value provided the best results. The system was evaluated for eight popular gestures applicable to a DVD playback system, and the experiments consisted of finding an optimal threshold value to converge the HMM, examining accuracy rates for different amounts of training repetitions, finding an optimal SNR value, and examining the effects of using noise distorted signal duplicates in training. With six training repetitions, accuracy was over 95%, the best accuracy for Gaussian and uniformly distributed noise with SNR = 3 and 5 was 97.2% ad 96.3%, respectively.

Discussion:
Some thoughts on the paper:
  • It felt like the authors wanted to create a system which allowed users to create customized “macros” analogous for hand motion gestures. It seems like an interesting idea, but my main concern with their system is with its robustness in relation to other users whom didn’t train these “macros.” It’s a novel idea to incorporate noise into existing training gesture data in order to generalize their system more while maintaining low training repetition, but the paper does not tell us how it performs on multiple users. The results may have given really high accuracy rates, but it’s a bit misleading since I didn’t see a separate test data used from, say, another user. I do think it’s a fine system if it’s for an application meant for that specific user though, but doesn’t seem robust for multiple users. If the latter is desired, I have no idea if this system will perform well.
  • They tested their data on 2D data. Seems like a waste of z-axis data, since it could have been done by omitting that third dimension. But then, I can’t really imagine truly useful gestures that would take advantage of z-axis data.
  • I think it would be better to have a system where users sketch their gestures in 2D on-screen, and then have the system try to recognize the accelerometer data using existing sketch recognition techniques.

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction (Murayama, et al – 2004)

26 March 2008

Current Mood: studious

Blogs I Commented On:


Summary:
This is a haptics hardware paper consisting of a haptic interface for two-handed manipulation of objects in a virtual environment. The system in this paper, SPIDAR-G&G, consists of a pair of string-based 6DOF haptic devices called SPIDAR-G. These haptic devices allow translation and rotational manipulation of virtual objects, and also force and torque feedback to the user. It does so by first getting position and orientation from each grip representing hand position by measuring the strings’ length, then calculating collision detection between virtual objects and the user’s hands, and finally displaying the appropriate force dieback by controlling the tension of each string. Three users familiar with VR interfaces participated in evaluating the system by timing the completion of a 3D pointing task. The four tasks consisted of either one- or two-handed manipulation, and with or without haptic feedback. As expected, two-handed manipulation with haptic feedback performed the best.

Discussion:
This paper came out a few years ago while it was still in its initial phase, but it’s an innovative system for providing physical feedback for two-handed manipulation of virtual objects. At the time of the paper’s publication, it still had a lot of work needed to be done, but I can see how it is still open for lots of improvement, unlike the 3D Tractus device. It’s a good start for the VR domain too, though it seems quite bulky and also limited with strings. Achieving the system without strings would be another feat in itself though.

Gesture Recognition with a Wii Controller (Scholmer, et al – 2008)

Current Mood: studious

Blogs I Commented On:


Summary:
This paper claims to be a gesture recognition system for the Wiimote, composed of a filter, k-mean quantizer, HMM model, and Bayes classifier. The system was tested on a set of trivial gestures, and their results did not achieve perfect recognition. That’s all. I’m serious.

Discussion:
The primary reason why I didn’t like this paper was because it was incomplete yet still accepted. If their chosen input device wasn’t new, I don’t think it would have been accepted. It really is lacking a lot of data. I do like this paper though, because our class can do better. I’m sorry that I chose this paper, everyone.

Taiwan sign language (TSL) recognition based on 3D data and neural networks (Lee & Tsai – 2007)

Current Mood: studious

Blogs I Commented On:


Summary:
This is a hand gesture recognition paper which uses a neural networks-based approach for the domain of Taiwanese Sign Language (TSL). Data for 20 right-hand gestures was retrieved using a VICON, which are then fed into their neural network, where 15 geometric distances were employed as feature representation of the different gestures. Their backpropagation neural network structure, implemented in MATLAB, had 15 input units, 20 output units, and two hidden layers in total. Recognition rates for varying number of neurons gave roughly 90% accuracy.

Discussion:
This is the second GR paper which used NNs, and I think that this paper was superior simply because their NN was more advanced. It was a vanilla NN implementation though, and it seemed like they just got some data and ran it though some MATLAB NN library. It’s a very straightforward paper that would have been more interesting had their gestures been more representative or complex. A vast majority of the gestures in this paper derive from words which are hardly used in the language. It appears that the authors preferred to work with gestures that were easier to classify as opposed to gestures that were actually common used.

Hand gesture modeling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker (Patwardhan & Roy – 2006)

Current Mood: studious

Blogs I Commented On:


Summary:
This is a hand gesture recognition paper that takes an eigenspace-based modeling approach, which takes into account trajectory and shape information. Recognition involves eigenspace projection and probability computation using the Mahalanobis distance. The paper is based on a previous system called EigenTracker, which can track moving objects that undergo appearance changes. The authors augment that system called Predictive EigenTracker in three ways: a particle filtering-based predictive framework, on the fly tracking of unknown views, and a combination of skin color and motion cues for hand tracking. Their gesture model framework accounts for both shape and temporal trajectory of the moving hand by constructing an eigenspace of suitably scaled shapes from a large number of training instances corresponding to the same shape. The framework also incorporates selecting a vocabulary to maximize recognition accuracy by computing the Mahalanobis distance of a query gesture from all gestures with some k shape-trajectory coefficient pair. This gives a probability of the given gesture being in some set of gestures. To test their system, the authors use a representative set of eight gestures for controlling a software audio player.

Discussion:
The paper was not an easy read for me, but from what I did read, it was an interesting system involving an approach we haven’t been exposed to yet. I do have some doubts, because I don’t know how this system is any different from using typical vision-based recognition specifically for hands. Concerning their experiment, I can’t gauge what the results are since they aren’t straightforward. Furthermore, the environment seems too controlled due to a white background and black, long-sleeved shirt in use for hand tracking.

Wiizards: 3D Gesture Recognition for Game Play Input (Kratz, et al – 2007)

Current Mood: studious

Blogs I Commented On:


Summary:
This is another applications paper that uses the Wiimote. Their application is called Wiizards, a multiplayer game which uses HMMs for gesture recognition. The game itself is a two player zero-sum game, where the player tries to damage to the other while limiting damage to one’s self. Gestures dictate spell casting, and players are more successful in the game if they use a variety of them. The three main components for the game is the Wiimote, gesture recognizer, and the game. Observations for the model are accelerometer data from the Wiimote, and then normalized using calibration information. Each gesture, which is an observation vector, is a collection of oberservations trained on the Baum-Welch algorithm, and also has a separate model associated with them for recognition. As calculated by the Viterbi Algorithm, the probability of a gesture given a model is the distribution of the observations and hidden states. To train the models, data was collected from 7 users. Each user were shown the gestures and performed them 40 times, and an HMM was created from the user data.

Discussion:
There are relatively few gesture recognition papers that cater to the Wiimote, since it’s a new device, and given the number of Wiimote papers on the topic matter, this would be considered a pretty good paper. It’s interesting that they built a nice application to demonstrate their recognizer using an HMM approach, but the system does have some kinks since the paper states about 50% accuracy by users whom hadn’t used the system before.

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning (Lieberman & Breazeal – 2007)

Current Mood: studious

Blogs I Commented On:


Summary:
This is a hardware application paper with the goal of creating a robotic wearable suit to analyze target movement and provide real-time corrective vibrotactile feedback to a student’s body over multiple joints in order to quickly develop new motor skills. Their system consists of optical tracking (for motion capture using markers on the wearable device), tactile actuators (for proportional feedback at the joints), feedback software (for determining the vibrotactile signals), and customized hardware for output control. Their system was tested by having users copying a series of images on a video screen while wearing the suit. Their user study generally gave positive feedback on their system.

Discussion:
I didn’t know how to comment on this paper directly since its applications didn’t really relate to the core aspect of the course. Judged independently from the purpose of the class, I felt it was a wonderful system that also had a sufficient user study applied to it. I could see some merits related to our class if it concentrated more on the hand.

A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction (Jenkins & Matari – 2004)

Current Mood: studious

Blogs I Commented On:


Summary:
The focus of this paper appears to be in efficiently uncovering the structure of motion using unsupervised learning for dimension reduction. The authors use a spatio-temporal Isomap approach for both continuous and segmented input data with sequential temporal ordering, where continous ST-Isomap is suited for uncovering spatio-temporal manifolds of data, and segmented ST-Isomap is for uncovering spatio-temporal clusters in segmented data. Their technique tries to address temporal relationships of proximal disambiguation and distal correspondence in order to uncover the spatio-temporal structure. Their example of the two relationships is two low waving motions of different directions, and also a low and high motion of the same direction. In the former, the two motions fall in proximal disambiguation, and in the latter, the two motions fall in distal correspondence. Their ST-Isomap approach extends an Isomap approach by having temporal windowing to provide a temporal history for each data point, hard spatio-temporal correspondences between proximal data pairs, and distance between data airs with spatio temporal relationships to accentuate their similarity.

Discussion:
I honestly had no idea what this paper was talking about most of the time. Most of it was the lingering feeling that I couldn’t find an aspect of this paper that would be relevant to the topics we are doing in the class. But I think it’s safe to say that this a nice paper to refer to if one wishes to use unsupervised learning in hand motion.

Articulated Hand Tracking by PCA-ICA Approach (Kato, et al – 2006)

Current Mood: studious

Blogs I Commented On:


Summary:
This paper focuses on hand tracking using a PCA-ICA approach. To do so, the authors first model the human with OpenGL as spheres, columns, and a rectangular parallelepiped. Hand motion data is capture with a data glove by capturing all combinations of open and closed fingers so that angles for 20 joints were measured. These measurements were divided into 100 instances to obtain a 2000-dimensional hand motion row vector. PCA is then used to find a smaller set of variables with less redundancy, measured by correlations between data elements using Singular Value Decomposition. From their approach, the authors first use PCA to reduce dimensionality, and then perform ICA on the low-dimensional PCA subspace to extract feature vectors. For ICA, the authors use a neural learning algorithm to maximize the joint entropy by using stochastic gradient ascent. The ICA-based model thus can represent a hand pose by five independent parameters corresponding to a particular finger at a particular time instant. From the PCA-ISA approach, PCA basis vectors represent global hand motion with mostly unfeasible hand motions, where ICA basis vectors represent particular finger motion. Particle filtering is then used for tracking hands by first generating samples where the hand pose is determined by five parameters (corresponding to each finger) from the ICA-based model, and then by using an observation model for employing edge and silhouette information to evaluate their hypothesis.

Discussion:
If I pretend what the paper was talking about then I will say that I found it intriguing that they combined the strengths of PCA and ICA to come up with what appears to be a viable hand tracking system, in that PCA’s limitations were overcome by ICA to model the hand for tracking purposes. It’s kind of hard to judge the merits of this paper though based on scant results, but the images provided at the end of the paper in less-than-ideal environments. I wish it had working actual results though (online video link is dead).

The 3D Tractus: A Three-Dimensional Drawing Board (Lapides, et al – 2006)

Current Mood: studious

Blogs I Commented On:


Summary:
This paper discusses the 3D Tractus, a drawing board-like device which can be raised and lowered to provide sketches in 3D. The device employs a counterweight system for easy vertical motion, four vertical aluminum bars for support, a tablet for the actual sketching, and a string potentiometer as a height senseor. For the software, a pen-based device handles input, and users have three visual software components to work with: a 3D sketch overview window, a drawing pad window, and a menu bar to access less common features. Dynamic line width is used to provide depth cues, and a traditional image editor-like eraser is used for deleting entire strokes.

Discussion:
The device discussed in this paper is an interesting concept for emulating 3D sketching using a standard tablet. There are obviously some limitations in providing true 3D sketching as it can only provide depth cues though a 2D window as opposed to a VR-based solution to visualize those same depths. I would imagine the usability would feel a bit distracting by relying on the other arm to navigate the drawing area in order to perform 3D sketching. Some improvements on the system I would suggest are a button to automate vertical movement, and an inclined surface to provide easier drawing. I can imagine some people whom would enjoy using this system over traditional devices.

A Hidden Markov Model Based Sensor Fusion Approach for Recognizing Continuous Human Grasping Sequences (Bernardin, et al – 2005)

Current Mood: studious

Blogs I Commented On:


Summary:
This paper discusses a system based on HMMs for recognizing hand grasps. Classification follows grasp types from Kamakura’s grasp taxonomy, separating grasps into 14 different classes by purpose, hand shape, and contact points with grasped objects. Each HMM is assigned to a different grasp type, and recognition is performed using the Viterbi algorithm. The focus of their system differs from existing ones in that it distinguishes between the purpose of grasps, as opposed to the object shape or number of fingers. Their glove-based device is equipped with flexible capacitive sensors to measure sensor readings for grasping. Noise and unwanted motion was filtered out in a garbage model with ergodic topology, and a ‘task’ grammar was used to reduce the search space. Their only assumption was that a grasp motion is followed up by a release motion. Their system is able to achieve 90% for multiple users.

Discussion:
The recognition system discussed in this paper is a lot different from the other types of systems discussed in prior papers because none of the papers tackled the problem of recognizing grasping. I liked the paper because it was different and covers an area overlooked in the hand gesture recognition domain. The use of grasp as a feature is very intriguing, and I believe that incorporating it in a recognition system would make such a system more powerful. Reminds me of the hand tension paper, now that I think about it. It does seem like extracting grasping data is a non-trivial affair.

Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series (Kadous – 2002)

Current Mood: studious

Blogs I Commented On:


Summary:
The core idea behind this thesis work is taking advantage of multivariate time series in order to aid in hand gesture recognition accuracy. In particular, the author focuses on the sub-events that a human might detect as part of a sign within some sign language, or Australian Sign Language (Auslan) for this paper. The author goes on by saying that metafeatures can parameterize them by capturing their properties such as temporal characteristics within a parameter space, a 2-D space of time and height, for feature construction. The temporal classification system, which I’m guessing is called TClass, uses synthetic events found within this space for feature construction, and applying several metafeatures into training instances constructs synthetic features. This is done in order for the TClass to mix temporal and non-temporal features not found in other temporal classification systems, as claimed by the author. A motivation is to produce a temporal classifier which produces comprehensible yet accurate descriptions. The system was implemented on Auslan, a language where signs consists of a mixture of handshapes, location, orientation, movement, and expression. Data was collected on the Nintendo Powerglove and Flock. The first input device was very noisy and far inferior to the second device. Several machine learning techniques were tested in conjunction and in comparison with TClass. Some observations concerning the Flock data itself was that TClass didn’t perform well with the HMM, smoothing of data didn’t improve results, and that TClass can handle tons of data. Accuracy rates for the Flock data are stated to be at 98% accuracy on voting, which sounds like ensemble averaging.

Discussion:
For me, some of the results were a bit confusing for me to give a fair critique of what I thought of the performance the author’s system. This would warrant reading the rest of the thesis, but on areas which were clear, I thought it was a pretty good approach. The author also did a nice job in collection tons of data to build his system based on the more accurate Flock device, despite it not having multiple users like the Nintendo data. It seems like a sound approach with nice accuracy results, but comparisons to other temporal classifiers showing improvement would have made it better. I don’t think I saw them in the sections we were supposed to read.

Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures (Ogris, et al – 2005)

Current Mood: studious

Blogs I Commented On:


Summary:
This gesture recognition paper focuses on the hardware side, specifically ultrasonics for improving recognition accuracy. The paper first discusses three issues inherent in all ultrasonic positioning systems, which are: reflections (false signals due to reflective materials in environment), occlusions (lost signals due to lack of line of sight between communicating devices), and temporal resolution (low transmission counts due to distance between communicating devices). To perform tracking, the authors then smooth the sonic data in two steps: on the raw signals and on the resulting coordinates. Classification is then done using C4.5 and k-Nearest Neighbor (k-NN) for motion sensors analysis. Each manipulative gesture in their experiment corresponds to an individually trained HMM model for model based classification, while a sliding window approach was used for framed based classification. After classification is performed on all frames for a gesture, a majority decision is applied on the results, yielding a filtered decision for a particular gesture. Finally, a complex fusion method is done where separate classification is done on the ultrasonic and motion signals. Their experiment was on manipulative gestures for a bicycle repair task, which performed best with their fusion method over k-NN, HMM, and C4.5.

Discussion:
Ultrasonics as a sensor for the hand gesture recognition domain appears to be potentially viable judging from this paper. Performance were mixed with the traditional machine learning methods experimented, but it gave pretty good accuracy for the fusion approach. I had a hard time understanding quite exactly how their fusion approach though, but from a hardware perspective, supplementing our available sensors with ultrasonics wouldn’t hurt.

American Sign Language Recognition in Game Development for Deaf Children (Brashear, et al – 2006)

Current Mood: studious

Blogs I Commented On:


Summary:
This is an application paper in the form of a computer game that utilizes computer gesture recognition technology to develop American Sign Language (ASL) skills to children in a system called CopyCat. The paper’s goal was to augment ASL learning in a fun manner for a children’s curriculum. The game focuses on correctly practicing repetitions of ASL phrases through the use of a computer. Due to lack of prior recognition engine for this area, the authors use a Wizard of Oz study by emulating missing functionality of the system for later implementation. For their data set, the authors selected phrases with three and four signs, and their recognition engine is currently limited to a subset of ASL single- and double-handed signs. To segment the samples, users were asked to use a mouse to indicate the start and end of their gestures, allowing their system to perform recognition on the pertinent phrases. Data consists of video of the user signing along with wireless accelerometers mounted on pink-colors for easy processing through a computer vision algorithm. Image pixel data is converted to HSC space for image segmentation purposes. For accelerometer processing, each data packet consists of four values: some sequence number and each of the three spatial dimensions. These packets are first synched to their video feed, then smoothed to account for variable number of packets associated in each frame. Finally, the feature vectors themselves are a combination of both the vision data and accelerometer data, where the accelerometer data consists of the spatial dimensions for both hands, and the vision data consists of various hand characteristics.

Discussion:
This is another applications paper for hand gesture recognition, but what I liked about this paper was that it was both something different from the typical applications papers concerning the domain our class is focusing on and potentially useful. I say potentially because in its current form, it still has some work to do. The results given in the paper are not very insightful on how well it really performs, and I’m still critical of the toolkit used due to its possible limitations (mentioned in a prior blog post), but I don’t see anything that would stop it from being a useful final product for their target audience of children.

A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence (Sagawa & Takeuchi – 2000)

Current Mood: studious

Blogs I Commented On:


Summary:
This is one of the older gesture segmentation and classifier papers, this time for the domain of Japanese Sign Language. Signed words, consisting solely of hand gestures, consist of combining gesture primitives such as hand shape, palm direction, linear motion, and circular motion. During recognition, gesture primitives are identified from the inputted gesture, and then the signed word is recognized by the time and spatial relationship between the gesture primitives. To segment, they used a hand velocity and a hand movement parameter. To correct differences between the two parameters, the hand movement segment border is considered the one closest to the hand velocity border. Several more parameters are used to determine if the gesture was one-handed or two-handed, and are classified in four types involving a combination of one- or two-handed and left- or right-dominant. To distinguish between word and transition segments, the authors extracted various features and discovered that the best distinction was minimum acceleration divided by maximum velocity. If the parameter is minimal, then it’s a word, otherwise it’s a transition. Evaluating their system of 100 JSL sentences, they achieved 86.6% accuracy for wor recognition and 58% accuracy for sentence recognition.

Discussion:
There were several things I liked about this paper. It was an easy read, had sound methods, and seemed logical in proceeding with creating their system. Unfortunately, the system is still a work in progress because of very bad accuracy in the end. It’s almost a decade since this paper came out, so it was still in unfamiliar territory at the time. I wish there was a follow-up paper that improved the accuracy rates.

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition (Westeyn, et al – 2003)

Current Mood: studious

Blogs I Commented On:


Summary:
This paper describes a toolkit called the Georgia Tech Gesture Toolkit for developing gesture-based recognition systems. The toolkit is first prepared by modeling each gesture as a separate HMM, specifying a rule-based for the possible sequences of gestures, and collecting and annotating data in numerical vector form, called feature vectors, over which the toolkit operates. The toolkit is trained and validated on using cross-validation and leave-one-out validation. The rest of the paper discusses various applications which use this toolkit.

Discussion:
The paper discusses what appears to be a viable toolkit for general gesture recognition. It’s hard to judge how versatile this toolkit really is given the lame applications discussed later on in the paper, but it’s definitely worth taking a look. My primary gripe is the requirement of a grammar to specify a gesture. It’s not too bad for simple gestures, but for more practical ones, I’m guessing the grammar would have to be huge to handle all possible cases.

Computer Vision-Based Gesture Recognition for an Augmented Reality Interface (Storring, et al – 2004)

Current Mood: studious

Blogs I Commented On:


Summary:
The gesture recognition system described in this paper is geared to supplement an augmented reality system, primarily for round-table meetings. The authors focus on six gestures of various outstretched fingers for their interface, requiring that users adapt to the hard limitation that these gestures be performed on a plane. Segmentation is done using a color pixel-based approach by segmenting blobs of similar color in the image. This approach uses normalized RGB to achieve invariance in order to transform the RGB colors to a color space that separates the intensity from the color space. In addition, skin blobs are defined with a minimum and maximum number of pixels. After segmenting hand pixels from the image, the hand and fingers are approximated as a circle and a number of rectangles for gesture recognition. The numbers of rectangles represent the number of outstretched fingers, which would involve doing a polar transformation around the center of the hand and counting the number of fingers present in each radius. To speed up the algorithm, they sample along concentric circles instead. Final classification is done by finding the number of fingers which is present for the most concentric circles. Recognized gestures are further filtered by a temporal filter, which means holding the gesture for a number of frames.

Discussion:
This paper takes a computer vision approach for gesture recognition, and what was nice about the paper was the technique of concentric circles in determining which and how many fingers were outstretched for particular gestures. It’s quite novel and reminiscent of the technique Oltmanns used in his dissertation. On the other hand, the authors required a hard limit to do so, and this takes away from the robustness of their system. Existing methods in computer vision could probably perform just as well with their hard limit, and yet still function better without it as well.

3D Visual Detection of Correct NGT Sign Production (Lichtenauer, et al – 2007)

Current Mood: studious

Blogs I Commented On:


Summary:
This paper focuses on hand tracking based on skin detection for the domain of Dutch sign language (NGT). The method involves the user producing a sign, the hands and heads being visually tracked, and then their features measured. A classifier is used to evaluate the measured features. For tracking, the hands and head are tracked by detecting skin color blocks assigned to the head and both hands from the previous positions, and then combining them with template tracking during occlusions between and hands and the head. An operator is then employed to click a square around face and head/hair/neck. Skin color is modeled by a 2D Gaussian perpendicular to the main direction of the distribution of positive skin samples in RGB space. For classification, the authors used fifty properties to the 2D/3D location and movement of the hands, measured at each frame. First, a reference sign is selected for each classifier and its time signal warped onto that sign using Dynamic Time Warping (DTW), solely for synchronization. Then the classifier records the properties under the assumption that the features are independent. Base classifiers are split for single features and then combined by the summation of their results, where a feature is selected for classification if less than 20% of the negative examples have a feature value within some 50% winsorization interval of the positive set. Signs are classified if they exceed some threshold value, which is determined by evaluating the positive training examples and using the median of the resulting values. Their approach gives 8% true positives against 5% false positives.

Discussion:
If I could summarize this paper in one sentence, it would be that it’s a very complicated hand gesture recognition system which relies on skin color. Given the amount of space dedicated to their technique which focuses on skin color to aid in recognition, I think it's made partially moot that requires manually selecting body parts beforehand as opposed to making it automatic. Add to the fact that it also requires some sort of ideal environment doesn't really motivate me to use their technique over some other vision-based technique. Then there's their results. It's not that they're good or bad. It's just...I still don't know how well their system performs even after seeing their results.

Television control by hand gestures (Freeman & Weissman – 1994)

Current Mood: studious

Blogs I Commented On:


Summary:
The goal of this paper is to use hand gestures to operate a television set remote, and the authors do so by creating a user interface for new users to master instantly. The problems claimed is the lack of a vocabulary to do so along with the image processing problem of identifying and hand gestures quickly and reliably in a complex and unpredictable visual environment. The solution involves imposing constraints on the user and the environment by exploiting the visual feedback from the television. Users only memorize one gesture: holding an open hand to the television. The computer tracks the hand, echoes its position with a hand icon, and then lets the users operate on-screen controls. The hand recognition method uses normalized correlation, where the position of the maximum correlation is at the user’s hand. Local orientation is use, as opposed to pixel intensities, for robustness against lighting variations by using filters in the image processing stage. Background removal is used to avoid analyzing stationary objects like furniture by first linearly combining the current image with a running average image, then subtracting the two images to detect image positions where change was above some pre-set threshold. Positions above the change threshold were finally processed to gain efficiency and resolve false positives.

Discussion:
This is an ancient paper dating back to 14 years ago. Even though its primitive in contribution to today’s standard, I believe it was pretty innovative for its time in various concepts introduced in the paper. One example was its desire of trying to create a vocabulary for gesture recognition (though I found contributions on that were highly underwhelming), and another was the use of background removal to do visual detection of the hand for recognition purposes. The authors cheat by requiring an open hand to do detection, and their application doesn’t appear intuitive to perform the task, but I think the ideas in the paper were ahead of its time.

A Survey of Hand Posture and Gesture Recognition Techniques and Technology (LaViola – 1999)

Current Mood: studious

Blogs I Commented On:


Summary:
This is basically a paper survey on hand gesture recognition, giving an overview of the technology for the domain at that time. The first part of the paper discusses glove- and vision-based approaches in handling hand gesture recognition and the data collected from them. The paper then shifts into discussion of algorithmic techniques that can be applied to input devices, ranging from feature extraction to learning algorithms. The last section then goes into depth of applications that can benefit from the devices and techniques discussion previously, ranging from sign language to virtual reality.

Discussion:
Since this paper is a PhD thesis, it’s a very long paper. Surprisingly enough, it’s an easy read and does a great job giving an overview of the types of topics that our class is focusing on. Even though it was written almost a decade ago, this thesis is still very relevant for today’s readers. That may be attributed to how well the author covered the spectrum of the field and maybe also the lack of huge changes in the field since then. While the thesis could be faulted by some for not going deep enough in some topics, I thought it was a great overall paper nonetheless.

[17] Real-time Locomotion Control by Sensing Gloves (Komura & Lam – 2006)

Current Mood: studious

Blogs I Commented On:


Summary:
This is a paper that describes a system which uses sensing gloves as an input device potentially for virtual 3D games. The sensing gloves used in this paper were the P5 and Cyberglove. Their method consists of a calibration stage and a control stage for employing the sensing gloves. In the calibration stage, some mapping function is created that converts hand motion to human body motion by mimicking the character motion appearing on a graphical display using the hand. The control stage involves performing a new hand movement, and then the corresponding motion of the computer character is generated by the mapping function and displayed in real-time. Construction of the mapping function first involves topological matching of the human body and fingers by comparing the motion of the fingers to that of the character. This is estimated by first calculating the DOF and then the autocorrelation of the trajectories. The authors determined the autocorrelation value to be 50 cycles, and joints are classified as either being a full cycle,, half cycle, or exceptional. After classifying the joints into one of the three categories, the hand’s generalized coordinates are matched to the character using the 3D velocity of the user’s fingers and the character’s body segments. This is done by first calculating the relative velocity of the character’s end effectors comparative to the root of the body, and then the relative velocities of the tip of the fingers comparative to the wrist of the hand. Lastly, all the end effectors of the character are matched to the finger of the user. Sometimes there is a phase shift between the user’s fingers and the character’s joints, which is remedied by keeping the ratio of the phase shift and the period the same when the user controls the character. In cases when the user suddenly extends or flexes the fingers, causing the system to not cover the range of the new finger motion, the system does extrapolation based on the tangent of the mapping function at the end of the original domain. Upper and lower joint angle boundaries are additional used by keeping the joint angles the same when those boundaries are exceeded until the mapping function comes back to within the valid domain. A virtual 3D environment with walls and obstacles were prepared to test their system on users using the Cyberglove, where users controlled the computer character with the index and middle finger to emulate walking, compared to keyboard input for the same task. The authors discovered that the average time to complete the task was shorter with the keyboard, but collisions were smaller with the glove.

Discussion:
I found this paper as an interesting application paper using glove devices for input, and also thought that it did a good job in covering the details behind their the method and their motivation behind their design choices. My gripe is more on the reasoning behind their application, primarily because I’m not a fan of using gloves to mimic locomotion of computer characters. The Iba paper was the only paper that came in mind that also focused on using gloves to achieve this feat, and I preferred the Iba paper because it felt more intuitive to use fingers to dictate motion as opposed to emulate motion.

[16] Shape Your Imagination: Iconic Gestural-Based Interaction (Marsh & Watt – 1998)

Current Mood: studious

Blogs I Commented On:


Summary:
This is an exploratory user study paper on iconic hand gestures. The goal of the paper was to test the hypothesis that iconic hand gestures were employable as an HCI technique for transferring spatial information. Their study was conducted on a group of 12 non-computer scientists by using 15 shapes and objects. These shapes and objects were split into two groups. The Primitive group had 2D and 3D geometric shapes found in most computer graphics systems, while the Complex and Compound group can be composed of one or more Primitive group shapes. The study discovered that iconic gestures were used throughout for the Primitive group, where users preferred to use two-handed virtual depiction. For the complex and compound objects, users preferred to described them using iconic two-handed gestures that accompanied or were substituted by pantomimic, deictic, and body gestures. Additionally, the authors found that all iconic hand gestures were formed immediately for shapes in the Primitive group, but had wide varying time for the Compound and Complex Group

Discussion:
This is a relatively early paper released about 10 years, so the results from the user study are quite limited for the type of work in our course. It is an interesting user that could possibly be modified and extended in order to derive results that would more benefit the types of things we envision for the course, especially considering the dearth of exploratory user studies in the field.

[15] A Dynamic Gesture Recognition System for the Korean Sign Language (Kim, et al – 1996)

Current Mood: studious

Blogs I Commented On:


Summary:
This paper focuses on on-line gesture recognition for the domain of Korean Sign Language (KSL) using a glove input device. 14 gestures were included in 25 chosen gestures out of 6,000 in KSL, and each hand postures are recognized by their system using a fuzzy min-max neural network. The authors begin by setting initial gesture positions, which they handle by calibrating data glove data by subtracting subsequent positions by the initial. After selecting 25 glove positions, they determined that they contained 10 basic direction types contaminated by noise from the glove. Since the measured deviation from the real value was within 4 inches, they split the x- and y-axis into 4 regions (for a total of 8) for efficient filtering and computation. The region data at each time unit is stored on 5 cascading registers, and the register values change for current data if it differs from data in the previous time unit. A fuzzy min-max neural network (FMMN) is used to recognize each hand posture, supposedly requires no pre-learning about the posture class, and has on-line adaptability. A fuzzy set hyberbox, which is a 10-dimensional box based on the two flex angles from each finger of the hand, is used in the study. This hyperbox is defined by a min point (V) and a max point (W), corresponds to a membership function, and is normalized between 0 and 1. Initial min-max values (V,W) of the network was created from empirical data of many individuals’ flex angles. A sensitivity parameter is used to regulate the speed of input pattern separation from the hyperbox, where a higher parameter value means a crisper membership function. Thus, when the network receives an input posture, the output is the membership function values for the 14 posture classes, and the input is classified as the posture class with the highest membership value that also exceeds some threshold value. No classification occurs when it falls below the threshold. For given words, the system achieves 85% recognition rates.

Discussion:
This is the second haptics we’ve read which used neural networks for recognition. Unlike the first paper, which used a basic perceptron network, this paper used a more complex fuzzy min-max neural network. It seems to give reasonable recognition rates and claims to have online adaptability. It would have been nice for the authors to have given some insight on why they chose to use FMMN over other types of neural networks, and also provide details on how it did was able to adapt online. I did like how they did provide actual network parameter values to their FMMNs.

[14] Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs (Song & Kim – 2006)

Current Mood: studious

Blogs I Commented On:


Summary:
This is another gesture segmentation and recognition technique paper that uses something called forward spotting accumulative HMMs. The domain was for the application of controlling curtains and lighting of smart homes using upper body gestures. The first main idea presented is that of a sliding window technique, one which computes the observation probabilities of a gesture or non-gesture using a number of continuing observations within a sliding window of a particular size k. From empirical testing, the authors chose k = 3. In this technique, the partial observation probability of a segment of a particular sequence of observation is computed by induction. The second idea involves forward spotting, which uses the prior technique to compute the competitive difference observation probability from a continuous frame of gesture images. The basic idea behind this idea is that every possible gesture class and also a non-gesture class has an HMM. After a partial observation, the value of a gesture class HMM that gives the highest probability is compared with the value of the non-gesture class HMM. Whichever HMM gives the highest value of the two is basically chosen, Accumulative HMMs, which are HMMs that accept all possible partial posture segments for determining the gesture type, are additionally used in the paper to improve gesture recognition performance. During testing, two gesture techniques were used on the eight gesture classes for their particular domain: manual and automatic threshold spotting. The latter performed better by generating mostly accurate rates in the 90s to perfect.

Discussion:
It’s hard to gauge the quality of the gesture segmentation technique for this paper. It seems like the technique in this paper offers a solution common in using generic HMMs in haptics, and such as using partial observations. Also, the use of their “junk” class, while nothing special, wouldn’t hurt. On the other hand, they tested their technique on such a simple domain without comparison to other techniques in the process. It looks like the only way to judge their technique’s merits is to actually implement their gesture segmentation algorithm on a more complex domain. The jury is still out on this one.

[13] A Survey of POMDP Applications (Cassandra – 2004)

Current Mood: studious

Blogs I Commented On:


Summary:
The paper gives a brief introduction of an certain type of Markov decision process model (MDP) called partial observable MDP (POMDP), and also a survey of applications that benefit from the use of POMDP). In POMDP models, there consists the following: a fine state set of states S (all possible, unobservable states that the process can be in), actions A (representing all available control choices at a point in time), observations Z (all possible observations that the process can emit), along with a state transition, observation, and immediate reward function tau (encodes the uncertainty in the process state evolution), o (relates the observations to the true process state), and r (gives the immediate utility for an action in each process state). The goal is to derive a control policy that will yield the highest utility over a number of decision steps.

The majority of the paper goes into the survey of applications, but none of them address haptics directly. The end of the paper then goes into two types of limitations from POMDPs: theoretical and real. For theoretical limitations, POMDPs do not easily handle problems that have certain characteristics, and the model is itself data intensive. For the real limitations, the first problem involves representing states as a set of attributes, which causes small concepts to have large state spaces since it requires enumerating every attribute value combination. The second problem is that the optimal policy for a general POMDP is intractable.

Discussion:
While it would have been preferably to have seen the paper focus specifically and in more detail of a particular application that used POMDPs to judge it merits, it appears to have made a decent case on the utility of using POMDPs as a potentially useful technique for the types of things that can be done in haptics. In addition, I think this paper gave an okay summary of POMDPs, assuming that the reader already had prior knowledge of MDPs.

[12] Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation (Ip – 2005)

Current Mood: studious

Blogs I Commented On:


Summary:
The paper talks about the Cyber Composer, a prototype system which takes an application-based approach in the field of haptics. This particular system allows both experienced and inexperienced users to express music at a high level without using musical instruments by solely relying on hand motions and gestures and some familiarity in music theory. Musical expressions are mapped to certain hand motions and gestures in the hopes of it being useful to experienced musicians, while intuitive for music laypersons. These expressions include: rhythm, pitch, pitch-shifting, dynamics, volume, dual-instrument mode, and cadence. This system makes use of Cybergloves to input the musical expressions into the system.

Discussion:
One interesting aspect found in the paper is their novel application in using haptics that strives to express music without traditional musical instruments, while also trying to balance their system in usefulness to experienced musicians and intuitiveness to inexperienced ones. One part of the paper that was severely lacking was their reasoning behind choosing the type of mapping for each of their musical expressions. It would have been nice to see some user study or testing results to defend the choices they made for their particular system.

[11] A Similarity Measure for Motion Stream Segmentation and Recognition (Li & Prabhakaran – 2005)

Current Mood: studious

Blogs I Commented On:


Summary:
The paper states that it’s a motion segmentation paper, which basically seems like gesture segmentation for a wider area. Given that directional and angular values in a motion is recorded in a matrix format for a motion, where columns measure the attributes and rows record the times of that motion, similarity between these motions can be compared given the same number of columns (i.e., attributes) yet different number of rows (i.e., times). This similarity comparison can be done using singular value decomposition (SVD), which reveals the geometric structure of a matrix. If two motions are similar, their corresponding eigenvectors should be parallel to each other, and their corresponding eigenvalues should be proportional to each other, thus only eigenvectors and eigenvalues from the matrices of the two motions are considered. Their similarity measure equation, which is dependent on the eigenvectors and eigenvalues, rely on some integer k, for 1 < k < n, where k determines the number of eigenvectors given n attributes in a motion matrix. For this paper, k = 6 from empirical testing. Given the significance of k, this non-metric similarity measure is called the k Weighted Angular Similarity (kWAS), which captures the angular similarities of the first k corresponding eigenvector pairs weighted by the corresponding eigenvalues. To recognize motion streams, the paper assumes a minimum length l and a maximum length L. Their kWAS method is done incrementally to segment the streams for motion recognition taken from a Cyberglove and cameras.

Discussion:
The paper claims accuracy rates in the high 90s, but as Brandon pointed out in class, those rates stemmed from artificially-produced samples. Since that was the case, there really weren’t any results for their motion segmentation algorithm, only a potential idea. I believe it has potential, and especially worth a look given the dearth of useful segmentation algorithms in this domain, but real results would be greatly appreciated in order to truly judge its merits.