Paul Taele's Blog on Gesture Recognition: 2008

Gesture Recognition Using an Acceleration Sensor and Its Application to Musical Performance Control (Sawada & Hashimoto – 1997)

02 April 2008

Current Mood:

Blogs I Commented On:

Summary:
The focus of this paper is gesture recognition using simple three-dimensional acceleration sensors with weights, where the domain is conducting. From these sensors, the accelerations in the x, y, and z directions are obtained. Motion features are extracted from three vectors, each of which are a combination of two of the directions. Kinetic parameters are then extracted from those motion features, which are: intensity of motion, rotational direction, direction of the main motion, and distributions of the acceleration data over eight principal directions. For a series of acceleration data, a kind of membership function was used as a fuzzy representation of the closeness to each principal direction, and it’s defined as an isolated triangular function which decreases in value linearly from the principal direction to the neighboring principal directions. Each kinetic parameter is then calculated from the time-series vectors, and gestures are recognized from a total of 33 kinetic features.

To recognize the gestures, the start of them is first identified from the magnitude of the acceleration. Before system use, an individual’s motion feature patterns are constructed during training and gesture start is detected when his/her acceleration is larger than the fluctuation is observed. Samples of the motion during training are recognized as input M times, and some standard pattern data for each gesture are constructed and stored. During recognition mode, weight errors for each feature parameter of an unknown motion’s acceleration data series is computed, and the dissimilarity to the patterns are determined. For musical tempo recognition, the rotational component is ignored, and the rhythm is extracted from the pattern of changes of the magnitude and phase of the vector. The instant when the conductor’s arm reaches the lowest point of the motion space, it’s recognized as the rhythm point. By detecting the maxima periodically, the rhythm points can be identified in real-time. To evaluate the system, the authors tested 10 kinds of gestures used for performance control. Two users repeated the gestures 10 times each for standard pattern construction. Though their system recognized the gestures for the first user trained on perfectly but not completely for the second user tested on, the authors concluded nonetheless that it still does better than vision-based or other approaches.

Discussion:
Before I discuss the technical content of this paper, I just want to say that the paper format was written suckily. The related section was slapped on in the middle of the paper out of left field, and I felt that the figure and table numbers were not easy to locate quickly within the paper sometimes. Okay, back on topic. My thoughts:

I’m glad that this paper did not spend two pages talking about music theory.

The shape of the conducting gestures in Aaron’s user study looked pretty complex. It seemed like that the author’s tackled recognition of the conducting gestures by simplifying them as a collection of directional gestures. I’m no expert on conducting gestures, but I wonder if the author’s approximation sufficient enough to separate all the gestures in Aaron’s domain. I will say yes for now.

I know people in our class will criticize this paper for its lame evaluation section. This seems so common in the GR papers we’ve been reading that I just gave up and accepted it as being the norm for this field.

I love reading translated research papers written by Japanese authors. They’re the funny.

Activity Recognition using Visual Tracking and RFID (Krahnstoever, et al – 2005)

01 April 2008

Current Mood:

Blogs I Commented On:

Summary:
This paper discusses how RFID technology can augment a traditional vision system with very specific object information to significantly extend its capabilities. The system’s components consist of two modules: tracking and RFID tag tracking. For the former, human motion tracking estimates a person’s head and hand location in a 3D world coordinate reference frame using a camera system. The likelihood for a frame is estimated based on the summation of the image over the bounding box of either the head or hand in a specific view. The tracker continuously estimates proposal body part locations from the image data, which are used as initialization and recovery priors. The tracker also follows a sequential Monte Carlo filtering approach and performs partitioned sampling to reduce the number of particles needed for tracking. For the latter, the RFID tracking unit detects the presence, movements, and orientation of RFID tags in 3D space. An algorithm for articulated upper body tracking is given in the paper. Combining both trackers gives an articulated motion tracker, which outputs a time series of mean estimates of a subject’s head and hand locations in addition to visibility flags that express whether the respective body parts are currently visible in the optical field of view. By observing subsequent articulated movements, the authors claim that the tracker can estimate what the person is doing. These interactions were encoded as a set of scenarios using rules in an agent-based architecture. To test their system, a prototype system was created consisting of a shelf-type rack made to hold objects of varying sizes and shapes. When a person interacts with RFID-equipped objects, the system was able to detect which item the user was interacting with (difficult for a vision-only system) and the type of interaction the user was doing to the object (difficult for an RFID-only system).

Discussion:
I was familiar with RFID tags from other applications, but I didn't know or think that could be used for haptics. That's why I wasn't too sure at first about the relevance of this paper to our class. Josh P. fortunately enlightened our class that RFID tags can nicely supplement the existing devices we have in the labs. While the potential is definitely there for the type of things our class is doing, the paper itself was pretty poor in presenting that potential. Most of that can be attributed to a weak example application which, of course, happened to have no numerical results to speak of. But that's been like the norm in the paper's we've been reading.

Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes (Wobbrock, et al – 2007)

31 March 2008

Current Mood:

Blogs I Commented On:

Summary:
The authors created an easy, cheap, and highly portable gesture recognizer called the $1 Gesture Recognizer. The algorithm requires one hundred lines of code and handles only basic geometry and trigonometry. The algorithm’s contributions include being easy to implement for novice user interface prototypers, capable as a measuring stick against more advanced algorithms, and used to give insight to gestures that are “best” for people and computer systems. Challenges to gesture recognizers in general include having resilience to sampling variations, supporting optimal and configurable rotation, scale, and position invariance, requiring no advanced math techniques, writing it easily in a few lines of code, being teachable with one example, returning a list of N-best points with sensible scores independent of number of points, and providing competitive recognition rates to more advanced algorithms.

$1 is able to cope with those challenges in its four-step algorithm: 1) re-sample to N points, where 32 <= N <= 256, 2) rotate once based on indicative angle, which is the angle formed between the gesture’s centroid and starting point, 3) scale non-uniformly and translate to the centroid, which is set as the origin, and 4) do recognition by finding the optimal angle for the best score. Analyzing the rotation invariance shows that there’s no guarantee that candidate points and template points of the gesture will optimally align after rotating the indicative angle to 0 degrees, so $1 uses a Golden Section Search (GSS) to find minimum ranges using the Golden Ratio in order to find the optimal angle.

Limitations of $1 include being unable to distinguish gestures whose identities depend on specific orientations, aspect ratios, or locations, abusing horizontal and vertical lines by non-uniform scaling, and being unable to differentiate gestures by speed since it doesn’t use time. In order to handle variations with $1, new templates can be defined to capture variation with a single name. A study was done to compare $1 with a modified Rubine classifier and Dynamic Time Warping (DTW) template matcher. The study showed that $1 and DTW were more accurate than Rubine, and that $1 and Rubine executed faster than DTW.

Discussion:
I guess I should change the discussion a bit because we're now looking at this paper from a GR perspective instead of an SR perspective. Our SR class was quite critical of this paper at the time, given that there were already existing SR algorithms that were more capable. Maybe $1 isn't as bad for GR, given that the simplicity of this algorithm would help bring GR-based applications into the mainstream, and also because such gestures for glove- and wand-based devices probably aren't as complicated to handle as for pen-based devices. The limitations that we noted in the SR class haven't gone away simply because we shifted to GR, but I don't think they're as disadvantageous second time around. I guess we won't know for sure unless we start experimenting with various applications which use $1.

Enabling fast and effortless customization in accelerometer based gesture interaction (Mantyjarvi, et al – 2004)

Current Mood:

Blogs I Commented On:

Summary:
The purpose of this paper is to create a procedure which allows users to create accelerometer-based gesture control when applying HMMs. Primarily, the authors refer to gestures as user hand movements collected with a set of sensors in a handheld device, modeled by machine learning methods. They use HMMs to recognize gestures since they can model time-series with spatial and temporal variability. Their system first involves preprocessing, where gesture data is normalized to equal length and amplitude. A vector quantizer is then used to map three-dimensional vectors into one-dimensional sequence of codebook indices, where the codebook was generated from collected gesture vectors using a k-means algorithm. This information is then sent to the HMM with an ergodic topology. A codebook size of 8 and a model state size of 5 were chosen. After vector quantization, the gesture is either used to train the HMM or to evaluate the HMM’s recognition capability. Finally, the authors added noise to test whether copied gesture data with added noise can reduce the amount of training repetitions done by the user when using discrete HMMs. The two types of noise distributions used were uniform and Gaussian, and various signal to noise ratios (SNR) were experimented with to determine which ratio value provided the best results. The system was evaluated for eight popular gestures applicable to a DVD playback system, and the experiments consisted of finding an optimal threshold value to converge the HMM, examining accuracy rates for different amounts of training repetitions, finding an optimal SNR value, and examining the effects of using noise distorted signal duplicates in training. With six training repetitions, accuracy was over 95%, the best accuracy for Gaussian and uniformly distributed noise with SNR = 3 and 5 was 97.2% ad 96.3%, respectively.

Discussion:
Some thoughts on the paper:

It felt like the authors wanted to create a system which allowed users to create customized “macros” analogous for hand motion gestures. It seems like an interesting idea, but my main concern with their system is with its robustness in relation to other users whom didn’t train these “macros.” It’s a novel idea to incorporate noise into existing training gesture data in order to generalize their system more while maintaining low training repetition, but the paper does not tell us how it performs on multiple users. The results may have given really high accuracy rates, but it’s a bit misleading since I didn’t see a separate test data used from, say, another user. I do think it’s a fine system if it’s for an application meant for that specific user though, but doesn’t seem robust for multiple users. If the latter is desired, I have no idea if this system will perform well.
They tested their data on 2D data. Seems like a waste of z-axis data, since it could have been done by omitting that third dimension. But then, I can’t really imagine truly useful gestures that would take advantage of z-axis data.
I think it would be better to have a system where users sketch their gestures in 2D on-screen, and then have the system try to recognize the accelerometer data using existing sketch recognition techniques.

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction (Murayama, et al – 2004)

26 March 2008

Current Mood: studious

Blogs I Commented On:

Summary:
This is a haptics hardware paper consisting of a haptic interface for two-handed manipulation of objects in a virtual environment. The system in this paper, SPIDAR-G&G, consists of a pair of string-based 6DOF haptic devices called SPIDAR-G. These haptic devices allow translation and rotational manipulation of virtual objects, and also force and torque feedback to the user. It does so by first getting position and orientation from each grip representing hand position by measuring the strings’ length, then calculating collision detection between virtual objects and the user’s hands, and finally displaying the appropriate force dieback by controlling the tension of each string. Three users familiar with VR interfaces participated in evaluating the system by timing the completion of a 3D pointing task. The four tasks consisted of either one- or two-handed manipulation, and with or without haptic feedback. As expected, two-handed manipulation with haptic feedback performed the best.

Discussion:
This paper came out a few years ago while it was still in its initial phase, but it’s an innovative system for providing physical feedback for two-handed manipulation of virtual objects. At the time of the paper’s publication, it still had a lot of work needed to be done, but I can see how it is still open for lots of improvement, unlike the 3D Tractus device. It’s a good start for the VR domain too, though it seems quite bulky and also limited with strings. Achieving the system without strings would be another feat in itself though.

Gesture Recognition with a Wii Controller (Scholmer, et al – 2008)

Current Mood: studious

Blogs I Commented On:

Summary:
This paper claims to be a gesture recognition system for the Wiimote, composed of a filter, k-mean quantizer, HMM model, and Bayes classifier. The system was tested on a set of trivial gestures, and their results did not achieve perfect recognition. That’s all. I’m serious.

Discussion:
The primary reason why I didn’t like this paper was because it was incomplete yet still accepted. If their chosen input device wasn’t new, I don’t think it would have been accepted. It really is lacking a lot of data. I do like this paper though, because our class can do better. I’m sorry that I chose this paper, everyone.

Taiwan sign language (TSL) recognition based on 3D data and neural networks (Lee & Tsai – 2007)

Current Mood: studious

Blogs I Commented On:

Summary:
This is a hand gesture recognition paper which uses a neural networks-based approach for the domain of Taiwanese Sign Language (TSL). Data for 20 right-hand gestures was retrieved using a VICON, which are then fed into their neural network, where 15 geometric distances were employed as feature representation of the different gestures. Their backpropagation neural network structure, implemented in MATLAB, had 15 input units, 20 output units, and two hidden layers in total. Recognition rates for varying number of neurons gave roughly 90% accuracy.

Discussion:
This is the second GR paper which used NNs, and I think that this paper was superior simply because their NN was more advanced. It was a vanilla NN implementation though, and it seemed like they just got some data and ran it though some MATLAB NN library. It’s a very straightforward paper that would have been more interesting had their gestures been more representative or complex. A vast majority of the gestures in this paper derive from words which are hardly used in the language. It appears that the authors preferred to work with gestures that were easier to classify as opposed to gestures that were actually common used.

Hand gesture modeling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker (Patwardhan & Roy – 2006)

Current Mood: studious

Blogs I Commented On:

Summary:
This is a hand gesture recognition paper that takes an eigenspace-based modeling approach, which takes into account trajectory and shape information. Recognition involves eigenspace projection and probability computation using the Mahalanobis distance. The paper is based on a previous system called EigenTracker, which can track moving objects that undergo appearance changes. The authors augment that system called Predictive EigenTracker in three ways: a particle filtering-based predictive framework, on the fly tracking of unknown views, and a combination of skin color and motion cues for hand tracking. Their gesture model framework accounts for both shape and temporal trajectory of the moving hand by constructing an eigenspace of suitably scaled shapes from a large number of training instances corresponding to the same shape. The framework also incorporates selecting a vocabulary to maximize recognition accuracy by computing the Mahalanobis distance of a query gesture from all gestures with some k shape-trajectory coefficient pair. This gives a probability of the given gesture being in some set of gestures. To test their system, the authors use a representative set of eight gestures for controlling a software audio player.

Discussion:
The paper was not an easy read for me, but from what I did read, it was an interesting system involving an approach we haven’t been exposed to yet. I do have some doubts, because I don’t know how this system is any different from using typical vision-based recognition specifically for hands. Concerning their experiment, I can’t gauge what the results are since they aren’t straightforward. Furthermore, the environment seems too controlled due to a white background and black, long-sleeved shirt in use for hand tracking.

Wiizards: 3D Gesture Recognition for Game Play Input (Kratz, et al – 2007)

Current Mood: studious

Blogs I Commented On:

Summary:
This is another applications paper that uses the Wiimote. Their application is called Wiizards, a multiplayer game which uses HMMs for gesture recognition. The game itself is a two player zero-sum game, where the player tries to damage to the other while limiting damage to one’s self. Gestures dictate spell casting, and players are more successful in the game if they use a variety of them. The three main components for the game is the Wiimote, gesture recognizer, and the game. Observations for the model are accelerometer data from the Wiimote, and then normalized using calibration information. Each gesture, which is an observation vector, is a collection of oberservations trained on the Baum-Welch algorithm, and also has a separate model associated with them for recognition. As calculated by the Viterbi Algorithm, the probability of a gesture given a model is the distribution of the observations and hidden states. To train the models, data was collected from 7 users. Each user were shown the gestures and performed them 40 times, and an HMM was created from the user data.

Discussion:
There are relatively few gesture recognition papers that cater to the Wiimote, since it’s a new device, and given the number of Wiimote papers on the topic matter, this would be considered a pretty good paper. It’s interesting that they built a nice application to demonstrate their recognizer using an HMM approach, but the system does have some kinks since the paper states about 50% accuracy by users whom hadn’t used the system before.

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning (Lieberman & Breazeal – 2007)

Current Mood: studious

Blogs I Commented On:

Summary:
This is a hardware application paper with the goal of creating a robotic wearable suit to analyze target movement and provide real-time corrective vibrotactile feedback to a student’s body over multiple joints in order to quickly develop new motor skills. Their system consists of optical tracking (for motion capture using markers on the wearable device), tactile actuators (for proportional feedback at the joints), feedback software (for determining the vibrotactile signals), and customized hardware for output control. Their system was tested by having users copying a series of images on a video screen while wearing the suit. Their user study generally gave positive feedback on their system.

Discussion:
I didn’t know how to comment on this paper directly since its applications didn’t really relate to the core aspect of the course. Judged independently from the purpose of the class, I felt it was a wonderful system that also had a sufficient user study applied to it. I could see some merits related to our class if it concentrated more on the hand.

A Spatio-temporal Extension to Isomap Nonlinear Dimension Reduction (Jenkins & Matari – 2004)

Current Mood: studious

Blogs I Commented On:

Summary:
The focus of this paper appears to be in efficiently uncovering the structure of motion using unsupervised learning for dimension reduction. The authors use a spatio-temporal Isomap approach for both continuous and segmented input data with sequential temporal ordering, where continous ST-Isomap is suited for uncovering spatio-temporal manifolds of data, and segmented ST-Isomap is for uncovering spatio-temporal clusters in segmented data. Their technique tries to address temporal relationships of proximal disambiguation and distal correspondence in order to uncover the spatio-temporal structure. Their example of the two relationships is two low waving motions of different directions, and also a low and high motion of the same direction. In the former, the two motions fall in proximal disambiguation, and in the latter, the two motions fall in distal correspondence. Their ST-Isomap approach extends an Isomap approach by having temporal windowing to provide a temporal history for each data point, hard spatio-temporal correspondences between proximal data pairs, and distance between data airs with spatio temporal relationships to accentuate their similarity.

Discussion:
I honestly had no idea what this paper was talking about most of the time. Most of it was the lingering feeling that I couldn’t find an aspect of this paper that would be relevant to the topics we are doing in the class. But I think it’s safe to say that this a nice paper to refer to if one wishes to use unsupervised learning in hand motion.

Articulated Hand Tracking by PCA-ICA Approach (Kato, et al – 2006)

Current Mood: studious

Blogs I Commented On:

Summary:
This paper focuses on hand tracking using a PCA-ICA approach. To do so, the authors first model the human with OpenGL as spheres, columns, and a rectangular parallelepiped. Hand motion data is capture with a data glove by capturing all combinations of open and closed fingers so that angles for 20 joints were measured. These measurements were divided into 100 instances to obtain a 2000-dimensional hand motion row vector. PCA is then used to find a smaller set of variables with less redundancy, measured by correlations between data elements using Singular Value Decomposition. From their approach, the authors first use PCA to reduce dimensionality, and then perform ICA on the low-dimensional PCA subspace to extract feature vectors. For ICA, the authors use a neural learning algorithm to maximize the joint entropy by using stochastic gradient ascent. The ICA-based model thus can represent a hand pose by five independent parameters corresponding to a particular finger at a particular time instant. From the PCA-ISA approach, PCA basis vectors represent global hand motion with mostly unfeasible hand motions, where ICA basis vectors represent particular finger motion. Particle filtering is then used for tracking hands by first generating samples where the hand pose is determined by five parameters (corresponding to each finger) from the ICA-based model, and then by using an observation model for employing edge and silhouette information to evaluate their hypothesis.

Discussion:
If I pretend what the paper was talking about then I will say that I found it intriguing that they combined the strengths of PCA and ICA to come up with what appears to be a viable hand tracking system, in that PCA’s limitations were overcome by ICA to model the hand for tracking purposes. It’s kind of hard to judge the merits of this paper though based on scant results, but the images provided at the end of the paper in less-than-ideal environments. I wish it had working actual results though (online video link is dead).

The 3D Tractus: A Three-Dimensional Drawing Board (Lapides, et al – 2006)

Current Mood: studious

Blogs I Commented On:

Summary:
This paper discusses the 3D Tractus, a drawing board-like device which can be raised and lowered to provide sketches in 3D. The device employs a counterweight system for easy vertical motion, four vertical aluminum bars for support, a tablet for the actual sketching, and a string potentiometer as a height senseor. For the software, a pen-based device handles input, and users have three visual software components to work with: a 3D sketch overview window, a drawing pad window, and a menu bar to access less common features. Dynamic line width is used to provide depth cues, and a traditional image editor-like eraser is used for deleting entire strokes.

Discussion:
The device discussed in this paper is an interesting concept for emulating 3D sketching using a standard tablet. There are obviously some limitations in providing true 3D sketching as it can only provide depth cues though a 2D window as opposed to a VR-based solution to visualize those same depths. I would imagine the usability would feel a bit distracting by relying on the other arm to navigate the drawing area in order to perform 3D sketching. Some improvements on the system I would suggest are a button to automate vertical movement, and an inclined surface to provide easier drawing. I can imagine some people whom would enjoy using this system over traditional devices.

A Hidden Markov Model Based Sensor Fusion Approach for Recognizing Continuous Human Grasping Sequences (Bernardin, et al – 2005)

Current Mood: studious

Blogs I Commented On:

Summary:
This paper discusses a system based on HMMs for recognizing hand grasps. Classification follows grasp types from Kamakura’s grasp taxonomy, separating grasps into 14 different classes by purpose, hand shape, and contact points with grasped objects. Each HMM is assigned to a different grasp type, and recognition is performed using the Viterbi algorithm. The focus of their system differs from existing ones in that it distinguishes between the purpose of grasps, as opposed to the object shape or number of fingers. Their glove-based device is equipped with flexible capacitive sensors to measure sensor readings for grasping. Noise and unwanted motion was filtered out in a garbage model with ergodic topology, and a ‘task’ grammar was used to reduce the search space. Their only assumption was that a grasp motion is followed up by a release motion. Their system is able to achieve 90% for multiple users.

Discussion:
The recognition system discussed in this paper is a lot different from the other types of systems discussed in prior papers because none of the papers tackled the problem of recognizing grasping. I liked the paper because it was different and covers an area overlooked in the hand gesture recognition domain. The use of grasp as a feature is very intriguing, and I believe that incorporating it in a recognition system would make such a system more powerful. Reminds me of the hand tension paper, now that I think about it. It does seem like extracting grasping data is a non-trivial affair.

Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series (Kadous – 2002)

Current Mood: studious

Blogs I Commented On:

Summary:
The core idea behind this thesis work is taking advantage of multivariate time series in order to aid in hand gesture recognition accuracy. In particular, the author focuses on the sub-events that a human might detect as part of a sign within some sign language, or Australian Sign Language (Auslan) for this paper. The author goes on by saying that metafeatures can parameterize them by capturing their properties such as temporal characteristics within a parameter space, a 2-D space of time and height, for feature construction. The temporal classification system, which I’m guessing is called TClass, uses synthetic events found within this space for feature construction, and applying several metafeatures into training instances constructs synthetic features. This is done in order for the TClass to mix temporal and non-temporal features not found in other temporal classification systems, as claimed by the author. A motivation is to produce a temporal classifier which produces comprehensible yet accurate descriptions. The system was implemented on Auslan, a language where signs consists of a mixture of handshapes, location, orientation, movement, and expression. Data was collected on the Nintendo Powerglove and Flock. The first input device was very noisy and far inferior to the second device. Several machine learning techniques were tested in conjunction and in comparison with TClass. Some observations concerning the Flock data itself was that TClass didn’t perform well with the HMM, smoothing of data didn’t improve results, and that TClass can handle tons of data. Accuracy rates for the Flock data are stated to be at 98% accuracy on voting, which sounds like ensemble averaging.

Discussion:
For me, some of the results were a bit confusing for me to give a fair critique of what I thought of the performance the author’s system. This would warrant reading the rest of the thesis, but on areas which were clear, I thought it was a pretty good approach. The author also did a nice job in collection tons of data to build his system based on the more accurate Flock device, despite it not having multiple users like the Nintendo data. It seems like a sound approach with nice accuracy results, but comparisons to other temporal classifiers showing improvement would have made it better. I don’t think I saw them in the sections we were supposed to read.

Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures (Ogris, et al – 2005)

Current Mood: studious

Blogs I Commented On:

Summary:
This gesture recognition paper focuses on the hardware side, specifically ultrasonics for improving recognition accuracy. The paper first discusses three issues inherent in all ultrasonic positioning systems, which are: reflections (false signals due to reflective materials in environment), occlusions (lost signals due to lack of line of sight between communicating devices), and temporal resolution (low transmission counts due to distance between communicating devices). To perform tracking, the authors then smooth the sonic data in two steps: on the raw signals and on the resulting coordinates. Classification is then done using C4.5 and k-Nearest Neighbor (k-NN) for motion sensors analysis. Each manipulative gesture in their experiment corresponds to an individually trained HMM model for model based classification, while a sliding window approach was used for framed based classification. After classification is performed on all frames for a gesture, a majority decision is applied on the results, yielding a filtered decision for a particular gesture. Finally, a complex fusion method is done where separate classification is done on the ultrasonic and motion signals. Their experiment was on manipulative gestures for a bicycle repair task, which performed best with their fusion method over k-NN, HMM, and C4.5.

Discussion:
Ultrasonics as a sensor for the hand gesture recognition domain appears to be potentially viable judging from this paper. Performance were mixed with the traditional machine learning methods experimented, but it gave pretty good accuracy for the fusion approach. I had a hard time understanding quite exactly how their fusion approach though, but from a hardware perspective, supplementing our available sensors with ultrasonics wouldn’t hurt.

American Sign Language Recognition in Game Development for Deaf Children (Brashear, et al – 2006)

Current Mood: studious

Blogs I Commented On:

Summary:
This is an application paper in the form of a computer game that utilizes computer gesture recognition technology to develop American Sign Language (ASL) skills to children in a system called CopyCat. The paper’s goal was to augment ASL learning in a fun manner for a children’s curriculum. The game focuses on correctly practicing repetitions of ASL phrases through the use of a computer. Due to lack of prior recognition engine for this area, the authors use a Wizard of Oz study by emulating missing functionality of the system for later implementation. For their data set, the authors selected phrases with three and four signs, and their recognition engine is currently limited to a subset of ASL single- and double-handed signs. To segment the samples, users were asked to use a mouse to indicate the start and end of their gestures, allowing their system to perform recognition on the pertinent phrases. Data consists of video of the user signing along with wireless accelerometers mounted on pink-colors for easy processing through a computer vision algorithm. Image pixel data is converted to HSC space for image segmentation purposes. For accelerometer processing, each data packet consists of four values: some sequence number and each of the three spatial dimensions. These packets are first synched to their video feed, then smoothed to account for variable number of packets associated in each frame. Finally, the feature vectors themselves are a combination of both the vision data and accelerometer data, where the accelerometer data consists of the spatial dimensions for both hands, and the vision data consists of various hand characteristics.

Discussion:
This is another applications paper for hand gesture recognition, but what I liked about this paper was that it was both something different from the typical applications papers concerning the domain our class is focusing on and potentially useful. I say potentially because in its current form, it still has some work to do. The results given in the paper are not very insightful on how well it really performs, and I’m still critical of the toolkit used due to its possible limitations (mentioned in a prior blog post), but I don’t see anything that would stop it from being a useful final product for their target audience of children.

A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence (Sagawa & Takeuchi – 2000)

Current Mood: studious

Blogs I Commented On:

Summary:
This is one of the older gesture segmentation and classifier papers, this time for the domain of Japanese Sign Language. Signed words, consisting solely of hand gestures, consist of combining gesture primitives such as hand shape, palm direction, linear motion, and circular motion. During recognition, gesture primitives are identified from the inputted gesture, and then the signed word is recognized by the time and spatial relationship between the gesture primitives. To segment, they used a hand velocity and a hand movement parameter. To correct differences between the two parameters, the hand movement segment border is considered the one closest to the hand velocity border. Several more parameters are used to determine if the gesture was one-handed or two-handed, and are classified in four types involving a combination of one- or two-handed and left- or right-dominant. To distinguish between word and transition segments, the authors extracted various features and discovered that the best distinction was minimum acceleration divided by maximum velocity. If the parameter is minimal, then it’s a word, otherwise it’s a transition. Evaluating their system of 100 JSL sentences, they achieved 86.6% accuracy for wor recognition and 58% accuracy for sentence recognition.

Discussion:
There were several things I liked about this paper. It was an easy read, had sound methods, and seemed logical in proceeding with creating their system. Unfortunately, the system is still a work in progress because of very bad accuracy in the end. It’s almost a decade since this paper came out, so it was still in unfamiliar territory at the time. I wish there was a follow-up paper that improved the accuracy rates.

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition (Westeyn, et al – 2003)

Current Mood: studious

Blogs I Commented On:

Summary:
This paper describes a toolkit called the Georgia Tech Gesture Toolkit for developing gesture-based recognition systems. The toolkit is first prepared by modeling each gesture as a separate HMM, specifying a rule-based for the possible sequences of gestures, and collecting and annotating data in numerical vector form, called feature vectors, over which the toolkit operates. The toolkit is trained and validated on using cross-validation and leave-one-out validation. The rest of the paper discusses various applications which use this toolkit.

Discussion:
The paper discusses what appears to be a viable toolkit for general gesture recognition. It’s hard to judge how versatile this toolkit really is given the lame applications discussed later on in the paper, but it’s definitely worth taking a look. My primary gripe is the requirement of a grammar to specify a gesture. It’s not too bad for simple gestures, but for more practical ones, I’m guessing the grammar would have to be huge to handle all possible cases.

Computer Vision-Based Gesture Recognition for an Augmented Reality Interface (Storring, et al – 2004)

Current Mood: studious

Blogs I Commented On:

Summary:
The gesture recognition system described in this paper is geared to supplement an augmented reality system, primarily for round-table meetings. The authors focus on six gestures of various outstretched fingers for their interface, requiring that users adapt to the hard limitation that these gestures be performed on a plane. Segmentation is done using a color pixel-based approach by segmenting blobs of similar color in the image. This approach uses normalized RGB to achieve invariance in order to transform the RGB colors to a color space that separates the intensity from the color space. In addition, skin blobs are defined with a minimum and maximum number of pixels. After segmenting hand pixels from the image, the hand and fingers are approximated as a circle and a number of rectangles for gesture recognition. The numbers of rectangles represent the number of outstretched fingers, which would involve doing a polar transformation around the center of the hand and counting the number of fingers present in each radius. To speed up the algorithm, they sample along concentric circles instead. Final classification is done by finding the number of fingers which is present for the most concentric circles. Recognized gestures are further filtered by a temporal filter, which means holding the gesture for a number of frames.

Discussion:
This paper takes a computer vision approach for gesture recognition, and what was nice about the paper was the technique of concentric circles in determining which and how many fingers were outstretched for particular gestures. It’s quite novel and reminiscent of the technique Oltmanns used in his dissertation. On the other hand, the authors required a hard limit to do so, and this takes away from the robustness of their system. Existing methods in computer vision could probably perform just as well with their hard limit, and yet still function better without it as well.

3D Visual Detection of Correct NGT Sign Production (Lichtenauer, et al – 2007)

Current Mood: studious

Blogs I Commented On:

Summary:
This paper focuses on hand tracking based on skin detection for the domain of Dutch sign language (NGT). The method involves the user producing a sign, the hands and heads being visually tracked, and then their features measured. A classifier is used to evaluate the measured features. For tracking, the hands and head are tracked by detecting skin color blocks assigned to the head and both hands from the previous positions, and then combining them with template tracking during occlusions between and hands and the head. An operator is then employed to click a square around face and head/hair/neck. Skin color is modeled by a 2D Gaussian perpendicular to the main direction of the distribution of positive skin samples in RGB space. For classification, the authors used fifty properties to the 2D/3D location and movement of the hands, measured at each frame. First, a reference sign is selected for each classifier and its time signal warped onto that sign using Dynamic Time Warping (DTW), solely for synchronization. Then the classifier records the properties under the assumption that the features are independent. Base classifiers are split for single features and then combined by the summation of their results, where a feature is selected for classification if less than 20% of the negative examples have a feature value within some 50% winsorization interval of the positive set. Signs are classified if they exceed some threshold value, which is determined by evaluating the positive training examples and using the median of the resulting values. Their approach gives 8% true positives against 5% false positives.

Discussion:
If I could summarize this paper in one sentence, it would be that it’s a very complicated hand gesture recognition system which relies on skin color. Given the amount of space dedicated to their technique which focuses on skin color to aid in recognition, I think it's made partially moot that requires manually selecting body parts beforehand as opposed to making it automatic. Add to the fact that it also requires some sort of ideal environment doesn't really motivate me to use their technique over some other vision-based technique. Then there's their results. It's not that they're good or bad. It's just...I still don't know how well their system performs even after seeing their results.

Television control by hand gestures (Freeman & Weissman – 1994)

Current Mood: studious

Blogs I Commented On:

Summary:
The goal of this paper is to use hand gestures to operate a television set remote, and the authors do so by creating a user interface for new users to master instantly. The problems claimed is the lack of a vocabulary to do so along with the image processing problem of identifying and hand gestures quickly and reliably in a complex and unpredictable visual environment. The solution involves imposing constraints on the user and the environment by exploiting the visual feedback from the television. Users only memorize one gesture: holding an open hand to the television. The computer tracks the hand, echoes its position with a hand icon, and then lets the users operate on-screen controls. The hand recognition method uses normalized correlation, where the position of the maximum correlation is at the user’s hand. Local orientation is use, as opposed to pixel intensities, for robustness against lighting variations by using filters in the image processing stage. Background removal is used to avoid analyzing stationary objects like furniture by first linearly combining the current image with a running average image, then subtracting the two images to detect image positions where change was above some pre-set threshold. Positions above the change threshold were finally processed to gain efficiency and resolve false positives.

Discussion:
This is an ancient paper dating back to 14 years ago. Even though its primitive in contribution to today’s standard, I believe it was pretty innovative for its time in various concepts introduced in the paper. One example was its desire of trying to create a vocabulary for gesture recognition (though I found contributions on that were highly underwhelming), and another was the use of background removal to do visual detection of the hand for recognition purposes. The authors cheat by requiring an open hand to do detection, and their application doesn’t appear intuitive to perform the task, but I think the ideas in the paper were ahead of its time.

A Survey of Hand Posture and Gesture Recognition Techniques and Technology (LaViola – 1999)

Current Mood: studious

Blogs I Commented On:

Summary:
This is basically a paper survey on hand gesture recognition, giving an overview of the technology for the domain at that time. The first part of the paper discusses glove- and vision-based approaches in handling hand gesture recognition and the data collected from them. The paper then shifts into discussion of algorithmic techniques that can be applied to input devices, ranging from feature extraction to learning algorithms. The last section then goes into depth of applications that can benefit from the devices and techniques discussion previously, ranging from sign language to virtual reality.

Discussion:
Since this paper is a PhD thesis, it’s a very long paper. Surprisingly enough, it’s an easy read and does a great job giving an overview of the types of topics that our class is focusing on. Even though it was written almost a decade ago, this thesis is still very relevant for today’s readers. That may be attributed to how well the author covered the spectrum of the field and maybe also the lack of huge changes in the field since then. While the thesis could be faulted by some for not going deep enough in some topics, I thought it was a great overall paper nonetheless.

[17] Real-time Locomotion Control by Sensing Gloves (Komura & Lam – 2006)

Current Mood: studious

Blogs I Commented On:

Summary:
This is a paper that describes a system which uses sensing gloves as an input device potentially for virtual 3D games. The sensing gloves used in this paper were the P5 and Cyberglove. Their method consists of a calibration stage and a control stage for employing the sensing gloves. In the calibration stage, some mapping function is created that converts hand motion to human body motion by mimicking the character motion appearing on a graphical display using the hand. The control stage involves performing a new hand movement, and then the corresponding motion of the computer character is generated by the mapping function and displayed in real-time. Construction of the mapping function first involves topological matching of the human body and fingers by comparing the motion of the fingers to that of the character. This is estimated by first calculating the DOF and then the autocorrelation of the trajectories. The authors determined the autocorrelation value to be 50 cycles, and joints are classified as either being a full cycle,, half cycle, or exceptional. After classifying the joints into one of the three categories, the hand’s generalized coordinates are matched to the character using the 3D velocity of the user’s fingers and the character’s body segments. This is done by first calculating the relative velocity of the character’s end effectors comparative to the root of the body, and then the relative velocities of the tip of the fingers comparative to the wrist of the hand. Lastly, all the end effectors of the character are matched to the finger of the user. Sometimes there is a phase shift between the user’s fingers and the character’s joints, which is remedied by keeping the ratio of the phase shift and the period the same when the user controls the character. In cases when the user suddenly extends or flexes the fingers, causing the system to not cover the range of the new finger motion, the system does extrapolation based on the tangent of the mapping function at the end of the original domain. Upper and lower joint angle boundaries are additional used by keeping the joint angles the same when those boundaries are exceeded until the mapping function comes back to within the valid domain. A virtual 3D environment with walls and obstacles were prepared to test their system on users using the Cyberglove, where users controlled the computer character with the index and middle finger to emulate walking, compared to keyboard input for the same task. The authors discovered that the average time to complete the task was shorter with the keyboard, but collisions were smaller with the glove.

Discussion:
I found this paper as an interesting application paper using glove devices for input, and also thought that it did a good job in covering the details behind their the method and their motivation behind their design choices. My gripe is more on the reasoning behind their application, primarily because I’m not a fan of using gloves to mimic locomotion of computer characters. The Iba paper was the only paper that came in mind that also focused on using gloves to achieve this feat, and I preferred the Iba paper because it felt more intuitive to use fingers to dictate motion as opposed to emulate motion.

[16] Shape Your Imagination: Iconic Gestural-Based Interaction (Marsh & Watt – 1998)

Current Mood: studious

Blogs I Commented On:

Summary:
This is an exploratory user study paper on iconic hand gestures. The goal of the paper was to test the hypothesis that iconic hand gestures were employable as an HCI technique for transferring spatial information. Their study was conducted on a group of 12 non-computer scientists by using 15 shapes and objects. These shapes and objects were split into two groups. The Primitive group had 2D and 3D geometric shapes found in most computer graphics systems, while the Complex and Compound group can be composed of one or more Primitive group shapes. The study discovered that iconic gestures were used throughout for the Primitive group, where users preferred to use two-handed virtual depiction. For the complex and compound objects, users preferred to described them using iconic two-handed gestures that accompanied or were substituted by pantomimic, deictic, and body gestures. Additionally, the authors found that all iconic hand gestures were formed immediately for shapes in the Primitive group, but had wide varying time for the Compound and Complex Group

Discussion:
This is a relatively early paper released about 10 years, so the results from the user study are quite limited for the type of work in our course. It is an interesting user that could possibly be modified and extended in order to derive results that would more benefit the types of things we envision for the course, especially considering the dearth of exploratory user studies in the field.

[15] A Dynamic Gesture Recognition System for the Korean Sign Language (Kim, et al – 1996)

Current Mood: studious

Blogs I Commented On:

Summary:
This paper focuses on on-line gesture recognition for the domain of Korean Sign Language (KSL) using a glove input device. 14 gestures were included in 25 chosen gestures out of 6,000 in KSL, and each hand postures are recognized by their system using a fuzzy min-max neural network. The authors begin by setting initial gesture positions, which they handle by calibrating data glove data by subtracting subsequent positions by the initial. After selecting 25 glove positions, they determined that they contained 10 basic direction types contaminated by noise from the glove. Since the measured deviation from the real value was within 4 inches, they split the x- and y-axis into 4 regions (for a total of 8) for efficient filtering and computation. The region data at each time unit is stored on 5 cascading registers, and the register values change for current data if it differs from data in the previous time unit. A fuzzy min-max neural network (FMMN) is used to recognize each hand posture, supposedly requires no pre-learning about the posture class, and has on-line adaptability. A fuzzy set hyberbox, which is a 10-dimensional box based on the two flex angles from each finger of the hand, is used in the study. This hyperbox is defined by a min point (V) and a max point (W), corresponds to a membership function, and is normalized between 0 and 1. Initial min-max values (V,W) of the network was created from empirical data of many individuals’ flex angles. A sensitivity parameter is used to regulate the speed of input pattern separation from the hyperbox, where a higher parameter value means a crisper membership function. Thus, when the network receives an input posture, the output is the membership function values for the 14 posture classes, and the input is classified as the posture class with the highest membership value that also exceeds some threshold value. No classification occurs when it falls below the threshold. For given words, the system achieves 85% recognition rates.

Discussion:
This is the second haptics we’ve read which used neural networks for recognition. Unlike the first paper, which used a basic perceptron network, this paper used a more complex fuzzy min-max neural network. It seems to give reasonable recognition rates and claims to have online adaptability. It would have been nice for the authors to have given some insight on why they chose to use FMMN over other types of neural networks, and also provide details on how it did was able to adapt online. I did like how they did provide actual network parameter values to their FMMNs.

[14] Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs (Song & Kim – 2006)

Current Mood: studious

Blogs I Commented On:

Summary:
This is another gesture segmentation and recognition technique paper that uses something called forward spotting accumulative HMMs. The domain was for the application of controlling curtains and lighting of smart homes using upper body gestures. The first main idea presented is that of a sliding window technique, one which computes the observation probabilities of a gesture or non-gesture using a number of continuing observations within a sliding window of a particular size k. From empirical testing, the authors chose k = 3. In this technique, the partial observation probability of a segment of a particular sequence of observation is computed by induction. The second idea involves forward spotting, which uses the prior technique to compute the competitive difference observation probability from a continuous frame of gesture images. The basic idea behind this idea is that every possible gesture class and also a non-gesture class has an HMM. After a partial observation, the value of a gesture class HMM that gives the highest probability is compared with the value of the non-gesture class HMM. Whichever HMM gives the highest value of the two is basically chosen, Accumulative HMMs, which are HMMs that accept all possible partial posture segments for determining the gesture type, are additionally used in the paper to improve gesture recognition performance. During testing, two gesture techniques were used on the eight gesture classes for their particular domain: manual and automatic threshold spotting. The latter performed better by generating mostly accurate rates in the 90s to perfect.

Discussion:
It’s hard to gauge the quality of the gesture segmentation technique for this paper. It seems like the technique in this paper offers a solution common in using generic HMMs in haptics, and such as using partial observations. Also, the use of their “junk” class, while nothing special, wouldn’t hurt. On the other hand, they tested their technique on such a simple domain without comparison to other techniques in the process. It looks like the only way to judge their technique’s merits is to actually implement their gesture segmentation algorithm on a more complex domain. The jury is still out on this one.

[13] A Survey of POMDP Applications (Cassandra – 2004)

Current Mood: studious

Blogs I Commented On:

Summary:
The paper gives a brief introduction of an certain type of Markov decision process model (MDP) called partial observable MDP (POMDP), and also a survey of applications that benefit from the use of POMDP). In POMDP models, there consists the following: a fine state set of states S (all possible, unobservable states that the process can be in), actions A (representing all available control choices at a point in time), observations Z (all possible observations that the process can emit), along with a state transition, observation, and immediate reward function tau (encodes the uncertainty in the process state evolution), o (relates the observations to the true process state), and r (gives the immediate utility for an action in each process state). The goal is to derive a control policy that will yield the highest utility over a number of decision steps.

The majority of the paper goes into the survey of applications, but none of them address haptics directly. The end of the paper then goes into two types of limitations from POMDPs: theoretical and real. For theoretical limitations, POMDPs do not easily handle problems that have certain characteristics, and the model is itself data intensive. For the real limitations, the first problem involves representing states as a set of attributes, which causes small concepts to have large state spaces since it requires enumerating every attribute value combination. The second problem is that the optimal policy for a general POMDP is intractable.

Discussion:
While it would have been preferably to have seen the paper focus specifically and in more detail of a particular application that used POMDPs to judge it merits, it appears to have made a decent case on the utility of using POMDPs as a potentially useful technique for the types of things that can be done in haptics. In addition, I think this paper gave an okay summary of POMDPs, assuming that the reader already had prior knowledge of MDPs.

[12] Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation (Ip – 2005)

Current Mood: studious

Blogs I Commented On:

Summary:
The paper talks about the Cyber Composer, a prototype system which takes an application-based approach in the field of haptics. This particular system allows both experienced and inexperienced users to express music at a high level without using musical instruments by solely relying on hand motions and gestures and some familiarity in music theory. Musical expressions are mapped to certain hand motions and gestures in the hopes of it being useful to experienced musicians, while intuitive for music laypersons. These expressions include: rhythm, pitch, pitch-shifting, dynamics, volume, dual-instrument mode, and cadence. This system makes use of Cybergloves to input the musical expressions into the system.

Discussion:
One interesting aspect found in the paper is their novel application in using haptics that strives to express music without traditional musical instruments, while also trying to balance their system in usefulness to experienced musicians and intuitiveness to inexperienced ones. One part of the paper that was severely lacking was their reasoning behind choosing the type of mapping for each of their musical expressions. It would have been nice to see some user study or testing results to defend the choices they made for their particular system.

[11] A Similarity Measure for Motion Stream Segmentation and Recognition (Li & Prabhakaran – 2005)

Current Mood: studious

Blogs I Commented On:

Summary:
The paper states that it’s a motion segmentation paper, which basically seems like gesture segmentation for a wider area. Given that directional and angular values in a motion is recorded in a matrix format for a motion, where columns measure the attributes and rows record the times of that motion, similarity between these motions can be compared given the same number of columns (i.e., attributes) yet different number of rows (i.e., times). This similarity comparison can be done using singular value decomposition (SVD), which reveals the geometric structure of a matrix. If two motions are similar, their corresponding eigenvectors should be parallel to each other, and their corresponding eigenvalues should be proportional to each other, thus only eigenvectors and eigenvalues from the matrices of the two motions are considered. Their similarity measure equation, which is dependent on the eigenvectors and eigenvalues, rely on some integer k, for 1 < k < n, where k determines the number of eigenvectors given n attributes in a motion matrix. For this paper, k = 6 from empirical testing. Given the significance of k, this non-metric similarity measure is called the k Weighted Angular Similarity (kWAS), which captures the angular similarities of the first k corresponding eigenvector pairs weighted by the corresponding eigenvalues. To recognize motion streams, the paper assumes a minimum length l and a maximum length L. Their kWAS method is done incrementally to segment the streams for motion recognition taken from a Cyberglove and cameras.

Discussion:
The paper claims accuracy rates in the high 90s, but as Brandon pointed out in class, those rates stemmed from artificially-produced samples. Since that was the case, there really weren’t any results for their motion segmentation algorithm, only a potential idea. I believe it has potential, and especially worth a look given the dearth of useful segmentation algorithms in this domain, but real results would be greatly appreciated in order to truly judge its merits.

[10] Hand Tension as a Gesture Segmentation Cue (Harling & Edwards – 1997)

02 February 2008

Current Mood: speechless

Blogs I Commented On:

Rant:
How is it that this university has two chicken fingers restaurant near campus, yet there is no buffalo wings restaurant nearby? (No, Buffalo Wild is not nearby.) What they need to do is convert Layne's into a buffalo wings restaurant. Their chicken fingers and sauce suck compared to Raising Cane's.

Summary:
Gestures classes can be defined as having either static or dynamic hand postures, and having static or dynamic hand locations. Recognizers designed for hand gesture recognition have difficulty defining whether two distinct and consecutive hand motions are considered an atomic gesture or not. The authors propose a system to help remedy that by considering what happens to the muscles of the fingers during posture creation. They observe that as the hand is moved from one posture to another, the amount of tension will change and become tenser for various postures. They thus theorize that intentional gestures will be made with a tense hand position rather than a relaxed one. Of the four gesture classes, their segmentation method using this theory works best when dynamic finger motions are not involved.

Currently, current input technology do not directly measure finger tension, therefore their model considers a finger to be a light rigid rod of a fixed length, with two light elastic strings attached to the end of the rod. They resolve the forces along the finger using Hooke’s law to determine the amount of tension in each finger. By summing the total hand tension, they observe that total hand tension increases as single finger tension increases, and likewise when it decreases.

The hand model was tested on two sets of gesture with a Mattel Power Glove that measured bentness on the fingers sans pinky for the domain of BSL. When sentence fragments were executed with hand gestures, the graph of hand tensions for the sentence fragments displayed local maxima when gestures were performed, and local minima during the transiting to the next gesture.

Discussion:
It’s an interesting idea to use finger tension as a way to segment different hand gestures in their domain of BSL (which isn’t much different from ASL, I suppose). Their improvised way to measure finger tension seem to offer decent results in segmentation, so their theory held up quite well. On the other hand, they note that this approach doesn’t work for dynamic hand postures and locations, which is a shame since these types of gestures are more natural. Though their approach alone wouldn’t be very useful for a robust gesture recognition system, it could be a useful metric to aid in a particular area of hand gesture recognition.

One other point that I wished to discuss is the author’s comments about the use of finger tension for aiding actual recognition rather than segmentation. Originally, they theorized their system for the problem of segmentation, but they observed from their sample data that different postures in their study exhibited unique levels of hand gesture. Hence, they wonder if these different levels of finger tensions hold in the general case. I have my doubts that this metric would be reliable for a domain with a large gesture library, but this might be useful for a smaller library. Of course, the segmentation problem would also decrease in complexity anyway with a smaller library. I think it would be more productive to focus their attention on solving the dynamic portion of the hand gesture classes, but their idea on this matter is intriguing and worthy of a look. I definitely would like to see if this correlation held.

[09] A Multi-Class Pattern Recognition System for Practical Finger Spelling Translation (Hernandez-Rebollar, et al – 2002)

Current Mood: studious

Blogs I Commented On:

Rant:
Looks like the groundhog didn't see its shadow today, because the weather's so nice right now. I hope I'm right.

Summary:
The authors propose a system for ASL that addresses portability and affordability, by using an inexpensive microcontroller and a set of five dual-axis accelerometers. Their input device is the Accele Glove, which doesn’t require an external tracking system. The glove also serves as the key component to their system, since it measures finger position. A PC is used for data analysis and algorithm training, and a voice synthesizer is used to vocally output recognized letters. Dual-axis sensors are attached to middle joints of the fingers to eliminate ambiguity, while the accelerometers measure joint flexion at the y-axis and hand roll/yaw and individual finger abduction at the x-axis.

The Accele glove can measure finger flexion (or hand shape), and also hand orientation (with respect to gravitational vector without needing an external sensor. Thus, the extracted features are: orientation of fingers, total number of fingers bent, and palm orientation (closed, horizontal, and vertical/open). The sample space used for classification consists of 50 samples for the 26 ASL symbols. A decision tree was used to do handle classification in a hierarchical structure, where different features are tested at each level. Many letters are recognized in two steps, while the most difficult ones are recognized at the bottom level.

Ten different sample data per five volunteers were collected to train and test the system. 21 of the 26 ASL letters achieved perfect recognition, while linear functions proved impossible to handle recognition without errors in the recognition with accuracy such as 90%, 78%, and 96%.

Discussion:
Finally, a hand gesture paper that doesn’t use HMMs. Instead, this paper treads on using decision trees for recognition. I found this very interesting, because I didn’t know decision trees could be used successfully for something that can’t be easily linearly classified as hand gestures. In my opinion, I think there’s a lot of other good in this paper. Such examples include not using the Cyberglove but an inexpensive alternative instead, having lots of sample data from several people instead of one, and actually testing on all the ASL symbols with a few more to produce basic communication. Using index finger position as the initial posture recognition component is an interesting aspect, which would make more sense with a decision tree implementation, but we can definitely see where the implementation reaches its limits with erroneous classifications on a few ASL letters. I wonder if there’s an easy way to incorporate the power of non-linearity functions into a decision tree implementation…

[08] A Dynamic Gesture Interface for Virtual Environments Based on Hidden Markov Models (Chen, et al – 2005)

01 February 2008

Current Mood: slightly peeved

Blogs I Commented On:

Rant:
The papers...they never end...

Summary:
Meaningful hand gestures consist of two types: static postures (e.g., ASL) and continuous dynamic gestures. The latter consists of global hand motions (i.e., large hand rotations and translations) and local finger motions (i.e., parameterized with a set of joint angles). The focus of this paper is a continuous dynamic gesture recognition system based on HMMs. The prototype for continuous dynamic gesture recognition involves rotating a cube with three different gestures.

The implementation of the system contains three steps. The first step involves collecting raw data and preprocessing them. Dynamic gestures are modeled with discrete HMMs, and observation signals are standard deviations of angle variations for each finger joint. The standard deviation describes the dynamic character of angle variation for each finger joint, thus transforming multi-dimensional observation signals into easier-to-process single discrete dimensional ones. The second step is training the HMMs using the Baum-Welch algorithm. Ten data sets were taken for each dynamic gesture, resulting in three HMMs. The third step focused on the gesture recognition, which uses the forward-backward algorithm. The paper gave no results to their system.

Discussion:
What an interesting paper. There were neither any results to show the performance of their system, nor were any convincing arguments as to why HMMs were used with their standard deviation technique. I do think that the author’s standard deviation technique using HMMs could work well for other applications -- in theory. In theory, communism works. In theory.

[07] Online, Interactive Learning of Gestures for Human/Robot Interfaces (Lee & Xu – 1996)

30 January 2008

Current Mood: toasty

Blogs I Commented On:

Rant:
I like having assigned readings for Monday's class, because we have five days to read over them. Assigned readings for Wednesday's class? Eh...not so zesty.

Anyway, looks like Brandon's once again the only person caught up with his blog posts for the course. He so h4x like a 1337 h4x0r. L4m3.

Summary:
The authors of this paper focus on a system where a robot learns a task by observing a teacher. The downfall of existing systems is the lack of mechanism for online teaching of gestures with symbolic meanings. This paper claims a gesture recognition system which can interactively learn gestures with as few as 1-2 examples. Their approach is with HMMs, and their device is a Cyberglove for the domain of sign language. Their general procedure for interactive training was: 1) user makes a gesture, 2) system segments input into separate gesture for classification (if it’s certain of the classification, then perform the action on it; else query the user for confirmation), 3) system adds symbols of that gesture to the list, then updates the HMM based on the example.

For the gesture recognition, the system preprocesses the raw data into a sequence of discrete observation symbols, determines which set of HMMs most likely generated that sequence, and checks if there is ambiguity between two or more gestures or if there is no known gesture similar to the observed data. For learning gesture models, the Baum-Welch (BW) algorithm was used to find the HMM that is a local maximum in the likelihood of generated observed sequences. To allow for an online, interactive style of gesture training, the HMMs were trained by beginning with one or some small number of examples, BW was run until it converges, and the system iteratively adds more examples while updating the model with BW on each one. In the signal preprocessing stage, the use of discrete HMMs required representing gestures as a sequence of discrete symbols, where the hand was treated as a single entity, and the sequence was generated as a single-dimensional sequence of features representing the entire hand. The preprocessing algorithm of the input data was a vector quantization of a series of short-time fast Fourier transforms.

The implementation of the hand gesture recognition system used 5-state Bakis HMMs, restricting the system to only move from a given state to the same state or one of the next two states. This allowed for the assumption of simple and non-cyclical sequence of motions. Classification errors yielded 1.0% and 2.4% after two examples, 0.1% after four examples, and none after six examples.

Discussion:
This paper fits pretty well in the context of this course. It’s cited by the Iba paper on gesture-based control for mobile robots, it focuses on the domain of sign language like the Allen paper, and resorts heavily on HMMs as introduced in the Rabiner paper. Even though I wasn’t too familiar with the Baum-Welch algorithm, a key aspect in their implementation, I liked how they modified its use for training HMMs from an offline, batch-based approach to an interactive and online-based approach. Unfortunately, they did not give the results for how its performance compared against the typical batch-based approach, only making note that their online training approach came close to the results of an offline one.

[06] HoloSketch: A Virtual Reality Sketching/Animation Tool (Deering – 1995)

27 January 2008

Current Mood: clinicly retarded

Blogs I Commented On:

Rant:
I noticed that, at the time of this post, Brandon's the only one who's completely up-to-date on his blog posts. Why? Because he h4x.

Summary:
Virtual reality (VR) is considered the next generation man-machine interface, but the interface is not widespread and also limited to being proof-of-concept prototypes. The goal of the paper is to test the hypothesis that VR can have mass market appeal. Thus, the author proposed HoloSketch, a 3D sketching system which can create and edit shapes in the air. Some previous “direct interaction” 3D construction systems resort to a 2D mouse for input, which only takes advantage of two out of six variables of control (hands control variables of xyz positions and three axes of rotation). Systems that do use full six-axis input devices are restricted with being in relative mode (i.e., not being at the visual site of object creation), having limited visual resolution with head-mounted devices (HMDs). In addition, all those systems built VR objects with a 2D system or text editors.

HoloSketch makes the claim of resolving the above problems. Designed to work with multiple and different VR environments, the display of focus in the paper is a “fish-tank stereo” display with an augmented 3D mouse shaped to a six-axis wand or “one-fingered data glove.” Furthermore, the display system employs highly accurate absolute (i.e., highly equivalent to real-world) measurements using relatively fat CRTs. The design philosophy for HoloSketch was to extend 2D sketching functionality into 3D. Some problems they encountered when shifting to 3D were increased screen real estate cost, suffering from the 3D consequences of Fitts’ Law, and facing rendering-resource limitations. Their solutions included shifting all main menu controls to buttons on the wand, which then displayed as a 3D pie.

HoloSketch was intended as a general-purpose 3D sketching and simple animation system. To test it, they employed a computer/traditional artist for a month. Positive comments on the system involved having the immediacy of 3D environments and increased productivity over traditional models, while negative comments included having a learning curve, having difficulty making fine adjustments with the wand, and requiring a richer interface to do other types of projects. Some limitations by the author with the system was it having less complex imagery due to hardware limitations, a less robust six-axis tracking, a software package comparable to simple 2D systems, and a less optimal 3D interface.

Discussion:
This paper was written over a decade ago, and during that time I haven’t really seen any equivalent system in widespread use at the moment. The technology’s definitely there, especially compared to back then. Graphics are more than capable of achieving the type of environment from next-gen hardware, and input devices can also achieve the type of functionality as can be see from the Wiimote. This may partly be attributed to the fact that this type of interface isn't being embraced as any more than a novelty. For what the system is, it’s an interesting concept that does a reasonable job in making sketching appear to seamlessly work in the third dimension. Since the technology is possible, I would have rather preferred seeing more focus on improving on Fitts’ law in a 3D environment instead of demonstrating unique input actions. Besides that, the techniques used to implement the various actions in the system were quite informative.

[05] An Architecture for Gesture-Based Control of Mobile Robots (Iba – 1999)

Current Mood: uber-humored

Blogs I Commented On:

Rant:
At first, I didn’t think it was worth the effort to use a haptics approach for controlling mobile robots compared to using a conventional remote controller, especially for a single robot. Then I thought about it for a moment and imagined it from a different perspective. Suppose I had control of an army of obedient and competent toddlers. If I had to use a mechanism to control them, would it be more practical to use gesture-based hand-motion controls or a conventional controller. It’s a silly question anyway. Where could I possibly find enough diapers to supply such an army?

Summary:
The idea in this paper is to transfer the burden of programming manipulator systems from robot experts to task experts, whom have extensive task knowledge but limited robotics knowledge. The goal is to enable new users to interact with robots by having an intuitive interface and the ability to interpret sometimes vague user specifications. The challenge is in robots interpreting user intent instead of simply mimicking it. Their approach is to use hand gestures as input to mobile robots using a data glove and position sense, allowing for richer interaction. It’s currently limited due to the system having cable connections that may be overcome by technological advances in wearable computing systems.

The system is composed of a data glove, a position sensor, and a geolocation system that tracks position and orientation of the mobile robot. The data glove and position sensor are connected to a gesture sensor to spot and interpret gestures. Waving in a direction moves the robot (local control mode), while pointing at a desired location emulates a ‘point-and-go’ command (global control mode). Gestures themselves are recognized with an HMM-based recognizer. First, data preprocessing is done on the data glove’s joint angel measurements to improve the speed and performance of gesture classification, and then gesture spotting occurs afterwards by classifying the data as one of six gestures (OPENING, OPENED, CLOSING, POINT, WAVING LEFT, WAVING RIGHT) or “none of the above” (to prevent inadvertent robot actions).

The HMM-algorithm differs from Rubiner and Juang’s standard forward-backward technique in two ways. First, an observation sequence is limited to the n most recent observations, since an increased observation sequence given the HMM decreases its probability to zero. Second, a “wait state” is created as the first transition node to all the gesture models and itself with equal probability. The reasoning behind it is that if an observation sequence corresponds to any sequence, the final state in that gesture will have the highest probability, else it is trapped in the “wait state” while subsequence observations raise the correct gesture’s probability. This is in response to the current model where one of six gestures are selected unless the threshold is high enough to reject them all, which the authors find unacceptable since it may also exclude gestures performed slightly differently from the training gestures.

Discussion:
The idea of controlling a group of mobile robots (or an army of toddlers, your choice) using a haptics approach seems more feasible than current methods, since hand commands feels like a natural thing to do. Therefore, I think the paper’s work towards that has a lot of merit. In regards to the execution, I found their modified HMM-based implementation of this system quite intriguing. The first modification of limiting the observation sequence to only the n most recent observations feels like it should have been in the original standard forward-backwards technique. I still haven’t come up with a reason as to the benefits of not restricting the sequence of observations at the moment.

Their second modification of employing a "wait state" was also a novel move to recognize gestures which were performed slightly different from the training ones. The authors reasoned that observation sequences which didn’t correspond to any gesture would be held in a "wait state" until more observations raised the correct gesture’s probability. There’s one part about this second modification that confuses to me though. Let’s suppose that during the execution of a gesture, the user does a posture different enough from the training data to confuse the system. According to this second modification, the observation state up to that posture would put the gesture into a "wait state," that is until more data from a later observation state connected to that gesture would raise intended correct gesture’s probability. What happens if the observation state had capture the user finished with the gesture while it was in a "wait state"? Future observation states would stem from the next gesture being executed by the user. From what I read, I would imagine the probability of the previous gesture would not increase, and thus never classify the previous gesture (if the n value for the number of recent observations is low enough) or classify the current gesture incorrectly (if the n value is high enough).

[04] An Introduction to Hidden Markov Models (Rabiner & Juang – 1986)

Current Mood: chilly

Blogs I Commented On:

Rant:
I was always psyched going to McDonald's as a kid, thinking it would be so awesome to eat there everyday. Now I have fulfilled that silly dream, but it's just not as cool anymore since that's the place where I do most of my studying. At least I can take comfort in epic unlimited drink refills to keep me hydrated. Beat that, library!

Speaking of McDonald's, have any of you swung by the place during lunch time on the weekdays? It sometimes feels like I'm back in Asia, with the amount of Chinese and Korean I hear around the place at that time. McDonald's on University Drive: the Little Asia of College Station.

Summary:
Hidden Markov Models (HMMs) are defined as being a doubly stochastic process with an underlying stochastic process that is not observable, but can only be observed through another set of stochastic processes that produce the sequence of observed symbols. Elements of HMMs consist of the following:
1. There are a finite number, say N, of states in the model.
2. At each clock time, t, a new state is entered based upon a transition probability distribution, which depends on the previous state.
3. After each transition is made, an observation symbol is produced according to a probability distribution which depends on the current states.

The “Urn and Ball” model illustrates a concrete example of an HMM in action. In this mode, there are:
* N urns, each filled with a large number of colored balls
* M possible colors for each ball
* an observation sequence:
> choose one of N urns (according to initial probability distribution
> select ball from initial urn
> record color from ball
> choose new urn based on transition probability distribution of new urn

A formal notation of a discrete observation HMM consists of the following:
* T = length of observation sequence (total number of clock times)
* N = number of states (urns) in the model
* M = number of observation symbols (colors)
* Q = {q_1, q_2, … , q_N}, states (urns)
* V = {v_1, v_2, … , v_M}, discrete set of possible symbol observations (colors)
* A = {a_i,j}, a_i,j = Pr (q_j at t+1 | q_i at ), state transition probability distribution
* B = {b_j(k)}, b_j(k) = Pr(v_k at t | q_j at t), observation symbol probability distribution in state j
* pi = {pi_i}, pi_i = Pr(q_i at t=1), initial state distribution

For an observation sequence, O = O_1 O_2 … O_T:
1. Choose an initial state, i_1, according to the initial state distribution, pi.
2. Choose O_t according to b_i_t(k), the symbol probability distribution in state i_t.
3. Choose i_t+1 according to {a_i_t,_i_t+1}, i_t+1 = 1,2,…,N, the state transition probability distribution for state i_t.
4. Set t=t+1; return to step 3 if t < T; otherwise, terminate procedure.

HMMs are represented by the symbol lambda = (A, B, pi), and specified as a choice of the number of states N, the number of discrete symbols M, and the specification of A, B, and pi. Three problems for HMMs are:
1. Evaluation Problem - Given a model and a sequence of observations, how do we “score” or evaluate the model?
2. Estimation Problem – How do we uncover the hidden part of the model (i.e., the state sequence)?
3. Training Problem – How do we optimize the model parameters to best describe how an observed sequence came about?

Discussion:
This is a somewhat decent paper to suggest for introducing HMMs. I really liked the “Urn and Ball” example as a simple and concrete way to describe an HMM structure. On the other hand, the coin examples used to illustrate the HMM execution could used some more clarification. I would just focus on the first half of the paper get a feeling of HMMs, and while focusing on the second part to get a good understanding of how to begin implementing one. That’s assuming if anyone can even make out what it says…

02 April 2008

01 April 2008

31 March 2008

26 March 2008

02 February 2008

01 February 2008

30 January 2008

27 January 2008

Blog Archive

Links