Paul Taele's Blog on Gesture Recognition: January 2008

[07] Online, Interactive Learning of Gestures for Human/Robot Interfaces (Lee & Xu – 1996)

30 January 2008

Current Mood: toasty

Blogs I Commented On:

Rant:
I like having assigned readings for Monday's class, because we have five days to read over them. Assigned readings for Wednesday's class? Eh...not so zesty.

Anyway, looks like Brandon's once again the only person caught up with his blog posts for the course. He so h4x like a 1337 h4x0r. L4m3.

Summary:
The authors of this paper focus on a system where a robot learns a task by observing a teacher. The downfall of existing systems is the lack of mechanism for online teaching of gestures with symbolic meanings. This paper claims a gesture recognition system which can interactively learn gestures with as few as 1-2 examples. Their approach is with HMMs, and their device is a Cyberglove for the domain of sign language. Their general procedure for interactive training was: 1) user makes a gesture, 2) system segments input into separate gesture for classification (if it’s certain of the classification, then perform the action on it; else query the user for confirmation), 3) system adds symbols of that gesture to the list, then updates the HMM based on the example.

For the gesture recognition, the system preprocesses the raw data into a sequence of discrete observation symbols, determines which set of HMMs most likely generated that sequence, and checks if there is ambiguity between two or more gestures or if there is no known gesture similar to the observed data. For learning gesture models, the Baum-Welch (BW) algorithm was used to find the HMM that is a local maximum in the likelihood of generated observed sequences. To allow for an online, interactive style of gesture training, the HMMs were trained by beginning with one or some small number of examples, BW was run until it converges, and the system iteratively adds more examples while updating the model with BW on each one. In the signal preprocessing stage, the use of discrete HMMs required representing gestures as a sequence of discrete symbols, where the hand was treated as a single entity, and the sequence was generated as a single-dimensional sequence of features representing the entire hand. The preprocessing algorithm of the input data was a vector quantization of a series of short-time fast Fourier transforms.

The implementation of the hand gesture recognition system used 5-state Bakis HMMs, restricting the system to only move from a given state to the same state or one of the next two states. This allowed for the assumption of simple and non-cyclical sequence of motions. Classification errors yielded 1.0% and 2.4% after two examples, 0.1% after four examples, and none after six examples.

Discussion:
This paper fits pretty well in the context of this course. It’s cited by the Iba paper on gesture-based control for mobile robots, it focuses on the domain of sign language like the Allen paper, and resorts heavily on HMMs as introduced in the Rabiner paper. Even though I wasn’t too familiar with the Baum-Welch algorithm, a key aspect in their implementation, I liked how they modified its use for training HMMs from an offline, batch-based approach to an interactive and online-based approach. Unfortunately, they did not give the results for how its performance compared against the typical batch-based approach, only making note that their online training approach came close to the results of an offline one.

[06] HoloSketch: A Virtual Reality Sketching/Animation Tool (Deering – 1995)

27 January 2008

Current Mood: clinicly retarded

Blogs I Commented On:

Rant:
I noticed that, at the time of this post, Brandon's the only one who's completely up-to-date on his blog posts. Why? Because he h4x.

Summary:
Virtual reality (VR) is considered the next generation man-machine interface, but the interface is not widespread and also limited to being proof-of-concept prototypes. The goal of the paper is to test the hypothesis that VR can have mass market appeal. Thus, the author proposed HoloSketch, a 3D sketching system which can create and edit shapes in the air. Some previous “direct interaction” 3D construction systems resort to a 2D mouse for input, which only takes advantage of two out of six variables of control (hands control variables of xyz positions and three axes of rotation). Systems that do use full six-axis input devices are restricted with being in relative mode (i.e., not being at the visual site of object creation), having limited visual resolution with head-mounted devices (HMDs). In addition, all those systems built VR objects with a 2D system or text editors.

HoloSketch makes the claim of resolving the above problems. Designed to work with multiple and different VR environments, the display of focus in the paper is a “fish-tank stereo” display with an augmented 3D mouse shaped to a six-axis wand or “one-fingered data glove.” Furthermore, the display system employs highly accurate absolute (i.e., highly equivalent to real-world) measurements using relatively fat CRTs. The design philosophy for HoloSketch was to extend 2D sketching functionality into 3D. Some problems they encountered when shifting to 3D were increased screen real estate cost, suffering from the 3D consequences of Fitts’ Law, and facing rendering-resource limitations. Their solutions included shifting all main menu controls to buttons on the wand, which then displayed as a 3D pie.

HoloSketch was intended as a general-purpose 3D sketching and simple animation system. To test it, they employed a computer/traditional artist for a month. Positive comments on the system involved having the immediacy of 3D environments and increased productivity over traditional models, while negative comments included having a learning curve, having difficulty making fine adjustments with the wand, and requiring a richer interface to do other types of projects. Some limitations by the author with the system was it having less complex imagery due to hardware limitations, a less robust six-axis tracking, a software package comparable to simple 2D systems, and a less optimal 3D interface.

Discussion:
This paper was written over a decade ago, and during that time I haven’t really seen any equivalent system in widespread use at the moment. The technology’s definitely there, especially compared to back then. Graphics are more than capable of achieving the type of environment from next-gen hardware, and input devices can also achieve the type of functionality as can be see from the Wiimote. This may partly be attributed to the fact that this type of interface isn't being embraced as any more than a novelty. For what the system is, it’s an interesting concept that does a reasonable job in making sketching appear to seamlessly work in the third dimension. Since the technology is possible, I would have rather preferred seeing more focus on improving on Fitts’ law in a 3D environment instead of demonstrating unique input actions. Besides that, the techniques used to implement the various actions in the system were quite informative.

[05] An Architecture for Gesture-Based Control of Mobile Robots (Iba – 1999)

Current Mood: uber-humored

Blogs I Commented On:

Rant:
At first, I didn’t think it was worth the effort to use a haptics approach for controlling mobile robots compared to using a conventional remote controller, especially for a single robot. Then I thought about it for a moment and imagined it from a different perspective. Suppose I had control of an army of obedient and competent toddlers. If I had to use a mechanism to control them, would it be more practical to use gesture-based hand-motion controls or a conventional controller. It’s a silly question anyway. Where could I possibly find enough diapers to supply such an army?

Summary:
The idea in this paper is to transfer the burden of programming manipulator systems from robot experts to task experts, whom have extensive task knowledge but limited robotics knowledge. The goal is to enable new users to interact with robots by having an intuitive interface and the ability to interpret sometimes vague user specifications. The challenge is in robots interpreting user intent instead of simply mimicking it. Their approach is to use hand gestures as input to mobile robots using a data glove and position sense, allowing for richer interaction. It’s currently limited due to the system having cable connections that may be overcome by technological advances in wearable computing systems.

The system is composed of a data glove, a position sensor, and a geolocation system that tracks position and orientation of the mobile robot. The data glove and position sensor are connected to a gesture sensor to spot and interpret gestures. Waving in a direction moves the robot (local control mode), while pointing at a desired location emulates a ‘point-and-go’ command (global control mode). Gestures themselves are recognized with an HMM-based recognizer. First, data preprocessing is done on the data glove’s joint angel measurements to improve the speed and performance of gesture classification, and then gesture spotting occurs afterwards by classifying the data as one of six gestures (OPENING, OPENED, CLOSING, POINT, WAVING LEFT, WAVING RIGHT) or “none of the above” (to prevent inadvertent robot actions).

The HMM-algorithm differs from Rubiner and Juang’s standard forward-backward technique in two ways. First, an observation sequence is limited to the n most recent observations, since an increased observation sequence given the HMM decreases its probability to zero. Second, a “wait state” is created as the first transition node to all the gesture models and itself with equal probability. The reasoning behind it is that if an observation sequence corresponds to any sequence, the final state in that gesture will have the highest probability, else it is trapped in the “wait state” while subsequence observations raise the correct gesture’s probability. This is in response to the current model where one of six gestures are selected unless the threshold is high enough to reject them all, which the authors find unacceptable since it may also exclude gestures performed slightly differently from the training gestures.

Discussion:
The idea of controlling a group of mobile robots (or an army of toddlers, your choice) using a haptics approach seems more feasible than current methods, since hand commands feels like a natural thing to do. Therefore, I think the paper’s work towards that has a lot of merit. In regards to the execution, I found their modified HMM-based implementation of this system quite intriguing. The first modification of limiting the observation sequence to only the n most recent observations feels like it should have been in the original standard forward-backwards technique. I still haven’t come up with a reason as to the benefits of not restricting the sequence of observations at the moment.

Their second modification of employing a "wait state" was also a novel move to recognize gestures which were performed slightly different from the training ones. The authors reasoned that observation sequences which didn’t correspond to any gesture would be held in a "wait state" until more observations raised the correct gesture’s probability. There’s one part about this second modification that confuses to me though. Let’s suppose that during the execution of a gesture, the user does a posture different enough from the training data to confuse the system. According to this second modification, the observation state up to that posture would put the gesture into a "wait state," that is until more data from a later observation state connected to that gesture would raise intended correct gesture’s probability. What happens if the observation state had capture the user finished with the gesture while it was in a "wait state"? Future observation states would stem from the next gesture being executed by the user. From what I read, I would imagine the probability of the previous gesture would not increase, and thus never classify the previous gesture (if the n value for the number of recent observations is low enough) or classify the current gesture incorrectly (if the n value is high enough).

[04] An Introduction to Hidden Markov Models (Rabiner & Juang – 1986)

Current Mood: chilly

Blogs I Commented On:

Rant:
I was always psyched going to McDonald's as a kid, thinking it would be so awesome to eat there everyday. Now I have fulfilled that silly dream, but it's just not as cool anymore since that's the place where I do most of my studying. At least I can take comfort in epic unlimited drink refills to keep me hydrated. Beat that, library!

Speaking of McDonald's, have any of you swung by the place during lunch time on the weekdays? It sometimes feels like I'm back in Asia, with the amount of Chinese and Korean I hear around the place at that time. McDonald's on University Drive: the Little Asia of College Station.

Summary:
Hidden Markov Models (HMMs) are defined as being a doubly stochastic process with an underlying stochastic process that is not observable, but can only be observed through another set of stochastic processes that produce the sequence of observed symbols. Elements of HMMs consist of the following:
1. There are a finite number, say N, of states in the model.
2. At each clock time, t, a new state is entered based upon a transition probability distribution, which depends on the previous state.
3. After each transition is made, an observation symbol is produced according to a probability distribution which depends on the current states.

The “Urn and Ball” model illustrates a concrete example of an HMM in action. In this mode, there are:
* N urns, each filled with a large number of colored balls
* M possible colors for each ball
* an observation sequence:
> choose one of N urns (according to initial probability distribution
> select ball from initial urn
> record color from ball
> choose new urn based on transition probability distribution of new urn

A formal notation of a discrete observation HMM consists of the following:
* T = length of observation sequence (total number of clock times)
* N = number of states (urns) in the model
* M = number of observation symbols (colors)
* Q = {q_1, q_2, … , q_N}, states (urns)
* V = {v_1, v_2, … , v_M}, discrete set of possible symbol observations (colors)
* A = {a_i,j}, a_i,j = Pr (q_j at t+1 | q_i at ), state transition probability distribution
* B = {b_j(k)}, b_j(k) = Pr(v_k at t | q_j at t), observation symbol probability distribution in state j
* pi = {pi_i}, pi_i = Pr(q_i at t=1), initial state distribution

For an observation sequence, O = O_1 O_2 … O_T:
1. Choose an initial state, i_1, according to the initial state distribution, pi.
2. Choose O_t according to b_i_t(k), the symbol probability distribution in state i_t.
3. Choose i_t+1 according to {a_i_t,_i_t+1}, i_t+1 = 1,2,…,N, the state transition probability distribution for state i_t.
4. Set t=t+1; return to step 3 if t < T; otherwise, terminate procedure.

HMMs are represented by the symbol lambda = (A, B, pi), and specified as a choice of the number of states N, the number of discrete symbols M, and the specification of A, B, and pi. Three problems for HMMs are:
1. Evaluation Problem - Given a model and a sequence of observations, how do we “score” or evaluate the model?
2. Estimation Problem – How do we uncover the hidden part of the model (i.e., the state sequence)?
3. Training Problem – How do we optimize the model parameters to best describe how an observed sequence came about?

Discussion:
This is a somewhat decent paper to suggest for introducing HMMs. I really liked the “Urn and Ball” example as a simple and concrete way to describe an HMM structure. On the other hand, the coin examples used to illustrate the HMM execution could used some more clarification. I would just focus on the first half of the paper get a feeling of HMMs, and while focusing on the second part to get a good understanding of how to begin implementing one. That’s assuming if anyone can even make out what it says…

[03] American Sign Language Finger Spelling Recognition System (Allen, et al – 2003)

26 January 2008

Current Mood: sinister

Blogs I Commented On:

Rant:
Oh man, it's going to take me awhile to get used to reading two papers per class instead of one. I must hone my gesture-fu skills. *executes a high-pitched Bruce Lee scream*

Summary:
Sign language is a form of communication used by the deaf and deaf-blind community, but most people in general are not familiar with it. Also, sign language interpreters are expensive and only used in formal settings. The goal is to make it possible to develop wearable technology that will recognize sign language, translate it into printed and spoken English, and transform that English into an animated and tactile language. To do so, the authors created a recognition system that allows classification of finger-spelled letters based on subunits of hand shape using a neural network for pattern classification. The classification technique involves matching descriptive parameters of sign language dictionaries, which differs from conventional techniques that must recognize the entire sign. Their device is a CyberGlove which contains 18 sensors and measures finger and write position/movement.

In developing the recognition system, the authors created an initial Labview program to collect glove data that saved to a file. The data was then loaded to Matlab in order to train the neural network. A second Matlab program related the glove data to the most similar America Sign Language (ASL) letter it was trained to recognize. After the system is capable of differentiating the different letters, a second Labview program translated the classifications into spoken form by outputting their corresponding English sound. The final step integrated the entire system together.

The main recognition system used a perceptron network since it produced the best results out of all network types tested. The input matrix represented the 18 glove values and 24 ASL letters (2 letters were omitted since they required arm movement as well), and the output was 24x24. A trained network took in an 18x1 matrix of sensor values, then output a 24x1 matrix, where a 1 value in the output denoted the index of the recognized ASL letter. Results generated up to 90% accuracy applied to data trained from one person, with the authors perceiving that more data will generate better results.

Discussion:
I never believed in the idea of researching technology for the sake of technology. That technology should have beneficial applications in order to justify its existence. In the case of haptics research, the authors’ work is an example of that justification. On the other hand, it does seem strange that they resorted to classifying ASL letters using a linear perceptron neural network. Most neural networks use more powerful non-linear sigmoid neural networks, especially for a domain as complex as haptics. The authors state that their results generated accuracy up to 90%, too. I wonder if those results would have been stated as above 90% had they employed a more powerful neural network.

[02] Environmental Technology: Making the Real World Virtual (Krueger – 1993)

Current Mood: suave

Blogs I Commented On:

Rant:
We need more 1-2 page-long papers.

Summary:
There are three types of technologies where computer penetrate our everyday lives: the already existing portable, the media-wide touted wearable, and the paper’s focus of environmental. The last one concerns the ultimate computer interface involving the human body and its senses, which would perceive the users instead of receiving input from them. The author rejects Sutherland’s head-mounted display model for being uncomfortable, and strives more for a “come as you are” interface.

In a computer-controlled responsive environment, the framework is the space that people see or hear in response to their actions. The author strives towards such an environment with his 1970 VIDEOSPACE application, a video projection of computer-graphic images to create a telecommunication space, where two geographically separated individuals are able to interact with each other as if though they were there in-person. For over a decade and a half, various applications he created were developed for teleconferencing and cooperated by geographically remote individuals.

Discussion:
While this paper is more about VR than haptics, the paper expressed the use of hands for interactivity within the scope of a virtual environment. While such an application is nothing spectacular today, it’s pretty amazing that a functioning application was made possible almost forty years ago, especially since the author didn’t have the luxury of modern internet broadband capabilities. I don’t think this particular paper makes much of a contribution to our class at a technical level, but it does give us a historical perspective on the potential use of hands as an input tool to interface with the computer. Beyond that, I still think the author has a long way to explain the public of preferring this kind of system over what they already use.

[01] Flexible Gesture Recognition for Immersive Virtual Environments (Deller, et al – 2006)

Current Mood: full

Blogs I Commented On:

Rant:
The word "haptics" was first thrown around on the first day of class. I honestly had no idea what it meant, and I don't recall anyone explicitly defining it. As any reasonable college student would do, I looked it up on Wikipedia. According to it, haptics is "the study of touching behavior." Awesome.

Anyway, it's about time Blogger has the same convenient saving functionality that Gmail has. Besides that, I better catch up with these blog posts or else I'll be this semester's CJ.

Summary:
Cheaper and powerful hardware has made it possible to enhance computer desktops and creating 3D applications in virtual environments. Advantages of virtual environments include allowing immersion and emulating real environments without attention to interface usages, but current implementations are lacking in interfaces, methods and paradigms for 3D manipulation, and not intuitive. The most natural way for manipulating environments is using hands, since they already exist and can be done naturally, so one promising approach is to employ a gesture recognition engine for interacting with applications naturally with hands.

Related works focus on visual capturing and interpretation of gestures. Non-invasive strategies require little or no special equipment, but need a strict environment. Invasive strategies alleviate above problems, but are expensive and require cable attachment for position sensor detection. The authors desire a system flexible and portable approach that is fast and powerful enough for reliable recognition of various gestures using a 3D analog to the mouse, so they make use of a commercial P5 glove to demonstrate that their recognition engine is powerful and flexible enough for inexpensive hardware. The glove’s sensors can track flexion of wearer’s fingers, its base station can track position and orientation using infrared sensors, and additional buttons give added functionality. The glove can provide accurate finger flexion values and dependable position value, but filtering methods must be used to account for unstable orientation information.

The authors aim for a reliable real-time recognition system that can be used on up-to-date PCs. They define gestures to be a sequence of postures, and the postures are defined as flexion values of fingers, hand orientation, and relevance of that orientation. Recognition consists of two parts: the first part being data acquisition, which receives data from the glove and matches from the gesture manager, the second part. Postures are held for about half a second to prevent misrecognition, and then are matched in the gesture manager. If distance and orientation are within a certain value, then it is recognized. To demonstrate this system, the authors set up a virtual office environment seen with a special 3D device and a normal monitor. This environment allows for virtual manipulation like in a real one. Users commented on its ease-of-use past the initial difficulties of learning the system.

Discussion:
This is a great overview paper to introduce people to the world of haptics. For what it is, readers can see how affordable commercial equipment can be adjusted to accomplish natural interaction in a virtual environment. Implementing this at a mainstream level is just a theory though. Without any numbers to justify how accurate the system is beyond a limited prototype demonstration, it’s hard to know how well their gesture recognition system scales to wider environments or performs with a broader audience. If the numbers can justify the potential of their system, it’s simply a matter of creating killer applications that can bring this type of technology into the mainstream.

The "About Me" Post

15 January 2008

I'm going to assume the first post of our blog will be an "About Me" post, so here's a modified version I used from the SR blog.

Hi, I'm Paul Taele. I got my Bachelors in Computer Sciences and Mathematics at the University of Texas at Austin (yes, I'm a Longhorn). I also studied Chinese for three semesters at National Chengchi University (Taipei, Taiwan).

Year:
Masters in Comp Sci, 1st Year

E-mail:
ptaele [at] gmail [dot] com

Academic Interests:
Neural Networks and Multi-Agents at the moment.

Relevant Experience:
I double majored in CS and math at UT. I primarily programmed in Java while there, since that's all that was taught. I took a variety of AI courses at the time: AI, Neural Networks, Robotics, Cognitive Science, Independent Research (on the CS/EE-side of neural networks), and Independent Study (on the math-side of neural networks). That was fun.

Why I'm taking this class?
I really liked what I learned in Sketch Recognition, both the research area and the class format. I'm taking this Gesture Recognition course expecting to learn another interesting research area that retains the same class format.

What do I hope to gain?
If I wanted to come to grad school just to learn a technical field, I wouldn't have bothered coming in the first place. I want to learn interesting topics that pertains to my major, and that is what I hope to gain from this class.

What do I think I will be doing in 5 years?
Ideally, I'd like to be at that stage where I'm finishing my PhD. If I end up working in I.T. instead, then the terrorists have already won.

What do I think I will be doing in 10 years?
I would like to work for a research group on something that I learned in grad school by then. It'd be hilarious if I'm still in school, but Josh Peschel has taught me that anything's possible. I was never good at predicting things ten years from now, though. Ten years ago, I thought I'd be working today. Silly me.

What are my non-academic interests?
I'm a big fan of East Asian movies, TV shows, and music. This is a consequence of studying Asian languages for several years, I guess.

My funny story:
I didn't plan on being a CS major during undergrad (I was originally a business major at the University of Southern California). When I went to UT afterwards, I decided to do pre-CS though since all my friends were doing it (at UT, students have to apply as a CS major typically in their third year). Out of my friends, I was the only one who got accepted into the CS program. I made new friends and decided to double major in math as well since they all were also CS and math majors. Turns out that I was the only one out of my friends whom didn't drop math as a major. I made some more friends, and we all vowed to go to grad school after we graduated. Well...yeah, I'm the only one out of them that went immediately into grad school. Wait a minute, that's not a funny story at all. That's just plain sad...

Random Stuff #1 - Why is my blog no longer pink?
There's two girls in the class now. I no longer have a monopoly on pink, hence the more dude colors.

Random Stuff #2 - Why am I doing grad school at A&M?
Haha, I sometimes wonder what I'm doing here after doing undergrad at UT. When I was thinking of doing grad school, my professors told me to go somewhere else for CS grad to gain a different perspective. I focused on A&M and UT Dallas because they were both in-state schools with decent CS programs, even though it turned out my profs originally wanted me to go out-of-state. In the end, I chose A&M for two reasons: 1) they gave me more money, and 2) Dr. Choe, one of the professors I would like to have on my advising committee, also went to UT (in fact, my prof for AI and neural nets and my prof for robotics at UT were his advisers when he was working on his PhD). Anyway, my friends gave me a new nickname: Traitor.

Random Stuff #3 - What do I think of College Station/Bryan?
It sucks.

Paul Taele's Blog on Gesture Recognition