- Experts Exchange Approved
The requirements of this project are as follows:
- Visual Studio 2010* (any edition should work, even Express)
- .NET 4.0 (this should be installed with VS 2010, if not already installed)
- A Kinect (duh!)
- Your system should have:
- A dual-core 2.66 GHz or better processor
- 2 GB RAM
- Windows 7
- Graphics card which supports DirectX 9.0c
For the speech portion of the project:
- The following Microsoft Speech-related libraries are needed for speech recognition. Make sure you get the x86 version of each library. This is because the Kinect SDK is built in x86 mode.
- Speech Platform Runtime (v10.2) x86
- Speech Platform SDK (v10.2)
- Kinect English Language Pack (direct download)
There are a couple of things you should know about the SDK. As I mentioned previously, it is in beta, so don't be surprised if there are bugs! The next thing is that the license governing the SDK provides for non-commercial use. I'm not going to cover the license in depth here, but if you use the SDK to create your own projects, make sure you read the license thoroughly and understand what you are agreeing to. I am in no way legally inclined and cannot offer advice as to acceptable use of the SDK.
My goals in this project were simple: become familiar with the API. Many of the samples that come with the SDK are written to take advantage of WPF. I haven't had much experience with that technology (yet), and so I was compelled to create a Forms application that could utilize this API. I played around with the Skeletal Tracking capabilities and I also dabbled in Speech Recognition. Let's first examine Skeletal Tracking, found under the Microsoft.Research.Kinect
Note: I have attached zip files for the demo project and these files can be found near the bottom of the article. There is a C# version as well as a VB.NET version.
Skeletal Tracking
I must confess: this blew my mind once I got it working, which didn't take long after going by the sample project. I kept my project rather simple--rather than draw the traditional skeleton, as demonstrated in the SDK's sample project, I drew only dots to represent the joints. We'll call it a poor-man's motion-capture studio. In order to play with Skeletal Tracking, you'll need to understand some of the classes which fall under this feature's namespace.
Runtime
This essentially is the Kinect, well the visual portion anyway. The Runtime class gives you access to visual field sensors of the device. This encompasses the color, depth, and skeletal information. There is a handy Initialize method which you can use to specify the data you would like to collect. For my project, I only needed to initialize the device using the RuntimeOptions.UseSkeleta
Checking for the InvalidOperationException
SkeletonFrame
Cameras, even video cameras, capture images as frames, which, when equating to something tangible, could be thought of like a Polaroid (if you're old enough to remember what those are). In the case of the Kinect, a frame is a single "image" as captured by one of the sensors. The skeletal tracking system has its own notion of frames as well. A frame in this case is the instantaneous position of the collection of Joints that make up the skeleton.
Don't get two caught up on the notion of "single," though, as even though we capture one "image" at a time, a skeletal image may actually contain two skeletons! Why? Well the Kinect was designed for multiplayer capabilities (as in simultaneous users, not just Internet-ready), and so it has the capability of capturing two simultaneous skeletons via its sensor. For my project, I only focused on one skeleton (I just couldn't bring myself to share!).
The SkeletonFrameReady event has an event arguments parameter of type SkeletonFrameReadyEventAr
SkeletonData
As I mentioned, SkeletonData stores the points recognized by the sensor. This class also houses a few other useful members, such as Position, TrackingState, and UserIndex. For this project, I focused on TrackingState and Joints, as demonstrated in the SDK sample. These two members allowed me to project myself onto my form. The first member, TrackingState, refers to whether or not the Joint is being tracked. I honestly can't describe what this implies, as I would think all Joints would be tracked. The documentation is a bit thin on this. The second member, Joints represents the collection of all points detected by the sensor. What points are detectable? There are 20 points, in fact. They consist of:
- Left ankle
- Right ankle
- Left elbow
- Right elbow
- Left foot
- Right foot
- Left hand
- Right hand
- Head
- Center, between hips
- Left hip
- Right hip
- Left knee
- Right knee
- Center, between shoulders
- Left shoulder
- Right shoulder
- Spine
- Left wrist
- Right wrist
There is an enumeration named JointID that has an entry for each joint mentioned above. A twenty-first entry, JointID.Count does not correspond to a joint; rather its value represents the number of joints defined by JointID, which according to the documentation is useful for looping through the Joints collection (or presumably other collections).
So, given all these data structures, how the heck do you make that darn black alien hot dog do something cool? Let's see, shall we?
Would You Care to Dance?
Working inside of the SkeletonFrameReady handler, we can loop through the skeletons detected in the frame, and for each skeleton, we can translate the Joint point to a screen point. I kept a class-level queue of Point structures for later painting. I used a queue because it is much easier to work with than an array is. Here is what the translating looks like:
Here's what's going on above. I loop through the skeletons in the frame (line 5). For each skeleton, I check that the SkeletonData is in a state of being tracked (line 7). If it is, I proceed to convert each Joint to a Point (lines 9 - 19). The SkeletonEngine class provides a couple of useful methods. One of the methods, DepthImageToSkeleton is used to return a value between 0 and 1 (I assume), and that value can be calculated against the client area of a form or canvas for later painting. Notice in lines 15 - 17, I call DepthImageToSkeleton and then multiply each of its out parameters against its respective dimension with regard to the ClientRectangle of the form. This essentially translates the image from "camera space" to "application space." The math for converting these values was extracted from the sample project.
You will notice that there is no actual drawing here. I am merely translating the points and storing them to an array. The drawing occurs when I override the Paint method of the form. What, you don't believe me? See for yourself:
Here I just loop through the Point array and draw the points as 11-pixel-diameter circles. To make sure this code is called appropriately, notice the call to Invalidate as the last thing the SkeletonFrameReady handler does. The combination of these groups of logic is what brings the app to life:
Click Click Boom!
You probably noticed a bit of code in the SkeletonFrameReady handler that I didn't discuss earlier. Well I didn't want to just settle for being a "dancing queen," so I decided to implement the ability to click a button with my hand. I'll warn you now, it's not as extravagant as I would like (I'd rather actually push to indicate a button press rather then simply hovering over it). I believe I would need to incorporate depth tracking to make the demo "pop" more, but for now, I'll hover.
In lines 21 - 38, I implement logic to check if the point representing my right hand has stayed in generally the same spot while a timer ticks down. I put a threshold of 15 in either direction as my algorithm for detecting "hovering." If I breach the threshold, then I stop my timer. If I'm within my threshold, and my timer is not running, then I start it. For this project, my timer's interval is 3 seconds. I also set the position of the cursor to follow the point representing my right hand.
"But where do you 'click?'" Excellent question. Right here:
It's just a handler for my timer's Tick event. I opted for the Win API for simulating the mouse click. I couldn't find anything in the framework that would offer this. (Yes, I could do a Button.PerformClick, and I actually did when I first started, but I wanted to be able to click the "OK" button on the resulting message box. I did not want to create a custom form and do Button.PerformClick there also just for this purpose.) For those of you unfamiliar with the mouse_event Win API function, it is imported thusly:
Of course add a using System.Runtime.InteropServ
Speech Recognition
As I mentioned earlier, I wasn't content on just dancing around my form (although it was thrilling at the time). I decided to experiment with speech recognition. This really isn't a Kinect feature; rather you use the Kinect's microphone to receive the audio and this data is then forwarded to the Microsoft Speech API. If you attempt this part of the project, make sure you grab the libraries listed in the requirement at the beginning of the article. Pay special attention to the note regarding the x86 versions of the libraries--this is important.
KinectAudioSource
The Microsoft.Research.Kinect
Note: much of the Speech API is not documented very well. I will do my best to describe what I interpret the following classes to do, based on the samples I've looked at and the API documentation (or lack thereof). I will try to keep an eye on the documentation for the Speech API, and if it becomes up-to-date, I will update this article accordingly.
RecognizerInfo
This class contains data about a speech recognizer installed on your system. The member of this class you will care about is the ID property, which identifies a speech recognizer. At the time of this project, I believe I read that only US English was supported by the Kinect SDK; perhaps this will change with more interest in the project or an official SDK release. Also at the time of this project, the only supported recognizer is identified by the id "SR_MS_en-US_Kinect_10.0".
SpeechRecognitionEngine
This class represents the actual recognizer installed on your system and you initialize an instance of it by passing in the ID of the recognizer as found in an instance of a RecognizerInfo object. It exposes a few events which you can use to take action when a piece of speech is identified ( SpeechRecognized ) or rejected ( SpeechRecognitionRejected
GrammarBuilder
I'm going to have to surmise that this is something like a StringBuilder, but for a grammar. In looking at the samples, a GrammarBuilder takes in a list of "choices" and factors in what culture those choices are described as. My guess is that it generates what the "choice" would sound like in the specified culture. That is only a guess.
Choices
The "choices" the GrammarBuilder uses to generate a word's sound are basically a list of words, each its own string added to the Choices object.
Grammar
The Grammar uses the GrammarBuilder to generate the sound representations, and subsequently stores those representations internally. Once these representations are created, the Grammar can be loaded into the instance of the SpeechRecognitionEngine.
Enough gory details. Now for the fun!
Speak and Ye Shall Receive
My example is going to differ a bit from the SDK examples, as I used a Forms app (the same one as the NUI demo above, in fact) for my experimentation. The only real difference is that the SDK example used a Console application and did everything inside the Main method, and thus had a local scope on everything. For my demo, I used some form-level variable to track my SpeechRecognitionEngine instance, in addition to a couple of other instance members. One very important thing to take note if you do a Forms app: you must use the MTAThread attribute in order to prevent a nasty exception from cropping up. The API itself apparently has some threading going on, and you need this attribute to account for this.
Wait no longer. Here is the code I used for speech recognition:
...and here's what the heck is going on:
I initialized a new instance of the KinectAudioSource class so that I can have interaction with the audio device. The next few assignments are taken straight from the example in the SDK. Setting the FeatureMode to true allows modification of some of the devices features. One such feature is "Automatic Gain Control," which according to the comments in the sample, needs to be off for speech recognition. The assignment of OptibeamArrayOnly means that "Audio Echo Cancellation" is not being used (the other option is OptibeamArrayAndAec, i.e. with "AEC"). The other group of options under the SystemMode refer to "SingleChannel," so I assume that "OptiBeam" refers to stereo, but don't quote me on that. Once I initialize the KinectAudioSource instance with these settings, initialization of the recognizer can begin.
I started by searching the installed recognizers on the system. As previously mentioned, at the time of this writing only the recognizer identified as "SR_MS_en-US_Kinect_10.0" is supported by the Kinect SDK. If that recognizer is not installed, an exception will be raised, which is gracefully handled above by means of the call to FirstOrDefault and the subsequent check for null. If the required recognizer is found, then a new instance of the SpeechRecognitionEngine can be initialized, passing in the ID of the recognizer that was just found. Then a Choices object and a GrammarBuilder object are created and initialized. To the Choices object, I added the words I'd like to recognize. I also assign the culture specified by the RecognizerInfo object to the GrammarBuilder object. Then a new Grammar object is created, and I pass it the GrammarBuilder object. Subsequently, this Grammar object is loaded it into the recognizer. I then add an event handler to the SpeechRecognized event of the recognizer. This handler will take care of displaying the words which were spoken and recognized. Only a few steps remain before conversing with the Kinect can begin.
The last steps for this speech recognition demo are to tell the KinectAudioSource to start "listening" to me. Thus we call the Start method. This method returns a Stream object, and this object should be captured as a reference so that it can be passed to the recognizer. This Stream needs to be properly closed when the application finishes, so this is another reason for maintaining the reference. The next call to the recognizer's SetInputToAudioStream method takes in the stream just created and also sets up some sampling information for acquiring audio data. I copied this from the sample directly because I am not terribly familiar with the aspects of audio capturing. You can experiment as you see fit. Once this point is reached, all that remains is to tell the recognizer to start recognizing, which can be done, conveniently, with a call to the recognizer's RecognizeAsync method. I used the "async" version of the the Recognize family of calls so that my constructor would return and not block. You can use the synchronous (blocking) methods if you like, but make sure you do it properly (i.e. don't block where blocking wouldn't make sense, like a constructor).
I maintained references to the following for easy cleanup once the form was told to close: KinectAudioSource, SpeechRecognitionEngine,
So how does it work? Let's put it all together:
Wrapping it Up With an Eye to the Future
And so ends my first dabble into the world that is Kinect. I have to say that I had a BLAST doing this project, even given how trivial it is. My hope with this article is to pique the interest of all from novice to expert. This project was developed in the course of one day. The possibilities of this API are quite expansive, given the proper amount of time and design. As I play with the device and the API more, I will try to post new articles regarding usage, such as interacting with the video camera. I'm sure as interest in the project continues to escalate, so too will the quality of the API. The tools are out there. Don't be intimidated; give your imagination a workout. Who knows what Kinect-ions you might create = )
Project Source Code
Yes, the project is called "Minority Report." The reason is because I told everyone at my office that I was playing with the Kinect API and that my goal was to give my next demo using the Kinect and have a Minority Report-like interface!
References
- Kinect SDK API Reference (included in SDK download)
- Kinect SDK Samples (included in SDK download)
by: MASQUERAID on 2011-06-22 at 12:17:18ID: 29008