Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

Getting Your Application Kinect-ed

kaufmed
CERTIFIED EXPERT
Published:
Updated:
For those of you who don't follow the news, or just happen to live under rocks, Microsoft Research released a beta SDK for the Xbox 360 Kinect. If you don't know what a Kinect is, then I will assume you do indeed live under a rock. The Xbox 360 peripheral has wowed gamers since 2010, and now Microsoft has seen fit to release a potential SDK for the device. In this article, I intend to demonstrate my first crack at the API. This article is targeted at anyone interested in developing applications which make use of the Kinect. Novice coders should have no trouble following what I did (since, in the Kinect world, I am a novice myself!).

The requirements of this project are as follows:

Visual Studio 2010* (any edition should work, even Express)
.NET 4.0 (this should be installed with VS 2010, if not already installed)
An XBox Kinect (duh!)**
Your system should have:

    - A dual-core 2.66 GHz or better processor
    - 2 GB RAM
    - Windows 7
    - Graphics card which supports DirectX 9.0c

For the speech portion of the project:

The following Microsoft Speech-related libraries are needed for speech recognition. Make sure you get the x86 version of each library. This is because the Kinect SDK is built in x86 mode.

    - Speech Platform Runtime (v10.2) x86
    - Speech Platform SDK (v10.2)
    - Kinect English Language Pack (direct download)


* - I tried to get this to work with VS 2008, but VS had trouble recognizing the added reference to the Kinect DLL. It may work with VS 2008, but as of this writing I did not figure out how to do so. (But hey, VS 2010 Express is free. Why not upgrade?)      = )

** - The beta SDK was designed around the XBox version of the connect, and at the time this article was written there was no Windows Kinect. MS has since released a Windows-specific version of the Kinect, and a corresponding SDK for that device. The Windows Kinect SDK is incompatible with the XBox Kinect. MS Research did not realease (to my knowledge) an updated version of the Xbox Kinect SDK, and the beta SDK discussed in this article is the only choice available to you if you only have an XBox Kinect.

There are a couple of things you should know about the SDK. As I mentioned previously, it is in beta, so don't be surprised if there are bugs! The next thing is that the license governing the SDK provides for non-commercial use. I'm not going to cover the license in depth here, but if you use the SDK to create your own projects, make sure you read the license thoroughly and understand what you are agreeing to. I am in no way legally inclined and cannot offer advice as to acceptable use of the SDK.

My goals in this project were simple: become familiar with the API. Many of the samples that come with the SDK are written to take advantage of WPF. I haven't had much experience with that technology (yet), and so I was compelled to create a Forms application that could utilize this API. I played around with the Skeletal Tracking capabilities and I also dabbled in Speech Recognition. Let's first examine Skeletal Tracking, found under the Microsoft.Research.Kinect.Nui namespace.


Note: I have attached zip files for the demo project and these files can be found near the bottom of the article. There is a C# version as well as a VB.NET version.

Skeletal Tracking

I must confess: this blew my mind once I got it working, which didn't take long after going by the sample project. I kept my project rather simple--rather than draw the traditional skeleton, as demonstrated in the SDK's sample project, I drew only dots to represent the joints. We'll call it a poor-man's motion-capture studio. In order to play with Skeletal Tracking, you'll need to understand some of the classes which fall under this feature's namespace.


Runtime

This essentially is the Kinect, well the visual portion anyway. The Runtime class gives you access to visual field sensors of the device. This encompasses the color, depth, and skeletal information. There is a handy Initialize method which you can use to specify the data you would like to collect. For my project, I only needed to initialize the device using the RuntimeOptions.UseSkeletalTracking enumeration. Here is what that initialization looks like:

nui = new Runtime();
                      
                      try
                      {
                          nui.Initialize(RuntimeOptions.UseSkeletalTracking);
                      }
                      catch (InvalidOperationException)
                      {
                          throw;
                      }

Open in new window


Checking for the InvalidOperationException is very good idea, as it will tell you if the API was unable to find/communicate with your Kinect. The Runtime class exposes a few events that pertain to the arrival of each kind of visual data the device can gather. For this project, the event of importance is SkeletonFrameReady. Adding a handler to this event will give us the opportunity to interact with the Joints calculated by the device (more on these later). This was all that I needed to get started with tracking my movements via the Kinect. Now for picturing myself!


SkeletonFrame

Cameras, even video cameras, capture images as frames, which, when equating to something tangible, could be thought of like a Polaroid (if you're old enough to remember what those are). In the case of the Kinect, a frame is a single "image" as captured by one of the sensors. The skeletal tracking system has its own notion of frames as well. A frame in this case is the instantaneous position of the collection of Joints that make up the skeleton.

Don't get two caught up on the notion of "single," though, as even though we capture one "image" at a time, a skeletal image may actually contain two skeletons! Why? Well the Kinect was designed for multiplayer capabilities (as in simultaneous users, not just Internet-ready), and so it has the capability of capturing two simultaneous skeletons via its sensor. For my project, I only focused on one skeleton (I just couldn't bring myself to share!).

The SkeletonFrameReady event has an event arguments parameter of type SkeletonFrameReadyEventArgs. This parameter is what transports the data from the API to your application. The SkeletonFrame represents the "image" that was captured by the device. This class contains a member named Skeletons which represents the skeleton(s) found in the image. A skeleton is basically a collection of points (referred to earlier as Joints). This collection of points is stored within the SkeletonData class, which is what the Skeletons collection is composed of.


SkeletonData

As I mentioned, SkeletonData stores the points recognized by the sensor. This class also houses a few other useful members, such as Position, TrackingState, and UserIndex. For this project, I focused on TrackingState and Joints, as demonstrated in the SDK sample. These two members allowed me to project myself onto my form. The first member, TrackingState, refers to whether or not the Joint is being tracked. I honestly can't describe what this implies, as I would think all Joints would be tracked. The documentation is a bit thin on this. The second member, Joints represents the collection of all points detected by the sensor. What points are detectable? There are 20 points, in fact. They consist of:


Left ankle
Right ankle
Left elbow
Right elbow
Left foot
Right foot
Left hand
Right hand
Head
Center, between hips
Left hip
Right hip
Left knee
Right knee
Center, between shoulders
Left shoulder
Right shoulder
Spine
Left wrist
Right wrist


There is an enumeration named JointID that has an entry for each joint mentioned above. A twenty-first entry, JointID.Count does not correspond to a joint; rather its value represents the number of joints defined by JointID, which according to the documentation is useful for looping through the Joints collection (or presumably other collections).

So, given all these data structures, how the heck do you make that darn black alien hot dog do something cool? Let's see, shall we?


Would You Care to Dance?


Working inside of the SkeletonFrameReady handler, we can loop through the skeletons detected in the frame, and for each skeleton, we can translate the Joint point to a screen point. I kept a class-level queue of Point structures for later painting. I used a queue because it is much easier to work with than an array is. Here is what the translating looks like:

void nui_SkeletonFrameReady(object sender, SkeletonFrameReadyEventArgs e)
                      {
                          SkeletonFrame frame = e.SkeletonFrame;
                      
                          foreach (SkeletonData skel in frame.Skeletons)
                          {
                              if (skel.TrackingState == SkeletonTrackingState.Tracked)
                              {
                                  JointsCollection joints = skel.Joints;
                      
                                  for (int ptIdx = 0; ptIdx < joints.Count; ptIdx++)
                                  {
                                      float x, y;
                                      Point current;
                      
                                      nui.SkeletonEngine.SkeletonToDepthImage(joints[(JointID)ptIdx].Position, out x, out y);
                      
                                      x = Math.Max(0, Math.Min(x * this.ClientRectangle.Width, this.ClientRectangle.Width));
                                      y = Math.Max(0, Math.Min(y * this.ClientRectangle.Height, this.ClientRectangle.Height));
                      
                                      current = new Point((int)Math.Truncate(x), (int)Math.Truncate(y));
                      
                                      jointPoints.Enqueue(current);
                      
                                      if (joints[(JointID)ptIdx].ID == JointID.HandRight)
                                      {
                                          Cursor.Position =
                                              new Point(current.X + (this.Left + (this.Width - this.ClientRectangle.Width)),
                                                          current.Y + (this.Top + (this.Height - this.ClientRectangle.Height)));
                      
                                          if (Math.Abs(current.X - lastHand.X) > 15 || Math.Abs(current.Y - lastHand.Y) > 15)
                                          {
                                              lastHand = current;
                                              delayToClick.Stop();
                                          }
                                          else if (!delayToClick.Enabled)
                                          {
                                              delayToClick.Start();
                                          }
                                      }
                                  }
                              }
                          }
                      
                          this.Invalidate();
                      }

Open in new window


Here's what's going on above. I loop through the skeletons in the frame (line 5). For each skeleton, I check that the SkeletonData is in a state of being tracked (line 7). If it is, I proceed to convert each Joint to a Point (lines 9 - 19). The SkeletonEngine class provides a couple of useful methods. One of the methods, DepthImageToSkeleton is used to return a value between 0 and 1 (I assume), and that value can be calculated against the client area of a form or canvas for later painting. Notice in lines 15 - 17, I call DepthImageToSkeleton and then multiply each of its out parameters against its respective dimension with regard to the ClientRectangle of the form. This essentially translates the image from "camera space" to "application space." The math for converting these values was extracted from the sample project.

You will notice that there is no actual drawing here. I am merely translating the points and storing them to an array. The drawing occurs when I override the Paint method of the form. What, you don't believe me? See for yourself:

protected override void OnPaint(PaintEventArgs e)
                      {
                          try
                          {
                              Graphics g = e.Graphics;
                      
                              while (jointPoints.Count > 0)
                              {
                                  Point p = jointPoints.Dequeue();
                      
                                  g.FillEllipse(Brushes.Black, p.X - 5, p.Y - 5, 11, 11);
                              }
                          }
                          catch (Exception ex)
                          {
                              System.Diagnostics.Debug.Print(ex.Message);
                          }
                      
                          base.OnPaint(e);
                      }

Open in new window


Here I just loop through the Point array and draw the points as 11-pixel-diameter circles. To make sure this code is called appropriately, notice the call to Invalidate as the last thing the SkeletonFrameReady handler does. The combination of these groups of logic is what brings the app to life:

 
It's Alive!

Click Click Boom!


You probably noticed a bit of code in the SkeletonFrameReady handler that I didn't discuss earlier. Well I didn't want to just settle for being a "dancing queen," so I decided to implement the ability to click a button with my hand. I'll warn you now, it's not as extravagant as I would like (I'd rather actually push to indicate a button press rather then simply hovering over it). I believe I would need to incorporate depth tracking to make the demo "pop" more, but for now, I'll hover.

In lines 21 - 38, I implement logic to check if the point representing my right hand has stayed in generally the same spot while a timer ticks down. I put a threshold of 15 in either direction as my algorithm for detecting "hovering." If I breach the threshold, then I stop my timer. If I'm within my threshold, and my timer is not running, then I start it. For this project, my timer's interval is 3 seconds. I also set the position of the cursor to follow the point representing my right hand.

"But where do you 'click?'" Excellent question. Right here:

void delayToClick_Tick(object sender, EventArgs e)
                      {
                          delayToClick.Stop();
                      
                          mouse_event(MOUSEEVENTF_LEFTDOWN, 0, 0, 0, IntPtr.Zero);
                          mouse_event(MOUSEEVENTF_LEFTUP, 0, 0, 0, IntPtr.Zero);
                      }

Open in new window


It's just a handler for my timer's Tick event. I opted for the Win API for simulating the mouse click. I couldn't find anything in the framework that would offer this. (Yes, I could do a Button.PerformClick, and I actually did when I first started, but I wanted to be able to click the "OK" button on the resulting message box. I did not want to create a custom form and do Button.PerformClick there also just for this purpose.) For those of you unfamiliar with the mouse_event Win API function, it is imported thusly:

[DllImport("user32.dll")]
                      private static extern void mouse_event(
                          UInt32 dwFlags,     // motion and click options
                          UInt32 dx,          // horizontal position or change
                          UInt32 dy,          // vertical position or change
                          UInt32 dwData,      // wheel movement
                          IntPtr dwExtraInfo  // application-defined information
                      );

Open in new window


Of course add a using System.Runtime.InteropServices; for the DllImport attribute. Importing this function gave me the desired functionality I sought:

 Coming in for a Landing Click-tastic!

Speech Recognition

As I mentioned earlier, I wasn't content on just dancing around my form (although it was thrilling at the time). I decided to experiment with speech recognition. This really isn't a Kinect feature; rather you use the Kinect's microphone to receive the audio and this data is then forwarded to the Microsoft Speech API. If you attempt this part of the project, make sure you grab the libraries listed in the requirement at the beginning of the article. Pay special attention to the note regarding the x86 versions of the libraries--this is important.


KinectAudioSource

The Microsoft.Research.Kinect.Audio namespace is the container for all things Kinect audio. The KinectAudioSource class more or less represents the subsystem which acquires audio data from the Kinect. Declaring an instance of this class will give you an interface to receiving audio data from the device.



Note: much of the Speech API is not documented very well. I will do my best to describe what I interpret the following classes to do, based on the samples I've looked at and the API documentation (or lack thereof). I will try to keep an eye on the documentation for the Speech API, and if it becomes up-to-date, I will update this article accordingly.


RecognizerInfo

This class contains data about a speech recognizer installed on your system. The member of this class you will care about is the ID property, which identifies a speech recognizer. At the time of this project, I believe I read that only US English was supported by the Kinect SDK; perhaps this will change with more interest in the project or an official SDK release. Also at the time of this project, the only supported recognizer is identified by the id "SR_MS_en-US_Kinect_10.0". You can get this recognizer from the third link under the MS Speech requirements section above. (As a side note, make sure you have VS closed when you install these libraries so they get registered with VS appropriately. I believe the file Microsoft.Speech.dll doesn't get put into the GAC (Global Assembly Cache), and you have to add this reference manually by browsing to it. On my system, this file was installed to "C:\Program Files (x86)\Microsoft Speech Platform SDK\Assembly\Microsoft.Speech.dll". Adjust your path accordingly.)


SpeechRecognitionEngine
This class represents the actual recognizer installed on your system and you initialize an instance of it by passing in the ID of the recognizer as found in an instance of a RecognizerInfo object. It exposes a few events which you can use to take action when a piece of speech is identified ( SpeechRecognized ) or rejected ( SpeechRecognitionRejected ), or when your current recognition engine takes a guess at what you said ( SpeechHypothesized ), as well as few other events. My experimentation only dealt with recognized text.


GrammarBuilder

I'm going to have to surmise that this is something like a StringBuilder, but for a grammar. In looking at the samples, a GrammarBuilder takes in a list of "choices" and factors in what culture those choices are described as. My guess is that it generates what the "choice" would sound like in the specified culture. That is only a guess.


Choices

The "choices" the GrammarBuilder uses to generate a word's sound are basically a list of words, each its own string added to the Choices object.


Grammar

The Grammar uses the GrammarBuilder to generate the sound representations, and subsequently stores those representations internally. Once these representations are created, the Grammar can be loaded into the instance of the SpeechRecognitionEngine.


Enough gory details. Now for the fun!


Speak and Ye Shall Receive

My example is going to differ a bit from the SDK examples, as I used a Forms app (the same one as the NUI demo above, in fact) for my experimentation. The only real difference is that the SDK example used a Console application and did everything inside the Main method, and thus had a local scope on everything. For my demo, I used some form-level variable to track my SpeechRecognitionEngine instance, in addition to a couple of other instance members. One very important thing to take note if you do a Forms app: you must use the MTAThread attribute in order to prevent a nasty exception from cropping up. The API itself apparently has some threading going on, and you need this attribute to account for this.

Wait no longer. Here is the code I used for speech recognition:

private void InitializeAudio()
                      {
                          audio = new KinectAudioSource();
                          audio.FeatureMode = true;
                          audio.AutomaticGainControl = false; // Per sample documentation, "Important to turn this off for speech recognition"
                          audio.SystemMode = SystemMode.OptibeamArrayOnly;
                      
                          InitializeRecognizer();
                      }
                      
                      private void InitializeRecognizer()
                      {
                          RecognizerInfo ri = SpeechRecognitionEngine.InstalledRecognizers().Where(r => r.Id == "SR_MS_en-US_Kinect_10.0").FirstOrDefault();
                      
                          if (ri == null)
                          {
                              throw new ApplicationException("Could not locate speech recognizer.");
                          }
                      
                          recognizer = new SpeechRecognitionEngine(ri.Id);
                          GrammarBuilder builder = new GrammarBuilder();
                          Choices whatIRecognize = new Choices();
                      
                          whatIRecognize.Add("you", "are", "officially", "Kinect-ed");
                          builder.Culture = ri.Culture;
                          builder.Append(whatIRecognize);
                      
                          recognizer.LoadGrammar(new Grammar(builder));
                          recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
                      
                          recogStream = audio.Start();
                          recognizer.SetInputToAudioStream(recogStream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null));
                          recognizer.RecognizeAsync(RecognizeMode.Multiple);
                      }

Open in new window


...and here's what the heck is going on:

I initialized a new instance of the KinectAudioSource class so that I can have interaction with the audio device. The next few assignments are taken straight from the example in the SDK. Setting the FeatureMode to true allows modification of some of the devices features. One such feature is "Automatic Gain Control," which according to the comments in the sample, needs to be off for speech recognition. The assignment of OptibeamArrayOnly means that "Audio Echo Cancellation" is not being used (the other option is OptibeamArrayAndAec, i.e. with "AEC"). The other group of options under the SystemMode refer to "SingleChannel," so I assume that "OptiBeam" refers to stereo, but don't quote me on that. Once I initialize the KinectAudioSource instance with these settings, initialization of the recognizer can begin.

I started by searching the installed recognizers on the system. As previously mentioned, at the time of this writing only the recognizer identified as "SR_MS_en-US_Kinect_10.0" is supported by the Kinect SDK. If that recognizer is not installed, an exception will be raised, which is gracefully handled above by means of the call to FirstOrDefault and the subsequent check for null. If the required recognizer is found, then a new instance of the SpeechRecognitionEngine can be initialized, passing in the ID of the recognizer that was just found. Then a Choices object and a GrammarBuilder object are created and initialized. To the Choices object, I added the words I'd like to recognize. I also assign the culture specified by the RecognizerInfo object to the GrammarBuilder object. Then a new Grammar object is created, and I pass it the GrammarBuilder object. Subsequently, this Grammar object is loaded it into the recognizer. I then add an event handler to the SpeechRecognized event of the recognizer. This handler will take care of displaying the words which were spoken and recognized. Only a few steps remain before conversing with the Kinect can begin.

The last steps for this speech recognition demo are to tell the KinectAudioSource to start "listening" to me. Thus we call the Start method. This method returns a Stream object, and this object should be captured as a reference so that it can be passed to the recognizer. This Stream needs to be properly closed when the application finishes, so this is another reason for maintaining the reference. The next call to the recognizer's SetInputToAudioStream method takes in the stream just created and also sets up some sampling information for acquiring audio data. I copied this from the sample directly because I am not terribly familiar with the aspects of audio capturing. You can experiment as you see fit. Once this point is reached, all that remains is to tell the recognizer to start recognizing, which can be done, conveniently, with a call to the recognizer's RecognizeAsync method. I used the "async" version of the the Recognize family of calls so that my constructor would return and not block. You can use the synchronous (blocking) methods if you like, but make sure you do it properly (i.e. don't block where blocking wouldn't make sense, like a constructor).

I maintained references to the following for easy cleanup once the form was told to close: KinectAudioSource, SpeechRecognitionEngine, and the Stream returned by the call to KinectAudioSource.Start.

So how does it work? Let's put it all together:

kaufmed-470188.flv


Wrapping it Up With an Eye to the Future

And so ends my first dabble into the world that is Kinect. I have to say that I had a BLAST doing this project, even given how trivial it is. My hope with this article is to pique the interest of all from novice to expert. This project was developed in the course of one day. The possibilities of this API are quite expansive, given the proper amount of time and design. As I play with the device and the API more, I will try to post new articles regarding usage, such as interacting with the video camera. I'm sure as interest in the project continues to escalate, so too will the quality of the API. The tools are out there. Don't be intimidated; give your imagination a workout. Who knows what Kinect-ions you might create  = )



Project Source Code

Yes, the project is called "Minority Report." The reason is because I told everyone at my office that I was playing with the Kinect API and that my goal was to give my next demo using the Kinect and have a Minority Report-like interface!
MinorityReport-CS.zip
MinorityReport-VB.zip



References

Kinect SDK API Reference (included in SDK download)
Kinect SDK Samples (included in SDK download)
Kinect for Windows Programming Guide (File Download Below)
ProgrammingGuide-KinectSDK.pdf
11
15,243 Views
kaufmed
CERTIFIED EXPERT

Comments (13)

CERTIFIED EXPERT
Most Valuable Expert 2011
Top Expert 2015

Author

Commented:
Here is the timing code I incorporated from the sample. It was placed inside the SkeletonFrameReady handler.
totalFrames++;

if (cur.Subtract(lastTime) > TimeSpan.FromSeconds(1))
{
    int frameDiff = totalFrames - lastFrames;
    lastFrames = totalFrames;
    lastTime = cur;
    frameRate.Text = frameDiff.ToString() + " fps";
}

Open in new window

CERTIFIED EXPERT
Most Valuable Expert 2023
Most Valuable Expert 2013

Commented:
Here's an interesting one ...  have you tried running this and then bringing a chair into the focal range of the Kinect camera?  It maps the chair as a second skeleton!
Also lots of fun to be had if you can find one of those fitness balls (or a large sized beach ball!) - let the Kinect map it and then try bouncing the ball!
when you mentioned the color recognition it made me think of an easter egg where if you wear a certain t-shirt you will get extra power ups.
CERTIFIED EXPERT
Most Valuable Expert 2011
Top Expert 2015

Author

Commented:
Hi r1tman2003,

Sorry, I don't follow.
armin sadatiFanyarai

Commented:
I have a Kinect device
How can I use it for face processing?
Does this device have this feature?
On my site, you can find an example of face recognition software
I want to connect it to Kinect.
What do you advise?

View More

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.