For those of you who don't follow the news, or just happen to live under rocks, Microsoft Research released a beta SDK
for the Xbox 360 Kinect. If you don't know what a Kinect is
, then I will assume you do indeed live under a rock. The Xbox 360 peripheral has wowed gamers since 2010, and now Microsoft has seen fit to release a potential SDK for the device. In this article, I intend to demonstrate my first crack at the API. This article is targeted at anyone interested in developing applications which make use of the Kinect. Novice coders should have no trouble following what I did (since, in the Kinect world, I am a novice myself!).
The requirements of this project are as follows:
Visual Studio 2010* (any edition should work, even Express)
.NET 4.0 (this should be installed with VS 2010, if not already installed)
An XBox Kinect (duh!)**
Your system should have:
For the speech portion of the project:
- A dual-core 2.66 GHz or better processor
- 2 GB RAM
- Windows 7
- Graphics card which supports DirectX 9.0c
* - I tried to get this to work with VS 2008, but VS had trouble recognizing the added reference to the Kinect DLL. It may work with VS 2008, but as of this writing I did not figure out how to do so. (But hey, VS 2010 Express is free. Why not upgrade?) = )
** - The beta SDK was designed around the XBox version of the connect, and at the time this article was written there was no Windows Kinect. MS has since released a Windows-specific version of the Kinect, and a corresponding SDK for that device. The Windows Kinect SDK is incompatible with the XBox Kinect. MS Research did not realease (to my knowledge) an updated version of the Xbox Kinect SDK, and the beta SDK discussed in this article is the only choice available to you if you only have an XBox Kinect.
There are a couple of things you should know about the SDK. As I mentioned previously, it is in beta
, so don't be surprised if there are bugs! The next thing is that the license governing the SDK
provides for non-commercial use
. I'm not going to cover the license in depth here, but if you use the SDK to create your own projects, make sure you read the license thoroughly and understand what you are agreeing to. I am in no way legally inclined and cannot offer advice as to acceptable use of the SDK.
My goals in this project were simple: become familiar with the API. Many of the samples that come with the SDK are written to take advantage of WPF. I haven't had much experience with that technology (yet), and so I was compelled to create a Forms application that could utilize this API. I played around with the Skeletal Tracking capabilities and I also dabbled in Speech Recognition. Let's first examine Skeletal Tracking, found under the Microsoft.Research.Kinect.Nui
Note: I have attached zip files for the demo project and these files can be found near the bottom of the article. There is a C# version as well as a VB.NET version.
I must confess: this blew my mind once I got it working, which didn't take long after going by the sample project. I kept my project rather simple--rather than draw the traditional skeleton, as demonstrated in the SDK's sample project, I drew only dots to represent the joints. We'll call it a poor-man's motion-capture studio. In order to play with Skeletal Tracking, you'll need to understand some of the classes which fall under this feature's namespace.
This essentially is
the Kinect, well the visual portion anyway. The Runtime
class gives you access to visual field sensors of the device. This encompasses the color, depth, and skeletal information. There is a handy Initialize
method which you can use to specify the data you would like to collect. For my project, I only needed to initialize the device using the RuntimeOptions.UseSkeletalTracking
enumeration. Here is what that initialization looks like:
nui = new Runtime();
Checking for the InvalidOperationException
is very good idea, as it will tell you if the API was unable to find/communicate with your Kinect. The Runtime
class exposes a few events that pertain to the arrival of each kind of visual data the device can gather. For this project, the event of importance is SkeletonFrameReady
. Adding a handler to this event will give us the opportunity to interact with the Joint
s calculated by the device (more on these later). This was all that I needed to get started with tracking my movements via the Kinect. Now for picturing myself!
Cameras, even video cameras, capture images as frames, which, when equating to something tangible, could be thought of like a Polaroid (if you're old enough to remember what those are). In the case of the Kinect, a frame is a single "image" as captured by one of the sensors. The skeletal tracking system has its own notion of frames as well. A frame in this case is the instantaneous position of the collection of Joint
s that make up the skeleton.
Don't get two caught up on the notion of "single," though, as even though we capture one "image" at a time, a skeletal image may actually contain two skeletons! Why? Well the Kinect was designed for multiplayer capabilities (as in simultaneous users, not just Internet-ready), and so it has the capability of capturing two simultaneous skeletons via its sensor. For my project, I only focused on one skeleton (I just couldn't bring myself to share!).
event has an event arguments parameter of type SkeletonFrameReadyEventArgs
. This parameter is what transports the data from the API to your application. The SkeletonFrame
represents the "image" that was captured by the device. This class contains a member named Skeletons
which represents the skeleton(s) found in the image. A skeleton is basically a collection of points (referred to earlier as Joint
s). This collection of points is stored within the SkeletonData
class, which is what the Skeletons
collection is composed of.
As I mentioned, SkeletonData
stores the points recognized by the sensor. This class also houses a few other useful members, such as Position
, and UserIndex
. For this project, I focused on TrackingState
, as demonstrated in the SDK sample. These two members allowed me to project myself onto my form. The first member, TrackingState
, refers to whether or not the Joint
is being tracked. I honestly can't describe what this implies, as I would think all Joint
s would be tracked. The documentation is a bit thin on this. The second member, Joints
represents the collection of all points detected by the sensor. What points are detectable? There are 20 points, in fact. They consist of:
Center, between hips
Center, between shoulders
There is an enumeration named JointID
that has an entry for each joint mentioned above. A twenty-first entry, JointID.Count
does not correspond to a joint; rather its value represents the number of joints defined by JointID
, which according to the documentation is useful for looping through the Joints
collection (or presumably other collections).
So, given all these data structures, how the heck do you make that darn black alien hot dog do something cool? Let's see, shall we?
Would You Care to Dance?
Working inside of the SkeletonFrameReady
handler, we can loop through the skeletons detected in the frame, and for each skeleton, we can translate the Joint
point to a screen point. I kept a class-level queue of Point
structures for later painting. I used a queue because it is much easier to work with than an array is. Here is what the translating looks like:
void nui_SkeletonFrameReady(object sender, SkeletonFrameReadyEventArgs e)
SkeletonFrame frame = e.SkeletonFrame;
foreach (SkeletonData skel in frame.Skeletons)
if (skel.TrackingState == SkeletonTrackingState.Tracked)
JointsCollection joints = skel.Joints;
for (int ptIdx = 0; ptIdx < joints.Count; ptIdx++)
float x, y;
nui.SkeletonEngine.SkeletonToDepthImage(joints[(JointID)ptIdx].Position, out x, out y);
x = Math.Max(0, Math.Min(x * this.ClientRectangle.Width, this.ClientRectangle.Width));
y = Math.Max(0, Math.Min(y * this.ClientRectangle.Height, this.ClientRectangle.Height));
current = new Point((int)Math.Truncate(x), (int)Math.Truncate(y));
if (joints[(JointID)ptIdx].ID == JointID.HandRight)
new Point(current.X + (this.Left + (this.Width - this.ClientRectangle.Width)),
current.Y + (this.Top + (this.Height - this.ClientRectangle.Height)));
if (Math.Abs(current.X - lastHand.X) > 15 || Math.Abs(current.Y - lastHand.Y) > 15)
lastHand = current;
else if (!delayToClick.Enabled)
Here's what's going on above. I loop through the skeletons in the frame (line 5). For each skeleton, I check that the SkeletonData
is in a state of being tracked (line 7). If it is, I proceed to convert each Joint
to a Point
(lines 9 - 19). The SkeletonEngine
class provides a couple of useful methods. One of the methods, DepthImageToSkeleton
is used to return a value between 0 and 1 (I assume), and that value can be calculated against the client area of a form or canvas for later painting. Notice in lines 15 - 17, I call DepthImageToSkeleton
and then multiply each of its out
parameters against its respective dimension with regard to the ClientRectangle
of the form. This essentially translates the image from "camera space" to "application space." The math for converting these values was extracted from the sample project.
You will notice that there is no actual drawing here. I am merely translating the points and storing them to an array. The drawing occurs when I override the Paint
method of the form. What, you don't believe me? See for yourself:
protected override void OnPaint(PaintEventArgs e)
Graphics g = e.Graphics;
while (jointPoints.Count > 0)
Point p = jointPoints.Dequeue();
g.FillEllipse(Brushes.Black, p.X - 5, p.Y - 5, 11, 11);
catch (Exception ex)
Here I just loop through the Point
array and draw the points as 11-pixel-diameter circles. To make sure this code is called appropriately, notice the call to Invalidate
as the last thing the SkeletonFrameReady
handler does. The combination of these groups of logic is what brings the app to life:
Click Click Boom!
You probably noticed a bit of code in the SkeletonFrameReady
handler that I didn't discuss earlier. Well I didn't want to just settle for being a "dancing queen," so I decided to implement the ability to click a button with my hand. I'll warn you now, it's not as extravagant as I would like (I'd rather actually push to indicate a button press rather then simply hovering over it). I believe I would need to incorporate depth tracking to make the demo "pop" more, but for now, I'll hover.
In lines 21 - 38, I implement logic to check if the point representing my right hand has stayed in generally the same spot while a timer ticks down. I put a threshold of 15 in either direction as my algorithm for detecting "hovering." If I breach the threshold, then I stop my timer. If I'm within my threshold, and my timer is not running, then I start it. For this project, my timer's interval is 3 seconds. I also set the position of the cursor to follow the point representing my right hand.
"But where do you 'click?'" Excellent question. Right here:
void delayToClick_Tick(object sender, EventArgs e)
mouse_event(MOUSEEVENTF_LEFTDOWN, 0, 0, 0, IntPtr.Zero);
mouse_event(MOUSEEVENTF_LEFTUP, 0, 0, 0, IntPtr.Zero);
It's just a handler for my timer's Tick
event. I opted for the Win API for simulating the mouse click. I couldn't find anything in the framework that would offer this. (Yes, I could do a Button.PerformClick
, and I actually did when I first started, but I wanted to be able to click the "OK" button on the resulting message box. I did not want to create a custom form and do Button.PerformClick
there also just for this purpose.) For those of you unfamiliar with the mouse_event
Win API function, it is imported thusly:
private static extern void mouse_event(
UInt32 dwFlags, // motion and click options
UInt32 dx, // horizontal position or change
UInt32 dy, // vertical position or change
UInt32 dwData, // wheel movement
IntPtr dwExtraInfo // application-defined information
Of course add a using System.Runtime.InteropServices;
for the DllImport
attribute. Importing this function gave me the desired functionality I sought:
As I mentioned earlier, I wasn't content on just dancing around my form (although it was thrilling at the time). I decided to experiment with speech recognition. This really isn't a Kinect feature; rather you use the Kinect's microphone to receive the audio and this data is then forwarded to the Microsoft Speech API. If you attempt this part of the project, make sure you grab the libraries listed in the requirement at the beginning of the article. Pay special attention to the note regarding the x86 versions of the libraries--this is important.
namespace is the container for all things Kinect audio. The KinectAudioSource
class more or less represents the subsystem which acquires audio data from the Kinect. Declaring an instance of this class will give you an interface to receiving audio data from the device.
Note: much of the Speech API is not documented very well. I will do my best to describe what I interpret the following classes to do, based on the samples I've looked at and the API documentation (or lack thereof). I will try to keep an eye on the documentation for the Speech API, and if it becomes up-to-date, I will update this article accordingly.
This class contains data about a speech recognizer installed on your system. The member of this class you will care about is the ID property, which identifies a speech recognizer. At the time of this project, I believe I read that only US English was supported by the Kinect SDK; perhaps this will change with more interest in the project or an official SDK release. Also at the time of this project, the only supported recognizer is identified by the id "SR_MS_en-US_Kinect_10.0".
You can get this recognizer from the third link under the MS Speech requirements section above. (As a side note, make sure you have VS closed when you install these libraries so they get registered with VS appropriately. I believe the file Microsoft.Speech.dll doesn't get put into the GAC (Global Assembly Cache), and you have to add this reference manually by browsing to it. On my system, this file was installed to "C:\Program Files (x86)\Microsoft Speech Platform SDK\Assembly\Microsoft.Spe
ech.dll". Adjust your path accordingly.)
This class represents the actual recognizer installed on your system and you initialize an instance of it by passing in the ID of the recognizer as found in an instance of a RecognizerInfo
object. It exposes a few events which you can use to take action when a piece of speech is identified ( SpeechRecognized
) or rejected ( SpeechRecognitionRejected
), or when your current recognition engine takes a guess at what you said ( SpeechHypothesized
), as well as few other events. My experimentation only dealt with recognized text.
I'm going to have to surmise that this is something like a StringBuilder
, but for a grammar. In looking at the samples, a GrammarBuilder
takes in a list of "choices" and factors in what culture those choices are described as. My guess is that it generates what the "choice" would sound like in the specified culture. That is only a guess.
The "choices" the GrammarBuilder
uses to generate a word's sound are basically a list of words, each its own string added to the Choices
uses the GrammarBuilder
to generate the sound representations, and subsequently stores those representations internally. Once these representations are created, the Grammar
can be loaded into the instance of the SpeechRecognitionEngine
Enough gory details. Now for the fun!
Speak and Ye Shall Receive
My example is going to differ a bit from the SDK examples, as I used a Forms app (the same one as the NUI demo above, in fact) for my experimentation. The only real difference is that the SDK example used a Console application and did everything inside the Main
method, and thus had a local scope on everything. For my demo, I used some form-level variable to track my SpeechRecognitionEngine
instance, in addition to a couple of other instance members. One very important thing to take note if you do a Forms app: you must use the
MTAThread attribute in order to prevent a nasty exception from cropping up.
The API itself apparently has some threading going on, and you need this attribute to account for this.
Wait no longer. Here is the code I used for speech recognition:
private void InitializeAudio()
audio = new KinectAudioSource();
audio.FeatureMode = true;
audio.AutomaticGainControl = false; // Per sample documentation, "Important to turn this off for speech recognition"
audio.SystemMode = SystemMode.OptibeamArrayOnly;
private void InitializeRecognizer()
RecognizerInfo ri = SpeechRecognitionEngine.InstalledRecognizers().Where(r => r.Id == "SR_MS_en-US_Kinect_10.0").FirstOrDefault();
if (ri == null)
throw new ApplicationException("Could not locate speech recognizer.");
recognizer = new SpeechRecognitionEngine(ri.Id);
GrammarBuilder builder = new GrammarBuilder();
Choices whatIRecognize = new Choices();
whatIRecognize.Add("you", "are", "officially", "Kinect-ed");
builder.Culture = ri.Culture;
recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
recogStream = audio.Start();
recognizer.SetInputToAudioStream(recogStream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null));
...and here's what the heck is going on:
I initialized a new instance of the KinectAudioSource
class so that I can have interaction with the audio device. The next few assignments are taken straight from the example in the SDK. Setting the FeatureMode
allows modification of some of the devices features. One such feature is "Automatic Gain Control," which according to the comments in the sample, needs to be off for speech recognition. The assignment of OptibeamArrayOnly
means that "Audio Echo Cancellation" is not being used (the other option is OptibeamArrayAndAec
, i.e. with "AEC"). The other group of options under the SystemMode
refer to "SingleChannel," so I assume that "OptiBeam" refers to stereo, but don't quote me on that. Once I initialize the KinectAudioSource
instance with these settings, initialization of the recognizer can begin.
I started by searching the installed recognizers on the system. As previously mentioned, at the time of this writing only the recognizer identified as "SR_MS_en-US_Kinect_10.0" is supported by the Kinect SDK. If that recognizer is not installed, an exception will be raised, which is gracefully handled above by means of the call to FirstOrDefault
and the subsequent check for null
. If the required recognizer is found, then a new instance of the SpeechRecognitionEngine
can be initialized, passing in the ID of the recognizer that was just found. Then a Choices
object and a GrammarBuilder
object are created and initialized. To the Choices
object, I added the words I'd like to recognize. I also assign the culture specified by the RecognizerInfo
object to the GrammarBuilder
object. Then a new Grammar
object is created, and I pass it the GrammarBuilder
object. Subsequently, this Grammar
object is loaded it into the recognizer. I then add an event handler to the SpeechRecognized
event of the recognizer. This handler will take care of displaying the words which were spoken and recognized. Only a few steps remain before conversing with the Kinect can begin.
The last steps for this speech recognition demo are to tell the KinectAudioSource
to start "listening" to me. Thus we call the Start
method. This method returns a Stream
object, and this object should be captured as a reference so that it can be passed to the recognizer. This Stream
needs to be properly closed when the application finishes, so this is another reason for maintaining the reference. The next call to the recognizer's SetInputToAudioStream
method takes in the stream just created and also sets up some sampling information for acquiring audio data. I copied this from the sample directly because I am not terribly familiar with the aspects of audio capturing. You can experiment as you see fit. Once this point is reached, all that remains is to tell the recognizer to start recognizing, which can be done, conveniently, with a call to the recognizer's RecognizeAsync
method. I used the "async" version of the the Recognize
family of calls so that my constructor would return and not block. You can use the synchronous (blocking) methods if you like, but make sure you do it properly (i.e. don't block where blocking wouldn't make sense, like a constructor).
I maintained references to the following for easy cleanup once the form was told to close: KinectAudioSource
, and the Stream
returned by the call to KinectAudioSource.Start
So how does it work? Let's put it all together:
Wrapping it Up With an Eye to the Future
And so ends my first dabble into the world that is Kinect. I have to say that I had a BLAST doing this project, even given how trivial it is. My hope with this article is to pique the interest of all from novice to expert. This project was developed in the course of one day. The possibilities of this API are quite expansive, given the proper amount of time and design. As I play with the device and the API more, I will try to post new articles regarding usage, such as interacting with the video camera. I'm sure as interest in the project continues to escalate, so too will the quality of the API. The tools are out there. Don't be intimidated; give your imagination a workout. Who knows what Kinect-ions you might create = )
Project Source Code
Yes, the project is called "Minority Report." The reason is because I told everyone at my office that I was playing with the Kinect API and that my goal was to give my next demo using the Kinect and have a Minority Report-like interface!
Kinect SDK API Reference (included in SDK download)
Kinect SDK Samples (included in SDK download)
Kinect for Windows Programming Guide (File Download Below)