Most of this week have been used to work on the Interface library you can see a demo of here:
I want to use this weeks blog post to set the record straight on the camera. I feel that a lot of false expectations have been created of what it can and can not do. You may call this my "review".
Lets start with the API. The SDK is a bit off a mess, it crashes if you do the wrong thing (in ways not documented) and it promises things it doesn't really deliver. Case in point, it says it has head tracking, but it really doesn't since the head tracking only uses the color image and not the depth image and therefor gets unusable results. The voice recognition is cool, but requires the user to have a payed for license already installed on the machine, making it useless for anyone making something commercial.
So lets do what I did and try to avoid using the APIs supposed features and go straight for what the hardware outputs. The depth camera values comes out as millimeters and is fairly accurate. However to work it need something to reflect against and therefore it handles dark skin and hair poorly. But the real limiting factor is the resolution output of only 320 by 240 pixels.
Lets do some math on that: If you have a HD monitor and you are trying to control a pointer, for each pixel you move your hand, the pointer moves 6 pixels on screen, making it very tricky to hit something like a web link. Now this is if you use the entire resolution of the camera meaning the user has to reach far out to reach the edge of the screen. To make it ergonomically feasible you would probably only use half the screen, giving you a resolution of 12 screen pixels to every camera pixel. If you add to that that you need to generate some kind of event to actually click on something like pinching, you would need click surfaces 100+ pixels large. Add to this that the image is unstable and you have a recipe for user frustration. Its a cool idea, but until the camera is at least HD, I would not attempt to use it for pointing.
So over to head tracking. This was my main interest. After discarding the SDKs head tracking I this weekend took a stab at writing my own. After about a day I had written four separate algorithms to find, track and stabilize the head position, down to sub pixel accuracy (Yes I can detect if you move your head about a quarter of a pixel). I use no smoothing or prediction in order to avoid adding any latency, and the result is much more predictable, precise and stable then the SDK head tracker. Even though its worlds better then the SDK and probably close to the limit of how well it can be done with the hardware, I'm still not sure its good enough.
One interesting thing is that the SDK outputs what it calls a precision map, that tells you how good the depth values are. This is a bit of a miss labeling, because what it really is, is the infrared camera image. While its generally true that a bright surface reflects the IR flash better, a dark pixel may still have a correct reflection while a bright pixel may still be something transparent and therefor have a erroneous depth value. One possible use for this map would be to use it as an aid for eye tracking as human eyes lights up like dears in headlights in IR light...
Speaking of eye tracking, When I and the SDK talks about eye tracking, we are talking about where the eyes are, not where they are looking. The SDK has some functionality to try to figure out what direction the head s turned but the most you can get out of it is "The user might be looking to the left". The idea that you could use that to control a pointer is completely ridiculous, as it is orders of magnitude less precise then using a finger as a pointer, and even that is too precise to be useful.
These are all pretty harsh words. So do I think the Camera is worthless? No, its just not nearly god enough to be used as an efective and reliable input device. There are many other specialized uses, like 3D model acquisition where it is useful. My guess is also that with time higher resolution cameras will make more uses feasible, and the current camera is a useful toy for researchers to try to write algorithms that become useful once the camera gets a high enough resolution to be useful to consumers.
The goal of any user interface is to disappear. To connect the users will directly to the machine. When you drive a car you dont have to think how you turn left, you just do. You dont think about turning the steering wheel, you think about turning the car. If you add any doubt that the wheel wont turn the car, the interface becomes a disaster. For a interface to disappear you must trust it 100%, and if it fails you once it becomes worthless. This creates an incredibly high bar for tracking and voice recognition to reach, and this camera isn't there yet.