Ultimate Coder Challenge II : Lee Going Perceptual : Week Five

This week marks a gear change for me. I have finished cramming everything I could possibly manage into the ultimate Perceptual Webcamera and started deploying my builds to other devices and usage scenarios. Here is a brief whistle stop of the week’s progress:

From here on in, it’s polish, polish and polish. It is a time when one bug can eat a whole day, and one device can rip your application apart. I don’t plan on embarrassing myself at GDC (more than usual that is) and my goal is to have an app that is more than a demonstration of Perceptual Computing, but an app you can actually use.

Voice Control

Rather than offer a huge break-down of the horrors of Voice SDK coding, I thought I would offer one solution and some advice to anyone who wants to add voice control to their apps.

WARNING: Firstly, I could not get grammar to work in Beta 3, and there is no example showing it being used in the samples folder. The dictation sample works fine, but only at high frame rates. You try to run it at 20fps and you will get spotty results and a potential waiting time of 15 seconds. Switching real-time to true will help a little!

SOLUTION 1: Instead of detecting whether a word has been spoken, create a piece of code which shows you all the words the SDK returns when you say the same word. Write them all down and detect for ALL of these words to have your app trigger the action based on the intended word. It’s crude, but effective, and with enough words even badly translated voice detection will serve your app well.

SOLUTION 2: Use very long and complicated single words. The voice SDK runs through a database of likely word matches. Using a word like ‘Return’ can produce a zillion variants, but a word like ‘Conference’ only returns one or two variants. Choosing the right command words in your app is essential if you want to avoid frustrated users.

Ultrabook Flip Back Form

I thought it would be cool to attach the perceptual camera in Flip Back Mode to get a feel for the device experience. The downside to this is that should the depth camera be built into a future Ultrabook, the camera is at the ‘bottom’ so some gesture detection will be affected.

Ultrabook Laptop Form

Traditional mounting provided the best gesture and camera data, and also offered a handy keyboard which will always find a use in modern teleconferencing software.

Remaining Challenges

On the eve of making these videos, I had planned to show a live video conferencing call through my router to a remote PC at a different IP address.  Alas a few things prevent this; no texture transfer between clients, VOIP quality too low and laggy, stability on non-Perceptual Camera devices unreliable, stability on Windows 8 devices unreliable, no port forwarding code to break through router NAT systems, no hand-shake server to orchestrate live contact list, destination avatar vertices not stitched, overall performance of 3D on Ultrabook too low, optimisations required for depth and texture compression over network and unpredictable response of voice control at low frame rates. As you might imagine, I have my work cut out over the next few days!

Signing Off

Hopefully I did not make the blog too long this week, and I hope the videos where a good substitute for blog length. All the features I have added to the app are sympathetic to the usage scenario of someone making a hands-free conferencing call, with optional touch and gestures that are simple enough to work with all of the time. My head is stuffed with more cool touches like speech synthesis so the software can talk back to you, hands-free contact list creation using voice and camera and being able to share documents and assets within the virtual world. The reality is that these could be done, but they would jeopardise the stability of a first version and when you’re presenting at GDC, you want bucket loads of stability when fifty people are staring right at you!

Voice Recognition Source Code

Here is the source code to tame the often wild conjecture provided by the Voice SDK:


int gLatestVoiceCommand = 0;
class MyVoiceSystem
 // Callback for recognized commands and alerts
 class MyHandler: public PXCVoiceRecognition::Recognition::Handler, public PXCVoiceRecognition::Alert::Handler 

 MyHandler(std::vector<pxcCHAR*> &commands) { this->commands=commands; }

 virtual void PXCAPI OnRecognized(PXCVoiceRecognition::Recognition *cmd) 
 // Recognise commands here
 pxcCHAR* pTheWords = cmd->dictation;
 gLatestVoiceCommand = -1;
 if ( pTheWords )
 // serious commands
 if ( wcsicmp ( pTheWords, L"host" )==NULL 
 ||   wcsicmp ( pTheWords, L"holst" )==NULL 
 ||   wcsicmp ( pTheWords, L"cost" )==NULL 
 ||   wcsicmp ( pTheWords, L"coast" )==NULL 
 ||   wcsicmp ( pTheWords, L"post" )==NULL 
 ||   wcsicmp ( pTheWords, L"pest" )==NULL 
 ||   wcsicmp ( pTheWords, L"past" )==NULL 
 ||   wcsicmp ( pTheWords, L"test" )==NULL 
 ||   wcsicmp ( pTheWords, L"asked" )==NULL 
 ||   wcsicmp ( pTheWords, L"analyst" )==NULL 
 ||   wcsicmp ( pTheWords, L"pulsed" )==NULL ) gLatestVoiceCommand = 1;

 if ( wcsicmp ( pTheWords, L"call" )==NULL 
 ||   wcsicmp ( pTheWords, L"count" )==NULL 
 ||   wcsicmp ( pTheWords, L"carl" )==NULL 
 ||   wcsicmp ( pTheWords, L"quote" )==NULL 
 ||   wcsicmp ( pTheWords, L"account" )==NULL ) gLatestVoiceCommand = 2;

 if ( wcsicmp ( pTheWords, L"exit" )==NULL 
 ||   wcsicmp ( pTheWords, L"accident" )==NULL 
 ||   wcsicmp ( pTheWords, L"goodbye" )==NULL ) gLatestVoiceCommand = 3;

 if ( wcsicmp ( pTheWords, L"confer" )==NULL 
 ||   wcsicmp ( pTheWords, L"conference" )==NULL ) gLatestVoiceCommand = 4;

 if ( wcsicmp ( pTheWords, L"rick" )==NULL 
 ||   wcsicmp ( pTheWords, L"right" )==NULL 
 ||   wcsicmp ( pTheWords, L"back" )==NULL 
 ||   wcsicmp ( pTheWords, L"trick" )==NULL ) gLatestVoiceCommand = 5;

 if ( wcsicmp ( pTheWords, L"lee" )==NULL 
 ||   wcsicmp ( pTheWords, L"he" )==NULL 
 ||   wcsicmp ( pTheWords, L"beat" )==NULL 
 ||   wcsicmp ( pTheWords, L"you" )==NULL 
 ||   wcsicmp ( pTheWords, L"lane" )==NULL 
 ||   wcsicmp ( pTheWords, L"when" )==NULL 
 ||   wcsicmp ( pTheWords, L"me" )==NULL ) gLatestVoiceCommand = 6;

 if ( wcsicmp ( pTheWords, L"toggle" )==NULL 
 ||   wcsicmp ( pTheWords, L"toddle" )==NULL 
 ||   wcsicmp ( pTheWords, L"problem" )==NULL 
 ||   wcsicmp ( pTheWords, L"travel" )==NULL 
 ||   wcsicmp ( pTheWords, L"template" )==NULL 
 ||   wcsicmp ( pTheWords, L"talk" )==NULL 
 ||   wcsicmp ( pTheWords, L"talking" )==NULL 
 ||   wcsicmp ( pTheWords, L"pedal" )==NULL ) gLatestVoiceCommand = 7;

 if ( wcsicmp ( pTheWords, L"import" )==NULL 
 ||   wcsicmp ( pTheWords, L"and" )==NULL 
 ||   wcsicmp ( pTheWords, L"and part" )==NULL 
 ||   wcsicmp ( pTheWords, L"input" )==NULL 
 ||   wcsicmp ( pTheWords, L"them holt" )==NULL ) gLatestVoiceCommand = 8;

 /* RETURN too much work!!
 if ( wcsicmp ( pTheWords, L"return" )==NULL 
 ||   wcsicmp ( pTheWords, L"written" )==NULL 
 ||   wcsicmp ( pTheWords, L"what time" )==NULL 
 ||   wcsicmp ( pTheWords, L"we touch" )==NULL 
 ||   wcsicmp ( pTheWords, L"with" )==NULL 
 ||   wcsicmp ( pTheWords, L"the time" )==NULL 
 ||   wcsicmp ( pTheWords, L"to time" )==NULL 
 ||   wcsicmp ( pTheWords, L"witch" )==NULL 
 ||   wcsicmp ( pTheWords, L"time" )==NULL 
 ||   wcsicmp ( pTheWords, L"me tan" )==NULL 
 ||   wcsicmp ( pTheWords, L"button" )==NULL 
 ||   wcsicmp ( pTheWords, L"mountain" )==NULL 
 ||   wcsicmp ( pTheWords, L"what's up" )==NULL 
 ||   wcsicmp ( pTheWords, L"mattoon" )==NULL 
 ||   wcsicmp ( pTheWords, L"ritson" )==NULL 
 ||   wcsicmp ( pTheWords, L"mention" )==NULL 
 ||   wcsicmp ( pTheWords, L"witch in" )==NULL 
 ||   wcsicmp ( pTheWords, L"motel in" )==NULL 
 ||   wcsicmp ( pTheWords, L"which" )==NULL ) gLatestVoiceCommand = 4;

 // fun commands
 if ( wcsicmp ( pTheWords, L"dog" )==NULL ) gLatestVoiceCommand = 11;
 if ( wcsicmp ( pTheWords, L"horse" )==NULL ) gLatestVoiceCommand = 12;
 if ( wcsicmp ( pTheWords, L"lion" )==NULL ) gLatestVoiceCommand = 13;
 if ( wcsicmp ( pTheWords, L"wolf" )==NULL ) gLatestVoiceCommand = 14;

 virtual void PXCAPI OnAlert(PXCVoiceRecognition::Alert *alert)

 std::vector<pxcCHAR*> commands;

 UtilCaptureFile* pvoicecapture;
 PXCSmartPtr<PXCVoiceRecognition> vc;
 MyHandler* phandler;
 PXCSmartPtr<PXCAudio> audio;
 PXCSmartSPArray* psps;

 pvoicecapture = NULL;
 phandler = NULL;
 psps = new PXCSmartSPArray(3);
 if ( phandler!=NULL )
 delete phandler;
 if ( psps!=NULL )
 delete psps;
MyVoiceSystem* pmyVS = NULL;