Can <em>your</em> phone SEE what you're saying?

Since I don't have a smartphone (and am not in the market for one), I don't typically care what new features get put into the latest models. That is, unless it is cool and interesting.

When I first saw commercials for the Apple iPhone S and Siri, I thought it must have been a recreation of the technology to get market buzz. When I figured out the scenes were portraits of reality, I took notice and marvelled at such a cool feature.

Today, I read some speculative articles and rumors about Google getting ready to release their competitive product, Majel. This is named after Majel Barrett-Rodenberry who provided the voice of the computer in Star Trek and Star Trek: The Next Generation. Beyond the appropriate name, I was thinking "Ho hum, another voice activated search tool."

Then I read some of the "details" published on the site. The most interesting point for me is in the second-to-last paragraph attributed to "anonymous Googler" touting the reasons for higher expected performance of Majel over normal speech recognition: "...mostly because of the use of high quality microphones and lip-reading assistance." [Emphasis is mine]

Regardless of whether or not this turns out to be true, it's one of those cool Star Trek technology ideas! Being able to read lips with a handheld device would  have so many benefits and uses.

Being able to understand spoken commands more reliably by the device is good and functional. Why couldn't you turn the camera outward to act as an aid to the hearing-impaired? This seems like it would be a great help for older citizens that have lost their hearing or when encountering people that do not know how to sign.  I hold my device so that it can see the speaker's face when they talk and a transcript of their words appears on the screen facing me. Maybe even translating from their language to mine?

And I haven't even had time to think about all the James Bond uses for such portable technology. All football coaches will need a clipboard to cover their mouths when calling in plays to keep fans or spies from the other team being able to discern what they say.

What's next?

  • Facial recognition? If I'm at a party and I want to be sure to meet someone that I've not met, I could put in some pictures, have my device stuck in my pocket scanning people as I walk around the party and give me a signal when I am close to the object of my search.

  • Finding separated members of your party in a crowd? If we arrange to meet at Splash Molehill at 1pm, but I can't tell if anyone else is close, I hold my device over my head and rotate slowly to scan the crowd. If anyone is recognized, I get a signal and an indication of where they are.

  • Or assessing the identity and potential value of objects in a previously locked room? (My wife and I have become fans of Storage Wars.) It would be cool if a device can scan a room, postulate the identity of items that can be seen and render a value of those objects from Interwebs search.

I may not be getting a smartphone anytime soon (or ever), but I'm still fascinated by what technology is being built into handheld devices that even 5 year ago seemed to require much more computational power than  you could hold in the palm of your hand.
For more complete information about compiler optimizations, see our Optimization Notice.


Clay B.'s picture

If you've seen any of the iPhone 4S commercials, the person talking to the phone is holding it out in front of them. Somewhat like when Kirk called up to the Enterprise to get him and his team beamed up to the ship. In Star Trek they didn't hold the communicator to their ear. On ST:TNG, since they carried the communicator on their chest, that version would need to be trained to the speaker's voice. For us here in the 21st Century, a litte of both is going to go a long way to being understood with spoken word to our devices.

Another SF example would be Adama's journal from the original Battlestar Galactica series (1978). Lorne Greene would speak his lines and the text appeared on the screen. TV "magic" of course, but from staging of those scenes you could imagine both voice recognition and lip reading used as combined technology.

Live Long and Prosper.

Raf Schietekat's picture

But what am I saying: only the capture is needed on the phone, synthesis can be delegated to the cloud (can I get a patent for that?).

Raf Schietekat's picture

I expect the device to be flexible in that regard (like you already don't need to train it to your voice anymore), but maybe not when held up to the ear (although here, unlike with human "speechreading", it's only an additional cue, so less information is needed). It would take some getting used to, always carrying that headset along, and so also finding yourself without an excuse not to go for multimedia communication (with video). Unless... maybe the device will also kindly substitute your motion-captured image with a synthesised one, shaved or made up, and with appropriate dress and background (and similarly for audio). When do you think we'll be there?

(Wow, I think when that movie came out I must have been about the same age as the character in the final scene...)

Clay B.'s picture

Raf - I figured it would require a pretty rigid configuration of camera angle and lighting and face position. It is just a cool idea to think that someone would propose adding that feature to voice recognition, even if it might not be too reliable outside ideal lab conditions (for now).

At least now we know such a thing is possible. Next time anyone is plotting to turn off the unbalanced computer that controls the air and food and communication, the conspirators will know they need to not only find a room that can be sealed off from sound, but they also need to turn off the light or turn thier backs to any cameras or hold up a clipboard in front of their mouths. They don't want to tip their hand too soon and wind up out in space with a severed oxygen hose.

Raf Schietekat's picture

To be clear, the "high quality microphones and lip-reading assistance" referred to a robot setup, where the user is facing the camera. With a... well, communicator (Star Trek parlance for smart phone), you'd probably need a headset in addition to the main device for the same assistance in a noisy environment.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.