Archived - Perceptual Drone Speech Recognition

The Intel® RealSense™ SDK has been discontinued. No ongoing support or updates will be available.

Download Code Sample

Controlling Drones with Speech-Recognition Applications Using the Intel® RealSense™ SDK

Every day we hear about drones in the news. With applications ranging from spying and fighting operations, photography and video, and simply for fun, drone technology is on the ground floor and worth looking into.

As developers, we have the ability to create applications that can control them. A drone is ultimately just a programmable device, so connecting to them and sending commands to perform desired actions can be done using a regular PC or smartphone application. For this article, I have chosen to use one of the most “hackable” drones available on the market: the Parrot’s AR.Drone* 2.0.

We will see how to interact with and control this drone with a library written in C#. Using this as our basis we will add speech commands to control the drone using the Intel® RealSense™ SDK.

PARROT AR.DRONE 2.0

Among the currently marketed drones for hobbyists, one of the most interesting is the AR.Drone 2.0 model from Parrot. It includes many features and incorporates a built-in help system that provides a stabilization and calibration interface. The drone’s sturdy Styrofoam protection helps to avoid damage to the propellers or moving parts in case of falls or collisions with fixed obstacles.

The AR.Drone* 2.0 from Parrot

The hardware provides a connection with external devices on its own Wi-Fi* network between the drone and the connected device (smartphone, tablet, or PC). The communication protocol is based on AT-like type messages (like those used to program and control telephone modems years ago).

Using this simple protocol, it is possible to send the drone all the commands needed to get it off the ground, raise or lower in altitude, and fly in different directions. It is also possible to read a stream of images taken from cameras (in HD) placed onboard the drone (one front and one facing down) to save pictures during flights or capture video.

The company provides several applications to fly the drone manually; however, it’s much more interesting to study how to autonomously control the flight. For this reason, I decided (with the help of my colleague Marco Minerva) to create an interface that would allow us to control it through different devices.

Controlling the Drone Programmatically

We said that the drone has its own Wi-Fi network, so we’ll connect to it to send control commands. The AR.Drone 2.0 developer guide gave us all the information we needed. For example, the guide says to send commands via UDP to the 192.168.1.1 address, on port 5556. These are simple strings in the AT format:

AT * REF for takeoff and landing control

AT * PCMD to move the drone (direction, speed, altitude)

Once we connect to the drone, we’ll create a sort of ‘game’ where we send commands to the drone based on the inputs of our application. Let's see how to create a Class Library.

First, we must connect to the device:

public static async Task ConnectAsync(string hostName = HOST_NAME, string port = REMOTE_PORT)
        {
             // Set up the UDP connection.
             var droneIP = new HostName(hostName);

             udpSocket = new DatagramSocket();
             await udpSocket.BindServiceNameAsync(port);
             await udpSocket.ConnectAsync(droneIP, port);
             udpWriter = new DataWriter(udpSocket.OutputStream);

             udpWriter.WriteByte(1);
             await udpWriter.StoreAsync();

             var loop = Task.Run(() => DroneLoop());
        }

As mentioned, we must use the UDP protocol, so we need a DatagramSocket object. After connecting with the ConnectAsync method, we create a DataWriter on the output stream to send the commands themselves. Finally, we send the first byte via Wi-Fi. It will be discarded by the drone and is only meant to initialize the system.

Let's check the command sent to the drone:

        private static async Task DroneLoop()
        {
            while (true)
            {

                var commandToSend = DroneState.GetNextCommand(sequenceNumber);
                await SendCommandAsync(commandToSend);

                sequenceNumber++;
                await Task.Delay(30);
            }
        }

The tag DroneState.GetNextCommand formats the string AT command that must be sent to the device. To do this, we need a sequence number because the drone expects that each command is accompanied by a progressive number and ignores all the commands with a number equal to or less than one already posted.

Then we use WriteString to send the command via StreamSocket to the stream, forcing StoreAsync to write the buffer and submit. Finally, we increment the sequence number and use Task Delay to introduce a 30-millisecond delay for the next iteration.

The DroneState class is the one that deals with determining which command to send:

    public static class DroneState
    {
       public static double StrafeX { get; set; }
       public static double StrafeY { get; set; }
       public static double AscendY { get; set; }
       public static double RollX { get; set; }
       public static bool Flying { get; set; }
       public static bool isFlying { get; set; }

        internal static string GetNextCommand(uint sequenceNumber)
        {
            // Determine if the drone needs to take off or land
            if (Flying && !isFlying)
            {
                isFlying = true;
                return DroneMovement.GetDroneTakeoff(sequenceNumber);
            }
            else if (!Flying && isFlying)
            {
                isFlying = false;
                return DroneMovement.GetDroneLand(sequenceNumber);
            }

            // If the drone is flying, sends movement commands to it.
            if (isFlying && (StrafeX != 0 || StrafeY != 0 || AscendY != 0 || RollX != 0))
                return DroneMovement.GetDroneMove(sequenceNumber, StrafeX, StrafeY, AscendY, RollX);

            return DroneMovement.GetHoveringCommand(sequenceNumber);
        }
    }

The properties StrafeX, StrafeY, AscendY, and RollX define the speed of navigation left/right, forward/backward, the altitude, and rotation change of the drone, respectively. These properties are double and accept values between 1 and -1. For example, setting StrafeX to -0.5 moves the drone to the left at half of its maximum speed; specifying 1 will go to the right at full speed.

Flying is a variable that determines the takeoff or landing. In the GetNextCommand method we check the values of these fields to decide which command to send to the drone. These commands are in turn managed by the DroneMovement class.

Note that, if no command is specified, the last statement creates the so-called Hovering command, an empty command that keeps the communication channel open between the drone and the device. The drone needs to be constantly receiving messages from the controlling application, even when there’s no action to do and no status has changed.

The most interesting method of the DroneMovement class is definitely GetDroneMove, which effectively composes and sends the command to the drone.  For other methods related to movement, please refer to this sample.

public static string GetDroneMove(uint sequenceNumber, double velocityX, double velocityY, double velocityAscend, double velocityRoll)
    {
        var valueX = FloatConversion(velocityX);
        var valueY = FloatConversion(velocityY);
        var valueAscend = FloatConversion(velocityAscend);
        var valueRoll = FloatConversion(velocityRoll);

        var command = string.Format("{0},{1},{2},{3}", valueX, valueY, valueAscend, valueRoll);
        return CreateATPCMDCommand(sequenceNumber, command);
    }
private static string CreateATPCMDCommand(uint sequenceNumber, string command, int mode = 1)
    {
        return string.Format("AT*PCMD={0},{1},{2}{3}", sequenceNumber, mode, command, Environment.NewLine);
    }

The FloatConversion method is not listed here, but it converts a double value between -1 and 1 in a signed integer that can be used by the AT commands, like the PCMD string to control the movements.

The code shown here is available as a free library on NuGet, called AR.Drone 2.0 Interaction Library, which provides everything you need to control the device from takeoff to landing.

AR.Drone UI on NuGet

Thanks to this sample application, we can forget the implementation details and focus instead on delivering apps that, through different modes of interaction, allow us to pilot the drone.

Intel® RealSense™ SDK

Now let’s look at one of the greatest and easiest-to-use features (for me) of the Intel RealSenseSDK — speech recognition.

The SDK offers two different approaches to speech:

  • Command recognition (from a given dictionary)
  • Free text recognition (dictation)

The first is essentially a list of commands, defined by the application, in a specified language for instructing the ‘recognizer’. Words not on the list are ignored.

The second is a sort of a recorder that “understands” any vocabulary in a free-form stream. It is ideal for transcriptions, automatic subtitling, etc.

For our project we will use the first option because we want to implement only a finite number of commands to send to the drone.

First, we need to define some variables to use:

        private PXCMSession Session;
        private PXCMSpeechRecognition SpeechRecognition;
        private PXCMAudioSource AudioSource;
        private PXCMSpeechRecognition.Handler RecognitionHandler;

Session is a tag required to access I/O and the SDK’s algorithms, since all subsequent actions are inherited from this instance.

SpeechRecognition is the instance of the recognition module created with a CreateImpl function inside the Session environment.

AudioSource is the device interface to establish and select an input audio device (in our sample code we select the first audio device available to keep it simple).

RecognitionHandler is the real handler that assigns the eventhandler for the OnRecognition event.

Let’s now initialize the session, the AudioSource, and the SpeechRecognition instance.

            Session = PXCMSession.CreateInstance();
            if (Session != null)
            {
                // session is a PXCMSession instance.
                AudioSource = Session.CreateAudioSource();
                // Scan and Enumerate audio devices
                AudioSource.ScanDevices();

                PXCMAudioSource.DeviceInfo dinfo = null;

                for (int d = AudioSource.QueryDeviceNum() - 1; d >= 0; d--)
                {
                    AudioSource.QueryDeviceInfo(d, out dinfo);
                }
                AudioSource.SetDevice(dinfo);

                Session.CreateImpl<PXCMSpeechRecognition>(out SpeechRecognition);

As noted before, to keep the code simple we select the first Audio device available.

PXCMSpeechRecognition.ProfileInfo pinfo;
              SpeechRecognition.QueryProfile(0, out pinfo);
              SpeechRecognition.SetProfile(pinfo);

Then we need to query the system about the actual configuration parameter and assign it to a variable (pinfo).

We should also set some parameters in the profile info to change the recognized language. Set the recognition confidence level (higher value request stronger recognition), end of recognition timeout, etc.

In our case we set the default parameter as in profile 0 (the first received from Queryprofile).

                String[] cmds = new String[] { "Takeoff", "Land", "Rotate Left", "Rotate Right", "Advance",
                    "Back", "Up", "Down", "Left", "Right", "Stop" , "Dance"};
                int[] labels = new int[] { 1, 2, 4, 5, 8, 16, 32, 64, 128, 256, 512, 1024 };
                // Build the grammar.
                SpeechRecognition.BuildGrammarFromStringList(1, cmds, labels);
                // Set the active grammar.
                SpeechRecognition.SetGrammar(1);

Next, we’ll define the grammar dictionary for instructing recognition system. Using BuildGrammarFromStringList we create a simple list of verbs and corresponding return values defining grammar number 1.

It is possible to define multiple grammars to use in our application and activate one at a time when needed, so we could create all the different command dictionaries for all the supported languages and provide a way for the user to switch between the different languages recognized by the SDK. In this case, you must install all the corresponding DLL files for the specific language support (the default SDK setup installs only the US English support assemblies). In this sample, we use only one grammar set with the default installation of US English.

We then select which grammar to make active in SpeechRecognition instance.

                RecognitionHandler = new PXCMSpeechRecognition.Handler();

                RecognitionHandler.onRecognition = OnRecognition;

Those instructions define a new eventhandler for the OnRecognition event and assign it to a method defined below:

        public void OnRecognition(PXCMSpeechRecognition.RecognitionData data)
        {
            var RecognizedValue = data.scores[0].label;
            double movement = 0.3;
            TimeSpan duration = new TimeSpan(0, 0, 0, 500);
            switch (RecognizedValue)
            {
                case 1:
                    DroneState.TakeOff();
                    WriteInList("Takeoff");
                    break;
                case 2:
                    DroneState.Land();
                    WriteInList("Land");
                    break;
                case 4:
                    DroneState.RotateLeftForAsync(movement, duration);
                    WriteInList("Rotate Left");
                    break;
                case 5:
                    DroneState.RotateRightForAsync(movement, duration);
                    WriteInList("Rotate Right");
                    break;
                case 8:
                    DroneState.GoForward(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    WriteInList("Advance");
                    break;
                case 16:
                    DroneState.GoBackward(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    WriteInList("Back");
                    break;
                case 32:
                    DroneState.GoUp(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    WriteInList("Up");
                    break;
                case 64:
                    DroneState.GoDown(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    WriteInList("Down");
                    break;
                case 128:
                    DroneState.StrafeX = .5;
                    Thread.Sleep(500);
                    DroneState.StrafeX = 0;
                    WriteInList("Left");
                    break;
                case 256:
                    DroneState.StrafeX = -.5;
                    Thread.Sleep(500);
                    DroneState.StrafeX = 0;
                    WriteInList("Right");
                    break;
                case 512:
                    DroneState.Stop();
                    WriteInList("Stop");
                    break;
                case 1024:
                    WriteInList("Dance");
                    DroneState.RotateLeft(movement);
                    Thread.Sleep(500);
                    DroneState.RotateRight(movement);
                    Thread.Sleep(500);
                    DroneState.RotateRight(movement);
                    Thread.Sleep(500);
                    DroneState.RotateLeft(movement);
                    Thread.Sleep(500);
                    DroneState.GoForward(movement);
                    Thread.Sleep(500);
                    DroneState.GoBackward(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    break;
                default:
                    break;

            }
            Debug.WriteLine(data.grammar.ToString());
            Debug.WriteLine(data.scores[0].label.ToString());
            Debug.WriteLine(data.scores[0].sentence);
            // Process Recognition Data
        }

This is a method of getting a value returned from the recognition data and executing the corresponding command (in our case the corresponding flight instruction for the drone).

Every drone command refers to the DroneState call with the specific method (TakeOff, GoUp, DoDown, etc.) with some specific parameter of movement or duration, referring in each case to a specific quantity of movement or a time duration for it.

Some commands need an explicit call to the Stop method to interrupt the actual action otherwise the drone will continue to move as instructed (refer to the previous code for those command).

In some cases is necessary to insert a Thread.Sleep between two different commands to permit the completion of the previous operation before sending the new command.

In order to test the recognition even if we don’t have a drone available I’ve inserted a variable (controlled by the checkbox present in the main window) that instructs the Drone Stub functional mode that creates the command but doesn’t send it.

To close the application, call the OnClosing method to close and destroy all the instances and handlers and to basically clean up the system.

In the code you can find some debug commands that print some helpful information in the Visual Studio* debug windows when testing the system.

Conclusion

In this article, we have shown how we can interact with a device as complex as a drone using an interaction interface with natural language. We have seen how we can define a simple dictionary of verbs and instruct the system to understand it and consequently control a complex device like a drone in flight. What I show in this article is only a small fraction of the available possibilities to operate the drone, and infinite options are possible.

Photo of the flying demonstration at the .NET Campus event session in 2014

About the Author

Marco Dal Pino has worked in IT from more than 20 years, and is a Freelance Consultant working on the .NET platform. He’s part of the staff of DotNetToscana, which is a community focused on Microsoft technologies, and he is a Microsoft MVP for Windows Platform Development.  He develops Mobile and Embedded applications for retail and enterprise sectors, and is also involved in developing Windows Phone and Windows 8 applications for a 3rd party company.

Marco has been Nokia Developer Champion for Windows Phone since 2013, and that same year Intel recognized him as an Intel Developer Zone Green Belt for the activity of developer support and evangelization about Perceptual and Intel RealSense technology. He’s also an Intel Software Innovator for Intel RealSense and IoT technologies.

He is a Trainer and speaks at major technical conferences.

Marco Minerva has been working on .NET platform since its first introduction. Now he is mainly focused on designing and developing Windows Store and Windows Phone apps using Windows Azure as back-end. He is co-founder and president of DotNetToscana, the Tuscany .NET User Group. He is speaker in technical conferences and writes for magazines.

For more complete information about compiler optimizations, see our Optimization Notice.