An Introduction to Threading on Intel® Atom™ and MeeGo*

Submit New Article

February 21, 2011 11:00 PM PST


Most Intel Atom systems today offer hardware support for two or more threads of execution through multiple cores or Intel® Hyper-Threading Technology. It is worthwhile for developers to explore the advantages of multithreading in order to achieve better application performance or improve the user experience and responsiveness when running on Intel Atom. This paper describes the common Qt classes used to implement threaded applications on MeeGo, and also offers a brief summary of the performance of various threading primitives on Intel Atom processors.

QThreads

On MeeGo, the Qt framework offers a full set of both high and low-level classes to support multithreading. The lowest-level class is the QThread. This class allows the user to directly create and manipulate an OS-level thread, such as one created by pthread_create() on Linux* and MeeGo or CreateThread() on Windows*.

The advantage of using Qt is that it works cross-platform and abstracts away the underlying implementation. Using QThread to multithread code is quite easy as the simple example below shows:

#include <QCoreApplication>
#include <QThread>
#include <QtGui/QImage>
#include <QString>
#include <QDebug>

class LoadThread : public QThread
{
public:
    LoadThread() { filename = ""; image = NULL; }
    LoadThread(QString str, QImage *img) { filename = str; image = img; }

protected:
    void run() { image->load(filename); }

private:
    QString filename;
    QImage *image;

};


int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    QString filename1 = "1.jpg";
    QString filename2 = "2.jpg";
    QString filename3 = "3.jpg";

    QImage image1, image2, image3;

    LoadThread *thread1 = new LoadThread(filename1, &image1);
    thread1->start();

    LoadThread *thread2 = new LoadThread(filename2, &image2);
    thread2->start();

    LoadThread *thread3 = new LoadThread(filename3, &image3);
    thread3->start();

    thread1->wait();
    thread2->wait();
    thread3->wait();

    if (!image1.isNull() && !image2.isNull() && !image3.isNull()) {
        qDebug("All pictures loaded!\n");
    }

    return 0;

}
In this code example, we are multithreading the loading of three images. A single threaded implementation would require sequential loading, which means that when the CPU is blocked waiting for I/O from the disk, the program must wait for the operation to complete before continuing. By creating three threads that will each load one picture, we allow the MeeGo OS to switch to another thread when one becomes blocked. In order to use QThread, we need to create our own class that inherits from the QThread class and then override the run() function. This allows us to then create our own threads and begin executing the code within run() when we call start() on our thread. The overridden run() function cannot take any parameters, but we can pass data to and from the thread by storing it in members of the derived class. In the example code, this is done by setting the filename and image through LoadThread's constructor. After the thread starts executing, we need to wait on it to complete before we can access the pointer where the thread writes its result. This is done by calling wait() on our thread object. When the program encounters a call to wait() on a thread, it blocks there and waits for the indicated thread to finish executing before continuing on with the next line. QThreads require direct management as users are responsible for explicitly creating and waiting for them before they can use their results. They might be useful in situations where developers are actively trying to load balance their tasks with explicit knowledge of the differing characteristics of their threads and the number of threads supported by the hardware.

QtConcurrent
If direct management of threads is not needed, MeeGo developers can choose to use a thread pool. A thread pool allows users to queue up tasks that they need completed through threading, but a new thread is not created for every task. Rather, only a set number of threads are created and these threads pull new tasks from the queue and execute them to completion until they have exhausted the contents of the queue.

This technique reduces the overhead caused by creating excessive numbers of system threads without direct hardware support.

Qt determines the optimal number of threads to use for thread pool execution based on the number of threads supported by the underlying hardware. For example, on a hyper-threaded single-core Intel Atom system, the number of threads that would be created for the thread pool is two. The recommended way to implement task queuing is through the QtConcurrent namespace, which also offers higher-level features such as parallel maps over collections. If more control over the pool's scheduling algorithm, number of system threads created, etc. is desired, the QThreadPool object underlying the QtConcurrent framework can be accessed and modified (not demonstrated in this paper).

Here we show an example of using QtConcurrent to load images using a thread pool:

#include <QCoreApplication>
#include <QtGui/QImage>
#include <QString>
#include <QtCore>
#include <QDebug>

bool LoadImage(QString str, QImage *img) {
    return img->load(str);
}

void CheckResult(QFuture<bool> res1, QFuture<bool> res2, QFuture<bool> res3) {
    if (res1 & res2 & res3) {
        qDebug("All pictures loaded!\n");
    }
}


int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    QString filename1 = "1.jpg";
    QString filename2 = "2.jpg";
    QString filename3 = "3.jpg";

    QImage image1, image2, image3;

    QFuture<bool> result1 = QtConcurrent::run(LoadImage, filename1, &image1);
    QFuture<bool> result2 = QtConcurrent::run(LoadImage, filename2, &image2);
    QFuture<bool> result3 = QtConcurrent::run(LoadImage, filename3, &image3);

    CheckResult(result1, result2, result3);

    return 0;

}
To use the thread pool in Qt, the task to be threaded is written in a function. This can be a global function or a class member. In this example, that function is LoadImage(). Unlike the QThread's run() function, QtConcurrent::run() can take in arguments, and as shown, the filename and image parameters are passed along as arguments to LoadImage(). Since we have not changed the default number of threads used in the QThreadPool, the Qt framework will create two threads when we are using a single-core, hyper-threaded system, assign each thread the first two LoadImages() tasks, and then reuse one of them to complete the last LoadImage() task.

Another nice concept that the QtConcurrent framework introduces is the QFuture object. QFuture represents the eventual output of the thread, even though it might not exist yet. In this way, users can pass the QFuture object to another function or thread that might be waiting for the data, as illustrated with CheckResult() above. When the computation result stored in a QFuture object is accessed, either by an implicit cast to the result type, or by an explicit call to the result() method, the Qt framework will wait for the result to be generated. In this case, in CheckResult(), the threads that are loading the image are waited on when the QFutures res1, res2 and res3 are checked. It is important to make sure all thread values are read, which is why the bitwise & was used to test the result values instead of the short-circuit boolean && operator. Again, the nice thing here is that the program will not wait for the result of the computation until it is actually needed. Thread pools are an easy way for users to achieve multithreading without the responsibility of managing threads or scaling an implementation across varying numbers of hardware threads. Of course, if more direct management of threads is beneficial (such as knowledge that each thread does a lot of I/O and would block), users can either change the default number of threads created for the thread pool by calling setMaxThreadCount() on the global instance of QThreadPool, or by manually spawning and maintaining their own QThreads.

Locking with QMutex

The previous examples of loading three images can be threaded without locks because the work each thread is doing is independent. They are each loading three separate files, resulting in three separate images. However, often multithreading requires locking to guarantee a particular sequence of events or to ensure atomic access to a shared resource. If multiple threads update a shared variable without locking, a race condition may be introduced. Depending on when each thread executes and completes, the value previously read for update by some threads may be stale as other threads have already written a different value to that variable. Mutexes are one of the basic mechanisms for ensuring consistency and Qt offers a mutex implementation though the QMutex class. Here is an example using QMutex:

#include <QCoreApplication>
#include <QThread>
#include <QtGui/QImage>
#include <QString>
#include <QDebug>
#include <QMutex>

int totalByteCount;
QMutex countLock;

class LoadThread : public QThread
{
public:
    LoadThread() { filename = ""; image = NULL; }
    LoadThread(QString str, QImage *img) { filename = str; image = img; }

protected:
    void run() {
        image->load(filename);

        if (!image->isNull()) {
           countLock.lock();
           totalByteCount += image->byteCount();
           countLock.unlock();
       }
    }

private:
    QString filename;
    QImage *image;

};

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    QString filename1 = "1.jpg";
    QString filename2 = "2.jpg";
    QString filename3 = "3.jpg";

    QImage image1, image2, image3;

    totalByteCount = 0;

    LoadThread *thread1 = new LoadThread(filename1, &image1);
    thread1->start();

    LoadThread *thread2 = new LoadThread(filename2, &image2);
    thread2->start();

    LoadThread *thread3 = new LoadThread(filename3, &image3);
    thread3->start();

    thread1->wait();
    thread2->wait();
    thread3->wait();

    if (!image1.isNull() && !image2.isNull() && !image3.isNull()) {
        qDebug("All pictures loaded with a total byte count of %d\n", totalByteCount);
    }

    return 0;
}
Since the three threads are now updating the same totalByteCount variable, they need to ensure that they complete the read and write of that variable before being swapped out for another thread. Otherwise, the result of totalByteCount may be inconsistent. By creating a global QMutex called countLock to protect this variable, each thread first calls lock() on that mutex to gain access to it. If another thread has already grabbed that lock, then the thread will block and wait for it to be freed (which happens when the previous lock-holding thread calls unlock() on the mutex) before continuing. Locking in threads does carry an overhead (see the Atom performance notes in the last section), so care should be taken to employ it only when needed.

QtConcurrent and Qt Collections

The QtConcurrent namespace also offers support for parallel algorithms by allowing users to transform all the elements in a Qt collection class with a user-defined function in parallel. For example, given a QVector of QStrings and a user function that converts a string to upper case, all the strings can be converted to uppercase in parallel. QtConcurrent takes care of spawning the threads and provides the user with a single QFuture object that can be tested to see if all the operations have completed. Here is an example of image loading using a QtConcurrent map:

#include <QCoreApplication>
#include <QtGui/QImage>
#include <QString>
#include <QtCore>
#include <QVector>

class MyImage {
public:
    MyImage() { filename = ""; image = NULL; }
    MyImage(QString _filename) { filename = _filename; image = NULL; }
    QString filename;
    QImage *image;
    void load() {
        if (filename != "") {
            image = new QImage(filename);
        }
    }
};

void ImageLoad(MyImage &myImg) {
    myImg.load();
}

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    QVector <MyImage> images;

    images.push_back(MyImage("1.jpg"));
    images.push_back(MyImage("2.jpg"));
    images.push_back(MyImage("3.jpg"));
    QFuture<void> done = QtConcurrent::map(images, ImageLoad);

    done.waitForFinished();

    bool flag = true;

    for (int i = 0; i < images.size(); i++) {
         if (images[i].image->isNull()) {
             flag = false;
             break;
         }
     }

    if (flag) {
        qDebug("All pictures loaded!\n");
    }

    return 0;
}
Using QtConcurrent::map requires the user map function to take in one parameter by reference, which matches the type stored in the collection. In the above example, the ImageLoad() function will be applied to each element of the images QVector and it will modify the MyImage object stored there in-place. The return value of the map function is not used. To wait for the function to be applied on every element in the QVector, we need to call waitForFinished() on the QFuture output of the QtConcurrent::map() call before we can safely use the collection to get our results. This single QFuture object signals the completion of the entire operation, and it is simpler to use than if we were to create individual threads ourselves, as we do not have to keep track of the state of each thread.

Thread Primitives on Atom

Using Qt to achieve parallelism as illustrated in the above four examples is not difficult. However, multithreading does incur some overhead and the cost of that must be considered when calculating the performance benefit of threading. On a N470 1.83 GHz Intel Atom system that is running a beta tablet build of MeeGo, we measured the overhead times of some thread primitives. Creating a thread costs about 366,000 clock cycles, context switching about 2,100 clock cycles and locking/unlocking a QMutex about 120 clock cycles. The results are not unexpected given the relatively slower (at this time) clock frequency and IPC of the Atom CPU compared to current desktop CPUs and are within 3X of them. The Atom CPU performs thread management in a manner comparable to desktop CPUs. Therefore, the same general rule for threading on any CPU applies. If the amount of work needed to be accomplished takes more clock cycles than the overhead incurred by the threading primitives needed to implement multithreading, then threading would be a good way to increase performance. On MeeGo, using QThreads when direct management of threads is needed, or the QtConcurrent namespace APIs when it is not, is an easy way to introduce multithreading into programs.

For further reading on the Qt classes for multithreading, Intel Atom development and hyper-threading, check out the following links:

http://doc.qt.nokia.com/latest/qthread.html
http://doc.qt.nokia.com/latest/qtconcurrent.html
http://software.intel.com/en-us/articles/developers-guide-to-atom-part-1-of-4/
http://www.intel.com/intelpress/sum_hyperthreading.htm

About the Author

Christine M. Lin is a Senior Software Engineer working in the Intel Atom Platforms Engineering group in SSG. She has a B.S. in Electrical Engineering and Computer Science from the University of California at Berkeley and has worked at Intel for 13 years. She lives in Sunnyvale, California with her husband and two young sons.