By Xiao-Feng Li
User experience evaluation is different from traditional performance evaluation. The scores of many performance benchmarks are not able to tell of real user experience, because they only measure the steady state of the system. User interactions usually involve system state transitions. In this document, we introduce Android Workload Suite (AWS), which includes a set of Android workloads to map user interaction to system behavior, and then use software stack metrics to measure the interaction scenarios. We expect AWS to reflect the representative usage of Android client devices, and to be used to evaluate and validate Android optimizations in performance, power and user experience.
User interaction scenarios with client device
To systematically evaluate user experience of client devices, we need a set of standard workloads to represent the typical usage scenarios and to return metrics values to users. To construct such a workload suite, we have following steps:
- Define the representative usage scenarios
- Map user behavior to system software operations
- Construct workloads
To define user interaction scenarios, we have extensively surveyed available materials including public documents from industry key players, popular applications from market stores, built-in applications from shipping devices, form-factor usages of tablets and smartphones. We also have investigated user interaction life-cycle and software design of Android source code. We partition client usages into four categories as shown in Figure 1. All are important for a comprehensive workload suite.
All usage models include multiple use scenarios according to the system computation nature. For example, use scenarios of browser usage life-cycle can be shown as a state transition graph in Figure 2.
Note that this browser interaction lifecycle does not include HTML5 technology but rather, common and general browsing. Otherwise, the scenarios should be specific to the concrete web application developed in HTML5.
Every scenario in the client usage models can be roughly classified into following three scenario categories:
- User operations.
It mainly includes the common interaction scenarios like browsing, gaming, authoring, setting, configuring, etc. Touch and sensors are the major input devices. This category also includes I/O and communication scenarios.
- Loading and rendering.
The category mainly includes system computation scenarios like browsers loading a web page, eBook document opening. gallery viewing of an image, etc. Rendering scenario is usually part of a loading scenario, and sometimes considered a separate one after loading process, including HTML5 rendering, media rendering, graphics rendering, etc.
- Task management.
The two categories above cover the common application scenarios. The last category includes the scenarios for application management, such as application launch, task switching, notification alert, incoming call, and multi tasking management.
Each of the scenarios may expose specific behaviors within the system, hence requiring specific metrics to measure. The major metrics we consider essential for user experience are classified into two kinds, with each kind having two measurement areas.
- How a user controls a device. This aspect has mainly two measurement areas.
- Accuracy/fuzziness. It evaluates what accuracy, fuzziness, resolution, and range are supported by the system for inputs from the touch screen, sensors, and other sources. For example, how many pressure levels are supported by the system, how the sampled touch events' coordinates are close to the fingertip move track on the screen, how many fingers can be sampled at the same time, etc.
- Coherence. It evaluates the drag lag distance between the fingertip and the dragged graphic object in the screen. It also evaluates the coherence between the user operations and the sensor-controlled objects, e.g., the angle degree difference between the tilting controlled water flow and the device oblique angle.
- How a device reacts to a user. This aspect also has two measurement areas:
- Responsiveness. It evaluates the time between an input being delivered to the device and device showing visible response. It also includes the time spent to finish an action.
- Smoothness. This area evaluates graphic transition smoothness with maximal frame time, frame time variance, FPS, and frame drop rate, etc. As we have discussed, FPS alone cannot tell all the user experience regarding to smoothness.
Based on the methodology described so far, we can refine the browser interaction lifecycle with the measurement areas for each scenario as shown in Figure 3.
In order for the workload suite to be representative for both tablet and smartphone, we have investigated their usage differences. Unsurprisingly, the key difference is their size difference. A smartphone is usually used as a handy gadget as a Swiss Army®-knife, with following features:
- Phone, in voice and video
- Music player, for music and podcast
- Camera, for shooting photo and video, barcode scanner, face recognition
- Navigation, with GPS, AGPS, compass, etc.
- Communicator, for chatting over text and multimedia
- Book/News reader
- Other utilities, like flashlight, night vision, etc.
In application design wise, smartphone applications are designed to fit the small screen size. Many smartphone games are cartoon-style and have light-weighted animations. The sensor controls in games are usually simple, and many are based on the accelerometer: Smartphone games are more designed to use "shaking" to control because it is easy to shake the phone in a hand. As a comparison, tablet games are more designed to use slow "tilting" to control by leveraging the gyroscope sensor. Besides the sensor usage difference, tablets with larger screen size have following additional characteristics:
- More realistic view experience in graphics and actions
- Easier or more controls through touch/sensors, such as virtual controllers in games, rich editor, and handwriting
- Bigger space to put more contents for news, education, eBook, etc.
- Support more than one players with games, interactive educations
- PC-experience web access for browser and other info portal such as RSS reader
- More small utilities apps for daily use on user's desk for convenient access. As a comparison, smartphones are used more in the user's pocket
Due to the differences between smartphones and tablets, some typical scenarios in one form factor may not be representative in another form factor. For example, smartphones usually have a status bar, whereas tablets use systems bar. The browser application in a smartphone usually switches its window when opening a new web page, while in a tablet it generally opens in a new tab.
At the same time, some scenarios exist on both form factors but should have design variants for each. For example, a 2D game has more animated sprites in its tablet profile then in smartphone. A browser scenario can load PC web page in tablet, while using mobile web pages in the smartphone.
Workloads construction for Android user interaction evaluation
With the use scenarios and measurement areas defined, we can construct workloads to reflect the interesting scenarios and to measure the user experience.
Before really developing the workloads, we should understand the relation between workloads and tools. Workloads characterize the representative usage model of the system, while tools analyze the system behavior. A tool itself does not represent a use case of the device, but analyzes the use case. At the same time, the common part of multiple workloads can be abstracted into a tool so as to be reused across the workloads. Below Figure 4 shows the various kinds of workloads.
The top half of Figure 4 shows the common user interaction scenarios in an Android system. The "Input" triggers the execution of "Activity 1", which in turn invokes "Service 1". Then "Service 1" communicates with "Service 2", which launches "Activity 2". The "Output" is extracted from "Activity 2".
In the bottom half of Figure 4 are displayed the four situations of how we measure the system.
- Standalone workload. It runs as a complete workload without user giving inputs, and outputs the result when the execution finishes
- Micro workload. It only tresses certain execution paths of the stack, is not a complete application of the platform
- Measurement tool. It allows the engineer to provide inputs, and then returns the metrics results. It is actually a tool that can process different inputs
- Scenario driver of built-in app. It provides inputs to triggers and extracts outputs from the built-in applications. One usage scenario is to provide standard inputs to different devices to measure their browser user experience
The workloads we construct for Android user experience evaluation have all four different kinds for their different purposes. Mostly we expect to include in our final workload suite kind 1 and kind 4, because the kind 1 workloads are easy to use for white-box investigation, and the kind 4 workloads are useful to investigate various devices as black-boxes.
Once we decide upon a workload and its scenarios, we need to have a good understanding of the Android software stack for every scenario, and then choose the right metrics for every scenario.
Since our goal is to provide an engineering tool for engineers to evaluate and optimize Android user experience, we expect our evaluation methodology to be objective. We set up the following criteria f our measurement of user experience.
- Perceivable. The metric has to be perceivable by a human being. Otherwise, it is irrelevant to the user experience.
- Measureable. The metric should be measurable by different teams. It should not depend on certain special infrastructure that can only be measured by certain teams.
- Repeatable. The measured result should be repeatable in different measurements. Large deviations in the measurement mean that it is a bad metric.
- Comparable. The measured data should be comparable across different systems. Software engineers can use the metric to compare the different systems.
- Reasonable. The metric should help reason the causality of software stack behavior. In other words, the metric should be mapped to the software behavior, and can be computed based on software stack execution.
- Verifiable. The metric can be used to verify an optimization. The measured result before and after the optimization should reflect the change of the user experience.
- Automatable. For software engineering purpose, we expect the metric can be measured largely unattended. This is especially useful in regression test or pre-commit test. This criterion is not strictly required though, because it is not directly related to user experience analysis and optimization.
Take video playing evaluation as an example. Traditional performance benchmarks only measure video playback performance with some metrics like FPS (frame-per-second), or frame drop rate. This methodology has at least two problems when evaluating user experience. The first is that video playback is only part of the user interactions in playing video. A typical life-cycle of user interaction usually includes at least the following links: "launch player" → "start playing" → "seek progress" → "video playback" → "back to home screen", as shown in Figure 5. But, good performance in video playback cannot characterize the real user experience in playing video. User interaction evaluation is a superset of traditional performance evaluation. The other problem is that FPS is not enough to evaluate the smoothness of video playback. We describe the common challenges in workload construction in next section with a case study.
Workload construction case study
We use browser a scrolling scenario to discuss the workload construction process. Figure 6 shows the interactions when a user scrolls a page top-down in a browser.
As shown in Figure 6, at time T0, the finger presses on the screen and starts to scroll the page from position P0. When the finger reaches position P1 at time T1, the page content starts to move after the finger scrolling. When the page content reaches position P1 at time T2, the finger has moved to position P2. That is, during the finger movement, there is a lag distance between the page content and the finger. At time T3, the finger releases from the screen, and the page content finally reaches the same position where the finger releases.
In this scenario, we choose to measure following three aspects:
- Response time - How fast the content starts to move as the response to finger scrolling
- Lag distance - How far the content movement lags behind finger
- Smoothness – How smooth the browser animates the scrolling
To measure the response time, we need understanding of the software internals for page scrolling. The scrolling process is shown in Figure 7.
The figure has three rows for input raw events, browser events, and browser drawing respectively. The screen touch sensor detects the touch operation and generates raw events to the system. When the framework receives the raw events, they transform them into move events, such as ACTION_DOWN, ACTION_UP and ACTION_MOVE. Every event has a coordinate (X, Y) pair associated. The browser computes the distance between the current move event and the starting position (the coordinate of the ACTION_DOWN event). If the distance reaches a platform specified threshold value, the browser considers the move events as part of a scroll gesture, and starts to scroll the page content accordingly. The browser scrolls the page content by drawing new frames with moved position on the screen. The user then can see continuous scrolling of the page content.
Figure 8 shows how we measure the responsiveness of browser scrolling. The response time of browser scrolling is the time from the first raw event delivered to the first scrolling frame is drawn.
To measure the smoothness of page scrolling, we log all the scrolling frames' time when they are drawn, as shown in Figure 9. We then compute the maximal frame time, number of frames longer than 30ms, FPS, and frame time variance to represent the smoothness.
One tricky thing in smoothness measurement is determining which frames are the scrolling frames. It is easy to determine the first frame. The difficulty is to determine the last frame. When the finger releases from the screen, there are still a few frames drawn as a result of the scrolling momentum (unless the finger moves very slowly.) We can count the last frame using the frame right before the finger releasing, or using the real last frame when the browser re-renders the page. To simplify the design, we choose the former approach, based on an assumption that the smoothness situation before the finger releasing is adequate to reflect the entire scrolling process smoothness.
To measure the lag distance between the fingertip and the page content, we logged the coordinate values and the timestamps of all the raw events and the drawn frames. We can compute the maximal distance between a frame and those events happening before the frame is drawn. We denote it as Distance[k] for frame k. Then for the entire scrolling process, we compute the lag distance as the maximal of Distance[k] for all frames. The approach is illustrated in Figure 10.
In order to make the measurement repeatable, we use Android UXtune* toolkit to generate the standard input gesture.
For different purposes, we have two different versions developed for browser scrolling measurement. One is a standalone workload that has a browser packaged in the workload together with the scenario driver to trigger the automatic execution and measurement. The other has only the scenario driver that triggers the device built-in browser to execute for the measurement. The former one can be used to compare the Android framework of different devices, while the latter one is mainly used to compare the device built-in browsers.
Android Workload Suite (AWS) and user experience optimization
Based on the methodology described in preceding sections, we develop Android Workload Suite (AWS) to drive and validate our Android user experience optimizations.
Table 1 shows the AWS 2.0 workloads. The use cases were selected based on our extensive survey in the mobile device industry, market applications, and user feedbacks.
Table 1. Android workload suite (AWS) v2.0
We have established a systematic methodology for Android user experience optimization. It includes following steps.
Step 2. Define the software stack scenarios and metrics that transform the user experience issue into a software symptom
Step 3. Develop a software workload to reproduce the issue in a measureable and repeatable way. The workload reports the metric values that reflect the user experience issue
Step 4. Use the workload and related tools to analyze and optimize the software stack. The workload also verifies the optimization
Step 5. Get feedback from the users and try more applications with the optimization to confirm the user experience improvement
For step 3, we basically rely on Android workload suite (AWS). For step 4, we have developed Android UXtune toolkit to assists user interaction analysis in the software stack.
AWS is still evolving based on user feedbacks and Android platform changes.
Some online public websites have useful information on user interactions and experience.
The author thanks his colleagues Greg Zhu and Ke Chen for their great supports in developing the methodology for Android user experience optimizations.
About the Author
Xiao-Feng Li is a software architect in the System Optimization Technology Center of the Software and Services Group of Intel Corporation. Xiao-Feng has extensive experience in parallel software design and runtime technologies. Before he joined Intel in year 2001, Xiao-Feng was a manager in Nokia Research Center. Xiao-Feng enjoys ice-skating and Chinese calligraphy in his leisure time.