| Last Modified On : | January 29, 2009 1:05 AM PST |
Rate |
|
Processing power and the size of memory in the average computer has increased significantly over the years. The fact that most computers implement virtual memory and have large amounts of physical memory has desensitized many programmers to the need of conserving physical memory. Many programmers may wonder, "Why should I care about memory; the OS takes care of everything for me?". If you are writing programs targeted to "ultra" mobile devices such as mobile phones or mobile internet devices, known as MIDs, it is important that you care. An example of this kind of device is the Apple iPhone*. In the iPhone SDK, Apple states that only one iPhone application can run at a time. This restriction is not placed on the iPhone because of the operating system since it is a based on a Unix multitasking OS. This restriction is because of small memory size. Desktops and laptops run multiple applications easily since there are typically a gigabyte or more of RAM memory and if the application needs more, the OS will use virtual memory to page in and out of physical memory using "fast enough" storage devices. Mobile devices may have the similarly featured multi-process/multi-tasking operating systems and virtual memory systems but they do not have the large amounts of RAM available for applications to use. This paper presents information and specific examples on how to write applications with memory size and usage in mind so that there is not significant performance degradation when multiple applications are running at the same time.
In order to understand how to use memory efficiently in an application, one has to understand how the application uses the memory. Some people confuse optimizing for memory with optimizing for size of a data storage device such as a hard drive or solid state device storage. I bring this up because I do not want to confuse the reader about the size of the program on the mass storage device and the size of the program "in memory". A perfect example of this when I asked a couple of "C/C++" programmers what they do to minimize the amount of memory used in a program -- they said they use the compiler options to "optimize for size". Although it is true this can optimize the in memory footprint of an application, it does not necessarily minimize the in RAM memory, just because the file is smaller on disk. Although, there is a relationship between on disk size and in RAM size, since the executable file on disk is very similar to the in memory footprint, but I digress. Let us first look at what a program looks like "on disk".
There are predominately two types of program file formats, Executable and Linking Format (ELF) and Portable Executable (PE) format. The ELF file typically runs on a Linux/Unix/Apple OS while the PE file typically runs on a Windows* OS.
This paper is not going to go into significant detail on file formats, but it will describe the basics so you can understand how executable files are loaded in memory by the operating system and how to minimize an applications runtime memory footprint.
The reason it is important to understand the file format and what the loader has to do is that the executable file on disk is very similar to the in memory footprint after the OS has loaded it. This is intentional so that the program loader has to do a minimal amount of work when loading the program into memory. The program loader will take the binary files and load them as specified in the header sections. Once the program is loaded in memory for the most part the OS can treat it like any another memory mapped file.
Having identified file formats, let us look at the types of binary objects that compilers and linkers generate to make up the PE or ELF content. The three executable binary file types generated by a compiler/linker are executables, static libraries, and shared libraries. Executables and shared libraries are be loaded by the operating system at runtime as opposed to static libraries that are included into the other two files directly and "fixed at link time".
Partitioning software into classes and libraries is in general good coding practice. However, the programmer needs to be aware of the implementation. For example, the use of static libraries in memory can be a problem. Static libraries will be duplicated in physical memory in separately linked binary objects. I have seen many applications that use static libraries duplicating code several times in the same or multiple processes. For example, an executable and a .dll or .so, that both use the same static library, will duplicate the code since they are both independently linked.
A shared library’s implementation is different from that of a static library and it has to be treated differently by the loader; hence, the ELF and PE file formats have to account for the difference. The ELF file format distinguishes the two by providing a Relocation Header Table where as the PE file format embeds the information in the individual sections. The most common sections of the ELF/PE formats are divided into the following section:
The sections of interest are the .text (code section) and .data sections that may require fix up code.
On Windows, a shared library is known as a .dll or dynamic link library and on Linux the shared library is known as a .so or shared object. Although the implementation is different, the characteristics of these libraries or objects are similar. Shared libraries (.dll/.so files) provide the convenience of static libraries from a programmer’s point of view of sharing code without a lot of the memory duplication. Multiple applications may use the same set of libraries without increasing the size of the text when multiple applications are running at the same time. Applications using the same .dll/.so may be (assigned as owning memory by the OS) tagged for some of the same memory but in reality, it is shared with other processes. There is only one copy in physical memory for the read only memory but its usage is charged against every process that has loaded the dynamic library. The actual memory in use can be calculated by subtracting out all shared memory that is counted more than once. If you are interested in calculating exact numbers there are utilities that parse the ELF or PE sections and indentify which sections are shareable.
So why would you ever use static libraries? Shared libraries are nice but they may come with a cost in startup performance. In order to use shared libraries the operating system needs to do the following when the binary is loaded:
This all takes time and space. Both Windows and Linux provide explicit dynamic linking routines such as LoadLibrary or dlopen() and GetProcAddress() or dlsym() respectively. These routines effectively call the same routines that the implicit linker calls.
On Linux the only shared library that is statically linked is the glibc. Linux uses a pre-link virtual address for all other shared libraries. A compiler switch helps with the fix-up by providing a hint to the compiler to generate code that is designed to be position independent (-fPIC), and it avoids referencing data by absolute address as much as possible. A developer asked me why should we care, doesn’t the operating system take care of all this? The operating system will take care of any of the fix code for you but it is very inefficient if you don’t use the right options and create a shared library just for the sake of creating a shared library. For example a text relocation is a memory address in the "read-execute" text segment of a shared library. Say a non-PIC text segment calls into a memory location that needs to "fixed up" by the runtime. In Linux this is performed by the ld.so in glibc during the startup of the dynamically linked executable. If the developer didn’t design the code correctly there would be significant memory and fix-up penalties associated with the code. For example, a non-PIC compiled libmpg3 library has roughly 6000 memory locations left inside the shared library to point to some 300 functions and data referred to by the instructions. So why not use –fPIC all the time? There may be some cases where you may not want to use PIC. For the code to set up the PIC register (ebx typically) it takes about three instructions and an additional 1 – 2 instructions per symbol accessed in the data object. In addition, the PIC register is being used so the compiler is not free to use it for other purposes resulting in possibly less then optimal code performance by limiting the number of registers available to the compiler. On Linux, to test if, a shared object requires relocation in its text segment, tools such as "readelf –d binary.so" and inspect the output for any TEXTREL entry. The fact that TEXTREL exists indicates that text relocations exist.
On Windows the code that is compiled into a DLL uses a define _WINDLL to provide the compiler hints to avoid position dependent code minimizing fix-ups. Windows also provides for the dll to be rebased which improves the load time of the shared code as well as a minimizes the size of the image directory table since it will be first try to be loaded at the address specified by the rebase address.
So what is a "fix-up"? Fix-ups are adjustments to specific addresses that are not relocatable. For example say I have an STL string
string MyString="Initial Value";
The loader has to allocate the string "Initial Value" in the data segment and initialize a pointer the value. If the shared library needs to be relocated or rebased, the value of the pointer needs to be fixed up at runtime to point to the new address, creating a pointer to a pointer. This also happens to functions.
In general a well designed application can save significant memory by using shared libraries. Let us look an example of an Adobe Air application using their common runtime. It runs in a working set of 58,508 KB with 14,348 KB shareable memory unshared. Now let us launch a second Air Application. It runs in a 43,864 KB working set with 13,748 KB shareable memory unshared. When these two applications run by themselves, there is some sharing with the operating system already but running together, we have increased the amount of memory shared by 12,880 KB, a memory savings of 22 percent in the first application and 29 percent in the second application.
Given the fact it seems obvious to used shared libraries to minimize memory usage, here are some things to think about as a part of your application design. The more shared libraries you use the more fragmented your memory space can be because of code alignment which may result in wasted memory. You can dynamically load and unload shared memory, which may slow down performance when doing specific tasks but significantly improve memory. For example, if you have code that does a specific task infrequently such as converting a file from one format to another format; explicitly load a shared library to do the conversion only when the user requests the conversion and then unload the library.
Now we know we should look very closely at using shared libraries if possible. This includes using the C-Runtime Libraries as much as possible since they will most likely are already loaded. If your application does not dynamically load and unload the shared library, it may not help the amount of memory your application appears to take running by itself but it will significantly help the platform run multiple applications at a time. In the example of the two Air applications, the 29 percent savings may make the difference of placing an artificial limitation of one application at a time on a system. So what are some other things you can do to optimize your application for memory? Now let us look at specific optimizations that will help in improving the size of the application. Let us go back to the comment of "just setting the compiler to optimize for size".
Most compilers come with the option of optimizing for size. Typically, what this does is some or all of the following:
What I have found with 10 different client applications, of varying types is that the code size isn’t significantly improved between the "optimize for size" verses the "full optimization" option, typically less than 5 percent with a few exceptions for specific cases where significant vectorization or loop unrolling is taking place. What I did find is that compilers vary widely in size. In some cases I saw code size differences as much as two times the size. The main reason for the code size swings is because of the optimizations due to performance optimizations as stated earlier. I did do a comparison of the two times the size of binary and it was almost 4 times as fast as the smaller code. A size verse speed tradeoff. Many compilers provide optimizations for specific processors. Target the processor you are running on if possible. This will reduce the size of code by eliminating the other branches of code optimized for other processors if you know your targeted machine. I guess the lesson learn here is look at the compiler and options you select and don’t be afraid to look at other compilers and options. There is not any magic dust you can use that works in all cases for all code. The bottom line is not all compilers are the same, so what about linkers?
Check for Incremental Linking and Debug Information
Some linkers provide an option to incrementally link files. Although the compiler typically has incremental linking turned on for debug and off for release, make sure it is off in your release code. Incremental linking is nice for developers that are changing code all of the time but is very bad when it comes to releasing code. The way incremental linking works is that on each segment of code it is padded with int 3 so that if the code is changed the linker will only have to re-link the effected section of code up to the padded int 3 region. In large executables the padded int 3 sections can easily put you into the hundreds of kilobytes or even megabytes. Also, remember to remove any debug symbols from the linker options that may be left behind in inadvertently.
So where does this leave us? I hope that you are a little more informed on how to write programs with tight memory requirements. Designing your applications to take full advantage of shared memory seems to give the biggest benefit. Determine which functions are needed in multiple places and share them. Take advantage of explicit loading of shared memory for large sections of code that rarely get used if possible. You should not only play around with different compiler options, but also look at different compilers and see what it does on your specific code. There is not a one size fits all here, the following list are a few of my findings and remember as you design your application don’t forget to think about memory.

English | 中文 | Русский | Français
Richard Winterton (Intel)
|