Processing power and the size of memory in the average computer has increased significantly over the years. The fact that most computers implement virtual memory and have large amounts of physical memory has desensitized many programmers to the need of conserving physical memory. Many programmers may wonder, "Why should I care about memory; the OS takes care of everything for me?". If you are writing programs targeted to "ultra" mobile devices such as mobile phones or mobile internet devices, known as MIDs, it is important that you care. An example of this kind of device is the Apple iPhone*. In the iPhone SDK, Apple states that only one iPhone application can run at a time. This restriction is not placed on the iPhone because of the operating system since it is a based on a Unix multitasking OS. This restriction is because of small memory size. Desktops and laptops run multiple applications easily since there are typically a gigabyte or more of RAM memory and if the application needs more, the OS will use virtual memory to page in and out of physical memory using "fast enough" storage devices. Mobile devices may have the similarly featured multi-process/multi-tasking operating systems and virtual memory systems but they do not have the large amounts of RAM available for applications to use. This paper presents information and specific examples on how to write applications with memory size and usage in mind so that there is not significant performance degradation when multiple applications are running at the same time.
In order to understand how to use memory efficiently in an application, one has to understand how the application uses the memory. Some people confuse optimizing for memory with optimizing for size of a data storage device such as a hard drive or solid state device storage. I bring this up because I do not want to confuse the reader about the size of the program on the mass storage device and the size of the program "in memory". A perfect example of this when I asked a couple of "C/C++" programmers what they do to minimize the amount of memory used in a program -- they said they use the compiler options to "optimize for size". Although it is true this can optimize the in memory footprint of an application, it does not necessarily minimize the in RAM memory, just because the file is smaller on disk. Although, there is a relationship between on disk size and in RAM size, since the executable file on disk is very similar to the in memory footprint, but I digress. Let us first look at what a program looks like "on disk".
Applications on Disk (File Formats)
There are predominately two types of program file formats, Executable and Linking Format (ELF) and Portable Executable (PE) format. The ELF file typically runs on a Linux/Unix/Apple OS while the PE file typically runs on a Windows* OS.
This paper is not going to go into significant detail on file formats, but it will describe the basics so you can understand how executable files are loaded in memory by the operating system and how to minimize an applications runtime memory footprint.
The reason it is important to understand the file format and what the loader has to do is that the executable file on disk is very similar to the in memory footprint after the OS has loaded it. This is intentional so that the program loader has to do a minimal amount of work when loading the program into memory. The program loader will take the binary files and load them as specified in the header sections. Once the program is loaded in memory for the most part the OS can treat it like any another memory mapped file.
Having identified file formats, let us look at the types of binary objects that compilers and linkers generate to make up the PE or ELF content. The three executable binary file types generated by a compiler/linker are executables, static libraries, and shared libraries. Executables and shared libraries are be loaded by the operating system at runtime as opposed to static libraries that are included into the other two files directly and "fixed at link time".
Static Libraries or Shared Libraries
Partitioning software into classes and libraries is in general good coding practice. However, the programmer needs to be aware of the implementation. For example, the use of static libraries in memory can be a problem. Static libraries will be duplicated in physical memory in separately linked binary objects. I have seen many applications that use static libraries duplicating code several times in the same or multiple processes. For example, an executable and a .dll or .so, that both use the same static library, will duplicate the code since they are both independently linked.
A shared library’s implementation is different from that of a static library and it has to be treated differently by the loader; hence, the ELF and PE file formats have to account for the difference. The ELF file format distinguishes the two by providing a Relocation Header Table where as the PE file format embeds the information in the individual sections. The most common sections of the ELF/PE formats are divided into the following section:
- .text – Contains all of the "code"
- .data – Contains all of the initialized data
- .idata – Contains import data – names of other files and functions called
- .edata – Contains export data – names of files and functions available to other modules
- .reloc – Contains relocation information
The sections of interest are the .text (code section) and .data sections that may require fix up code.
Share and Share Kind of Alike (.dll and .so files)
On Windows, a shared library is known as a .dll or dynamic link library and on Linux the shared library is known as a .so or shared object. Although the implementation is different, the characteristics of these libraries or objects are similar. Shared libraries (.dll/.so files) provide the convenience of static libraries from a programmer’s point of view of sharing code without a lot of the memory duplication. Multiple applications may use the same set of libraries without increasing the size of the text when multiple applications are running at the same time. Applications using the same .dll/.so may be (assigned as owning memory by the OS) tagged for some of the same memory but in reality, it is shared with other processes. There is only one copy in physical memory for the read only memory but its usage is charged against every process that has loaded the dynamic library. The actual memory in use can be calculated by subtracting out all shared memory that is counted more than once. If you are interested in calculating exact numbers there are utilities that parse the ELF or PE sections and indentify which sections are shareable.
So why would you ever use static libraries? Shared libraries are nice but they may come with a cost in startup performance. In order to use shared libraries the operating system needs to do the following when the binary is loaded:
- Locate the shared library on disk
- Check to determine if the shared library is already loaded in the process space
- Allocate memory for the shared library
- Resolve Fix ups for the .text and .data sections
This all takes time and space. Both Windows and Linux provide explicit dynamic linking routines such as LoadLibrary or dlopen() and GetProcAddress() or dlsym() respectively. These routines effectively call the same routines that the implicit linker calls.
On Linux the only shared library that is statically linked is the glibc. Linux uses a pre-link virtual address for all other shared libraries. A compiler switch helps with the fix-up by providing a hint to the compiler to generate code that is designed to be position independent (-fPIC), and it avoids referencing data by absolute address as much as possible. A developer asked me why should we care, doesn’t the operating system take care of all this? The operating system will take care of any of the fix code for you but it is very inefficient if you don’t use the right options and create a shared library just for the sake of creating a shared library. For example a text relocation is a memory address in the "read-execute" text segment of a shared library. Say a non-PIC text segment calls into a memory location that needs to "fixed up" by the runtime. In Linux this is performed by the ld.so in glibc during the startup of the dynamically linked executable. If the developer didn’t design the code correctly there would be significant memory and fix-up penalties associated with the code. For example, a non-PIC compiled libmpg3 library has roughly 6000 memory locations left inside the shared library to point to some 300 functions and data referred to by the instructions. So why not use –fPIC all the time? There may be some cases where you may not want to use PIC. For the code to set up the PIC register (ebx typically) it takes about three instructions and an additional 1 – 2 instructions per symbol accessed in the data object. In addition, the PIC register is being used so the compiler is not free to use it for other purposes resulting in possibly less then optimal code performance by limiting the number of registers available to the compiler. On Linux, to test if, a shared object requires relocation in its text segment, tools such as "readelf –d binary.so" and inspect the output for any TEXTREL entry. The fact that TEXTREL exists indicates that text relocations exist.
On Windows the code that is compiled into a DLL uses a define _WINDLL to provide the compiler hints to avoid position dependent code minimizing fix-ups. Windows also provides for the dll to be rebased which improves the load time of the shared code as well as a minimizes the size of the image directory table since it will be first try to be loaded at the address specified by the rebase address.
So what is a "fix-up"? Fix-ups are adjustments to specific addresses that are not relocatable. For example say I have an STL string
string MyString="Initial Value";
The loader has to allocate the string "Initial Value" in the data segment and initialize a pointer the value. If the shared library needs to be relocated or rebased, the value of the pointer needs to be fixed up at runtime to point to the new address, creating a pointer to a pointer. This also happens to functions.
In general a well designed application can save significant memory by using shared libraries. Let us look an example of an Adobe Air application using their common runtime. It runs in a working set of 58,508 KB with 14,348 KB shareable memory unshared. Now let us launch a second Air Application. It runs in a 43,864 KB working set with 13,748 KB shareable memory unshared. When these two applications run by themselves, there is some sharing with the operating system already but running together, we have increased the amount of memory shared by 12,880 KB, a memory savings of 22 percent in the first application and 29 percent in the second application.
Given the fact it seems obvious to used shared libraries to minimize memory usage, here are some things to think about as a part of your application design. The more shared libraries you use the more fragmented your memory space can be because of code alignment which may result in wasted memory. You can dynamically load and unload shared memory, which may slow down performance when doing specific tasks but significantly improve memory. For example, if you have code that does a specific task infrequently such as converting a file from one format to another format; explicitly load a shared library to do the conversion only when the user requests the conversion and then unload the library.
Now we know we should look very closely at using shared libraries if possible. This includes using the C-Runtime Libraries as much as possible since they will most likely are already loaded. If your application does not dynamically load and unload the shared library, it may not help the amount of memory your application appears to take running by itself but it will significantly help the platform run multiple applications at a time. In the example of the two Air applications, the 29 percent savings may make the difference of placing an artificial limitation of one application at a time on a system. So what are some other things you can do to optimize your application for memory? Now let us look at specific optimizations that will help in improving the size of the application. Let us go back to the comment of "just setting the compiler to optimize for size".
Compiler Optimizations for Size
Most compilers come with the option of optimizing for size. Typically, what this does is some or all of the following:
- Disables function, jump, loop and label alignment (removes gaps in memory due to the alignment)
- Disables pre-fetch loop arrays
- Does not perform loop unrolling
- Disables inline of functions
What I have found with 10 different client applications, of varying types is that the code size isn’t significantly improved between the "optimize for size" verses the "full optimization" option, typically less than 5 percent with a few exceptions for specific cases where significant vectorization or loop unrolling is taking place. What I did find is that compilers vary widely in size. In some cases I saw code size differences as much as two times the size. The main reason for the code size swings is because of the optimizations due to performance optimizations as stated earlier. I did do a comparison of the two times the size of binary and it was almost 4 times as fast as the smaller code. A size verse speed tradeoff. Many compilers provide optimizations for specific processors. Target the processor you are running on if possible. This will reduce the size of code by eliminating the other branches of code optimized for other processors if you know your targeted machine. I guess the lesson learn here is look at the compiler and options you select and don’t be afraid to look at other compilers and options. There is not any magic dust you can use that works in all cases for all code. The bottom line is not all compilers are the same, so what about linkers?
Check for Incremental Linking and Debug Information
Some linkers provide an option to incrementally link files. Although the compiler typically has incremental linking turned on for debug and off for release, make sure it is off in your release code. Incremental linking is nice for developers that are changing code all of the time but is very bad when it comes to releasing code. The way incremental linking works is that on each segment of code it is padded with int 3 so that if the code is changed the linker will only have to re-link the effected section of code up to the padded int 3 region. In large executables the padded int 3 sections can easily put you into the hundreds of kilobytes or even megabytes. Also, remember to remove any debug symbols from the linker options that may be left behind in inadvertently.
So where does this leave us? I hope that you are a little more informed on how to write programs with tight memory requirements. Designing your applications to take full advantage of shared memory seems to give the biggest benefit. Determine which functions are needed in multiple places and share them. Take advantage of explicit loading of shared memory for large sections of code that rarely get used if possible. You should not only play around with different compiler options, but also look at different compilers and see what it does on your specific code. There is not a one size fits all here, the following list are a few of my findings and remember as you design your application don’t forget to think about memory.
- Share as much memory as possible within and outside of your application. I have found this usually gives you more memory then all of the other optimization techniques.
- Look at how you partition functions or classes. Try to make use of reusable code as much as possible and use shared libraries as the package implementation.
- Determine how to best take advantage of other shared libraries that are already loaded on the system. Many applications are using a lot of the same functionality that you are and maybe running simultaneously. For example, a c-runtime is most likely already loaded into memory. If you are considering an application that uses other runtimes such as a Java runtime or Adobe Air runtime, some of the runtime code may be shared with other applications running simultaneously as well.
- Experiment with the Compiler and it options
- As stated earlier there is not a one size fits all for compilers and their options. Most compilers have an optimize for size option, however what I have found is that in a lot of the cases full optimization produces very similar code perhaps only slightly larger and slightly faster. You almost have to take it on a case-by-case basis.
- When selecting compiler options if you know the target you are compiling to and do not plan on sharing the binary across platforms set the compiler option for the targeted platform. This provides about a 5 percent improvement on size and can significantly improve performance of the application since the compiler can be more specific in generating code optimized for the target platform.
- Watch for options that may take up significant memory such as loop unrolling or linker options that is designed for debug such as incremental linking. Although loop unrolling may not be the most expensive as far as memory in some applications other it may be significant. Moreover, from what I have seen incremental linking is always an expensive option and provides no runtime benefits.
- Watch out for segment alignments. Some compilers align segments of text on large boundaries sometimes up to page sizes of 4096 bytes. This could end up wasting a lot of memory just to align a text or data segment.
- Use the stack often but be careful of static or unnecessary variable initializations. Remember if you use initialize the variable it will require you to have a copy of the value in a read only data section in many cases.
- Share as much memory as possible within and outside of your application. I have found this usually gives you more memory then all of the other optimization techniques.