I'm working on securing access to L1 cache by locking it line by line. Is there any way to do it? For example, two threads accessing the L1 and L1 lines are locked for a certain time to each thread accessed them.
What are the operations you wish to perform during the lock?
You are aware that multiple addresses map to the same cache line. View this like a shared parking slot. Only one car at a time can park in slot N, though their license plate number (memory address) will identify who's in the slot. The rules for the parking slot are "If other car in slot, push it out of the slot".
Answering the first question may yield a solution that you haven't thought of.
Thank you for replying. I don't care about what kind of operations that will take place. I'm trying to prevent threads from gaining any information about what addresses the victim thread has accessed. In my point of view, if I can lock cache lines that a thread accesses for a certain time and then flush them after predefined time, other threads cann't gain any information about the accessed addresses and lunch the attack.
If the threads are in a different process, then the virtual address spaces and physical address spaces (at any one time) preclude sharing of L1 cache. *** subject to your process not setting up shared memory between processes ***
You may have multiple threads within the same process (sharing the same virtual memory) whereby each thread can access all of the process's virtual memory. In this case, the multiple threads from the same process can share the same cache line.
If you want to exclude this from happening, then split your program into multiple processes.
You can use various inter-process messaging techniques and/or have one or more blocks of shared memory between processes. The information you want to hide from the other process is not to be placed into the shared memory block(s).
Are you working on cache timing side channel attack on cryptographic keys? Well-known referred papers in this filed could give you enough hints.
I should add that, for the sensitive data, that you allocate what is called (on Windows) non-pageable memory. And verify that the non-pagable memory does not reside in, or more precisely, is never written to the system page file. As the page file can potentially be read by other processes. If the memory management protection is weak, then a different process or service or driver or filter (virus) might snatch data you place into non-pageable memory. This leaves your "keep" (place where you store your valuables) to be limited to the register set. Yet this isn't entirely safe unless you can inhibit your process, or more precisely the hardware thread, from system interrupts, as at interrupt time the register set, or portions thereof, may get saved to RAM and then potentially seen.
Thank you some much for all responses. I found ARM code that can be used to lock CPU L1 cache lines, but can't find the same on Intel. So, is there any direct or indirect code to lock CPU cache lines for a process, thread or even a VM on Intel processors?
On the Intel IA-32 and Intel-64 processors the L1 and L2 caches are typically constricted within each core within the processor. Some of the processor core designs permit reading other core's L2 cache without going through RAM but I am not aware of being able to directly read other core's L1 (HT siblings can read the core's L1). Any core can potentially cause an eviction of the other core's L1, however this comes with a restriction that the other cores also have to map to the same physical address. This said, the Intel IA-32 and Intel-64 processors do not have a means that I am aware of to write to L1 cache while inhibiting the write data from being enqueued to RAM. ARM may treat L1 cache as an extended register set, Intel cache design as an remembrance of data written to RAM.
This leaves as your "only" recourse the register set. On Intel-64 you have a sizable number of registers per hardware thread ~13 x 64-bit, 16 x 128-bit or 256-bit. This is per hardware thread. You also have the x87 FPU stack to store stuff into.
The "only" above can be circumvented if you have available a set of physical addresses that you can map to that shares the characteristics that it is cacheable .AND. appears writeable, but in fact is non-writable. You would also like it not to be located on a bus that can be snooped.
The "trick" then is for you to interact with the O/S to constrict your (some of your) software threads to specific hardware threads (affinity binding) .AND. exclude the selected hardware threads (and core) from participating in interrupt handling .AND. exclude the O/S from scheduling other software threads to those hardware threads. Some O/S's may have Real-Time support API's that permit you to do this.
Thank you Jim for replying. It's really appreciated.
>>...I'm working on securing access to L1 cache by locking it line by line. Is there any way to do it?
Try to boost a priority of a thread that will do main processing to Real-Time. ( Note: Used that technique in a financial system when some cryptography software subsystem had to do some task ).
Boosting the priority might not exclude it from being interrupted. You would need an O/S feature that would permit a (privileged) thread to request, and get acknowledgement of that request, to permit it to run continuously. While this would be similar to requesting real-time priority, real-time priority might not preclude preemption. Example: over subscribing the number of threads requesting real-time priority. If granted, the O/S would time-slice the threads. If not granted, then the threads granted rights might be given full runtime. Note, this thread, could contain call-back functions run by other threads (e.g. driver), but the thread must not be interrupted even at the request of system shutdown. Presumably a call back function, run by different thread, would write a flag in memory indicating the shutdown request. Periodically the secured thread would poll the shutdown request and destroy any sensitive information held in registers, then terminate itself.
In addition to this, the program would have to be written to not call any O/S function or local function that would save the registers that are required to be un-snoopable.
If the above is followed, there still may be a very small chance for reverse engineering the protected information if the code is not written to take this into consideration. This is, while the registers can be protected by the above (if provided for), what is not protected would be the memory fetches used by the code. Additionally, the performance counters of that thread might be readable. If the code space is somehow readable by a different thread (it will be to some threads), then the combination of the performance counter and memory fetches might yield some insight as to what were the initial inputs to the protected code, and which the spying program may have a copy of. The spying code could then re-run the code with the results now visible to itself. The protected code would have to be written to circumvent this type of attack.
The kind of locking that ARM (optionally) supports is not supported by most general-purpose processors. I have not run across such functionality while working in the design teams at SGI (MIPS, Itanium), IBM (POWER), or AMD, and I have not seen any indication that Intel supports such a feature either.
Cache locking or similar functions seem to be limited to processors targeting embedded markets. ARM supports several types of cache locking, while TI processors (DSPs) support configuring the SRAM as partly cache and partly locally controlled memory. For example, a chip with a 64 KiB "level-1 SRAM" could configure 0,16,32,48, or 64 KiB as cache, with the remainder as explicitly controlled local memory.
It is essentially impossible for unprivileged code to gain specific information about the memory locations accessed by another thread, and surprisingly difficult to get even general information. If a system is configured for time-sharing and allows an "attacker" task to request services from a "target" task, and provides a high-resolution timer, then some information about how long it takes to complete the task(s) can be obtained. This is typically used for timing attacks against compute-intensive services -- it is much more difficult to learn anything about memory accesses because there are so many different ways for the "target" process to get the same average memory latency. Latency for a load can take almost any value between ~4 cycles and well over 500 cycles, making it effectively impossible to fit an unambiguous model for more than a handful of accesses. (I know this because I have spent much of the last 15 years building and testing models for understanding memory accesses in cases where I control everything, and it is really hard work -- even with perfect control over the code being executed, the process placement, the page size(s), and with full access to hardware performance counters.)
By boosting a thread to Real-Time priority as Jim said will not keep thread from being interrupted either by ISR/DPC routine or by system thread running at the same priority. Moreover there are "housekeeping" system threads which are running at lower priority and those threads could be preempted also which can cause system instabillity.
Hi M. Younis A,
Can i have your address mail? in order to have a discussion on this interesting subject. I'm working on it too.