Technical Prerequisites
Linux Paging, DRAM Concepts
Intro & History of RowHammer
What is RowHammer?
RowHammer is the name of a very critical vulnerability found in DRAM. The concept relies on repetitively accessing the same location in memory. As DRAM chips tend to hold more memory, manufacturers create chips with smaller and closer cells. The proximity of these cells in the DRAM introduces electromagnetic coupling effects between them, causing them to interact with each other in undesirable ways. This operation may induce bit flips in adjacents locations. Yoongu Kim et al. paper was published in 2014, and it is the first time the RowHammer problem was mentioned. At this point, the team demonstrated that most of the DRAM modules existing on the market were affected by this vulnerability. They also discussed different solutions to these disturbances. Some of the solutions are hardware fixes, and others are software approaches.
Google Zero Project
The paper of Kim et al. has been a reference for the team of the Google Zero project. They found a working privilege escalation exploit that use rowhammer-induced bit flips to gain kernel privileges on x86-64 Linux when run as an unprivileged userland process. The process induced bit flips in page table entries (PTEs). It is an entry in the Page Table that stores in 4 Bytes information about a page of memory. Each PTE contains information such as the physical address of the page in memory, if the page is present in memory or not if it is writable or not, and also access permissions. Google Zero’s team induced bit flips in a PTE to obtain write access to its own page table and then gain read-write access to all of the physical memory.
Causing bit flips in PTEs is just a possibility of exploitation; other ways of exploiting bit flips can be practical too.
To make such a routine work, google zero specified that there are plenty of methods for the address selection. However, the CPU cache must be flushed at each access to make sure data is written out to actual memory and not read from the cache.
VULDB’s article
RowHammer exposes a security threat since it leads to a breach of memory isolation, where accesses to one row (a user-level memory page) modify the data stored in another memory row. This vulnerability has been handled as CVE-2015-0565 since 01/06/2015 on the vuldb website. It has a CVSS Meta Temp Score of 8.7. The Common Vulnerability Scoring System (CVSS) is a method used to supply a qualitative measure of severity. CVSS consists of three metric groups: Base, Temporal, and Environmental.
A score between 7.0 and 8.9 corresponds to a high-risk vulnerability.
As the exploitation is known to be difficult and is at a state of proof of concept, the interests of the attackers are quite low through. The website also mentions that all devices using DRAM and some servers rack are affected by RowHammer.
DRAM Structure & Usage
Cells, rows, banks, and ranks
As mentioned earlier, DRAM has cells and rows. Cells are a capacitor and a transistor, also called access transistor. If the capacitor is charged, the cell will be in a charged state. At the other end, if the capacitor’s voltage is low, it will be in a discharged state. A charged state doesn’t always represent a bit value of 1, and a discharged state doesn’t always represent a bit value of 0. Their value will depend on the orientation of the cell, and it will be chosen by the manufacturer of the DRAM module.
In DRAM modules, we find arrays of cells like this. The rows of this array are called wordline, and the columns are bitline. When the charge of a row is raised to a high voltage, each access transistor of the line is enabled, and it allows the charge in the capacitors to be transferred to a buffer called rowbuffer. Then, the content is copied back to the cells by recharging them. When the charge of the line is lowered to zero, the capacitors are disconnected from the rowbuffer, and it is ready to store a new wordline.
These arrays of rows are grouped in sets and referenced as banks, and there is a rowbuffer for each bank. Finally, multiple banks come together to form a rank. It is important to note that the same bank location is accessed simultaneously across all the DRAM chips within a rank. In a way, organizing the DRAM chips into ranks gives the Memory Controller the illusion that it can interact with a high-capacity single DRAM chip. For example, consider a rank of 4 DRAM chips: 64 bytes of data need to be read from a location in Bank 2. In this case, each DRAM chip will contribute 16 bytes of data from their respective banks.
Discharging and recharging the cells repeatedly can induce disturbances on adjacents rows.
Refreshing DRAM
As detailed earlier, cells are made of a capacitor and a transistor. The capacitor holds a charge that represents a state. Because of some leakage mechanisms, they discharge themselves through time. If the capacitor’s charge becomes too low, its state changes, and data can be lost. The amount of time that a cell can hold a charge is called retention time, and it is usually 64ms.
To prevent data loss, the memory controller will periodically read each row to restore the original charge of the capacitor. This process, although it’s the same as reading a row, is called refreshing and prevents capacitors from leaking too much. As the retention time is 64ms, the memory controller will refresh a cell at least once during this amount of time.
If we want to trigger bit flips, the hammering must be done in enough amount of time between each refresh.
Accessing DRAM
A snippet of code has been released in Yoongu Kim’s paper :
|
|
The mov instructions read from DRAM at addresses X and Y and load the data into a register and also the CPU’s cache. Then the two clflush instructions empty the processor’s cache. Finally, the code jumps back to the first instruction for another iteration of reading from DRAM.
The code1a can cause bitflips if the addresses X and Y are well chosen: they must be in the same bank but must not be in the same row. Supposing X and Y point to the same location, this snippet of code will read the data only from the row buffer and not trigger the process of charging and discharging the cell. If the X and Y are in two different rows, code1a will cause these rows to be repeatedly activated. This routine is called row hammering.
To pick the rights addresses, Google Zero’s team explored several techniques. The first one is also the one mentioned in Yoongu Kim’s paper: they use their knowledge about physical address mapping. It is possible to use absolute physical addresses, which are available in /proc/PID/pagemap in Linux. Yoongu Kim et al. chose to pick Y = X + 8Mb. This delta was found in reversing Intel and AMD’s CPU memory controller. Moreover, the concept of huge pages can be useful: in most of the DRAM modules available on the market, rowbuffers sizes are mostly 8192 bytes wide, so classical pages aren’t big enough. Another one relies on random address selection: we simply have to allocate a large block of memory and pick two addresses randomly. On a machine with 16 DRAM banks, this gives us a 1/16 chance that the chosen addresses are in the same bank, which is quite high. We also can hammer the two rows of neighbors; this is called double-sided hammering, and this technique increases a lot the chances of getting bit flips. It requires the attacker to know or guess what the offset between the rows that are in the same bank and adjacent will be.
Disturbances on DRAM
In general, disturbance errors happen when there is an interaction between two circuit components like capacitors, transistors, and wires that aren’t isolated enough from each other. Depending on the interaction, many different types of disturbances are possible. The result of these disturbances is two kinds of errors: the ‘0’ -> ‘1’ and the ‘1’ -> ‘0’ error. The “direction” of these errors will depend on the orientation of the cell (CF DRAM Structure & Usage).
Here, when a row’s voltage is switched on and off repeatedly (typically when we access data and write it in the row buffer), some cells in nearby rows leak their charge much faster. These cells cannot keep the charge for the time interval at which they are refreshed (this time interval is the retention time). This retention time is usually 64ms. Three main reasons for these disturbances have been found at the electronic level.
The first one is the electromagnetic coupling, which is the result of Alternative Current flowing through conductors. It induces an electromagnetic field that constantly expands and contracts in reason of the AC’s nature. Whenever electromagnetic lines of force cut through another conductor, a voltage is induced in that conductor. This can lead to hazardous voltage in nearby cells. There are also bridges which are unwanted connections between two nodes in the memory. It is the result of the isolation layer in between cells not being crafted properly, leaving a small conductive space. This leads to a lack of isolation and increases the risk of disturbances errors.
The last one is the hot carrier injections: toggling a wordline for hundreds of hours can damage it. As an example, a capacitor damaged with hot carrier injections can leak its charge faster.
With the miniaturization of the components, it becomes harder to keep well-isolated DRAM capacitors.
Solutions to disturbances
To overcome this problem, researchers from the SAFARI’s research group thought about several solutions. The main difficulty is that the most simple solutions are very greedy, either in performance or energy. Two viable solutions are finally discussed.
Increasing the refresh rate
The fact that a cell can leak its charge in a shorter time than the normal retention time is the reason for disturbances. Reducing the refresh interval could be a solution. Even if the cells lose the charge, they will be refreshed before a disturbance occurs. The drawback of this technique is that it would consume a lot of energy but also significantly reduce the performance of the module. Kim et al. found that to counteract RowHammer, a refresh period of 8.2ms instead of the original 64ms would be efficient. DRAM modules would then spend much more time refreshing their cells (from 11% to 35% instead of 1% to 4%). These are the main reasons why this can’t be a viable solution.
ECC module
ECC modules (Error Correcting Code Memory) might be a solution. These memory chips have a built-in controller that includes a parity code known as a Hamming Code. If one of the bits flips, it can detect it and report the correct data back to the host computer. This works by employing an algorithm that relies on parity checks with the help of a few redundancy bits. However, even such modules cannot correct multi-bit disturbance errors because the Hamming codes aren’t effective in situations where error rates are very high. Due to their expensive cost, ECC modules are rarely seen in the lambda consumer systems even though they are common in servers, for example. This solution is interesting but too expensive to be accessible to everyone.
Retire cells
The victim cells could also be detected and remapped or even disabled. This can be realized by the manufacturer or the user. The problem for the manufacturer is that detecting it would take a big amount of time (several days at least). Remapping it would not always be a good idea because the DRAM might not have enough space. A user could use system-level techniques to remap and detect faulty cells. On the other hand, the efficiency of a remapped or disabled bank/group of rows would be seriously tainted by this long process.
Refresh neighbors
Another idea could be identifying sensitive cells (some cells are more sensitive than others) and refreshing their neighbors. The difficulty is mainly about detecting the faulty rows efficiently without using a lot of resources. Some algorithms based on hash functions have been selected by the researchers: Bloom Filter and Morris Counters. Here another limitation appears: hash functions rely on hash collisions. In particular situations, it can lead to tagging a large number of healthy rows as sensitive and trigger many useless refreshes and consume energy and performances. So again, it seems that it is very difficult to find protection without drawbacks.
PARA
Researchers dig out a method called PARA for probabilistic adjacent row activation. The concept is quite simple. As a row is accessed, there is a little chance that the neighbor row shall be opened too. If a neighbor is opened, it is also refreshed by extension. The principle of the RowHammer attacks is hammering, meaning that a lot of accesses are done in a short amount of time. The more the rows are opened, the more the probability that a neighbor is opened is high. What makes this technique interesting is that its impact is very small in a normal situation. It also doesn’t require additional heavy hardware like ECC.
As it is a probabilistic solution, it might not prevent the totality of the errors. It can still significantly lower the probability that a disturbance error occurs. Kim and his team think of implementing it in the memory controller directly. An important piece of information the memory controller needs is which rows are physically adjacent. Without it, it is difficult to set up PARA. As a response, the researchers ask the manufacturer to reveal the mapping function for the rows. It is possible to store it in a small ROM(Read Only Memory) of the DRAM, which already exists. Nine years after the publication of this paper, some manufacturers implemented this solution, and the probability of refresh can even be chosen in some bios.
BlockHammer
In 2021, a new solution was introduced by A. Giray Yaglikci et al.: BlockHammer. They chose to prevent RowHammer by blacklisting the rapidly-accessed DRAM rows. The two key challenges that this technique is trying to solve are : - The scalability as the vulnerability has worsened with the DDR4 and new LPDDR4; the solution must use an acceptable amount of performance and energy. - The compatibility with new DRAM chips: they don’t want to modify the internals of the chips, but the concept must work with any kind of DRAM, and vendors choose to keep confidential the address mapping of their product
The concept is to deactivate rows which are toggled suspiciously. They detect and blacklist rows that are a lot activated with a Bloom Filter. If the rate of activations is too high, the program blocks the access of the row at the level of the memory scheduler. Some tests have been run, and the team demonstrated that this solution consumes a negligible quantity of energy and performance in a situation without attack. In a RowHammer attack situation, BlockHammer provides significantly higher performance (71% on average) and lower energy consumption (32% on average) than the actual solutions.
Conclusion
We introduced the vulnerability RowHammer and briefly described the key points of its history. We detailed and explained the phenomenon of disturbance error and dived into DRAM structure and functioning. This article also covers the procedure of RowHammer and why it works. Finally, some solutions are discussed, and two of them stand out of the crowd: PARA and BlockHammer. Even though some ingenious solutions have been found, modern DRAM chips aren’t completely protected, and there is still some research to do.
Sources
Z. Al-Ars et al. DRAM-Specific Space of Memory Tests. In ITC, 2006
3Blue1Brown. How to send a self-correcting message (Hamming codes), 2021