MCU land, part 6: DMA on Cortex-M7
Exploring the DMA subsystem of Microchip S70 / E70 / V70 chips.
One of the unsung heroes of modern high-performance computing are Direct Memory Access (DMA) controllers. These dedicated devices accept instructions to move data between memory locations or I/O peripherals, and then carry out their orders without the involvement of the CPU. This frees up the processor to focus on real computation, rather than tending to every single byte that’s flowing in or out.
To hobbyists, DMA controllers are mythical creatures: understood to be powerful yet firmly out of reach. A quick Google search reveals a total of one third-party Github repository using the DMA subsystem on the popular SAM S70 series MCU — and that’s since the chip’s introduction in 2015. The code in the repository is oddly formatted, sparsely commented, and difficult to reuse.
In an earlier project, we came across at least two situations where a DMA controller would have helped: recording and playing back audio samples, and streaming data to an external memory module. The 8-bit MCU I used at the time did not have DMA support. That said, it might be instructive to explore this subsystem on the aforementioned Microchip SAM E70x / S7x / V7x series MCU; as it turns out, it’s not all that hard to use.
Simple memory transfers
Similarly to some other non-essential peripherals on the series’ die, the chip’s 24-channel Extended DMA Controller (XDMAC) is in powered-down mode after boot. To enable it, we need to toggle the corresponding bit in PMC_PCER1, one of the power management registers:
void turn_on_dma() { PMC->PMC_PCER1 = 1 << 26; }
When XDMAC is in its boot-up configuration, straightforward memory transfers can be accomplished by writing the source address to the XDMAC_CSA register, putting the destination in XDMAC_CDA, and then specifying the length in XDMAC_CUBC.
The next stop is the XDMAC_CC configuration register, where we can specify data transfer units (DWIDTH=0/1/2 for 8, 16, or 32 bits). We also need to indicate that both the source and the destination pointer should be advanced after every write (SAM=1 and DAM=1) and tell the controller to read data via the second DMA interface (SIF=1), which gives its access to SRAM, flash, and peripherals. The first (zeroth?) interface is connected only to SRAM. Since that’s the most likely memory destination in most applications, we set DIF=0.
With this in place, we can trigger the transfer by enabling a bit in XDMAC_GE.
void dma_move_mem(uint8_t dma_ch, const void* src, volatile void* dst, uint_32t len) { XDMAC->XdmacChid[dma_ch].XDMAC_CSA = (uint32_t)src; XDMAC->XdmacChid[dma_ch].XDMAC_CDA = (uint32_t)dst; XDMAC->XdmacChid[dma_ch].XDMAC_CUBC = len; XDMAC->XdmacChid[dma_ch].XDMAC_CC = (0 << 11) /* DWIDTH: bytes */ | (1 << 13) /* SIF: source on the the second DMA iface */ | (0 << 13) /* DIF: destination on the the first DMA iface */ | (1 << 16) /* SAM: source address increment */ | (1 << 18) /* DAM: destination address increment */; XDMAC->XDMAC_GE = (1 << dma_ch); /* Go! */ }
That’s all we need to carry out a simple test:
const uint8_t src[4] = { 1, 2, 3, 4 }; volatile uint8_t dst[4]; int main() { turn_on_dma(); dma_move_mem(0, src, (void*)dst, sizeof(dst)); sleep(1); /* Implementation from an earlier article */ if (dst[0] == 1 && dst[1] == 2 && dst[2] == 3 && dst[3] == 4) { /* Make happy sounds, take a victory lap */ } }
Seeing is believing — although in general, a better way to confirm the completion of a transfer is to check the BIS flag in the DMA channel’s XDMAC_CIS register (or to enable the corresponding interrupt).
DMA-assisted I/O
Of course, the DMA controller is more than just a fancy version of memcpy(). You can specify any I/O port as the source or the destination of a data transfer. Such transfers can be clocked off a signal provided by a serial bus controller or in a handful of other ways.
Recall from the previous article that on Cortex chips, registers are memory-mapped, so their addresses can be passed to the DMA subsystem as-is. That said, we need to make some tweaks to the configuration register: this time, we set TYPE=1 to indicate a peripheral transfer; DSYNC=1 to indicate that writes (vs reads) should be synchronized to a clock; and DAM=0 to keep the destination address unchanged. We also set DIF=1 to use the DMA bus interface connected to I/O ports.
For the initial experiment, let’s also set SWREQ=1, telling the DMA controller that we’ll generate timing signals in software. This saves us the effort of building a more elaborate hardware-clocked setup:
void dma_send_to_pio(uint8_t dma_ch, const void* src, void* pio, uint32_t len) { XDMAC->XdmacChid[dma_ch].XDMAC_CSA = (uint32_t)src; XDMAC->XdmacChid[dma_ch].XDMAC_CDA = (uint32_t)pio; XDMAC->XdmacChid[dma_ch].XDMAC_CUBC = len; XDMAC->XdmacChid[dma_ch].XDMAC_CC = (1) /* TYPE: memory-to-peripheral */ | (1 << 4) /* DSYNC: sync before write */ | (1 << 6) /* SWREQ: software-controlled sync (XXX) */ | (0 << 11) /* DWIDTH: bytes */ | (1 << 13) /* SIF: source on the the second DMA iface */ | (1 << 14) /* DIF: destination on the the second DMA iface */ | (1 << 16) /* SAM: source address increment */ | (0 << 18) /* DAM: destination address fixed */; XDMAC->XDMAC_GE = (1 << dma_ch); /* Go! */ }
If you have a LED connected to PA3, as shown in one of the earlier articles, you should now be able to run this code to play back a sequence of blinks:
#define PA_LED (1 << 3) const uint8_t led_sequence[10] = { 0, PA_LED, 0, PA_LED, PA_LED, 0, PA_LED, PA_LED, PA_LED, 0 }; int main() { /* Configure LED pin */ PIOA->PIO_OER = PA_LED; PIOA->PIO_PUDR = PA_LED; PIOA->PIO_OWER = PA_LED; turn_on_dma(); dma_send_to_pio(0, led_sequence, &PIOA->PIO_ODSR, sizeof(led_sequence)); while (1) { sleep(1); XDMAC->XDMAC_GSWR = 1; /* DMA timing pulse */ } }
For truly asynchronous I/O, one would turn off the SWREQ flag in the XDMAC_CC register, and then configure the nearby PERC bitfield to point to a hardware timing source. Given what we discussed here and in the previous article, this should be a fairly straightforward task.
It must be said that XDMAC is a complex beast; it has a plethora of other modes, including vectored transfers of non-continuous memory regions, burst transfer features, and more. That said, unless you’re writing an operating system, linear DMA operations usually do the trick.
A note about data caching
An interesting complication may arise if the MCU is running with data caching enabled. This is not the boot-up state on the SAM S70, but it is a performance-enhancing tweak that can be turned on via SCB_EnableDCache(). It is often combined with a related SCB_EnableICache() call that controls the instruction cache.
In this state, it is possible for the CPU to retain a stale local copy of the DMA destination buffer in its cache, not realizing that the DMA controller updated the main memory in the meantime. From the perspective of your program, it will appear that the DMA transfer partly or fully failed.
That’s not all: a related issue may happen due to write caching. Consider that the CPU cache might contain pending writes to the DMA source region that have not yet been reconciled with the main memory; or pending writes to the destination region that might be belatedly finalized in the middle of the DMA.
But wait, there’s more! Even if you address these issues, you’re still not in the clear. The processor might speculatively prefetch memory into cache based on the instructions it is expecting to execute down the line, even if it hasn’t gotten to that code yet. If this happens before or in the middle of a DMA, subsequent access to prefetched memory will have hilarious results.
Luckily, there is a solution. If you’re transferring data from memory, you should call SCB_CleanDCache_by_Addr(src_buf, src_len) to force pending writes and ensure coherency before scheduling the DMA. Conversely, if you’re performing a transfer into a memory buffer, you should either refrain from modifying that region beforehand, or discard pending changes by calling SCB_InvalidateDCache_by_Addr(dest_buf, dest_len). Finally, to address the last issue — read prefetching — you must call SCB_InvalidateDCache_by_Addr(dest_buf, dest_len) once more after the transfer is complete.
Continue to the next article: Clocks in digital circuits. To review the entire series of articles on electronics , check out this page. I also have a followup article about DMA hacking here.