MCU land, part 3: baby steps with Cortex-M7

A gentle introduction to Microchip SAM S70 / E70 / V7x chips, aka PIC32CZ CA70 - from "hello world" to clock settings and performance tweaks.

Jan 28, 2023

In my earlier posts, I advocated for weaning yourself off embedded Linux platforms such as Raspberry Pi. To make my case, I talked about the ease of tackling seemingly “OS-grade” problems — such as driving a graphical display or communicating with external storage devices — with simple code running on a bare MCU.

My examples relied on 8-bit AVR microcontrollers, so I danced around another pervasive myth: that if you need high performance, the complexity of wiring up, bootstrapping, and programming a modern 32-bit device is beyond the skill of most hobbyists.

For the most part, this belief is bunk. To illustrate, let’s take the Microchip-made SAM S70 / E70 / V7x series MCUs; a slightly refreshed version of these chips is also marketed as PIC32CZ CA70. The devices are based on the ARM Cortex-M7 architecture. My pick of the litter is ATSAMS70J21, an LQFP-64 chip with a 300 MHz clock, an onboard FPU, 2 MB of flash memory, and 384 kB of SRAM. The PIC32CZ equivalent part is PIC32CZ2051CA70064; it’s largely identical, spare for having 512 kB of RAM.

*Basic LQFP-64 pinout on the SAM S70, by author.*

A simple $0.90 breakout board can be used to mount the chip for prototyping (link). Soldering QFP chips to breakout boards is fairly easy; if you don’t have a microscope, the most foolproof technique is to align and secure the chip in place, drag a good amount of solder and flux across to wet all the pins, and then collect the excess with a clean tip or a soldering wick. That said, ready-made breakout boards for Cortex-M7 or other Cortex family chips can be purchased, too.

*LQFP-64 on a breakout board. There’s no such thing as too much flux.*

With the chip mounted, the next task is to provide it with a supply of between 1.7 and 3.6V on three groups of pins — marked in my drawing as GND (6, 14, 31), Vdd (4), I/O Vdd (10, 42, 58), and USB Vdd (64). These pins power the main voltage regulator, the I/O subsystem, and the USB subsystem, respectively.

There is also a second class of pins - Vcore (13, 24, 61) and PLL Vcore (52) — that need 1.2V and can be damaged by a higher voltage. Luckily, the MCU has a built-in regulator generating 1.2V on to the pin marked Vcore OUT (3); your job is just to route it back to the MCU, with a filtering capacitor hooked up.

To program the device, you can use the Serial Wire Debug (SWD) protocol. This requires connecting SWCLK (39), SWDIO (35), and RESET (36) pins to a programmer. Cheap $15 programmers compliant with the CMSIS-DAP protocol and supporting SWD transport work fine; so does Atmel-ICE, a universal programmer and debugger that supports a variety of MCUs, including all generations of 8-bit AVR chips. (For tips for the Linux toolchain, click here.)

In the end, a rudimentary circuit to boot, program, and test a SAM S70 is as simple as it gets:

*Getting SAM S70 off the ground, by author.*

When driving more peripherals, it would be important to connect all the power pins — but for a quick demo, this half-baked setup will suffice.

There is a relative dearth of tutorials for programming 32-bit MCUs, but the environment is in many respect analogous to the more familiar 8-bit chips. For example, the following “hello world” program strobes a LED connected to PA3 (pin 40) at about 2.5 Hz:

#include <sam.h>

void short_delay() {
  uint32_t i = 1000000;
  while (i--) asm("nop");
}

int main(void) {

  /* Disable watchdog to avoid a reboot every 16 secoonds */
  WDT->WDT_MR = 1 << 15;

  /* Configure output pins */
  PIOA->PIO_OER  = 1 << 3;
  PIOA->PIO_PUDR = 1 << 3;

  /* Toggle bits and spin wheels */
  while (1) {
    PIOA->PIO_SODR = 1 << 3; short_delay();
    PIOA->PIO_CODR = 1 << 3;ﾠshort_delay();
  }

}

In this snippet, PIOA stands for the I/O controller for port A on SAM series chips; PIO_OER is the output enable register (at boot, most pins are configured as inputs); PIO_PUDR is the pull-up resistor disable register; and PIO_SODR and PIO_CODR are registers for setting and clearing output bits.

(If the code uploads but doesn’t run, use the programmer to set the BOOT_MODE flag to 1 in the GPNVM. This enables booting from flash. Also note that on PIC32CZ CA70, the naming convention for the registers is slightly different: you'd use PIOA_REGS instead of PIOA, WDT_REGS instead of WDT, and so on.)

Using pins as inputs is about as simple and involves reading the PIO_PDSR register for the corresponding port. That said, while output operations are “free”, for input, one needs to power up the I/O controller first. For port A, the relevant operation is:

PMC->PMC_PCER0  = 1 << 10;

That’s not to say that 32-bit MCUs are always a walk in the park. Consider that this particular chip boots up with a clock of 12 MHz — a fraction of what you are paying for. On an 8-bit MCU, adjusting the clock is trivial; on the SAM S70, you can try some online examples, realize that they’re all wrong, and then spend several hours nose down in the 2,000 page datasheet. This diagram foreshadows the adventure ahead:

*From the SAM S70 spec. Color highlights added by author.*

The first revelation is that 300 MHz is the maximum clock for the processor core (HCLK), not the entire device. There is an internal bus used by the flash memory controller and other built-in peripherals, and that bus can’t run faster than 150 MHz; this fact is helpfully mentioned on page 1852 of the spec.

Given the diagram above, the way to get the CPU core to its maximum speed is to coax the clock generator module (left block on the diagram) into producing a 300 MHz signal; but prior to that, we need to configure a divider (MDIV) in the main programmable clock controller (PMC, roughly center of the diagram) to halve the passthrough frequency it delivers to the bus clock (MCK) branch.

The other gotcha is the onboard flash memory: non-volatile memories are fairly slow, and if you jack up the bus clock, the controller won’t be able to complete reads and writes in the allotted number of cycles. But don’t worry: flip to page 1805 of the spec and have a look at table 57-50, which recommends the value to put in the FWS bits of the flash controller register (EEFC_FMR). For a 150 MHz bus, it’s 6.

After setting MDIV and FWS values, your MCU will run slower, not faster; after all, the clock source is still the 12 MHz RC oscillator. The next step is to configure an internal phase-locked loop circuit (PLLA) to take that 12 MHz and multiply it by 25. This is done by accessing the multiplier (MULA) field in the CKGR_PLLAR register of the PMC.

We’re still not done! The final step is to actually switch the clock input from the RC oscillator to the PLL-multiplied clock (blue to red line on the diagram). This is accomplished by changing the clock select (CSS) field in its PMC_MCKR register. Critically — and this is what most online examples get wrong — this can’t be done in tandem with tweaking any other PMC_MCKR values. You need to change the register step by step, waiting for an acknowledgment flag along the way. It’s mentioned on page 248 of the spec.

In the end, the code is simple, but not trivial to arrive at without understanding some of the finer points of the architecture of the chip — and the sheer size of the spec makes the relevant information hard to find:

void turbo_mode() {

  /* Set flash wait state suitable for the new clock (FWS@8 = 6). */
  EFC->EEFC_FMR = 6 << 8;

  /* Set PLL multiplier to x25 (MULA@16 = 24), wait for ACK. */
  PMC->CKGR_PLLAR = (1 << 29) | (24 << 16) | (0x3f << 8) | 1;
  while (!(PMC->PMC_SR & PMC_SR_LOCKA));

  /* Set system bus clock divider to 2 (MDIV@8 = 1), wait for ACK. */
  PMC->PMC_MCKR = (0b01 << 8) | 1;
  while (!(PMC->PMC_SR & PMC_SR_MCKRDY));

  /* Toggle PMC clock source (CSS@0 = 2), wait for ACK. */
  PMC->PMC_MCKR = (0b01 << 8) | 2;
  while (!(PMC->PMC_SR & PMC_SR_MCKRDY));

}

(On PIC32CZ CA70, use PMC_SR_LOCKA_Msk instead of PMC_SR_LOCKA; and PMC_SR_MCKRDY_Msk instead of PMC_SR_MCKRDY.)

Another peculiarity of the platform is that it features a hardware floating-point unit (FPU) — but that unit is turned off by default, costing you quite a bit if you want to do certain types of calculations. The way to turn it on is not documented in the Microchip spec at all, but it is outlined in the ARMv7-M Architecture Reference Manual. In the end, it’s a matter of toggling four bits corresponding to “coprocessor 10” and “coprocessor 11” in the ARM Cortex “coprocessor access control register” (CPACR):

SCB->CPACR |= (0b1111 << 20);

This is the only documented function of the CPACR register; the other “coprocessors” are either reserved or vendor-specific.

In any case, the flipping of the bits does not speed up your code: your compiler is still not generating the opcodes for the FPU. This can be fixed by appending “-mfloat-abi=softfp -mfpu=fpv5-d16” to your CFLAGS. But if you do that, your program might hang — if not right now, then as you add more code to the project. What gives?

The problem is that after the change, the compiler is free to output code that tries to use floating-point operations — and it might choose to do so before the FPU is turned on. This might happen even before the first line of main() is executed.

The surest fix is to make your initialization routine a “constructor” function. If declared this way, it will not need to be explicitly called, and it will run before almost anything else:

__attribute__((constructor(101))) static void enable_fpu() {
  SCB->CPACR |= (0b1111 << 20);
}

While you're at it, you might also want to call SCB_EnableDCache() and SCB_EnableICache(). They enable memory caching, giving most code a major performance boost.

All this goes back to my original point: in many applications, 8-bit MCUs are cheaper, more robust, and simpler to use. That said, if you have a computationally-intensive application, it doesn’t make sense to shy away from a 32-bit chip — and you don’t need an operating system or a costly, component-packed evaluation board to make it come alive.

👉 Continue to the next article in series: digital signal processing with an MCU. To review the entire series of articles on digital and analog electronics, check out this page.

I write well-researched, original articles about geek culture, electronic circuit design, and more. If you like the content, please subscribe. It’s increasingly difficult to stay in touch with readers via social media; my typical post on X is shown to less than 5% of my followers and gets a ~0.2% clickthrough rate.

lcamtuf

Jan 20

FWIW: I'm using the SAM S70 series because I think they're some of the most user-friendly high-speed MCUs out there. They come in packages that are easy to hand-solder, have internal flash and SRAM, come with uncomplicated power supply requirements, and have a decent free IDE, good (if voluminous) docs, and a well-maintained open-source toolchain.

Most other Cortex-M7 chips fail on one or more of these fronts. For example, most of STM32H7 chips come only in non-leaded packages - I think there are just two TQFP-64 varieties, and their availability is hit-and-miss. NXP has some comparable products, but from what I recall, their ecosystem is just not as hobbyist-friendly.

Expand full comment

skybrian

Feb 1, 2023

You mentioned ready-made breakout boards in passing, but it seems like buying a board from a vendor like Teensy or Adafruit and using Arduino or PlatformIO would be the most common way to approach this for hobbyists? Certainly there’s no reason to spend $250 when these boards go for $6-30. (Whether an Arduino implementation counts as an OS is debatable, but it does handle booting and some device drivers for you, sort of like MS-DOS did.)

My recent projects have used Raspberry Pi Picos after a friend gave me a few. The most difficult part of getting started was deciding among the many ways of writing a program for it. Settled on this one: https://github.com/earlephilhower/arduino-pico

When programming at this level of abstraction, the complexity of 32-bit microcontrollers is hidden and they can be had at similar prices to 8-bit boards, so I don’t see any reason not to use 32-bit MCU’s all the time.

1 reply by lcamtuf

3 more comments...

lcamtuf’s thing

Discussion about this post