MCU #13: building a graphics stack

Going from pixels to usable graphics with an ST7789 display hooked up to a generic microcontroller.

May 24, 2024

Over the past year or so, I developed a series of handheld games running on custom hardware. It started as an attempt to introduce my kids to a couple of early arcade classics — but then quickly spiraled out of control with a sophisticated Sokoban clone on an AVR Dx microcontroller, followed by an investigation of modern 2D game aesthetics on a Cortex-M7 MCU.

Right now, I’m working on what might be the series’ pièce de résistance — a reimagining of Lemmings, a blockbuster game from the early 1990s:

Developing games on bare-metal platforms is a welcome detox from the paradigms of modern software development. Contrary to what most people suspect, the hardware part is easy; the challenge is in the software abstractions you build. When it comes to graphics, you’re no longer forced to put up with the quirks of OpenGL or Unity — but then, if you end up with an unworkable API, you have no one else to blame.

Not long ago, I talked about the audio stack for my latest Lemmings-themed project, dubbed Over the Edge. In today’s episode, I wanted to take a quick look at the graphics API.

The hardware side

Most microcontrollers available on the market today lack any graphics facilities; this is in stark contrast not only to modern PCs, but even to the 8-bit microcomputers of the 1980s. The machines of that era featured hardware character generators, specialized sprite compositing circuitry, and a variety of ROM drawing routines.

On the flip side, contemporary 8-bit MCUs are orders of magnitude faster than their 1980s counterparts, so it’s possible to do 2D graphics in software. It helps that many displays designed for embedded applications have simple digital interfaces and built-in framebuffers; you no longer have to fiddle with analog video or synchronize data transfers to the pixel clock.

In Over the Edge, I am using a 2.8” IPS display from Newhaven (NHD-2.8-240320AF-CSXP-F). The device comes with an embedded ST7789Vi controller, which takes care of the physics of driving an LCD matrix, and has some onboard SRAM. I talk to the panel using a 16-bit wide parallel bus running at 15 MHz; the resulting data rate is 240 Mbit/sec, or about 200 frames per second. More pragmatically, the approach lets me maintain a sensible refresh rate while not taking up too much CPU time.

I covered the ST7789 protocol and the 16-bit RGB565 pixel format in an earlier article; the core pixel-pushing logic — which operates on hardware I/O lines via memory-mapped registers — is trivial:

/* Send command byte to panel (WR edge while RS low). */

static void tft_cmd(u8 cmd) {
  PIOB->PIO_CODR = 0b001;           /* RS- low  */
  PIOB->PIO_CODR = 0b010;           /* WR- low  */
  tft_slight_delay();               /* ~60 ns   */
  PIOD->PIO_ODSR = cmd;
  PIOB->PIO_SODR = 0b010;           /* WR- high */
  PIOB->PIO_SODR = 0b001;           /* RS- high */
}

/* Send data word to panel (WR edge while RS high). */

static void tft_data(u16 data) {
  PIOB->PIO_CODR = 0b010;           /* WR- low  */
  tft_slight_delay();               /* ~60 ns   */
  PIOD->PIO_ODSR = data;
  PIOB->PIO_SODR = 0b010;           /* WR- high */
}

/* Get the screen in a predictable state. */

void tft_init() {
  tft_cmd(0x11);                   /* Exit sleep mode.             */
  tft_cmd(0x3a); tft_data(0x05);   /* Pixel format: 16bpp (RGB565) */
  tft_cmd(0x21);                   /* IPS inversion mode           */
  tft_cmd(0x29);                   /* Display on                   */
}

/* Send 320 x 240 x 16bpp bitmap to screen. */

void tft_send_bitmap(const u16* bitmap) {
  for (u32 i = 0; i < 320 * 240; i++) tft_data(*(bitmap++));
}

Some microcontrollers might support offloading tft_send_bitmap() to the DMA controller. That said, this is not necessary in my use case; more about that soon.

The screen compositor

In the 1980s and early 1990s, many graphics libraries were content to give you a way to paint on the screen: that is, to bake text or geometric shapes onto the output bitmap, with little regard for what these pixels represent or what might need to happen next.

The problem with this approach was that you sometimes needed to remove objects, too. In a text editor or a rudimentary computer game, you could simply memset() the unwanted section to a uniform background color; but in more sophisticated cases, you could be dealing with background images and overlapping foreground elements. All too often, you had to painstakingly recreate the painted-over visuals while the graphics library just sat there and stared blankly at you.

To be more useful, a modern graphics stack also needs to do compositing: it must be able to keep track of logical objects on the screen and then dynamically calculate screen frames based on their momentary state. To that effect, all the synchronous graphics API calls in Over the Edge just tweak attributes of screen objects, and don’t touch pixel-level data at all, e.g.:

void put_char(const u8 chr, u8 cx, u8 cy, u16 color) {

  screen[cy][cx].spr_id = (chr >= 33 && chr <= 122) ? chr : 0;
  screen[cy][cx].color  = color;
  gcc_barrier();
  cy_dirty[cy] = 1;

}

Similarly to the design of my earlier game — Bob the Cat — the task of actually computing the bitmap is handled in the background via a 1 kHz hardware timer interrupt. The interrupt reads the list of “dirty” screen regions — stripes of 320×8 pixels — and then selectively calculates the stripes and sends them to the LCD:

void display_interrupt() {

  for (u8 cy = 0; cy < Y_CELLS; cy++) {

    if (cy_dirty[cy]) {
      calculate_cy_bitmap(cy);
      tft_send_bitmap(cy_bitmap, 0, cy * 8, X_RESOLUTION, 8);
      cy_dirty[cy] = 0;
      return;
    }

  }

}

Only a single stripe is cleared per interrupt; this maintains a ~30 fps maximum screen refresh rate while limiting any jitter introduced by the IRQ to less than 200 µs. We get seamless, “magical” screen updates for the price of a 5 kB stripe buffer — compared to 150 kB that would have been needed for the entire screen.

The use of such a small transfer unit, coupled with the relative simplicity of game logic, also explains why it’s not worth the hassle to offload this work to DMA.

Layer 1: background bitmaps

*Editing level pixel art in Affinity Photo.*

The first compositing layer in Over the Edge are full-screen background backgrounds. At 320×240×16bpp, a single uncompressed bitmap takes up about 150 kB of flash memory, so I employed a variant of the QOI algorithm, as designed by Dominic Szablewski, to losslessly reduce the size of the images by 60-70%.

The QOI algorithm is remarkably simple, encoding pixels as one of five possible output values:

A single-byte lookup into a hashed array of 64 recent colors (00nnnnnn),
A single byte with -2 … +1 deltas from last pixel’s R, G, and B (01rrggbb),
A two-byte “luminance” delta: -32 … +31 for the green component, and a G-relative -8 … +7 deltas for R and B (10gggggg rrrrbbbb),
A byte for up to 62 repetitions of the previous pixel (11nnnnnn, n = 0 … 61),
A four-byte raw value, RGB prefixed by a marker (11111110 <r> <g> <b>).

I reimplemented the algorithm for RGB565 images, improved compression by getting rid of alpha channels and the associated encodings, and removed checks for malformed data. The resulting decoding routine fits on a napkin:

void nqoi_decode(struct bgr565* out, const u8 *in, u32 len) {

  struct bgr565 in_px = { v: 0 }, px_index[64];
  memset(px_index, 0, sizeof(px_index));

  while (len--) {
    u8 in_b1 = *(in++), run_len, in_b2;
    s8 delta_g;

    if (in_b1 == NQOI_RGB) {
      (out++)->v = in_px.v = ((struct bgr565*)in)->v;
      px_index[NQOI_HASH(in_px)].v = in_px.v;
      in  += 2;
      len -= 2;
      continue;
    }

    switch (in_b1 & 0b11000000) {
      case NQOI_INDEX: 
        (out++)->v = in_px.v = px_index[in_b1].v;
        break;
      case NQOI_DELTA:
        in_px.r   += ((in_b1 >> 4) & 0b11) - 2;
        in_px.g   += ((in_b1 >> 2) & 0b11) - 2;
        in_px.b   += (in_b1 & 0b11) - 2;
        (out++)->v = px_index[NQOI_HASH(in_px)].v = in_px.v;
        break;
      case NQOI_LUMA_D:
        in_b2 = *(in++);
        len--;
        delta_g    = (in_b1 & 0b111111) - 32;
        in_px.r   += delta_g + (in_b2 >> 4) - 8;
        in_px.g   += delta_g;
        in_px.b   += delta_g + (in_b2 & 0b1111) - 8;
        (out++)->v = px_index[NQOI_HASH(in_px)].v = in_px.v;
        break;
      case NQOI_RUN:
        run_len = 1 + (in_b1 & 0b111111);
        while (run_len--) (out++)->v = in_px.v;
    }

  }

}

Instead of being processed as a single blob, the bitmaps are split into 8-pixel wide slices, each slice processed separately. This facilitates fast selective redraw while having a pretty inconsequential effect on compression; for full-screen images, the state of the encoder is reset once every 2,560 pixels.

Layer 2A: text

The next compositing layer is a 40×30 matrix of character cells, essentially equivalent to text (or “terminal”) modes on personal computers. Each character cell consists of one ASCII value plus a RGB565 color attribute; this overlay takes up about 3.5 kB and is remarkably easy to work with. For example, the following function prints text, supporting newlines and line wrap:

void put_text(const u8* str, u8 cx, u8 cy, u16 color) {

  while (*str) {

    if (*str == '\n' || cx == X_CELLS) {
      cx = 0;
      if (++cy == Y_CELLS) cy = 0;
    }

    if (*str != '\n') {
      screen[cy][cx].spr_id = (*str >= 33 && *str <= 122) ? *str : 0;
      screen[cy][cx].color  = color;
      cx++;
    }

    gcc_barrier();
    cy_dirty[cy] = 1;
    str++;
  }

}

Because the MCU doesn’t have any sort of a character ROM, I reused my own 8×8 font originally developed for Bob the Cat and vaguely inspired by the ZX Spectrum typeface:

The actual IRQ-driven rendering routine for text cells is similarly uncomplicated. Once the background bitmap is in place, the code reads the character’s 1-bpp bitmap and then sets the non-zero pixels to the specified color, leaving the zeros transparent:

static inline void render_single_char(u16* out_bmp, u8 chr, u16 color) {

  const u8* fb = font_off33[chr - 33];

  for (u8 y = 0; y < 8; y++) {

    for (u8 x = 0; x < 8; x++) {
      if (*fb & (128 >> x)) *out_bmp = color;
      out_bmp++;
    }

    fb++;
    out_bmp += X_RESOLUTION - 8;
  }

}

Layer 2B: fixed sprites

In many video games, certain UI or playfield elements are laid out on a regular grid and don’t need to move freely across the screen:

*Editing fixed sprites that make up the level’s terrain.*

To accommodate this use case for free, I repurposed high-bit (128-255) ASCII values on the character grid, interpreting them as indexes into a sprite_bmp[] array. The array contains full-color 8×8 pixel sprites in the chip’s non-volatile memory; the rendering code is quite similar to that used for text, with zero-value pixels encoding transparency:

static inline void memcpy16_alpha(u16* dst, const u16* src, u8 len2) {
  while (len2--) { if (*src) *dst = *src; src++; dst++; }
}

static inline void render_sprite(u16* out_bmp, u8 bmp_id) {

  const u16* spr_bmp = sprite_bmp[bmp_id];

  for (u8 y = 0; y < 8; y++) {
    memcpy16_alpha(out_bmp, spr_bmp, 8);
    spr_bmp += 8;
    out_bmp += X_RESOLUTION;
  }

}

These “fixed” sprites can still be animated, deleted, or replaced with ease; the main constraint is that they can’t be easily moved in increments smaller than eight pixels. They also can’t occupy the same cell as text.

In one of my earlier games — Sir Box-a-Lot — I worked around the first limitation by allowing the sprite’s rendering location to be offset by +/- 7 pixels, essentially making it easy to fake smooth transitions from cell to cell. That said, the approach complicated the selective redraw code while being fairly cumbersome for the caller, too. In Over the Edge, I decided to do better than that.

Layer 3: floating sprites

The final compositing layer in the game are “floaties”, or floating sprites; these sprites can have arbitrary dimensions and can be placed anywhere on the screen.

The main trade-off with floating sprites is that because they can be anywhere on the screen and might only partly overlap with screen refresh stripes, they are far more computationally involved to handle. This is the floatie compositing code, sans comments — see if you can make sense of this:

static inline void render_floatie(u16* out_bmp, u8 slot, u8 cy) {

  const u16* spr_bmp = floatie[slot].bmp;

  s16 f_st_x = floatie[slot].x,
      f_xlen = floatie[slot].xlen;

  if (f_st_x < 0) {
    spr_bmp -= f_st_x;
    f_xlen  += f_st_x;
    f_st_x   = 0;
  } else if (f_st_x + f_xlen > X_RESOLUTION) f_xlen = X_RESOLUTION - f_st_x;

  if (f_xlen < 1) return;
  out_bmp += f_st_x;

  u16 stripe_y = cy * 8;

  s16 f_st_y     = floatie[slot].y,
      f_ylen     = floatie[slot].ylen,
      stripe_len = 8;

  if (f_st_y > stripe_y) {
    out_bmp    += X_RESOLUTION * (f_st_y - stripe_y);
    stripe_len -= (f_st_y - stripe_y);
  } else if (f_st_y < stripe_y) {
    spr_bmp += floatie[slot].xlen * (stripe_y - f_st_y);
    f_ylen  -= (stripe_y - f_st_y);
  }

  stripe_len = MIN(f_ylen, stripe_len);
  if (stripe_len < 1) return;

  while (stripe_len--) {
    memcpy16_alpha(out_bmp, spr_bmp, f_xlen);
    spr_bmp += floatie[slot].xlen;
    out_bmp += X_RESOLUTION;
  }

}

It follows that floaties are the tool of last resort; in the game, I’m using them to animate creatures and to place several larger bitmaps on the screen, but the total is expected to stay under 30 or so — compared to up to 300 fixed sprites.

Putting it all together

With all these rendering functions implemented, the final compositing routine — called from the IRQ handler — is quite straightforward:

static u16 cy_bitmap[X_RESOLUTION * 8];

static void calculate_cy_bitmap(u8 cy) {

  /* Render background. */

  if (nqoi_background) {

    nqoi_decode((struct bgr565*)cy_bitmap, nqoi_background +
                nqoi_bg_offset[cy],
                nqoi_bg_offset[cy + 1] - nqoi_bg_offset[cy]);

  } else memset(cy_bitmap, 0, sizeof(cy_bitmap));

  /* Overlay static sprites. */

  for (u8 cx = 0; cx < X_CELLS; cx++) {
    u8 cv = screen[cy][cx].spr_id;
    if (!cv) continue;
    if (cv < 128)
      render_single_char(cy_bitmap + 8 * cx, cv, screen[cy][cx].color);
    else
      render_sprite(cy_bitmap + 8 * cx, cv - 128);
  }

  /* Incorporate floaties. */

  for (u8 fl = 0; fl < MAX_FLOATIES; fl++)
    if (floatie[fl].bmp && floatie[fl].st_cy <= (s8)cy &&
        floatie[fl].en_cy >= (s8)cy) render_floatie(cy_bitmap, fl, cy);

}

The end result is a responsive and flexible graphics pipeline — done on a piece of hardware that knows nothing about graphics at all.

👉 You can check out the WIP version of Over the Edge by clicking here.

I write well-researched, original articles about geek culture, electronic circuit design, and more. If you like the content, please subscribe. It’s increasingly difficult to stay in touch with readers via social media; my typical post on X is shown to less than 5% of my followers and gets a ~0.2% clickthrough rate.

lcamtuf’s thing

Discussion about this post