By Chris Lomont, Nov 2013

(See part II at http://hypnocube.com/2014/07/design-and-implementation-of-serial-led-gadgets-part-ii/)

Gene Foulk and I obtained a batch of WS2812 LED strips a few months ago, and wanted to play with them, but their unique timing protocol made it somewhat of a hassle. We wanted to drive large numbers of LEDs (on the order of 10,000), and many of the open source libraries would make the driving hardware messy and/or expensive, so we decided to look at what we could create that would allow easy control of large numbers of LEDs at fast frame rates. We decided on making a driver module that was the blend of powerful and cheap that suited our purposes. The result is our upcoming HypnoLSD module.

This article is a summary of some things we did and learned about WS2812 LED strips that should make your designs easier to complete.

Background

The popular WS2812 LED strips take the WS2811 LED driver and place them into a 5050 sized (5×5 mm) RGB LED module to create a tiny, serial driven 24-bit LED module. They are cheap (around $0.15 or less per module), readily available [Alibaba], and most of all, make neat gadgets.

Other devices

There are quite a few similar chips and systems similar to the WS2811/WS2812/WS2812B. Some are interesting because they are predecessors from the same company and imply design information useful for understanding features needed for large scale construction.

Here is a summary of some other devices. We have not worked with them, so this is sourced from various datasheets and comments found on the internet.

Titanmech makes the TM1804 [TM1804] and TM1809 [TM1809], driving 3 and 9 channels respectively. They operate at 6-24V, have a 1800 nanosecond (ns throughout) bit-length, and seem to use a protocol similar to the WS2811, but may be SPI. There are also TM1803 and TM1812 modules which we did not look up yet.

LPD8803, LPD8806, LD8809, and LPD8812 are LED drivers that run 3, 6, 9, and 12 LED channels respectively, so can drive 1, 2, 3, or 4 RGB LEDs per module. They also claim a built in 1.2MHz oscillation circuit, use a 2-wire input control up to 20MHz, do 24-bit color, and support voltage up to 12V. They claim support for over 2000 length cascades.

Worldsemi [Semi] makes several LED driving chips and modules, including the WS2812 which we are investigating.

The WS2801 uses a SPI input up to 25 MHz, does three channel 8-bit output (i.e., 24-bit color), and interestingly states it also has an internal 1.2 MHz oscillator, giving a refresh rate approximating 2.5 kHz. It runs on a 3.3-5.5V power supply, provides 200mA per (three?) LEDs, and has signal shaping. Note that 1.2MHz/(256*2) is almost 2.5kHz, so it’s reasonable to think this 1.2MHz counter drives a counter that turns on/off the LED for PWM. We think this is how the WS2811/WS2812 operates.

The WS2803 also has an SPI input up to 25MHZ input, does 18 channels (6 RGB LEDs worth), 8 bit output per channel, refresh rate approximately 2.5 kHz (implying another 1.2 MHz internal oscillator?), operates on 3.3-5.5V supply voltage, has PWM output (and also says 18-bit constant current output?), has signal shaping, drives 30mA per channel, and has built-in over-temperature protection (the LED shuts off).

For completeness, and to transition to our needs, the WS2811 claims a non-return-to-zero (NRZ) input (but it seems to be a return to zero in my opinion) at 400kbps or 800 kbps, 24-bit output (8 bits on each of on three channels), has an “internal precision oscillator” and refresh rate approximately of 2.5 kHz (implying the same internal 1.2 MHz oscillator?) , a 5-12V supply voltage, PWM output, signal shaping, drives 30mA per channel, and has an on chip regulator.

Adafruit [Adafruit] has open source libraries to drive various chips from various microcontrollers. Look around at their site to find them.

WS2811, WS2812, WS281B

The Worldsemi WS2811 chip is a LED driver, upon which the WS2812 added a LED module. The WS2812 is a 6 pin chip running on 6-7V. The WS2812B [WS2812B] is a modification of the WS2812, giving a newer 4 pin version (Vss, Vdd, Data In, Data Out) that runs on 3.5-5.3V. The marketing claims he WS2812B “inherited all the good qualities of the WS2812,” improved the IC mechanical arrangement outside to the structure inside, and further enhanced the stability and efficiency. It claims better, brighter color, and more uniform color. One good improvement is making reverse power connection not damaging the device. [WS2812+WS2812B]

These format devices, using a 2-wire protocol, have no timing line, thus end up being cheaper at the downside of using a weird timing protocol. Some microcontrollers use SPI at 4MHz with tricks to write to these devices, by banding bits together to make long and short high pulses.

All support a wide range of operating temperatures, with junction temperature ranges listed as -25 to 80 degrees Celsius.

Most of what follows is both for the WS2811, WS2812, and WS2812B, except where marked.

Timing

The WS2811/WS2182/WS2812B modules are driven by a single data line, with the chips connected in series; the Data Out pin of one is connected to the Data In of the next. Color information is sent to the first in the chain, one bit at a time. Each module “absorbs” 24 bits of color information to drive three output LEDs, usually red, green, and blue (but could support whatever someone makes), then succeeding bits are passed on down the chain for subsequent modules to use. In this manner 24 bits can be sent to each LED module in the chain. A final low signal, held long enough, signals all modules to latch the color data to LEDs, using pulse width modulation (PWM) to hold that color until a new set of colors is loaded and latched.

Bits are modulated in the following manner: the input line starts low (0 volts, say), and toggles high (5V, say), signaling the start of a bit. These voltages depend on the reference voltage powering the module, and we have found them somewhat tricky if you do not follow the specs carefully. This line is held high either a short time T0H (marking a 0 bit to be transferred) or a longer time T1H (marking a 1 bit to be transferred), then dropped low again. This process repeats, sending all bits. Each bit has the same length overall. The time between low-to-high transitions determines the bit rate.

This timing is captured in the following figure:

The supported data rates are 800 kilobits per second (kbps) or 400 kbps, meaning each bit length is 1250 ns or 2500 ns, respectively. For the rest of this document we’re only going to discuss the 800kbps rate because it allows the most LEDs in a single strand to be driven at a high frame rate, and is what we investigated and implemented.

The timing rates for the internal high and low timing lengths, and the longer reset timing signaling a latch, are listed at various places with different values, sometimes self-contradictory. Our testing shows quite a range is possible, which we detail below. Most places list the overall bit length as 1250ns (which is 1/800K) and the timing error to be plus or minus 600ns (giving rates from 540kbps to 1.5Mbps), and each of the four timing components to be plus or minus 150ns. Since each bit only uses two of the four timing components, this doesn’t add up. I suspect someone saw the four timing numbers, each allowing error of 150ns, multiplied by 4, and thought the overall timing can be off by 600ns. Using a more likely overall error of plus or minus 300ns, the bit rates will be in 645kbps to 1Mbps. We have found longer times work fine, but no LEDs worked at a 650ns rate. Next, timings in some datasheets don’t add up to the overall listed bit length. As a result, we tested quite a lot of timing scenarios, and find the timing can be pretty sloppy and still work (until you start dealing with lots of LEDs. More on that later.) For most of our development we chose to break the overall interval into thirds, partly because it made the timing easier to put cleanly into code. For the table, we put all times in nanoseconds to show that the reset is on the order of 40 bits worth of time. In practice this can be much shorter.

Source	TH+TL	T0H	T0L	T1H	T1L	Treset
WS2811 data sheet [WS2811]	1250ns	250ns	1000ns	600ns	650ns	50,000ns
WS2812 data sheet [WS2812]	1250ns	350ns	800ns	700ns	600ns	50,000ns
WS2812B data sheet [WS2812B]	1250ns	400ns	850ns	800ns	45ns	50,000ns
www.aliexpress.com	1250ns	350ns	800ns	700ns	600ns	50,000ns
www.doityourselfchristmas.com	1250ns	250ns	1000ns	1000ns	250ns	50,000ns
www.sparkfun.com [SparkFun]	1250ns	350	800ns	700ns	600ns	50,000ns

As the signal travels from one module to the next it is modified by the signal shaping. The WS2812 datasheet states “Built-in signal reshaping circuit, after wave reshaping to the next driver, ensure wave-form distortion not accumulate.” This, coupled with variability in module performance, has implications for large designs. It turns out that the signal shaping causes bits to be dropped for large designs. We will spend a lot of time below analyzing signal shaping.

Colors

Although the datasheet of the WS2811 shows data being sent in red, green, then blue order (RGB order), all WS2812 strips we’ve seen have the LEDs attached such that data must be sent in green, then red, then blue (GRB) order. Each color byte is sent most significant bit first, i.e., left to right:

Even though the WS2812B marketing material claims them to be brighter than the WS2812, the higher operating voltage of the WS2812 allows brighter LEDs, which is supported by the datasheets themselves. Here is color frequency, brightness, current, and voltage for each LED color.

	Red nm	Green nm	Blue nm	Red mcd	Green mcd	Blue mcd	current	Rv	Gv	Bv
WS2812	620-630	515-530	465-475	550-700	1100-1400	200-400	20mv	1.8-2.2	3.0-3.2	3.2-3.4
WS2812B	620-625	522-525	465-467	390-420	660-720	180-200		2.0-2.2	3.0-3.4	3.0-3.4

To gamma correct from the input number 0-255 to the output perceived brightness, one person [Gamma] empirically found this relation: outputValue = 255.0*(inputValue/255.0)^(1.0/0.45).

Electronics

For simple use, buy a strip of around 100 LEDs, hook up a good 5V power supply, attach a ground and data line to your chip, and bang away. There are many libraries for popular chips so you won’t have to do much work. [PJRC] has some detailed information on a lot of WS2811 related topics.

Module design

We wanted to drive as many WS2812 LEDs as possible for a reasonable cost. Initial estimates led us to believe we could run 10,000 LEDs on a low cost PIC32 microcontroller.

Output

We wanted to drive a lot of them at decent frame rates (at least 30 frames per second), so we focused on the 800kbps module version. At this rate, each bit uses 1/800000 = 1250 ns to transmit. A low signal of 50us, which is about 40 bits of timing, is used to signal the end of the image, at which point all the LEDs latch the image to their outputs. Each LED uses 24 bits to set its own color, then passes successive bit signals down the line. Thus a single strand running at 30 frames per second has maximum length of N=1109 LEDs. (Solve 30*(1250*24*N + 50000)=10^9).

So to run more LEDs, we need more strands; we need more output pins to output in parallel. To get 10000 LEDs running at 30+ fps, we need 10 pins.

Input

To drive 10000 LEDs at 30+ fps, we need an input bit rate of at least 10000*3*30*8=7.2 megabits (Mbps)=240,000 bytes/second. USB, serial, SPI, and I2C are all common protocols able to supply this rate.

Usability

We wanted the device to be easy to use, and usable from low power devices and low speed microcontrollers, so we decided that a simple byte input format with a simple protocol would be best. If we had decided that the input should be carefully packed in weird formats on the user side, we could have slightly reduced the final requirements, but we opted against this on usability requirements. We also wanted some control protocols and other niceties to support robust testing, performance, and customizability for end users, all of which made the protocol slightly more complex. So we wanted a simple image input format of RGB tuples that could be generated on the fly by users. That meant our device would take this input, lay it out in RAM as needed, and play it back on the output pins.

CPU

We need to input and output 240000 bytes per second, so we must move 480,000 bytes per second. To get decent fps output, we need to stream the data across many pins, and the fastest way to do this is to split the bits needing output into streams. We wanted to output multiple pins worth of data with one write to a hardware port, and this data needed bits from multiple color bytes, so some bit fiddling needed done, packing (as it turned out) 16 bits from different input bytes per write.

Requirements

So we have the minimum requirements

30,000 bytes of image RAM.

Serial, USB, or SPI inputs, at a rate at least 7.2 Mbps.
Multiple pins (10+) for output.
Enough CPU power to process 240000 bytes per second of throughput, including processing of these bytes.

Selection

After evaluating several microcontrollers, we decided to use a PIC32 150F128B, which is a 32 bit processor, MIPS 32 core, up to 50MHz clock. Since USB and serial both had better supporting external hardware interfaces at a 48 MHz internal clock, we settled on that clock speed. We needed an external crystal for stability.

From past work on USB, we knew the USB stack would take too much CPU time, so we decided against on chip USB. Since UART is more common than SPI or I2C, very easy to work with, and there are nice USB to UART bridge chips, we decided on a serial solution. This way small microcontrollers could use serial to communicate with the device, and a PC could use USB to UART bridges. FTDI has a powerful 12 Mbps USB to serial chip (the FT232H) as well as lower cost 3Mbaud versions (such as the FT232R) [FTDI]. We have tested both extensively. Having never worked with serial at this speed (12Mbaud before), we found it amazing robust, not dropping even a single bit over hundreds of gigabytes tested, even using haphazard wiring.

Using serial, at 12Mbaud serial, with 8N1 signaling, it takes 10 baud to transmit a byte, so we can move 1.2M bytes per second, which is 9.6Mbps, enough to meet our 7.2Mbps requirement.

Hardware

The PIC has 28 pins, and laying out how they are used to maximize features is always one of the fun parts of design like this.

From the design, we needed at least 10 general purpose input/output (GPIO) pins for output. On the PIC we chose one of the 16-bit ports (pins RB0-RB15) for output, so we could output on all 16 pins in parallel with a single CPU instruction.

To keep the timing stable, for both the LED output and the serial input, we needed an external crystal. This took two pins, which overlap GPIO pins RA2 and RA3.

For data input, we chose the UART1 port, which used GPIO pins RA0 and RA4.

For programming during development, there are three sets of programmer pins, all of which overlap with our output pins RB0-RB15. To make things as easy to use as possible, we chose the only pair not on the lower byte (RB0-RB7) of output, that is, the PGED/PGEC pair on RB10 and RB11. The programmer also requires the MCLR pin and Vdd and Vss.

After adding the final power and ground signals, the chip has precisely one unused pin, RA1, which we put to dual purpose for a flash settings reset on boot and for a LED for messaging during deployment.

This pin selection allowed us to use SPI on pins RB11, RB13, and RB15 for a second SD card driven version that has 8 output pins instead of 16.

Software

Getting the software to handle the design requirements on such a low end microcontroller was the hardest task.

We wanted to allow users to have designs of 1-16 strands of LEDs, and for each we wanted to use RAM as efficiently as possible.

For example, using 16 strands, we can output one 16-bit word on the port at a time, and the 30,000 byte RAM buffer is treated as a 625 word tall buffer of 16-bit words. This allows 10,000 LEDs.

We also wanted to be able to do input and output continuously into the same buffer to maximize throughput. Alternating them would slow down the process, since writing to LEDs is slow yet must happen at very precise intervals.

To illustrate, consider a 3 strand design. Then we output 3 bits at a time, and we treat RAM as a 625 tall buffer again, where bits 0-2 are the 3 strands, then we return to buffer top and do bits 3-5, then bits 6-8, etc., until the last pass uses bits 12-14. This allows 5 passes through the 625 tall buffer, giving 5*625=3125 long logical strands. The last bit in each word, bit 15, is unused, resulting in some unused space, but packing more strands into that last bit would make the code even messier than it turned out to be. The resulting usable space is as follows:

strands	max height	pixels	used bytes	unused bytes	% used
1	10000	10000	30000	0	100.00
2	5000	10000	30000	0	100.00
3	3125	9375	28125	1875	93.75
4	2500	10000	30000	0	100.00
5	1875	9375	28125	1875	93.75
6	1250	7500	22500	7500	75.00
7	1250	8750	26250	3750	87.50
8	1250	10000	30000	0	100.00
9	625	5625	16875	13125	56.25
10	625	6250	18750	11250	62.50
11	625	6875	20625	9375	68.75
12	625	7500	22500	7500	75.00
13	625	8125	24375	5625	81.25
14	625	8750	26250	3750	87.50
15	625	9375	28125	1875	93.75
16	625	10000	30000	0	100.00

The MIPS core has 32 registers, once of which always contains zero, so we have at most 31 registers to work with. It has an internal counter that ticks once per two CPU clocks, which can be used to check code timings. Instruction timing follows the following rules, with a few exceptions and oddities:

One instruction per clock, exceptions below.
Reading from flash causes a one clock delay, during which no instruction processes.
The result of reading into a register is not available the next instruction.
The instruction following a branch is always executed (called a branch slot).
Some “pseudo” instructions actually expand into multiple instructions, and can be disabled with a GNU assembler directive.

An interesting side-note: on first pass, I was unaware of the Flash load delay rule, so the overall timing ended up being about 15% over spec, and all worked well. Only by trying to find some other bug did I find this out, by adding instruction timing counters to the code to track every part. This helped narrow down the wrong timing assumptions and end up with rock solid timing in all code paths.

We chose a 48 MHz clock, instead of the maximal 50 MHZ, since the UART top speed is ¼ of the clock speed, and a 12 MHz rate was much easier to find interface solutions than 12.5 MHz if we used the top speed of 50MHz. Plus the desired output rate aligned better to instruction boundaries (60 instructions per bit instead of 62.5 instructions).

Thus for each output bit period (1250ns) we had 60 clock cycles. We also needed to input data at a max rate of 12 Mbaud, which is 1,200,000 bytes per second (using 8N1 signaling), requiring an average of 1.5 bytes of input processed per 60 instruction block.

Luckily the PIC has an 8-byte deep FIFO queue on input (and output), allowing us some slack, but requiring that we check in often and on average fast enough to process all bytes.

Note that each output bit to a strand takes a few decision points: at the start, a pin has to go from low to high, then at the right time, a 0 or 1 is output on this pin to denote a 0 or 1, and at a later time the pin must be set to 0 always. Note the color bit can be output directly at the second point, avoiding some decision to be made with a branch to different code points, but at the cost of requiring writing the last 0 signal in all cases (which is redundant for the 0 bit). However this results in the least amount of instructions, but requires three writes per bit.

Thus, during each 120 instructions, we had to process 2 output writes (each is three possible high/low transitions) and input 3 bytes of serial input. This entailed:

Each of the 2 outputs required packing 16 bits into the output word, thus would (naively) require 16 reads, 16 masking operations, 16 shifting operations, and 3 writes to construct the outputs. This is already 2*(16+16+16+3) = 102 instructions, most of the 120 instruction budget.
Each of the bytes read would require a check to see if a byte was available (read, mask, compare, branch), if so read it (read), check some control byte issues (a few compares and branches), log any transmission errors (a mask and write) , and write it to memory (a write), so would use around 4+1+2*2+2+1=12 per byte, or 24 instructions.
Pointers would need to be updated on where to read to and where to write to, with logic to handle the logical sizes of the output compared to how it is packed into RAM, easily another 5-10 instructions.
Latching would need done on signal bottom or some command or some counter, another few instructions to decide.
State needed kept to know which color byte was coming next, a little state for the protocol, etc., which is a few more instructions.

So we have used approximately 102+24+5+4+4=139 instructions out of our budget of 120. Only 15% over budget, blah!

There is no way to write this in C; it had to be done in assembler.

The end protocol supports a synchronization byte, latching control, logs timing and flow data, had programmable delays, returns status bytes to the controller, supports strand widths of 1-16 and the maximal lengths in the above table, and even had a little space left over. The tightest parts ended up with 17 unused instruction slots out of 480, of which there were about 9 such blocks.

The way this was done involved a TON of tricks, the hardest one was writing a custom assembler that took templates of code, evaluated the timing and parametric parts, and then did a fitting algorithm that tried to find a solution to the needs of the output timing while meeting all the requirements we wanted. For example, branch paths had to take the same time in either part, lots of state was stored in code location (which duplicated some code paths, freeing registers and removing reads and writes), some clever table driven tricks, a completely non-obvious internal representation (sort of a meet-in-the-middle form of the input doing some work on transforming, and the output doing some of the work…. No other combination I tested would fit), using a dynamic register allocation scheme, and using some of the weirder MIPS instructions to squeeze every cycle out of the drawing routines. Here is a picture of the assembler I wrote, which color coded things, allowed jumps and timing info to be checked, and logged other interesting things I needed:

Code looked like this, with embedded timing info, symbols explaining branch sizes, symbols to be expanded and replaced locally and globally, and atomic blocks (which cannot be split when interleaving the pieces).

“C// do [COL]. “,

” lw r2,UARTSTA(r1) // byte ready? r1=UART base “,

“F lw r5,BITTABLE(r7) // get the bit expansion table. “,

” or r60,r2 // log any UART errors “,

” andi r2,r2,1 // ready mask “,

“{ beq r2,zero,[WAIT_X+] // exit to wait for byte. Note: next instruction always executed”,

” addi r3,zero,SYNCVAL // restore Sync value “,

“}[WAIT_R+] // Wait must return here”,

” lbu r2,UARTRX(r1) // get the data byte in r2, offset r1 to U1RXREG “,

” nop // todo – this can be anything “,

“{ bne r2,r3,[SYN_J+] // 1: if r2 != sync, continue with UART “,

” nop // 2: todo – can place optional stuff here? “,

” addiu r21,1 // 3: increment sync counter, always executed “,

” beq zero,zero,[SYN_X+] // 4: if sync byte (in r3), jump to sync operation “,

” sb r21,UARTTX(r1) // 5: output count, always executed in branch slot”,

“-3[SYN_R+] sll r2,2 // 3: multiply by 4 to get 4 32-bit table entries. Sync must return here “,

” add r2,r5 // 4: add table base (in r5) to reversed, packed entries “,

“} lw r3,[PASS_RAM](r4) // 5: get data to modify from RAM buffer “,

” addiu r5,255*4 // prepare to make bitmask mask”,

“F lw r5,(r5) // get inverse of bitmask “,

“F lw r2,(r2) // get table entry, with 8 bits of data, one per nibble. FLASH clock penalty “,

” nor r5,zero,r5 // invert to get bitmask “,

” and r3,r5 // zero eight bits from RAM “,

” or r3,r2 // or in eight bits from UART “,

” sw r3,[PASS_RAM](r4) // write all bits back to RAM buffer “,

The final output is a single routine consisting of about 8,000 lines of assembly code, a massive rat’s nest of gotos (no subroutines). It is so massive that trying to get the popular disassembler IdaPRO to make a flow graph it crashes the tool.

We ended up with a cheap device that can handle about 40 fps and 10,000 LEDs, as desired.

Enough about that.

Power

During testing, we noticed that if you had a strand of about 240 LEDs in a row, powered from one end, and put it all on bright white (color 255,255,255), then the far end would be much dimmer than the powered end due to power draw. So it might be important to add power taps often to the LED strands to meet your needs. Some strands may vary.

Too get some idea of power draw, we measured draw on strands of lengths 1, 25,40, 100, and 240, and tested many color combinations including all combinations of 0 and 255 for R,G,B, and also R=G=B for values 0 to 255 in steps of size 16 (using 255 instead of 256). The conclusion is a good guideline is each LED draws around 10-13mA on full brightness, so a RGB module can draw up to 3*13=39mA, and this draw scales linearly with color value until about color 16, when it drops slightly more quickly to 0.

Shorter strands drew slightly more current per LED than longer strands; probably we were not driving some LEDs as bright as others due to length related dropouts.

For example, a strand of length 240 with red and green on full and blue on half should stay under 240*(13+13+13*128/255) = 7806 mA = 7.8A.

We have fried more LED modules than we cared to by connecting them wrong (mistakenly), so be careful. Some brands seem surprisingly fragile. We have found, though, if you fry a strand, cutting off the first one or two LEDs allows the rest of the strand to work, so you can treat the modules as $0.15 fuses if needed.

Signal voltages also drop along the length of a strand, and sometimes modules will miss bits if the power supply is not sufficient. If you’re having trouble with long strands, be sure to measure various components and ensure they meet specs. Signal voltages depend on the reference voltage powering the module, and we have found them somewhat tricky if you do not follow the specs carefully.

There still seems to be some weird power issues when building large strands, some of which we’ll detail below. If we get more solid knowledge and experience that clarifies the issues we’ll update this document.

Testing

When we started testing, we ran into some image issues at long (> 5,000) length strands. There was flickering, lost bits (resulting in shifting colors), and other behavior that made us wonder if the power was flaky or if the LED internal timings were causing problems.

Here is about 11,000 LEDs for testing our module design. Gene is in the process of making a cool cylindrical LED brain melter out of these at this point, for me to test the module on.

Our original timings were over by 15% due to the code penalty for flash reads, and after fixing this, using an internal counter to validate, we finally checked with an external logic analyzer to ensure our timings were perfect.

Some flickering persisted, and spending time determining if it was flaky power or our timing errors led us to some interesting analysis.

After testing the power we were led to believe the signals were corrupted, so we decided to look at the signals using a Seleae [Seleae] logic probe, which is an inexpensive 8-channel digital signal capture tool, capable of 24MHz sampling, which gives a timing resolution of 41.67 ns. We hooked the signals to the output from our device, which was at this point flawless, and then to various points on the LED strands. We found that the signal was indeed getting corrupted, but the behavior was sufficiently random and hard to reproduce that we deduced it was related to the signal shaping in the modules combined with clock fluctuations.

The behavior exhibited random noise, which made us wonder if there were an internal clock in each module, and if clock skew were adding up somehow and messing the signal up. We had read there was some form of signal shaping done by the LEDs from one to the next, and we wondered how this worked, and if it was somehow responsible, so we decided to try and measure it.

Since the modules are cheap, and behaved as if they were self-clocked (which turned out to be true, and is likely based on preceding models in the family stating their internal clock speeds), it seemed they’d use the cheapest clocks they could, which is likely some RC circuit like the old-school 555 timer [555]. These clocks have skew of a 1-10% based on manufacturing, component variance, temperature, and non-linarites based on power draw. The behavior was definitely worse when trying to update a bright strand than a dim strand, which could have been a power issue or a heat issue.

We used our module, which has programmable timings, to test the boundaries of timings, looking for where things fail. We captured signals between LEDs to investigate how the signal shaping was modifying or interfering with the signal.

We first captured data spread out on a few long strands, but trying to understand long strand behavior with only 8 taps was messy, so we went to an 8 LED strand, tapping the output of our module and the output of each of the first 7 LEDs, and the 8^th LED showed the final color seen. We also got to use some neat standalone LEDs based on the WS2812, pictured below. This made adding taps easier.

Next we wrote a tool to analyze the data flows and related timings. All captures were done at 24MHz, the limit of the Seleae tool, (we would love to have a much higher rate!). For most of the tests we sent 16 LEDs of color down the line to see how longer signals were handled, even though we could only capture 8 LEDs of timing.

The next picture shows how each LED consumes 24 bits, then passes the rest through, leading to the stair step signal. The top row is our signal; the next 7 channels are outputs of successive LED modules. This is 16 LEDs worth of color 254,254,254 sent using 1250ns length bits and equally spaced timing cutoffs.

Here is a close-up. The top signal is the one we generate, then the signal cascades down. The next picture has green connectors tracking how a bit flows down the LEDs. Here is a bit highlighted, with each succeeding bit delayed by the LED modules as part of their signal shaping, each delay is around 200 ns (to be determined better soon). We noticed that the signal shaping was changing the lengths of the highs and lows a little, so we wanted to investigate that.

For each test we recorded all transitions, then counted the high lengths, low lengths, variations between the top bit (our generator) and the successive bits (LED module shaped), all sorts of timing information like cascade delays, missing bits, how lengths changed as they moved from one LED to the next, etc. As we gathered more insight we added more analysis capability until we answered (almost) all of our questions.

WS2812 Model

Before we detail things we found, it might be best to give the final understanding we have, so you can follow along. We’ll then cover the measurements that led us to this model of the WS2812 LED modules.

This figure shows how we think of the timings the LED modules follow.

Figure 1- Timing Labels

Symbol	Description	Length
b	Bit length	Varies based on input, always > 800ns or so.
h	High part for a 0 bit	382 ns
m	Medium part, for a 1 bit	382 ns
w	Low part, needed to set next high part	As low as we could measure, perhaps < 1ns. Follows from b=h+m+w
d	Delay between input and pass-through	206 ns
s	Time when sample is taken	d+h=588ns
L	Minimum latch time	9370 ns

Figure 1 shows what we think is an accurate model of how the LEDs behave. The model is based on the measurements and experiments we do below.

The LED modules act as follows:

Each module has a free running oscillator, not determined by the input data rates.
Each module samples incoming bits, starting a timer when it sees a low to high transition.
If the module has consumed 24 bits, it passes signals through after signal shaping, so…
1. After some delay time d, the low to high edge is passed on to the next module.
After time s, the line is sampled again.
1. If the module has not consumed 24 bits, the sample bit is used for color, and one less bit is needed.
2. If the module has consumed 24 bits, it sets the output to the sample value.
After some time high edges are watched for again. NOTE THERE IS A DEFINITE MINIMUM TIME BEFORE A RISING EDGE IS DETECTED!
If a long enough low is detected, the colors valued are latched to the output LEDs, and each module then looks for another 24 bits of data, and the process repeats.

Measurements

We labeled the data runs in the form G-R-B-H-M-L, where G, R, and B are the byte colors sent in that order (shows up as GRB). The H, M, and L were the lengths of the high part, the optional middle part, and the low part making the total bit length. The H, M, L values are PIC32 instruction counts, so each value is 125/6=21.8333… nanoseconds. A total length of 60 is thus 1250ns, the recommended bit length. 20-20-20 timing is equally spaced, and seems to work fine for strands up to a few thousand in length. We considered converting numbers to nanoseconds, but this implies more timing precision than we could accurately measure or create. All measured counts are at 24MHz, the sampling rate of the logic analyzer, so each measured tick is 125/3=41.6666…. Sorry for the multiple value sizes, but this keeps all inputs and outputs as measured, and we write the conclusion in nanoseconds. Hopefully each usage will be clear from context.

We experimented with all sorts of bit patterns and timing combinations, recording tens of thousands of bit cascades. Our custom analyzer then extracted the data so we could analyze it visually and numerically and test hypotheses against the information. This led to more tests to refine our ideas. Here we record the conclusions and some relevant data.

For many tests, we used 255-0-170 as a bit pattern with a good mix of 0’s, 1’s, runs, and transitions. To test high bits we sent 255-255-255 colors, and 0-0-0 for low bit studies. We often used 254-254-254 to see if bits were lost while allowing eyeballing of byte ends.

All errors we were able to uncover were of one type, which was a module missing the low to high transition of a parent, and then waiting until the next low to high happened, effectively losing a bit in the sequence. This can happen when the input bits are too short, but also happens as variation in clock skew along the strand sometimes bunches bits too close together.

For example, when we shortened the overall bit length to 255-0-170-20-20-14 (which corresponds to a total bit length of 1125ns, but with 433ns length high and middle parts), the test dropped some bits, but only after the first 8 LEDs obtained the correct colors. Here is a screen capture showing a bit getting squeezed out in the lower 2 strands.

At 20-20-10 timing, errors are rampant. You can see many dropped bits. Here are some stats: top bit lengths 25 (except two of them are 26), Top bit high are good, top bit low are short (5,6 and 15,16).

We investigated a large range of timing and bit combinations, always fine tuning the tests around edge cases where things went wrong. From all the timings, we saw that several components of the LED timing were relatively constant: the length of high parts for the 0 and 1 bits, and the inter-LED delay. Across 13605 short high bit samples, we saw an average of 9.18 (382.6 ns) with min of 8 and max of 11, so we conclude that h should be 382.6 ns. Across 16632 long high bit samples, we saw an average of 18.34 (764.3 ns) with a min of 17 and a max of 21. Thus we conclude that m should be 764.3-382.6=381.7 ns. Note this is very close to the short bit. Coincidence? Probably not. For the inter-LED delay, across 36999 samples we got an average of 4.95 (206.2 ns), with a min of 2 (only 3 samples, this needs more checking) and a high of 7 (only 7 samples). The majority (36771/36999=99.38%) were in lengths 4-6. Thus d should be around 206.2.

These timings are a bit longer than some reported at [Beaklow], who measured a single LED. Perhaps we’re using different manufacturers, or even seeing variations in batches.

We note that the datasheet has a stated transmission delay time max of 300 ns; we have seen no events even near this long, and we obtained 206 ns as a very good value for the propagation delay. At 206 ns, a 10,000 long strand should propagate a signal in around 2 milliseconds, which is around 1/485 of a second.

A quick note on statistics: the underlying internal clocks can be modeled as a Gaussian (normal) distribution around the central clock rate, and likely temperature and manufacturing and ageing cause some variance in the clock rate. However we can only measure on discrete boundaries, which means we’re measuring the normal plus a uniform (spread over 1 tick length). However, the mean of the Gaussian is still the mean of the measured lengths (check it), so talking about averages (technically, means) is sufficient. A better model would take into account variations induced by temperature and power draw, which is a function of history and needs more data. We think there is some power or temperature component to the behavior, but we have not isolated it.

The overall bit lengths are dependent on the top signal – when we lengthened the input bit length to around 1800 ns, everything transmitted fine, and each child bit just lengthened the low part of the signal. The above timings were all still stable. When we shortened the bits as low as 800 ns, the modules lost every other bit, but again the numbers above stayed stable for the bits obtained. So this provides evidence against the modules obtaining clocking from the input signal, and implies an internal clock, as expected. We were able to drop the bit length to around 1150 and keep the other lengths and still work for our 8 LED test bench, but we expect longer strands will fail.

From the way bits are dropped, it appears that once a bit is started, the module ignores transitions on input until the module has finished with its bit processing, so there is a minimum bit time. Again, across all the many tests we ran, we observed that the longest bit length that was dropped was 19 (7 occurrences, 199 occurrences for length 18, many for even shorter), and the shortest bit that was passed correctly was also 19 (1 occurrence, 12 of length 20, and 301 of length 21, out of 47403). This implies that the shortest bit possible for one module to transmit to the next module is around length 19 (791.7 ns) or 20 (833.3 ns).

The shortest low part seen was 1 tick, but we recorded several events where the high part seemed to be multiple ones concatenated together, meaning we did not capture the drop to low and subsequent rise to high. This means that the high to low and low to high can be below 1 tick (41.66 ns) and still transmit, but it’s asking for trouble.

Low length for the child bits were all over the map, so it is unlikely that these are clocked by the modules. They seem to depend only on the input timings.

To determine when the sampling to differentiate a 0 from a 1 bit takes place (or if there is multisampling and majority vote, or some other scheme), we had our software analyze all transitions to determine how it looked, and we found a few events such as this:

Here the top bit drops high to low at tick 9227, and the child bit drops at 9227 also. So the module had to be sent a low at this tick, and in the same tick, it dropped its output. The bit started on tick 9218, so after 9 ticks it looked at its input and set the output to that value. (It could not have sampled sooner or it would have seen a high. It certainly could not sample later than its output transition).

Thus sampling seems to occur once at the end of the high length h, and the output is done immediately. At a design level this saves having to store the sample and output later, so makes sense for simplicity.

Finally, we timed the minimum latch length as follows. We put a dynamic pattern on the LEDs, and shortened the latch time until the image flickered, meaning it was not latching correctly. We found that we obtained stable images using a latch length of 9375 ns, and an unstable image at 9333 ns. Again, we sampled with the logic analyzer, so we’re accurate to within 41.666… ns. Datasheets recommend 50,000 ns, so we suspect you can go much lower and still perform well.

So we have evidence for all the numbers and behavior in out model.

There was one last really odd behavior we could not figure out. With perfect 20-20-20 timing, well within spec, and the 8 LEDs above, we noticed that sending the same color to each channel (such as 255-255-255) and then waiting would cause all the LEDs to go black after a while (1-5 seconds). This only happened if all three colors were the same, testing 255-255-254 would not do it. It happened at 1-1-1 color, 5-5-5 color, 254-254-254 color, and any other combinations we tested had the same behavior. If the colors were different, then there was no cutoff (we tested to 60 seconds in each case). 1-1-0 and other combinations did not do it. The cutoff times were all in the 1-7 second range, with one outlier taking 11 seconds. This was all at 5 V power. At 7V power it did not happen. When we added a 0.1uF capacitor to the far end of the chain, it went away. During all this we collected signal traces, and saw no anomalies or signals between the LEDs after they were latched.

So – is there some weird design bug where the same value on all channels causes power instability? We don’t yet know.

Speculation

The modules appear to have an internal clock, and it seems likely that the simplest circuit showing the behavior we observed is the circuit used. As such, it seems a low to high transition starts a counter, and at certain counts different actions are taken. A 200ns delay requires around a 5MHz clock, quite a bit faster than the supposed 1.2MHz. To make the whole thing run on the slowest clock possible would require the other events (d=206, h=382, m=382, s=588) to be integral multiples of the smallest, and they are close.

For example, if we took d=191, h=382, m=382, s=573, and a clock rate of 5.235MHz, then on clock ticks 1,2,3,4 the various events would happen, all on successive ticks. Nice, huh? But it requires a fast clock, and is the non-illuminating multiple 4.36 of the 1.2MHz suspected clock…. Of course all the timing may just be CMOS delays, with no high speed clock needed. It would be interesting to know.

For a final try to gather some clock information, we could record various color outputs at 30fps (or higher if possible), mess with the color values, and try to determine the LED pulse width modulation frequencies. We could play different values on the three channels against each other, obtaining “beats” like one does when tuning a piano. The PWN likely works as follows (since it is simple in hardware). The color c for a channel is added periodically to an 8-bit counter, and on overflow the LED is turned on, else it is turned off. Adding color 0 will never overflow, so never turn on. Color 1 will overflow 1 out of 256 ticks, so be on 1/256 of the time. Color 128 will overflow every other tick, so be on half the time. Color 255 will overflow 255 times out of 256, so will be on all but 1 tick out of 256. Ideally we’d like the 255 on all the time, but the minimal circuit for that (that still handles the other cases nicely) is much more complicated. It would be fun to see if this hypothesis is true by determining flicker at color 255 (with a FAST camera), and thereby measuring the speed. I played a little with a 240 fps Canon S100, but have not reached any conclusions, other than I do see the PWM at work on lower colors.

Conclusion

We obtained pretty solid evidence on how the signal shaping of the WS2811/WS2812/WS2812B modules work, and will use this knowledge to fix the flicker on long strands (over 5,000, up to 10,000 modules). There is still some work that can be done to figure out the internal clock rates as well as try to measure clock variability, skew with current/temperature/history.

We’re getting ready to test some of these ideas on getting stable 10K long strand images (just to see if it works!), and will post results.

If you find errors or have more information to add, let us know, and we’ll add it to this document.

Thanks for reading this long write-up.