If anyone wants to flex their asm-hacking skills, there's some draft
code here. If not, don't panic; this is primarily notes to myself that someone else might be interested in. To finish long long support in printf, I need to print 64-bit numbers (preferably, arbitrary-precision numbers) in octal, and it seems to be taking as long to write as the decimal code. Hex is easy to do byte at a time, but octal is messy because 3 doesn't divide 8. It's not that it's difficult, per se. Or that I need to stress about efficiency (to a first approximation, nobody uses octal, so it matters even less than usual). The problem is that messy translates to large code (boo!), and it's hard to see how large an alternative will be without writing it in detail. The old code just masked and printed the low-order bits, did a full-width shift by 1 bit, repeated it 3x, and continued as long as the result wasn't zero. It did that with a 12-instruction helper routine to shift the buffer by cnt bits: .L_lsr_4: ldi cnt, 4 .L_lsr: lsr v_fifth ror v_hhi ror v_hlo ror v_hi ror v_lo dec cnt brne .L_lsr ; tst sbiw v_hlo, 0 ; only Z flag is needed cpc v_lo, rzero cpc v_hi, rzero ret This was shared between the hex/octal print code and the decimal print code, so the rest of the hex/octal print code was only 20 instructions (+4 to branch to it depending on "base" on function entry). The equivalent for a variable-length memory buffer is rather more awkward, 14 instructions which *aren't* shared with the decimal print code: .L_lsr_4: ldi cnt, 4 .L_lsr: movw ZL, r24 sub merge, merge ; Set to to zero and clear carry mov tlen, len 1: ld r0, -Z ror r0 st Z, r0 or merge, r0 dec tlen bne 1b dec cnt bne .L_lsr tst merge ret Worse, the inner loop is 9 cycles to shift 1 byte by 1 bit, and the outer loop is another 5 per bit. Shifting an 8-byte value is 9 * 8 + 5 = 77 cycles per bit. If I have a 64-bit input to print, then that's 77 * 64 = 4928 cycles just spent shifting! Given that I can print a *decimal* 64-bit number in 4103 cycles, that seems Just Wrong. And even assuming I could keep the rest to 20 instructions, that's a total of 34. (It's not obvious I can, because while the shift leaves the lsbyte in r0 ready for printing, the first digit doesn't have a shift, so it needs to be handled specially.) After a lot of messing about, I came up with the following O(n) scheme. It buffers two registers, keeping 8..15 bits of the input buffered at any given time. Shifting is bit at a time, with the buffer refilled every 8 bits, and a digit printed every 3. (Or 4. The code is not only shared with hex, but can handle bases 2, 4 or 32 as well.) A trick I use repeatedly is bit shifting for loop counters. Rather than initialize a counter to "3" and decrement it each iteration, I initialize a counter to "7" (a value I already have available for digit-extraction purposes) and shift it right each iteration. For the bit buffer, I have 8..15 bits of buffered data, and rather than have a separate counter, mark the end of the valid data with a 1 bit. So the data starts out like "1abcdefg hijklmno" and shifts down until it's "00000001 abcdefgh". At that point, we start the next shift, but the msbyte becomes zero (with the 1 bit stored in the carry). That's the signal to fetch another byte. This reduces the loop overhead in both time and code size. Here's the current draft. Suggestions solicited! Registers: len: Length of input (in bytes). Shared code with the decimal print has stripped off leading zero bytes. X: Points to output buffer Z: Points to input buffer flags: Input flags distinguishing decimal, octal, and upper/lower case hex msb, lsb: Current bit buffer. Most significant set bit of msb counts bits per byte. mask: A bitmask the size of the current base's digits (7 or 15) tmask: A loop counter initialized to "mask" which counts bits per digit digit: a temporary used to form the ASCII digit delta: The amount to add to digits past '9' to make them start at 'a' or 'A'. .L_octhex: ldi mask, 7 lsr flags ; bit 1 in carry, but 1 in register breq 1f ldi mask, 15 ldi delta, 'A'-'0'+10 brcc 1f ldi delta, 'a'-'0'+10 1: ldi msb, 1 ldi tmask, 1 ld lsb, Z+ rjmp 2f .L_bitloop: ; Main loop ror lsb 2: lsr tmask ; Entry point, tmask >>= 1 brne 4f ; if (tmask == 0) { ; Spit out a digit of (lsb & mask) mov digit, lsb and digit, mask cpi digit, 10 brcs 3f add digit, delta 3: add digit, '0' st X+, digit mov tmask, mask 4: ; } lsr msb brne .L_bitloop ; End of main loop ; Load another byte ; Note: we know that carry is SET dec len ; Load another byte breq .L_lastbyte ld msb, Z+ ror msb ; Shift carry=1 into msbit rjmp .L_bitloop .L_lastbyte: lastbyte: adc msb, tmask ; Pad until end of digit (and clear carry) inc len cpse lsb, __zero_reg__ rjmp 4b movw r24, r26 ; Copy X to return value ret This is 35 instructions in its current state. I can probably shave one off the end by sharing the epilogue with the decimal print code, and I'm thinking of making the "mask" value be what's passed in from the caller to select the base, which will simplify the prologue. The main loop has two conditional branches, either of which could be the loop end. I could save a cycle (but cost an instruction) in the main loop by placing both conditions out of line. The other tricky thing is the "lastbyte" section, invoked when we need another byte but len == 0. When we get to the end of the input, we need to pad with zeros until the significant digits have all been printed. This involves setting msb to some suitable all-zero, and keeping going until lsb == 0 as well. We stop when len == 0 and lsb == 0. len is decremented *after* using the byte, so when len is decremented to 0, the last byte has already been used and we need to load an msbyte of padding. There are three ways to check for this, and I *think* I've picked the one with minimum code size (and fastest among those), but let me analyze them in detail. In the first two cases, we can pad with a full byte of 8 zero bits, then the "lastbyte" code is simply "ror msb" (which sets msb to 0x80 and clears the carry in one instruction, since we know we started with msb = 0 and carry = 1) and "rjmp .L_bitloop". Oh! Those instructions already exist at the end of the "load another byte" branch, so maybe the savings is more than I thought. The three ways of checking and the additional costs are: 1) Check each iteration of the main bit loop. This is slowest, so only use it if it has no rivals for smallest. This can be placed after the "ror lsb", adding three more instructions: branch if not equal, test len == 0, and branch if equal to epilogue code. 2) Check with each digit printed. This is identical code to the above, but inside the "if (!tmask)" condition to reduce overhead. The problem is that we don't get the test of lsb for free any more, so I think it's four instructions. No! We can do it in three: cp lsb, __zero_reg__ $ cpc len,__zero_reg__ $ breq epilogue CPC ANDs its result with the previous zero flag. *If* the zero flag was set (the only case we care about) then the carry flag is clear, and the cpc will compute the test we want. If the carry is set, the cpc will do the wrong thing, but that doesn't matter because zero is already clear. 3) The way I did it, checking with each byte fetched. We have to test len == 0 anyway, so we just have to check lsb == 0. But we need four additional instructions: - We have to use "adc msb,tmask" to set the msb correctly, so we'll fall back into the byte-load check after only one more digit, and - We have to "inc len" so it will decrement again properly. - And we need two instructions to test lsb == 0 I thought option 3 was smallest because I thought the "rol msb" the other cases required was the same size as "adc msb.tmask". I didn't notice the 2-instruction code sharing. So now it seems that option 2 is the smallest. Here it is in 34 instructions: .L_octhex: ldi mask, 7 lsr flags ; bit 1 in carry, but 1 in register breq 1f ldi mask, 15 ldi delta, 'A'-'0'+10 brcc 1f ldi delta, 'a'-'0'+10 1: ldi msb, 1 ldi tmask, 1 ld lsb, Z+ rjmp 2f .L_bitloop: ; Main loop ror lsb 2: lsr tmask ; Entry point, tmask >>= 1 brne 4f ; if (tmask == 0) { ; Check if we're done cp lsb, __zero_reg__ cpc len, __zero_reg__ breq .L_epilogue ; Spit out a digit of (lsb & mask) mov digit, lsb and digit, mask cpi digit, 10 brcs 3f add digit, delta 3: add digit, '0' st X+, digit mov tmask, mask 4: ; } lsr msb brne .L_bitloop ; End of main loop ; Load another byte ; Note: we know that msb == 0 and carry is SET dec len ; Load another byte breq 5f ld msb, Z+ 5: ror msb ; Shift carry=1 into msbit rjmp .L_bitloop .L_epilogue: movw r24, r26 ; Copy X to return value ret This is reasonably small (although there is definitely a size hit relative to the previous code, sigh) and reasonably efficient. As I said, the 2-instruction epilogue can be shared with the decimal code, and the prologue code might be shortened by 3 instructions by choosing the "flags" values so they can be used directly as "mask". That would cut it to 29. (Plus two in the common code to branch to this depending on the flags.) I have to code it up and test it, but I wanted to write down my thoughts about it first, and that seemed post-worthy. _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
(Am I annoying everyone by using this mailing list as my personal
coding blog?) After considerable rearranging (and fixing one nasty logic bug in the first algorithm posted), I have octal converison working to my satisfaction. The logic bug was that I assumed I'd need at most one byte of zero-padding to print a number. But I was checking for termination before printing a digit. That ended up not working with 1-byte octal numbers where the top digit is non-zero. By the time I was ready to print the fourth digit (when the termination check would fire), the lsbyte wanted to hold bits 9..16, and that meant loading a *second* byte (bits 16..23). So I changed to checking for termination *after* printing a digit, which I knew would save time, but I unexpectedly found additional space savings, too. Not counting preamble code shared with decimal printing (all the stuff before the label "3:"), it's down to 29 instructions. Still a bit more than 20, but I'm satisfied. It's even slightly faster than the previous code: Bits Old New 0 56 42 8 144 113 16 276 232 24 364 314 32 496 430 40 546 48 628 56 744 64 860 /* Arguments */ #define out X /* Arrives in r24:r25, but we move it immediately */ #define out_lo r26 #define out_hi r27 #define bin Z /* Arrives in r22:r23, but we move it immediately */ #define bin_lo r30 #define bin_hi r31 #define len r20 #define flags r18 /* Mask, after removing two lsbits */ /* Local variables */ #define msb r25 /* Overlaps input */ #define lsb r24 /* Overlaps input */ #define digit r23 /* Overlaps input */ #define delta r22 /* Overlaps input */ #define tmask r21 // len = r20 #define k r19 // flags = r18 .text .global binprint .type binprint, @function binprint: movw out_lo, r24 movw bin_lo, r22 #if 1 add bin_lo, len adc bin_hi, zero #else mov tmask, len ; Conditional negate using the standard identity -x = ~x + 1. ; Given mask of -1 or 0, (x ^ mask) - mask returns -x or x. ; However, we would need the carry bit clear to start this, and ; forming "mask" from the carry bit in one instruction preserves ; the carry bit. So instead add zero with carry. lsr flags ; Lsbit is negate flag sbc k, k ; Set to 0 or -1, carry preserved 1: ld __tmp_reg__, bin eor __tmp_reg__, k adc __tmp_reg__, __zero_reg__ st bin+, __tmp_reg__ dec tmask brne 1b #endif ; Strip trailing (most-significant) zeros from bin */ 2: dec len breq 3f ; If we've reached the end, stop ld __tmp_reg__, -bin or __tmp_reg__, __tmp_reg__ breq 2b ; Continue as long as bytes are zero 3: movw bin_lo, r22 ; Reset bin to lsbyte ; Len is now pre-decremented ; Done with args in r22-r25; now allowed to use delta, digit, lsb, msb ldi delta, 'A'-'0'-10 lsr flags brcc 4f ldi delta, 'a'-'0'-10 4: ldi msb, 1 ld lsb, bin+ .L_digit_out: ; Spit out a digit mov digit, lsb and digit, flags cpi digit, 10 brcs 5f add digit, delta ; Hex digit > 9 5: subi digit, -'0' st X+, digit ; Check for done: is len:lsb < 0:flags? cp flags, lsb cpc __zero_reg__, len brcc .L_epilogue ; if (!lsb && !len) return X mov tmask, flags .L_bitloop: lsr msb brne 7f ; if ((msb >>= 1) == 0) get another byte ; Fetch another byte or len, len ; Preserves carry breq 6f dec len ; Preserves carry ld msb, Z+ 6: ror msb ; Shift carry=1 into msbit 7: ror lsb lsr tmask brne .L_bitloop ; if ((tmask >>= 1)== 0) { rjmp .L_digit_out .size binprint, .-binprint _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
Hi George,
> (Am I annoying everyone by using this mailing list as my personal > coding blog?) > No. But I speak for myself. Some of us are interested and reading but don't have time to comment or to work through the code. Your efforts are appreciated. > After considerable rearranging (and fixing one nasty logic bug in > the first algorithm posted), I have octal converison working to my > satisfaction. > > The logic bug was that I assumed I'd need at most one byte of zero-padding > to print a number. But I was checking for termination before printing > a digit. That ended up not working with 1-byte octal numbers where the > top digit is non-zero. By the time I was ready to print the fourth digit > (when the termination check would fire), the lsbyte wanted to hold bits > 9..16, and that meant loading a *second* byte (bits 16..23). > > So I changed to checking for termination *after* printing a digit, > which I knew would save time, but I unexpectedly found additional > space savings, too. > > Not counting preamble code shared with decimal printing (all the > stuff before the label "3:"), it's down to 29 instructions. Still > a bit more than 20, but I'm satisfied. > > It's even slightly faster than the previous code: > > Bits Old New > 0 56 42 > 8 144 113 > 16 276 232 > 24 364 314 > 32 496 430 > 40 546 > 48 628 > 56 744 > 64 860 > > > /* Arguments */ > #define out X /* Arrives in r24:r25, but we move it immediately */ > #define out_lo r26 > #define out_hi r27 > #define bin Z /* Arrives in r22:r23, but we move it immediately */ > #define bin_lo r30 > #define bin_hi r31 > #define len r20 > #define flags r18 /* Mask, after removing two lsbits */ > > /* Local variables */ > #define msb r25 /* Overlaps input */ > #define lsb r24 /* Overlaps input */ > #define digit r23 /* Overlaps input */ > #define delta r22 /* Overlaps input */ > #define tmask r21 > // len = r20 > #define k r19 > // flags = r18 > > .text > .global binprint > .type binprint, @function > binprint: > movw out_lo, r24 > movw bin_lo, r22 > #if 1 > add bin_lo, len > adc bin_hi, zero > #else > mov tmask, len > ; Conditional negate using the standard identity -x = ~x + 1. > ; Given mask of -1 or 0, (x ^ mask) - mask returns -x or x. > ; However, we would need the carry bit clear to start this, and > ; forming "mask" from the carry bit in one instruction preserves > ; the carry bit. So instead add zero with carry. > lsr flags ; Lsbit is negate flag > sbc k, k ; Set to 0 or -1, carry preserved > 1: > ld __tmp_reg__, bin > eor __tmp_reg__, k > adc __tmp_reg__, __zero_reg__ > st bin+, __tmp_reg__ > dec tmask > brne 1b > #endif > ; Strip trailing (most-significant) zeros from bin */ > 2: dec len > breq 3f ; If we've reached the end, stop > ld __tmp_reg__, -bin > or __tmp_reg__, __tmp_reg__ > breq 2b ; Continue as long as bytes are zero > > 3: movw bin_lo, r22 ; Reset bin to lsbyte > ; Len is now pre-decremented > > ; Done with args in r22-r25; now allowed to use delta, digit, lsb, msb > ldi delta, 'A'-'0'-10 > lsr flags > brcc 4f > ldi delta, 'a'-'0'-10 > 4: ldi msb, 1 > ld lsb, bin+ > > .L_digit_out: ; Spit out a digit > mov digit, lsb > and digit, flags > cpi digit, 10 > brcs 5f > add digit, delta ; Hex digit > 9 > 5: subi digit, -'0' > st X+, digit > ; Check for done: is len:lsb < 0:flags? > cp flags, lsb > cpc __zero_reg__, len > brcc .L_epilogue ; if (!lsb && !len) return X > mov tmask, flags > .L_bitloop: > lsr msb > brne 7f ; if ((msb >>= 1) == 0) get another byte > ; Fetch another byte > or len, len ; Preserves carry > breq 6f > dec len ; Preserves carry > ld msb, Z+ > 6: ror msb ; Shift carry=1 into msbit > 7: ror lsb > lsr tmask > brne .L_bitloop ; if ((tmask >>= 1)== 0) { > rjmp .L_digit_out > .size binprint, .-binprint > > _______________________________________________ > AVR-libc-dev mailing list > [hidden email] > https://lists.nongnu.org/mailman/listinfo/avr-libc-dev > _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
On 16/12/16 09:56, Dale Whitfield wrote:
> Hi George, > >> (Am I annoying everyone by using this mailing list as my personal >> coding blog?) >> > > No. But I speak for myself. > > Some of us are interested and reading but don't have time to comment or > to work through the code. > > Your efforts are appreciated. > My feelings are much the same. I would not put in too much effort into making this code fast (except for the fun of it, of course). Printing in octal is necessary to make printf long long support complete. But it is unlikely to be useful in practice - people very rarely use octal. I can think of only 4 use-cases: 1. chmod parameters on *nix. 2. 0 is an octal constant in C, and is often quite handy. 3. "uint8_t month = 09;" is a fine way to get accidental compile-time errors. 4. "uint16_t year_of_nicaea = 0325;" is a fine way to get accidental run-time errors. Still, as long as octal is in the C standards, it is great that you are doing this work to finish the standard support in printf. _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
In reply to this post by George Spelvin
tl;dr: Beta code included below; it passes my initial testing. I named
it _itoa_engine, following "ftoa_engine", but suggestions are welcome. It is unfortunately one instruction longer than ultoa_invert: text data bss dec hex filename 188 0 0 188 bc ultoa_invert.o 190 0 0 190 be itoa_engine.o 208 0 0 208 d0 itoa_engine.o (+OPTIMIZE_SPEED) 212 0 0 212 d4 itoa_engine.o (without HAVE_MUL) 230 0 0 230 e6 itoa_engine.o (+OPTIMIZE_SPEED) but I'm reasonably satisfied. Well, it turned out that my great idea from last time had a nasty bug in it. There's no really good way to check for termination after printing each digit. Recall that my bit buffer only has room for 15 bits of the number, so I arrange for it to always hold 8 <= n <= 15 bits. When it's about to fall from 8 to 7, I fetch another byte and decrement it from 16 to 15 instead. The problem is that there are two ways to handle the length, and both of them have problems. One is to count so len=1 after fetching the last byte of input, and len=0 after fetching the first byte of padding. In that case, consider printing one byte of zero in hex. I start with 8 bits in the buffer (msbyte = 0x01), and len = 1. I print a '0', and check if len == 0. It's not, so I keep going and print a second zero before realizing that I should stop. Okay, that doesn't work, so what if I arrange so len counts the bytes remaining? len=0 after fetching the last byte, and it holds there after fetching padding. In that case, there are no extra digits, but there are missing digits. Consider printing 2^15 = 0x8000. I start out with len=1 and the last byte in the bit buffer. I print the final 0, then have to fetch another byte to start shifting bits into the lsbyte. So the bit buffer holds 800 (msbyte = 0x18, lsbyte = 0x00, len = 0) when I print the second 0. At this point, len == 0 and lsbyte == 0, so I should stop printing. But there are more bits in the msbyte. There are various tricks you can try in hex, but they break down in octal. Checking for extra bits in the msbyte makes for messy (= large) code. So I went back to my original idea of checing for termination when fetching padding. It's a couple of instructions longer, but it works. Anyway, here's the current state of the function I intend to rewrite vfprintf() around. Any comments? I know about a few mechanical things: I need to integrate it into avr-libc and #include the appropriate headers rather than #define things myself. Likewise X_movw and so on. /* * bum * * 1. vt. (obs.) To make highly efficient, either in time or space, * often at the expense of clarity. * * The core integer-to-ASCII code. This takes as input arbitrary- * precision byte arrays, and outputs ASCII in bases 10 or 2^k. * (With mostly separate code.) * * This code is small and fast, but difficult to understand. It uses * tricky number theory (dividing by multiplying by the 2-adic inverse) * and low-level bit hacks to implement it. Sorry about that. * * The arguments are: * char *out: * Buffer in which the ASCII digits are placed, in little-endian * order. Must be reversed by the caller for printing. The caller is * also responsible for ensuring that it is big enough. * uint8_t *bin: * Buffer holding the input value. This is NOT const; it is used as * working storage by the algorithm. More specifically, negation * is done in place if requested, after which decimal conversion * leaves the buffer filled with zero, and binary-base conversion * leaves the buffer unmodified. * uint8_t len: * The length of the data at "bin". Must be between 1 and 255; * if you pass in 0, Bad Things will happen. * uint8_t flags: * A variable whise bits determine the form of the output. * flags bit 0: If set, the input is negated before printing. * This is a convenience to the caller for printing signed * numbers. * flags bit 1: If set, digits greater than 10 (a, b, c, ...) * are printed in lower case. If clear, they are printed in * upper case. This has no effect if the base is 10 or less. * flags bits 2..7: A contiguous bit mask (of the form (1 << k) - 1) * selecting the base to print. If zero, decimal is output. * If non-zero, a mask of the binary base to print (1 for * binary, 7 for octal, 15 for hex). No error-checking * is done; if you pass in something else, you'll get * peculiar output. * * The return value is a pointer to just past the end of the output. * * An overview of the code is as follows: * - The conditional negate is fairly straightforward. It's convenient * to do here because it leaves the pointer at the msbyte for the * next step. * - Then we strip off most significant zero bytes until we get to a * non-zero byte or only one byte is left. (The latter ensures we * print "0" for zero, rather than the null string.) * - Then the code splits depending on whether the remaining flags are * zero (decimal) or not. * * - The decimal code alternates two phases, finding the last digit (the * input mod 5), and then dividing the input by 10. Each involves a * pass over the input, although conveniently in opposite directions. * - Actually, the first pass (msbyte-to-lsbyte) divides the input by 2, * and stores the lsbit. It also computes the quotient mod 255, * which is then reduced mod 5. Combining (x % 2) and (x/2 % 5) * gives the digit. * - The second pass (lsbyte-to-msbyte) then divides the input by 5. * This is done by subtracting the remainder mod 5 to produce an exact * multiple of 5, then multiplying the result by the 2-adic expansion of * -1/5 (which has a particularly simple form 0x...33333) and negating. * - If the second pass produces an msbyte of 0, the length is reduced. * When the length reaches zero, conversion is done. * * - The binary code does one lsbit-to-msbit pass over the input. * - It mostly counts loop iterations with right shifts rather than * counters. When the shift produces a zero result (and a set carry, * since it must have been non-zero before), the count has expired. * - The input value is kept in a 2-byte shift register which holds * 8..15 bits of the input, with a 1 bit at the high end delimiting * the valid data. * - The main loop shifts the 16-bit buffer right one bit at a time until * it's time to print a digit, or the buffer is about to underflow to * 7 bits, meaning it's time to load another byte into the high half. * - The "digit" value is a mask of "bits already printed". We shift * it down until it's zero, then print the digit (lsb & mask), then * reload it with the mask. * - When the right shift of the msbyte of the buffer produces zero * (the delimiter bit was just shifted out), we load a new msbyte and * retry the shift. * - When we run out of input bytes, there's some tricky code to add * enough padding msbits to ensure remaining bits in the buffer * are printed, for any remaining bits in the buffer to be printed, * stopping when there are no more. * * The code often sets the carry bit several instructions before using it. * When reading, it's important to know that ld, st, mov, inc, dec, * and logical operations do not modify the carry bit. */ #ifndef __tmp_reg__ # define __tmp_reg__ r0 #endif #ifndef __zero_reg__ # define __zero_reg__ r1 #endif //#undef __AVR_HAVE_MUL__ #define __AVR_HAVE_MUL__ 1 #define OPTIMIZE_SPEED 1 /* Arguments */ #define out X /* Arrives in r24:r25, but we move it immediately */ #define out_lo r26 #define out_hi r27 #define bin Z /* Arrives in r22:r23, but we move it immediately */ #define bin_lo r30 #define bin_hi r31 #define len r20 #define flags r18 /* Mask, after removing two lsbits */ /* Local variables. The first four overlap with arguments. */ #define msb r25 /* 16-bit accumulator */ #define lsb r24 #define rem r23 /* Remainder mod 255, or mod 5 */ #define digit r22 /* Current digit */ #define tlen r21 /* Loop counter, copy of len or flags */ #define k r19 /* Constant value, usually the multiplier 0x33 */ #if __AVR_HAVE_MUL__ #define zero r18 /* Used instead of r1 to free up multiplier */ #else #define zero __zero_reg__ #endif .text .global _itoa_engine .type _itoa_engine, @function _itoa_engine: movw out_lo, r24 movw bin_lo, r22 lsr flags ; Lsbit is negate flag #if OPTIMIZE_SPEED ; 4 more instructions, but avoids a pass over the input (saving ; 9*len - 4 cycles) in the common case when no negation is desired. brcs 1f add bin_lo, len adc bin_hi, __zero_reg__ rjmp 3f #endif ;; Conditional negate, depending on carry. The standard identity ;; -x = ~x + 1 can be extended to do a conditional negate. ;; Given a mask of 0 or -1, (x^mask) - mask returns x or -x. ;; However, we would need the carry bit clear to start this, ;; and forming "mask" from the carry bit in one instruction ;; preserves the carry bit. So instead do the conditional ;; increment using add zero with carry. ;; ;; It turns out there's no faster way to do an unconditional ;; negate. Multi-precision negate is awkward on AVR, which lacks ;; negate-with-carry or any form of dest = src-dest instruction. ;; So it's two instructions minimum (mov+sbc or invert+adc), ;; and either way we need one instruction of setup, to clear ;; the carry or create a constant of 0xff. The latter because ;; the COM instruction overwrites carry, and there's no EORI, ;; so we need to load a constant 0xff into a register. 1: mov tlen, len sbc k, k ; Set to 0 or -1, carry preserved 2: ld __tmp_reg__, bin eor __tmp_reg__, k adc __tmp_reg__, __zero_reg__ st bin+, __tmp_reg__ dec tlen brne 2b ; Strip redundant trailing (most-significant) zeros from bin. 3: ld __tmp_reg__, -bin or __tmp_reg__, __tmp_reg__ brne 4f dec len brne 3b inc len ; Split into two branches based on msbits of flag. 4: lsr flags ; lower/upper case flag to carry brne .L_binary ;; The longest form of the decimal code (OPTIMIZE_SPEED && !HAVE_MUL) ;; is exactly 63 instructions (126 bytes) long, which is the extreme ;; limit of this conditional branch. If it gets any longer, you'll ;; have add an instruction. I suggest swapping the cases and dealing ;; with the fact that the .L_epilogue branch is now out of range by ;; placing it at the end of the binary code and having the decimal ;; code end in a rjmp to it. .L_decimal: adiw bin_lo, 1 #if __AVR_HAVE_MUL__ clr zero ldi k, 0x33 #endif ; The main loop, repeated while len > 0 .L_decimal_loop: #if OPTIMIZE_SPEED ldi digit, '0'+10 ; Adjust for later offset on rem #else ldi digit, '0' ; Sneakily preload the ASCII offset #endif ser rem ; Sum of all bytes, mod 255 mov tlen, len ;; Pass 1, msb-to-lsb: finding the input mod 10. ;; ;; We do two things here: divide by 2 (saving the lsbit), and sum ;; the result mod 255. This is then used to compute the result ;; mod 5, which combined with the digit gives the decimal digit ;; we want. The awkward thing is that we want *two* carry bits ;; in this loop (one for the shift and the other for summing the ;; bytes), so we use the lsbit of digit for one of them. .L_mod10: ld __tmp_reg__, -bin lsr digit ; Lsbit to carry bit ror __tmp_reg__ st bin, __tmp_reg__ rol digit ; Carry bit to lsbit add rem, __tmp_reg__ adc rem, zero ; End-around carry for mod 255 dec tlen brne .L_mod10 #if OPTIMIZE_SPEED ; Reduce rem mod 15 (from 1 <= rem <= 255 to 1 <= rem <= 15) mov __tmp_reg__, rem swap __tmp_reg__ andi rem, 0xf0 add rem, __tmp_reg__ ; Add high halves to get carry bit andi rem, 0xf0 swap rem adc rem, zero ; End-around carry ; Reduce rem mod 5 (average of 2.2 subtracts) 0: subi rem, 5 brcc 0b ; We are left with rem-5, which can be compensated for elsewhere. #else ;; Reduce rem mod 35, using a loop. ;; ;; This is a bit slower (22.82 cycles average to the end of the ;; mod-5 loop, vs. 12.6 cycles using the above), but saves 5 ;; instructions. Any multiple of 5 will work, but 35 is optimal. 0: subi rem, 35 brcc 0b ; Now add back mod 5 until we go positive again. 0: subi rem, -5 brcs 0b #endif ; Form and store ASCII digit (2*rem + digit) add digit, rem add digit, rem st out+, digit ;; Pass 2, lsb-to-msb: dividing by 5. ;; ;; Rather than do a general divide by 5, we can subtract rem ;; to produce a multiple of 5, and then do an exact division by ;; multiplying by the 2-adic expansion of 1/5, 0xCCC...CCD. ;; ;; To get this into an even simpler form, we multiply by -1/5 = ;; 0x333...333 and negate. Each byte is multiplied by 0x33 and ;; added to an accumulator to be used for each higher byte. ;; ;; The accumulator has to be 16 bits wide, but after storing ;; each output byte, we can fold the msbyte into the lsbyte. ;; ;; Negating the output can be "complement and add one", but ;; we do it as "subtract one and complement", initializing the ;; accumulator to 0xff, then complementing before storing. ;; ;; To subtract rem without an separate carry propagation pass, ;; subtract 0x33 times rem from the accumulator to start. ;; (Since 0 <= rem <= 4, this is very easy.) ; msb:lsb = 255 - (rem * 0x33). Guaranteed to fit into 8 bits. #if !OPTIMIZE_SPEED ldi lsb, 255 #elif __AVR_HAVE_MUL__ ldi lsb, 0 ; Adjust for pre-decrement of rem by 5 #else ldi lsb, 45 ; Adjust for pre-decrement of rem by 5 #endif #if __AVR_HAVE_MUL__ mul rem, k sub lsb, r0 #else ; Multiply by 0x11 mov __tmp_reg__, rem swap __tmp_reg__ ; rem < 16, so this is rem << 4 add __tmp_reg__, rem ; Multiply by 3 sub lsb, __tmp_reg__ sub lsb, __tmp_reg__ sub lsb, __tmp_reg__ #endif clr msb ; Here is the actual divide loop mov tlen, len .L_divide: ld __tmp_reg__, bin ; acc += 0x33 * r0. This product could be up to 14 bits long. #if __AVR_HAVE_MUL__ mul __tmp_reg__, k add lsb, r0 adc msb, r1 #else ; let rem:digit = __tmp_reg__ << 4 = __tmp_reg__ * 0x10 mov digit, __tmp_reg__ swap digit mov rem, digit andi rem, 15 ; Mask off high 4 bits eor digit, rem ; Mask off low 4 bits ; Add __tmp_reg__ to get __tmp_reg__ * 0x11 add digit, __tmp_reg__ adc rem, zero ; Now add it to the accumulator 3 times (there is no faster way) add lsb, digit adc msb, rem add lsb, digit adc msb, rem add lsb, digit adc msb, rem #endif ; Store the complemented accumulator (*bin++ = ~lsb) mov __tmp_reg__, lsb com __tmp_reg__ st bin+, __tmp_reg__ ; Fold the accumulator: msb:lsb = msb + lsb add lsb, msb clr msb adc msb, zero dec tlen brne .L_divide ;; End of main loop: check if the new msbyte was zero. ;; If so, drop it (decrement len), and stop when len == 0. cpse r0, zero rjmp .L_decimal_loop sbiw bin_lo, 1 dec len brne .L_decimal_loop #if __AVR_HAVE_MUL__ clr __zero_reg__ #endif ; Finally copy the out pointer to r24 and return it. .L_epilogue: movw r24, out_lo ret .L_binary: movw bin_lo, r22 ; Reset bin to lsbyte ; Done with args in r22-r25; now allowed to use digit, rem, lsb, msb ; Load startup values. ldi rem, 'A'-'0'-10 ; ASCII adjustment for hex digits past 9 brcc 0f ldi rem, 'a'-'0'-10 0: ldi msb, 1 ld lsb, bin+ .L_digit_out: ; Spit out a digit mov digit, flags and digit, lsb cpi digit, 10 brcs 0f add digit, rem ; Hex digit > 9 0: subi digit, -'0' st X+, digit mov digit, flags .L_bitloop: lsr msb brne 6f ; msb is empty; fetch another byte dec len ; Preserves carry breq 7f ; No more input ld msb, Z+ 5: ror msb ; Shift carry=1 into msbit 6: ror lsb lsr digit brne .L_bitloop ; if ((tlen >>= 1)== 0) { rjmp .L_digit_out ;; We've run out of bytes. If all 1 bits in lsb have been printed ;; (i.e. lsb <= digit), stop. If not, pad with enough zeros to print ;; one more digit, and so we'll check again after printing it. 7: cp digit, lsb ; if (lsb <= digit) return X brcc .L_epilogue inc len adc msb, digit ; msb = digit + 1, and clear carry rjmp 5b .size _itoa_engine, .-_itoa_engine _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
George Spelvin schrieb:
> tl;dr: Beta code included below; it passes my initial testing. I named > it _itoa_engine, following "ftoa_engine", but suggestions are welcome. > It is unfortunately one instruction longer than ultoa_invert: > > text data bss dec hex filename > 188 0 0 188 bc ultoa_invert.o > 190 0 0 190 be itoa_engine.o > 208 0 0 208 d0 itoa_engine.o (+OPTIMIZE_SPEED) > 212 0 0 212 d4 itoa_engine.o (without HAVE_MUL) hmmm, that's quite some increase in code size... IMO octal conversion is so remote from any practical purposes that we should use an algorithm that works for all radices. This can be done in less than 70 bytes (variable length / radix algorithm that operates on memory, no need for MUL, needs strrev. So all is boiling down to the question what execution time is acceptable or not: Is 8000 ticks too slow? Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just because we have an algorithm that performs in 3000 ticks? My strong preference is still to have a one-fits-all algorithm that might very well be slower than an optimal one. But hey, an ordinary division of a 64-bit value by 10 already costs 2300 cycles, so why should we hunt cycles just for printf...? Also I think it's a good idea to have conversion routines like [u]lltoa, so we need generic conversions anyway. The current __ultoa_invert is quite verbose and also adds a noticeable amount of code. It performs faster than ultoa from stdlib.h, but is this essential? I played around some more; attached is a version that operates on memory, variable length and radices, and doesn't need MUL. It's called mem_toa and needs < 70 bytes / 8000 ticks for decimal. The signed version adds yet another 40 bytes: #define __zero_reg__ r1 #define __tmp_reg__ r0 .macro DEFUN name .global \name .func \name \name: .endm .macro ENDF name .size \name, .-\name .endfunc .endm #if defined (__AVR_HAVE_JMP_CALL__) #define XCALL call #define XJMP jmp #else #define XCALL rcall #define XJMP rjmp #endif .macro wmov r_dest, r_src #if defined (__AVR_HAVE_MOVW__) movw \r_dest, \r_src #else mov \r_dest, \r_src mov \r_dest+1, \r_src+1 #endif .endm ;; End preamble #define pNum 26 #define pBuf 30 #define nBytes 20 #define Radix 18 #define Num 19 #define Count 21 #define nBits 22 #define Rem 23 ;; extern void mem_toa (char *pBuf, void *pNum, uint8_t nBytes, uint8_t Radix); .section .text.mem_toa,"ax",@progbits DEFUN mem_toa wmov pBuf, 24 wmov pNum, 22 .LnextDigit: add pNum, nBytes adc pNum+1, __zero_reg__ mov Count, nBytes ;; nBytes < 0 <==> nBytes is unknown ldi nBytes, -1 clr Rem .LnextByte: ld Num, -X ldi nBits, 8 .LnextBit: ;; Bit-wise construct (complement of) quotient into Num. ;; Bit-wise reconstruct Num into Rem. rol Num rol Rem ;; Reducing Rem mod Radix; C = 0 <==> reduction applied cp Rem, Radix brcs 1f sub Rem, Radix 1: dec nBits brne .LnextBit rol Num com Num st X, Num breq 2f sbrc nBytes, 7 mov nBytes, Count 2: dec Count brne .LnextByte subi Rem, -'0' cpi Rem, '9' + 1 brlt 3f subi Rem, '9' + 1 - 'a' 3: st Z+, Rem sbrs nBytes, 7 rjmp .LnextDigit st Z, __zero_reg__ ;; pBuf survived in R25:R24 XJMP strrev ENDF mem_toa #define Minus1 pBuf ;; extern void mem_toa_signed (char *pBuf, void *pNum, uint8_t nBytes, uint8_t Radix); .section .text.mem_toa_signed,"ax",@progbits DEFUN mem_toa_signed wmov pNum, 22 add pNum, nBytes adc pNum+1, __zero_reg__ ld Num, -X tst Num brpl 9f ;; Negate pNum[] according to -N = ~N + 1 = ~N - (-1). ;; Carry is clear now because ADC above didn't overflow. wmov pNum, 22 mov Count, nBytes ldi Minus1, -1 1: ld Num, X ;; Must use EOR for complement because COM clobbers carry. eor Num, Minus1 sbci Num, -1 st X+, Num dec Count brne 1b wmov pBuf, 24 ldi Num, '-' st Z, Num adiw 24, 1 9: XJMP mem_toa ENDF mem_toa_signed #undef Minus1 #undef pNum #undef pBuf #undef nBytes #undef Radix #undef Rem #undef Num #undef Count #undef nBits For purely decimal there's a version consuming 90 bytes only, requiring MUL and which is only 10% slower than yours: (Using the same preamble as the code above) .section .text.u64toa_nibbles,"ax",@progbits #ifdef __AVR_HAVE_MUL__ #define SPEED 1 #else #define SPEED 0 #endif #define pBuf 26 #define pNum 30 #define Length r20 #define Rem r19 #define Quot r21 #define Count r22 #define Num r23 #if SPEED # define Ten r16 # define Ox67 r18 #endif ;; extern void u64toa_nibbles (char *pBuf, void *pNum, uint8_t Length) ;; Length in [1, 127] DEFUN u64toa_nibbles wmov pBuf, 24 wmov pNum, 22 #if SPEED push Ten ldi Ten, 10 ;; For / 10 by multiply we would usually use a value of 0xcd ~ 0x800/10, ;; however Rem below will be in [0, 0x9f] so that ~ 1/2 of that value ;; also works, and that saves us one "lsr r1" per division. ldi Ox67, 0x67 clr Count #endif .LoopDigits: add pNum, Length #if SPEED ;; Count is 0. adc pNum+1, Count #else adc pNum+1, __zero_reg__ #endif mov Count, Length clt ; Length unknown clr Rem .LoopBytes: ld Num, -Z #if SPEED ;; Rem:Num <<= 4 ;; "andi Rem, 0xf" not needed because Rem < 10. eor Rem, Num andi Num, 0x0f eor Rem, Num swap Rem ;; Quot := Rem / 10 mul Rem, Ox67 lsr r1 lsr r1 mov Quot, r1 ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10 mul r1, Ten sub Rem, r0 ;; ...and (mostly) the same again for the low nibble. ;; Quot <<= 4 swap Quot ;; Rem:Num <<= 4 swap Rem eor Rem, Num ;; Quot += Rem / 10 mul Rem, Ox67 lsr r1 lsr r1 add Quot, r1 brts 0f ;; Quot = 0 ==> R1 = 0, hence we can also skip the MUL + SUB below. breq 1f ;; Current Length not yet known: Set it according to highest non-zero byte. set mov Length, Count 0: ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10 mul r1, Ten sub Rem, r0 1: #else // !SPEED clr Quot .LoopNibbles: ;; Quot <<= 4 swap Quot ;; Rem:Num <<= 4 swap Num swap Rem ;; "andi Rem, 0xf0" not needed because Rem was < 10. eor Rem, Num andi Num, 0xf0 eor Rem, Num 1: subi Quot, -4 subi Rem, 40 brcc 1b subi Rem, -40 ;; "+1" avoids "dec Quot" after the next loop. subi Quot, 4 + 1 2: inc Quot subi Rem, 10 brcc 2b subi Rem, -10 ;; dec Quot subi Count, 0x80 brmi .LoopNibbles brts 0f tst Quot breq 0f ;; Current Length not yet known: Set it according to highest non-zero byte. set mov Length, Count 0: #endif // SPEED st Z, Quot dec Count brne .LoopBytes subi Rem, -'0' st X+, Rem ;; Only continue if we determined Length (T=1). ;; T=0 means all bytes are zero, hence we are done then. brts .LoopDigits #if SPEED pop Ten clr __zero_reg__ #endif st X, __zero_reg__ ;; pBuf survived in R25:R24 XJMP strrev ENDF u64toa_nibbles #undef Count #undef pNum #undef pBuf #undef Rem #undef Num #undef Quot #undef Ten #undef Ox67 #undef Length #undef SPEED > 230 0 0 230 e6 itoa_engine.o (+OPTIMIZE_SPEED) > > but I'm reasonably satisfied. > > > Well, it turned out that my great idea from last time had a nasty > bug in it. There's no really good way to check for termination after > printing each digit. > > Recall that my bit buffer only has room for 15 bits of the number, so I > arrange for it to always hold 8 <= n <= 15 bits. When it's about to fall > from 8 to 7, I fetch another byte and decrement it from 16 to 15 instead. > > The problem is that there are two ways to handle the length, and > both of them have problems. > > One is to count so len=1 after fetching the last byte of input, and > len=0 after fetching the first byte of padding. > > In that case, consider printing one byte of zero in hex. > > I start with 8 bits in the buffer (msbyte = 0x01), and len = 1. I print a > '0', and check if len == 0. It's not, so I keep going and print a second > zero before realizing that I should stop. > > > Okay, that doesn't work, so what if I arrange so len counts the bytes > remaining? len=0 after fetching the last byte, and it holds there > after fetching padding. > > In that case, there are no extra digits, but there are missing > digits. Consider printing 2^15 = 0x8000. I start out with > len=1 and the last byte in the bit buffer. I print the final 0, then > have to fetch another byte to start shifting bits into the lsbyte. > > So the bit buffer holds 800 (msbyte = 0x18, lsbyte = 0x00, len = 0) when > I print the second 0. At this point, len == 0 and lsbyte == 0, so > I should stop printing. But there are more bits in the msbyte. > > There are various tricks you can try in hex, but they break down in > octal. Checking for extra bits in the msbyte makes for messy > (= large) code. > > > So I went back to my original idea of checing for termination when > fetching padding. It's a couple of instructions longer, but it works. > > Anyway, here's the current state of the function I intend to rewrite > vfprintf() around. Any comments? I know about a few mechanical things: > I need to integrate it into avr-libc and #include the appropriate headers > rather than #define things myself. Likewise X_movw and so on. > > > [...] > > or __tmp_reg__, __tmp_reg__ actually encoded as "AND Reg, Reg" so for any practical purposes it has the same effect (and purpose) as "OR Reg, Reg". Johann #define __zero_reg__ r1 #define __tmp_reg__ r0 .macro DEFUN name .global \name .func \name \name: .endm .macro ENDF name .size \name, .-\name .endfunc .endm #if defined (__AVR_HAVE_JMP_CALL__) #define XCALL call #define XJMP jmp #else #define XCALL rcall #define XJMP rjmp #endif .macro wmov r_dest, r_src #if defined (__AVR_HAVE_MOVW__) movw \r_dest, \r_src #else mov \r_dest, \r_src mov \r_dest+1, \r_src+1 #endif .endm .text .section .text.u64toa_nibbles,"ax",@progbits #ifdef __AVR_HAVE_MUL__ #define SPEED 1 #else #define SPEED 0 #endif #define pBuf 26 #define pNum 30 #define Length r20 #define Rem r19 #define Quot r21 #define Count r22 #define Num r23 #if SPEED # define Ten r16 # define Ox67 r18 #endif ;; extern void u64toa_nibbles (char *pBuf, void *pNum, uint8_t Length) ;; Length in [1, 127] DEFUN u64toa_nibbles wmov pBuf, 24 wmov pNum, 22 #if SPEED push Ten ldi Ten, 10 ;; For / 10 by multiply we would usually use a value of 0xcd ~ 0x800/10, ;; however Rem below will be in [0, 0x9f] so that ~ 1/2 of that value ;; also works, and that saves us one "lsr r1" per division. ldi Ox67, 0x67 clr Count #endif .LoopDigits: add pNum, Length #if SPEED ;; Count is 0. adc pNum+1, Count #else adc pNum+1, __zero_reg__ #endif mov Count, Length clt ; Length unknown clr Rem .LoopBytes: ld Num, -Z #if SPEED ;; Rem:Num <<= 4 ;; "andi Rem, 0xf" not needed because Rem < 10. eor Rem, Num andi Num, 0x0f eor Rem, Num swap Rem ;; Quot := Rem / 10 mul Rem, Ox67 lsr r1 lsr r1 mov Quot, r1 ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10 mul r1, Ten sub Rem, r0 ;; ...and (mostly) the same again for the low nibble. ;; Quot <<= 4 swap Quot ;; Rem:Num <<= 4 swap Rem eor Rem, Num ;; Quot += Rem / 10 mul Rem, Ox67 lsr r1 lsr r1 add Quot, r1 brts 0f ;; Quot = 0 ==> R1 = 0, hence we can also skip the MUL + SUB below. breq 1f ;; Current Length not yet known: Set it according to highest non-zero byte. set mov Length, Count 0: ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10 mul r1, Ten sub Rem, r0 1: #else // !SPEED clr Quot .LoopNibbles: ;; Quot <<= 4 swap Quot ;; Rem:Num <<= 4 swap Num swap Rem ;; "andi Rem, 0xf0" not needed because Rem was < 10. eor Rem, Num andi Num, 0xf0 eor Rem, Num 1: subi Quot, -4 subi Rem, 40 brcc 1b subi Rem, -40 ;; "+1" avoids "dec Quot" after the next loop. subi Quot, 4 + 1 2: inc Quot subi Rem, 10 brcc 2b subi Rem, -10 ;; dec Quot subi Count, 0x80 brmi .LoopNibbles brts 0f tst Quot breq 0f ;; Current Length not yet known: Set it according to highest non-zero byte. set mov Length, Count 0: #endif // SPEED st Z, Quot dec Count brne .LoopBytes subi Rem, -'0' st X+, Rem ;; Only continue if we determined Length (T=1). ;; T=0 means all bytes are zero, hence we are done then. brts .LoopDigits #if SPEED pop Ten clr __zero_reg__ #endif st X, __zero_reg__ ;; pBuf survived in R25:R24 XJMP strrev ENDF u64toa_nibbles #undef Count #undef pNum #undef pBuf #undef Rem #undef Num #undef Quot #undef Ten #undef Ox67 #undef Length #undef SPEED ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; #define pNum 26 #define pBuf 30 #define nBytes 20 #define Radix 18 #define Num 19 #define Count 21 #define nBits 22 #define Rem 23 ;; extern void mem_toa (char *pBuf, void *pNum, uint8_t nBytes, uint8_t Radix); .section .text.mem_toa,"ax",@progbits DEFUN mem_toa wmov pBuf, 24 wmov pNum, 22 .LnextDigit: add pNum, nBytes adc pNum+1, __zero_reg__ mov Count, nBytes ;; nBytes < 0 <==> nBytes is unknown ldi nBytes, -1 clr Rem .LnextByte: ld Num, -X ldi nBits, 8 .LnextBit: ;; Bit-wise construct (complement of) quotient into Num. ;; Bit-wise reconstruct Num into Rem. rol Num rol Rem ;; Reducing Rem mod Radix; C = 0 <==> reduction applied cp Rem, Radix brcs 1f sub Rem, Radix 1: dec nBits brne .LnextBit rol Num com Num st X, Num breq 2f sbrc nBytes, 7 mov nBytes, Count 2: dec Count brne .LnextByte subi Rem, -'0' cpi Rem, '9' + 1 brlt 3f subi Rem, '9' + 1 - 'a' 3: st Z+, Rem sbrs nBytes, 7 rjmp .LnextDigit st Z, __zero_reg__ ;; pBuf survived in R25:R24 XJMP strrev ENDF mem_toa #define Minus1 pBuf ;; extern void mem_toa_signed (char *pBuf, void *pNum, uint8_t nBytes, uint8_t Radix); .section .text.mem_toa_signed,"ax",@progbits DEFUN mem_toa_signed wmov pNum, 22 add pNum, nBytes adc pNum+1, __zero_reg__ ld Num, -X tst Num brpl 9f ;; Negate pNum[] according to -N = ~N + 1 = ~N - (-1). ;; Carry is clear now because ADC above didn't overflow. wmov pNum, 22 mov Count, nBytes ldi Minus1, -1 1: ld Num, X ;; Must use EOR for complement because COM clobbers carry. eor Num, Minus1 sbci Num, -1 st X+, Num dec Count brne 1b wmov pBuf, 24 ldi Num, '-' st Z, Num adiw 24, 1 9: XJMP mem_toa ENDF mem_toa_signed #undef Minus1 #undef pNum #undef pBuf #undef nBytes #undef Radix #undef Rem #undef Num #undef Count #undef nBits ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; .section .text.put64,"ax",@progbits #define A0 18 #define A1 A0 + 1 #define A2 A0 + 2 #define A3 A0 + 3 #define A4 A0 + 4 #define A5 A0 + 5 #define A6 A0 + 6 #define A7 A0 + 7 #define BUF0 26 #define BUF1 BUF0 + 1 #define Digit 31 #define Carry 31 #define Ten 30 #define Count 29 #define nonZero 28 ;;; extern void put64 (uint64_t A, char *buf); ;;; This function writes up to 21 characters (including final '\0') to buf. DEFUN put64 push r28 push r29 wmov BUF0, 16 ldi Count, 19 ldi nonZero, '0' .Loop: ;; A[] -= 1.000.000.000.000.000.000 as often as we can. ;; Digit will hold the net number of subtractions (plus '0'). ldi Digit, '0'-1 1: inc Digit ;; 1.000.000.000.000.000.000 = 0DE0 B6B3 A764 0000 subi A2, 0x64 sbci A3, 0xa7 sbci A4, 0xb3 sbci A5, 0xb6 sbci A6, 0xe0 sbci A7, 0xd brcc 1b ;; -1.000.000.000.000.000.000 = F21F 494C 589C 0000 ;; Undo the underflow from above subi A2, 0x9c sbci A3, 0x58 sbci A4, 0x4c sbci A5, 0x49 sbci A6, 0x1f sbci A7, 0xf2 ;; As ffff ffff ffff ffff = 18.***.***.***.***.***.*** we can get ;; Digit up to 18+'0' in the first round. cpi Digit, '0'+10 brlo 2f ;; If that is the case, split it into 2 digits: a '1' and the ;; low part in Digit. ldi nonZero, '1' st X+, nonZero subi Digit, 10 2: ;; nonZero "accumulates" digits: it holds a value != '0' if we ever ;; output a digit not equal to '0'. or nonZero, Digit ;; We only output digits if we saw a digit != '0', i.e. strip leading '0's. cpi nonZero, '0' breq 3f st X+, Digit 3: ;; A[] *= 10 ldi Ten, 10 clr Carry .Lrot: mul A0, Ten add r0, Carry mov Carry, r1 clr __zero_reg__ adc Carry, __zero_reg__ ;; A[] >>>= 8 ;; Rotate 8 times so that we can loop the *= 10 over A[]. ;; The current value of A0 resides in r0, hence no "mov r0,A0" needed ;; to start the rotation. mov A0,A1 $ mov A1,A2 $ mov A2,A3 $ mov A3,A4 mov A4,A5 $ mov A5,A6 $ mov A6,A7 $ mov A7,r0 ;; Use the upper 3 bits of Count as loop counter for 8 loops. subi Count, -32 brcs .Lrot dec Count brne .Loop ;; If all digits were '0', we had A[] = 0: Output one '0'. cpi nonZero, '0' brne 4f st X+, nonZero 4: ;; String end and epilogue. st X+, __zero_reg__ pop r29 pop r28 ret ENDF put64 .section .text.put64s,"ax",@progbits ;;; extern void put64s (int64_t A, char *buf); ;;; Writes up to 21 characters (including '-' and final '\0') to buf. DEFUN put64s #ifdef __AVR_ERRATA_SKIP_JMP_CALL__ tst A7 brmi 0f #else sbrs A7, 7 #endif XJMP put64 0: push r16 push r17 XCALL __negdi2 ; Assumes avr-gcc 4.7+ wmov BUF0, 16 ldi Digit, '-' st X+, Digit wmov 16, BUF0 XCALL put64 pop r17 pop r16 ret ENDF put64s #undef A0 #undef A1 #undef A2 #undef A3 #undef A4 #undef A5 #undef A6 #undef A7 #undef Digit #undef BUF0 #undef Ten #undef Count #undef Carry #undef nonZero _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
Longer reply coming; I'm benchmarking.
But the tl;dr is that damn it, you may be right and I've wasted several days of effort. :-( I can even shave a few cycles (and one instruction) off of your code. At less than half the size and slightly better than half the speed, it's hard to argue with. While the optimal tradeoff depends on the application, and all we can be sure of is that it's somewhere on the convex hull of the space/time plot, optimizing time^k * space for some exponent k is usually good, and you're beating me with k=1. _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
In reply to this post by Georg-Johann Lay-2
Georg-Johann Lay wrote:
> George Spelvin schrieb: >> It is unfortunately one instruction longer than ultoa_invert: >> >> text data bss dec hex filename >> 188 0 0 188 bc ultoa_invert.o >> 190 0 0 190 be itoa_engine.o >> 208 0 0 208 d0 itoa_engine.o (+OPTIMIZE_SPEED) >> 212 0 0 212 d4 itoa_engine.o (without HAVE_MUL) > hmmm, that's quite some increase in code size... Um... just how serious are you being? I didn't think two bytes was "quite some increase". But compared to your alternatives... sigh, I may have been wasting my time. :-( > IMO octal conversion is so remote from any practical purposes that we > should use an algorithm that works for all radices. Well, I sort of did that. I did one that works for all power-of-two sizes, and used it for both hex and octal. Hex alone can be done much faster, but it isn't bad with the more generic code. ...but you don't mean limited to powers of 2. Just plain old binary long division. As I said when I started this, it's hard to escape certain optimization habits. > This can be done in less than 70 bytes (variable length / radix > algorithm that operates on memory, no need for MUL, needs strrev. H'm... let me have a poke at that. The actual code changes are at the end, but I found 90 cycles in one place, and 91 cycles + 1 instruction in another. With those changes, here's a timing table, for values of the from 2^k - 1: Decimal Hex mod100 Input mem_toa itoa mem_toa itoa mem_toa 8 bit 293 217 194 98 194 16 bit 700 451 527 187 440 24 bit 1342 838 1008 276 748 32 bit 2119 1167 1637 365 1142 40 bit 3143 1649 2414 454 1697 48 bit 4290 2127 3339 543 2301 56 bit 5585 2733 4412 632 2991 64 bit 7115 3420 5633 721 3743 That's really attractive for decimal. Also, for not much extra code, we could do the long division in base 100 (timing for only this part is above), and then split that into 2 digits. For hex, it's a bit painful. > So all is boiling down to the question what execution time is > acceptable or not: > > Is 8000 ticks too slow? > > Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just > because we have an algorithm that performs in 3000 ticks? The answer, of course, is the same as to "Is 190 bytes too big?": It Depends On The Application. And as library writers, we don't *know* the application. For many applications, an AVR is too slow and they need an ARM or a DSP. "Best" is somewhere on the convex hull of the space/time plot, but I can't say where. I was designing a helper function for printf, and that's a certain size: text data bss dec hex filename 1648 0 0 1648 670 ./avr/lib/avr4/vfprintf_flt.o 948 0 0 948 3b4 ./avr/lib/avr4/vfprintf_std.o 505 0 0 505 1f9 ./avr/lib/avr4/vfprintf_min.o That, and the initial size of __ultoa_invert, kind of set the size scale of what I was aiming for. I have a faster algorithm that uses a 100-byte binary-to-BCD lookup table to do everything two digits at a time. Some applications might want that. > My strong preference is still to have a one-fits-all algorithm that > might very well be slower than an optimal one. But hey, an ordinary > division of a 64-bit value by 10 already costs 2300 cycles, so why > should we hunt cycles just for printf...? There is that. But an intelligent programmer might only use 64 bits for a final accumulation (of, say, 32x32-bit products), avoiding even 64-bit multiplies. > I played around some more; attached is a version that operates on > memory, variable length and radices, and doesn't need MUL. It's > called mem_toa and needs < 70 bytes / 8000 ticks for decimal. > The signed version adds yet another 40 bytes: I made the following changes: Since you already have pointers to the beginning and end of the string, which is what the first half of strrev finds, a slight rearrangement of that code (which saves space and 1 cycle per byte at the expense of one extra loop iteration for odd-length inputs) would let you jump to the middle of it. Copying that code to the end of this to make my benchmarking simpler, I get 7296 cycles to print 2^64-1. One speedup that leaps to mind is that you can negate the radix and do the trial subtraction using an add. That will produce the quotient with the right polarity. Moves one instruction out of the middle loop to the setup code. That saves 90 cycles, taking it to 7206. Another possibility is some bit-shifting tricks. It's of limited use here becaue it messes up your trick of using Num for both the input number and the output quotient. But consider the following, with "nBits" renamed to "Quot". It takes another cycle out of the second loop, saving another 91 cycles (7115). It starts "Quot" out equal to 1, and detect the completion of 8 bit shifts when that bit ends up in the carry flag. So "rol Quot" replaces "dec nBits", and thereby avoids the need for the second "rol Num" just before the store. DEFUN mem_toa wmov pBuf, 24 wmov pNum, 22 neg Radix .LnextDigit: add pNum, nBytes adc pNum+1, __zero_reg__ mov Count, nBytes ;; nBytes < 0 <==> nBytes is unknown ldi nBytes, -1 clr Rem .LnextByte: ld Num, -X ldi Quot, 1 .LnextBit: ;; Bit-wise construct quotient into Quot. ;; Bit-wise reconstruct Num into Rem. lsl Num rol Rem ;; Reducing Rem mod Radix; C = 1 <==> reduction applied add Rem, Radix brcs 1f sub Rem, Radix 1: rol Quot brcc .LnextBit st X, Quot breq 2f sbrc nBytes, 7 mov nBytes, Count 2: dec Count brne .LnextByte subi Rem, -'0' cpi Rem, '9' + 1 brlt 3f subi Rem, '9' + 1 - 'a' 3: st Z+, Rem sbrs nBytes, 7 rjmp .LnextDigit st Z, __zero_reg__ ;; pBuf survived in R25:R24 # XJMP strrev wmov pNum, 24 2: ld Num, X ld Count, -Z st X+, Count st Z, Num /* Loop while X < Z */ 3: cp XL, ZL cpc XH, ZH brlo 2b ret ENDF mem_toa _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
In reply to this post by Georg-Johann Lay-2
> Is 8000 ticks too slow?
> > Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just > because we have an algorithm that performs in 3000 ticks? > > My strong preference is still to have a one-fits-all algorithm that > might very well be slower than an optimal one. But hey, an ordinary > division of a 64-bit value by 10 already costs 2300 cycles, so why > should we hunt cycles just for printf...? Well, I went and asked the customer. As I mentioned, the motivating application is the TAPR time interval counter (TICC). Info: http://tapr.org/kits_ticc.html Source: https://github.com/TAPR/TICC (Not up to date.) Manual: http://www.tapr.org/~n8ur/TICC_Manual.pdf Basically, it timestamps input events to sub-nanosecond resolution. It prints them with picosecond (12 decimal place) resolution. E.g. fed a 1 Hz input signal, it might print: 104.897999794440 105.897999794492 106.897999794549 107.897999794551 108.897999794553 109.897999794552 110.897999794667 It would like to be able to run until 2^64 picoseconds wrap around in 213 days. Anyway, although it only prints every input transition, the main processing loop has a 1 ms schedule to meet (it *can* print at up to 1 kHz, synchronized with the USB polling interval), and of the 16,000 clock cycles available in that ms, 8000 are currently spoken for. 8000 are available for formatting and output device drivers. So yeah, they'd definitely prefer 4000 cycles to 8000. But they're going to use custom code *anyway*, since they don't want to wait for an avr-libc release, so that doesn't have to determine what avr-libc does. _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
In reply to this post by George Spelvin
I implemented the idea I had, of using mem_itoa with base 100
and then printing two digits. Omitting the strrev code to make everything equal, mem_toa is 64 bytes, and the modified decimal-only version I call mem_tod is 80 bytes. u64toa_nibbles is 90 bytes. (That's assuming a ret of the reversed string or a 2-byte rjmp to strrev; add 2 bytes if using JMP.) Here's the timing: Decimal Hex Input mem_toa mem_tod itoa nibbles mem_toa itoa 8 bit 269 220 217 141 194 98 16 bit 664 462 451 321 527 187 24 bit 1294 783 838 608 1008 276 32 bit 2059 1219 1167 948 1637 365 40 bit 3059 1775 1649 1395 2414 454 48 bit 4194 2373 2127 1895 3339 543 56 bit 5477 3069 2733 2459 4412 632 64 bit 6995 3822 3420 3130 5633 721 mem_tod has the advantage that it does no multiplies, so it works fine on all devices. The nibbles code is the fastest, and I haven't tried to improve it yet. _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
Oops, I meant to include this last time. Note how the mod-10 reduction
works, subtracting 30s until the remainder goes negative, then adding back 10s until it goes positive again. That takes an average of 12.2 cycles (averaged over all inputs 00..99). #define __zero_reg__ r1 #define __tmp_reg__ r0 .macro DEFUN name .global \name .func \name \name: .endm .macro ENDF name .size \name, .-\name .endfunc .endm #if defined (__AVR_HAVE_JMP_CALL__) #define XCALL call #define XJMP jmp #else #define XCALL rcall #define XJMP rjmp #endif .macro wmov r_dest, r_src #if defined (__AVR_HAVE_MOVW__) movw \r_dest, \r_src #else mov \r_dest, \r_src mov \r_dest+1, \r_src+1 #endif .endm ;; End preamble #define pNum 26 #define pBuf 30 #define nBytes 20 #define Radix 18 #define Num 19 #define Count 21 #define Quot 22 #define Rem 23 ;; extern void mem_tod (char *pBuf, void *pNum, uint8_t nBytes); .section .text.mem_tod,"ax",@progbits DEFUN mem_tod wmov pBuf, 24 _mem_tod_aux: wmov pNum, 22 ldi Radix, -100 .LnextDigit: add pNum, nBytes adc pNum+1, __zero_reg__ mov Count, nBytes ;; nBytes < 0 <==> nBytes is unknown ldi nBytes, -1 clr Rem .LnextByte: ld Num, -X ldi Quot, 1 .LnextBit: ;; Bit-wise construct quotient into Quot. ;; Bit-wise reconstruct Num into Rem. lsl Num rol Rem ;; Reducing Rem mod Radix; C=1 <==> reduction applied add Rem, Radix brcs 1f sub Rem, Radix 1: rol Quot brcc .LnextBit st X, Quot breq 2f sbrc nBytes, 7 mov nBytes, Count 2: dec Count brne .LnextByte ;; Now break Rem apart into two base-10 digits ;; Use Num, because we know it's zero 3: subi Num, -3 subi Rem, 30 brcc 3b 4: dec Num subi Rem, -10 brcs 4b subi Rem, -'0' st Z+, Rem subi Num, -'0' st Z+, Num sbrs nBytes, 7 rjmp .LnextDigit ;; Chop off redundant trailing zero cpi Num, '0' brne 5f st -Z, __zero_reg__ 5: st Z, __zero_reg__ ;; pBuf survived in R25:R24 #if 0 # XJMP strrev wmov pNum, 24 7: ld Num, X ld Count, -Z st X+, Count st Z, Num /* Loop while X < Z */ cp XL, ZL cpc XH, ZH brlo 7b #endif ret ENDF mem_tod #define Minus1 Num ;; extern void mem_toa_signed (char *pBuf, void *pNum, uint8_t nBytes, uint8_t Radix); .section .text.mem_toa_signed,"ax",@progbits DEFUN mem_tod_signed wmov pBuf, 24 wmov pNum, 22 add pNum, nBytes adc pNum+1, __zero_reg__ ld __tmp_reg__, -X cp __tmp_reg__, __zero_reg__ brpl 9f ;; Negate pNum[] according to -N = ~N + 1 = ~N - (-1). ;; Carry is clear now because ADC above didn't overflow. wmov pNum, 22 mov Count, nBytes ldi Minus1, -1 1: ld __tmp_reg__, X ;; Must use EOR for complement because COM clobbers carry. eor __tmp_reg__, Minus1 sbc __tmp_reg__, Minus1 st X+, __tmp_reg__ dec Count brne 1b ldi Num, '-' st Z+, Num 9: XJMP mem_tod_aux ENDF mem_tod_signed #undef Minus1 #undef pNum #undef pBuf #undef nBytes #undef Radix #undef Rem #undef Num #undef Count #undef nBits _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
In reply to this post by George Spelvin
On 19.12.2016 23:05, George Spelvin wrote:
> Georg-Johann Lay wrote: >> George Spelvin schrieb: >>> It is unfortunately one instruction longer than ultoa_invert: >>> >>> text data bss dec hex filename >>> 188 0 0 188 bc ultoa_invert.o >>> 190 0 0 190 be itoa_engine.o >>> 208 0 0 208 d0 itoa_engine.o (+OPTIMIZE_SPEED) >>> 212 0 0 212 d4 itoa_engine.o (without HAVE_MUL) > >> hmmm, that's quite some increase in code size... > > Um... just how serious are you being? I didn't think two bytes was > "quite some increase". But compared to your alternatives... sigh, > I may have been wasting my time. :-( oh, I was just misreading this table and thought that it means yet another 200 bytes atop of ultoa_invert (just demonstrating that it isn't worse than ultoa_invert). But it appears you are intending to drop ultoa_invert which is great! >> IMO octal conversion is so remote from any practical purposes that we >> should use an algorithm that works for all radices. > > Well, I sort of did that. I did one that works for all power-of-two > sizes, and used it for both hex and octal. > > Hex alone can be done much faster, but it isn't bad with the more > generic code. > > ...but you don't mean limited to powers of 2. Just plain old binary long > division. As I said when I started this, it's hard to escape certain > optimization habits. > >> This can be done in less than 70 bytes (variable length / radix >> algorithm that operates on memory, no need for MUL, needs strrev. > > H'm... let me have a poke at that. > > The actual code changes are at the end, but I found 90 cycles > in one place, and 91 cycles + 1 instruction in another. > > With those changes, here's a timing table, for values of the from 2^k - 1: > > Decimal Hex mod100 > Input mem_toa itoa mem_toa itoa mem_toa > 8 bit 293 217 194 98 194 > 16 bit 700 451 527 187 440 > 24 bit 1342 838 1008 276 748 > 32 bit 2119 1167 1637 365 1142 > 40 bit 3143 1649 2414 454 1697 > 48 bit 4290 2127 3339 543 2301 > 56 bit 5585 2733 4412 632 2991 > 64 bit 7115 3420 5633 721 3743 Really funny results (just noticed you are using TABs and therefore the whole table is rigged ;-)) > That's really attractive for decimal. Also, for not much extra code, > we could do the long division in base 100 (timing for only this part is > above), and then split that into 2 digits. > > For hex, it's a bit painful. > >> So all is boiling down to the question what execution time is >> acceptable or not: >> >> Is 8000 ticks too slow? >> >> Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just >> because we have an algorithm that performs in 3000 ticks? > > The answer, of course, is the same as to "Is 190 bytes too big?": > It Depends On The Application. > > And as library writers, we don't *know* the application. For many > applications, an AVR is too slow and they need an ARM or a DSP. > > "Best" is somewhere on the convex hull of the space/time plot, but I > can't say where. > > I was designing a helper function for printf, and that's a certain size: > > text data bss dec hex filename > 1648 0 0 1648 670 ./avr/lib/avr4/vfprintf_flt.o > 948 0 0 948 3b4 ./avr/lib/avr4/vfprintf_std.o > 505 0 0 505 1f9 ./avr/lib/avr4/vfprintf_min.o > > That, and the initial size of __ultoa_invert, kind of set the size scale > of what I was aiming for. > > I have a faster algorithm that uses a 100-byte binary-to-BCD lookup table > to do everything two digits at a time. Some applications might want that. > >> My strong preference is still to have a one-fits-all algorithm that >> might very well be slower than an optimal one. But hey, an ordinary >> division of a 64-bit value by 10 already costs 2300 cycles, so why >> should we hunt cycles just for printf...? > > There is that. But an intelligent programmer might only use 64 bits > for a final accumulation (of, say, 32x32-bit products), avoiding > even 64-bit multiplies. Often programmers are bitten by their smartness when they observe that avr-gcc generates "some" extra instructions around the core 64-bit arithmetic. But that's a different story... >> I played around some more; attached is a version that operates on >> memory, variable length and radices, and doesn't need MUL. It's >> called mem_toa and needs < 70 bytes / 8000 ticks for decimal. >> The signed version adds yet another 40 bytes: > > I made the following changes: > > Since you already have pointers to the beginning and end of the string, > which is what the first half of strrev finds, a slight rearrangement of > that code (which saves space and 1 cycle per byte at the expense of one > extra loop iteration for odd-length inputs) would let you jump to the > middle of it. > > Copying that code to the end of this to make my benchmarking > simpler, I get 7296 cycles to print 2^64-1. IMO including the time of strrev doesn't add much to the execution times; and if strrev is invoked it actually adds to the conversion times anyway. > One speedup that leaps to mind is that you can negate the radix and > do the trial subtraction using an add. That will produce the quotient > with the right polarity. Moves one instruction out of the middle loop > to the setup code. > > That saves 90 cycles, taking it to 7206. Just a small speed-up, but really cool idea :-) > Another possibility is some bit-shifting tricks. It's of limited use here > because it messes up your trick of using Num for both the input number and > the output quotient. But consider the following, with "nBits" renamed to > "Quot". It takes another cycle out of the second loop, saving > another 91 cycles (7115). > > It starts "Quot" out equal to 1, and detect the completion of 8 bit > shifts when that bit ends up in the carry flag. So "rol Quot" replaces > "dec nBits", and thereby avoids the need for the second "rol Num" just > before the store. > > DEFUN mem_toa > wmov pBuf, 24 > wmov pNum, 22 > neg Radix > > .LnextDigit: > add pNum, nBytes > adc pNum+1, __zero_reg__ > mov Count, nBytes > ;; nBytes < 0 <==> nBytes is unknown > ldi nBytes, -1 > clr Rem > > .LnextByte: > ld Num, -X > ldi Quot, 1 > > .LnextBit: > ;; Bit-wise construct quotient into Quot. > ;; Bit-wise reconstruct Num into Rem. > lsl Num > rol Rem > > ;; Reducing Rem mod Radix; C = 1 <==> reduction applied > add Rem, Radix > brcs 1f > sub Rem, Radix > 1: > rol Quot > brcc .LnextBit > > st X, Quot > > breq 2f > sbrc nBytes, 7 > mov nBytes, Count > 2: > dec Count > brne .LnextByte > > subi Rem, -'0' > cpi Rem, '9' + 1 > brlt 3f > subi Rem, '9' + 1 - 'a' > 3: > st Z+, Rem > > sbrs nBytes, 7 > rjmp .LnextDigit > > st Z, __zero_reg__ > > ;; pBuf survived in R25:R24 > # XJMP strrev > wmov pNum, 24 > 2: ld Num, X > ld Count, -Z > st X+, Count > st Z, Num > /* Loop while X < Z */ > 3: cp XL, ZL > cpc XH, ZH > brlo 2b > > ret > ENDF mem_toa > _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
In reply to this post by George Spelvin
On 20.12.2016 00:51, George Spelvin wrote:
>> Is 8000 ticks too slow? >> >> Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just >> because we have an algorithm that performs in 3000 ticks? >> >> My strong preference is still to have a one-fits-all algorithm that >> might very well be slower than an optimal one. But hey, an ordinary >> division of a 64-bit value by 10 already costs 2300 cycles, so why >> should we hunt cycles just for printf...? > > Well, I went and asked the customer. > > As I mentioned, the motivating application is the TAPR time interval > counter (TICC). > Info: http://tapr.org/kits_ticc.html > Source: https://github.com/TAPR/TICC (Not up to date.) > Manual: http://www.tapr.org/~n8ur/TICC_Manual.pdf > > Basically, it timestamps input events to sub-nanosecond resolution. > It prints them with picosecond (12 decimal place) resolution. > > E.g. fed a 1 Hz input signal, it might print: > > 104.897999794440 > 105.897999794492 > 106.897999794549 > 107.897999794551 > 108.897999794553 > 109.897999794552 > 110.897999794667 > > It would like to be able to run until 2^64 picoseconds wrap around in > 213 days. Just out of interest: Isn't this overflow a problem? Hence isn't a different representation warranted that doesn't overflow so soon? > Anyway, although it only prints every input transition, the main > processing loop has a 1 ms schedule to meet (it *can* print at up to > 1 kHz, synchronized with the USB polling interval), and of the 16,000 > clock cycles available in that ms, 8000 are currently spoken for. > 8000 are available for formatting and output device drivers. > > So yeah, they'd definitely prefer 4000 cycles to 8000. > > But they're going to use custom code *anyway*, since they don't want > to wait for an avr-libc release, so that doesn't have to determine what > avr-libc does. IMO that's a sound approach -- not only because of avr-libc release schedule but also because you actually have to guarantee some upper bound for the execution time (of conversion etc.). What consumes the other 8000 cycles? Arithmetic? Johann _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
> What consumes the other 8000 cycles? Arithmetic?
I gave a link to the code (which I haven't read either, so I'm not blaming you), but I expect it's a combination of dealing with the TDC7200 time-to-digital converter chip (it's an SPI peripheral) and arithmetic. The chip is a combination of a 10 MHz "coarse" counter clocked by an external quartz reference, and a 17 GHz ring oscillator. The latter is subject to drift, so every input measurement is followed by a calibration cycle which measures one cycle of the coarse counter. Then you have to do some math to translate the ring oscillator cycle count to fractions of the 10 MHz timebase. While I expect there's slac in that code which could be taken out, just knowing that there's a 16000 clock cycle budget for *everything* suggests that maybe I shouldn't be cavalier about 4000 cycles. _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
In reply to this post by Georg-Johann Lay-2
Georg-Johann Lay wrote:
> oh, I was just misreading this table and thought that it means yet > another 200 bytes atop of ultoa_invert (just demonstrating that > it isn't worse than ultoa_invert). > > But it appears you are intending to drop ultoa_invert which is great! *Whew*! No, I'm comparing it because it's a *replacement* for ultoa_invert.o. I wanted smaller *and* faster *and* arbitrary-length. I was feeling very deflated by your complaints. > Really funny results (just noticed you are using TABs and therefore > the whole table is rigged ;-)) Here is the updated table (which consistently omits the string reverse) in a form that can take one level of quoting: Input Decimal Hex bits mem_toa mem_tod itoa nibbles mem_toa itoa 8 269 220 217 141 194 98 16 664 462 451 321 527 187 24 1294 783 838 608 1008 276 32 2059 1219 1167 948 1637 365 40 3059 1775 1649 1395 2414 454 48 4194 2373 2127 1895 3339 543 56 5477 3069 2733 2459 4412 632 64 6995 3822 3420 3130 5633 721 For binary bases, yu're seeing here the difference between an O(n) algorithm (89n+9 cycles) and an O(n^2) one (74*n^2 + 111*n + 9). I'm really not happy with that. For decimal, however mem_tod without a multiplier is almost fast as my decimal code *with* one, and is 80 bytes long. That's the code I'd like to use on multiplierless machines. mem_toa is 64 bytes, but the almost 2x speed difference is worth the 16 bytes, IMHO. Your u64toa_nibbles code (after I tweaked it a bit) is 90 bytes and the fastest of all. There, it's only about 75% the time of the multiplierless code, so whether the speed is worth it is more of a question. One thing that justfies it to me is that the enhanced cores tend to come with more flash, so an additional 10 bytes is affordable. Also contributing to eh afforadility is that the enhanced cores tend to generate smaller code overall. On the other hand, maintaining two completely different code paths is a bother. There's a lot to be said for just one. One alternative I mentioned earlier, which I'm thinking seriously about, is to reorganize the code into two phases: 1) Convert decimal and octal to little-endian BCD. (%x would just find the length.) 2) Print little-endian hex as ASCII. That would enlarge the code somewhat, but reduce stack usage by 11 bytes. As I noted, ROM:RAM is usually 16:1 so I could argue that those 16 bytes of RAM are "worth" 176 bytes of code, which is far more than code size increase. Note that mem_tod produces digits two at a time anyway, so it's a natural fit. I have some ideas for how to adapt the u64toa_nibbles code, and octal wouldn't be too hard. I'd really appreciate your opinion of the idea. > Often programmers are bitten by their smartness when they observe > that avr-gcc generates "some" extra instructions around the core > 64-bit arithmetic. But that's a different story... I don't quite know what you're alluding to. What frustrates me abut avr-gcc is code like time/gm_sidereal.c, where it's doing a 32x32->64 bit multiply, but if !__AVR_HAVE_MUL__, then gcc "knows" that there's no __umulsidi3 function, and generates an absolutely massive spill & fill sequence to call __muldi3, which ends up being a lot bigger & slower than a call to the __umulsidi3 wrapper which totally exists. >> That saves 90 cycles, taking it to 7206. > > Just a small speed-up, but really cool idea :-) You had some awfully cool ideas yourself, particularly integrating the length-finding into the main loop. _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
I mentioned last time trying doing something like this, and now
I have it coded up (as utoa_bytes()). Input Decimal bits mem_toa mem_tod itoa nibbles bytes 8 269 220 217 141 143 16 664 462 451 321 273 24 1294 783 838 608 432 32 2059 1219 1167 948 666 40 3059 1775 1649 1395 941 48 4194 2373 2127 1895 1217 56 5477 3069 2733 2459 1551 64 6995 3822 3420 3130 1902 It's larger (120 bytes, vs, 90 for the nibbles code), and has a bit more startup overhead, but is quite fast. I also have a 122-byte variant which saves 1 cycle per pair of digits. (1895 cycles for 2^64-1.) I'm interested in this both for speed, and because it's a good fit to my suggestion of saving RAM space buffering the converted digits in BCD. So now that we have several good candidates, how to proceed? What size/speed tradeoff should be the final choice? Personally, I'm not worried about 30 bytes of code on enhanced AVRs with a multiplier, and although I haven't coded up the combined algorithm, I believe the whole thing is smaller than the ultoa_invert code it replaces. But this is a judgement call, and I'd appreciate some other voices. #define __zero_reg__ r1 #define __tmp_reg__ r0 .macro DEFUN name .section .text.\name,"ax",@progbits .global \name .func \name \name: .endm .macro ENDF name .size \name, .-\name .endfunc .endm #if defined (__AVR_HAVE_JMP_CALL__) #define XCALL call #define XJMP jmp #else #define XCALL rcall #define XJMP rjmp #endif .macro wmov r_dest, r_src #if defined (__AVR_HAVE_MOVW__) movw \r_dest, \r_src #else mov \r_dest, \r_src mov \r_dest+1, \r_src+1 #endif .endm .text #define pBuf 26 #define pNum 30 #define Length r20 #define Rem r19 #define Quot r21 #define Count r22 #define Num r23 #define Ten r16 #define Ox67 r18 ;; extern void u64toa_nibbles (char *pBuf, void *pNum, uint8_t Length) ;; Length in [1, 127] DEFUN u64toa_nibbles wmov pBuf, 24 wmov pNum, 22 push Ten ldi Ten, 10 ;; For / 10 by multiply we would usually use a value of 0xcd ~ 0x800/10, ;; however Rem below will be in [0, 0x9f] so that ~ 1/2 of that value ;; also works, and that saves us one "lsr r1" per division. ldi Ox67, 0x67 clr Count .LoopDigits: add pNum, Length ;; Count is 0. adc pNum+1, Count mov Count, Length ldi Length, -1 clr Rem .LoopBytes: ld Num, -Z ;; Rem:Num <<= 4 ;; "andi Rem, 0xf" not needed because Rem < 10. eor Rem, Num andi Num, 0x0f eor Rem, Num swap Rem ;; Quot := Rem / 10 mul Rem, Ox67 lsr r1 lsr r1 mov Quot, r1 ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10 mul r1, Ten sub Rem, r0 ;; ...and (mostly) the same again for the low nibble. ;; Quot <<= 4 swap Quot ;; Rem:Num <<= 4 swap Rem eor Rem, Num ;; Quot += Rem / 10 mul Rem, Ox67 lsr r1 lsr r1 add Quot, r1 ;; Quot = 0 ==> R1 = 0, hence we can also skip the MUL + SUB below. breq 1f sbrc Length, 7 ;; Current Length not yet known: Set it according to highest non-zero byte. mov Length, Count ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10 mul r1, Ten sub Rem, r0 1: st Z, Quot dec Count brne .LoopBytes subi Rem, -'0' st X+, Rem ;; Only continue if we determined Length (Length > 0) ;; Length < 0 means all bytes are zero, hence we are done then. sbrs Length, 7 rjmp .LoopDigits pop Ten clr __zero_reg__ st X, __zero_reg__ ;; pBuf survived in R25:R24 #if 0 # XJMP strrev wmov ZL, 24 7: ld Count, -X ld Num, Z st Z+, Count st X, Num /* Loop while Z < X */ cp ZL, XL cpc ZH, XH brlo 7b #endif ret ENDF u64toa_nibbles #undef Count #undef pNum #undef pBuf #undef Rem #undef Num #undef Quot #undef Ten #undef Ox67 #undef Length ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; This works simularly, but in base 100. Every step, we take a ;; remainder <= 99, multiply by 256, and add the next byte (<=255). ;; The result is in the range 0 <= x < 25600. ;; ;; To divide by 100 with multiply: ;; x * 0x29 >> 12 == x/100 for 0 <= x < 1099 ;; x * 0x51f >> 17 == x/100 for 0 <= x < 4699 ;; x * 0x147b >> 19 == x/100 for 0 <= x < 43699 ;; x * 0x28f6 >> 20 == x/100 for 0 <= x < 43699 ;; ;; Using a multiplier one bit larger than strictly necessary gives us ;; a more convenient shift amount. This is still small enough that we ;; can compute the high bits of the product we need in only two registers. ;; ;; The largest possible x = 0x63ff also has the largest possible ;; lsbyte, so it will give us the largest possible partial products. ;; Multiplying that by 0x28f6 requires four partial products: ;; ;; FF * F6 = F50A ;; FF * 28-- = 27D8-- ;; 63-- * F6 = 5F22-- ;; 63-- * 28-- = F78---- ;; ;; The important thing is that the sum of the first three partial products ;; is 0x87ef0a, a 24-bit number. So there is no need to propagate ;; carries into the msbyte. We immediately discard the lsbyte of the ;; first partial product, and sum the others into a 2-byte register. ;; Then we throw away the lsbyte of that register and use it for the msbyte ;; of the final sum. ;; extern void utoa_bytes(char *pBuf, void *pNum, uint8_t Length) ;; Length in [1, 127] #define ZH r31 #define ZL r30 #define XH r27 #define XL r26 #define Q2 r23 /* Byte 2 of 4-byte product */ #define Q1 r22 /* Byte 1 (and 3) of 4-byte product */ #define Count r21 #define Length r20 /* Argument */ #define Rem r19 #define Hundred r18 /* Constant 0x64 = 100 */ #define Khi r17 /* Constant 0x28 = 40 */ #define Klo r16 /* Constant 0xf6 = 246 */ #define Num r15 DEFUN utoa_bytes movw XL, r24 ; pBuf (output) to X movw ZL, r22 ; pNum (input) to Z ldi Hundred, 100 push Khi ldi Khi, 0x28 push Klo ldi Klo, 0xf6 push Num clr Count .LoopDigits2: add ZL, Length adc ZH, Count mov Count, Length ldi Length, -1 clr Rem .LoopBytes2: ld Num, -Z /* Multiply Rem:Num by Khi:Klo */ mul Num, Khi mov Q1, r0 mov Q2, r1 mul Rem, Klo add Q1, r0 adc Q2, r1 ; Cannot overflow mul Num, Klo add Q1, r1 clr Q1 ; We only need the carry bit out adc Q2, Q1 ; This cannot overflow, either mul Rem, Khi add Q2, r0 adc Q1, r1 ; We now have the high 12 bits of the 28-bit product in Q1:Q2. ; Shift down by 4 bits andi Q2, 0xf0 or Q1, Q2 swap Q1 ;; We now have the new quotient in "Q1". st Z, Q1 ;; If Q1 == 0, don't set length (and multiply is redundant, too) breq 1f sbrc Length, 7 ;; Do we have a length yet? mov Length, Count ;; Remember position of highest non-zero Q mul Q1, Hundred sub Num, r0 ; New remainder 1: mov Rem, Num dec Count brne .LoopBytes2 ;; We now have the final base-100 remiander in Rem. ;; Break it apart into two base-10 digits in Q1:Rem ;; (We could use the multiplier again, but that woud require ;; even more registers for the constants and more instructions, ;; and this isn't the inner loop.) ldi Q1, '0' 3: subi Q1, -3 subi Rem, 30 brcc 3b 4: dec Q1 subi Rem, -10 brcs 4b subi Rem, -'0' st X+, Rem st X+, Q1 ;; Only continue if we determined Length (Length > 0) ;; Length < 0 means all bytes are zero, hence we are done. sbrs Length, 7 rjmp .LoopDigits2 ;; Chop off redundant trailing zero clr __zero_reg__ cpi Q1, '0' brne 5f st -X, __zero_reg__ ; Could sbiw, but this is same speed 5: st X, __zero_reg__ ; Restore registers pop Num pop Klo pop Khi movw r24, r26 ; Return end of output ret ENDF utoa_bytes #undef Num #undef Klo #undef Khi #undef Hundred #undef Rem #undef Length #undef Count #undef Q1 #undef Q2 #undef XL #undef XH #undef ZL #undef ZH _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
George Spelvin schrieb:
> I mentioned last time trying doing something like this, and now > I have it coded up (as utoa_bytes()). > > Input Decimal > bits mem_toa mem_tod itoa nibbles bytes > 8 269 220 217 141 143 > 16 664 462 451 321 273 > 24 1294 783 838 608 432 > 32 2059 1219 1167 948 666 > 40 3059 1775 1649 1395 941 > 48 4194 2373 2127 1895 1217 > 56 5477 3069 2733 2459 1551 > 64 6995 3822 3420 3130 1902 > > It's larger (120 bytes, vs, 90 for the nibbles code), and has a bit more > startup overhead, but is quite fast. I also have a 122-byte variant > which saves 1 cycle per pair of digits. (1895 cycles for 2^64-1.) > > I'm interested in this both for speed, and because it's a good fit to my > suggestion of saving RAM space buffering the converted digits in BCD. > > So now that we have several good candidates, how to proceed? > What size/speed tradeoff should be the final choice? After all it's you who will provide the final implementation and testing, hence the final decision of what's appropriate, how much effort will be put into the implementation, and what the final code will look like is your decision, IMO. > Personally, I'm not worried about 30 bytes of code on enhanced AVRs with > a multiplier, and although I haven't coded up the combined algorithm, I > believe the whole thing is smaller than the ultoa_invert code it replaces. We only have multilib granularity, and there are not so many features that are related to the flash size. One is __AVR_HAVE_JMP_CALL__ which applies to devices with >= 16 KiB flash. The next size milestone is __AVR_HAVE_ELPM__ which means >= 128 KiB. The JMP + CALL looks reasonable to me; I used it for 64-bit divisions in libgcc (which leads to the funny situation that 64-bit division might run faster than a 32-bit division for the same values). > But this is a judgement call, and I'd appreciate some other voices. For smaller devices (e.g. no CALL but MUL some bytes can be squeezed out by moving LDI of the constants to the inner loop which saves PUSH + POP. But on smaller devices, where xprintf is already a major code consumer, a programmer might prefer something like ulltoa over the bloat from xprintf. > > #define __zero_reg__ r1 > #define __tmp_reg__ r0 > > .macro DEFUN name > .section .text.\name,"ax",@progbits > .global \name > .func \name > \name: > .endm > > .macro ENDF name > .size \name, .-\name > .endfunc > .endm > > #if defined (__AVR_HAVE_JMP_CALL__) > #define XCALL call > #define XJMP jmp > #else > #define XCALL rcall > #define XJMP rjmp > #endif > > .macro wmov r_dest, r_src > #if defined (__AVR_HAVE_MOVW__) > movw \r_dest, \r_src > #else > mov \r_dest, \r_src > mov \r_dest+1, \r_src+1 > #endif > .endm > > .text > > [...u64toa_nibbles...] > > ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; > > ;; This works simularly, but in base 100. Every step, we take a > ;; remainder <= 99, multiply by 256, and add the next byte (<=255). > ;; The result is in the range 0 <= x < 25600. > ;; > ;; To divide by 100 with multiply: > ;; x * 0x29 >> 12 == x/100 for 0 <= x < 1099 > ;; x * 0x51f >> 17 == x/100 for 0 <= x < 4699 > ;; x * 0x147b >> 19 == x/100 for 0 <= x < 43699 > ;; x * 0x28f6 >> 20 == x/100 for 0 <= x < 43699 > ;; > ;; Using a multiplier one bit larger than strictly necessary gives us > ;; a more convenient shift amount. This is still small enough that we > ;; can compute the high bits of the product we need in only two registers. > ;; > ;; The largest possible x = 0x63ff also has the largest possible > ;; lsbyte, so it will give us the largest possible partial products. > ;; Multiplying that by 0x28f6 requires four partial products: > ;; > ;; FF * F6 = F50A > ;; FF * 28-- = 27D8-- > ;; 63-- * F6 = 5F22-- > ;; 63-- * 28-- = F78---- > ;; > ;; The important thing is that the sum of the first three partial products > ;; is 0x87ef0a, a 24-bit number. So there is no need to propagate > ;; carries into the msbyte. We immediately discard the lsbyte of the > ;; first partial product, and sum the others into a 2-byte register. > ;; Then we throw away the lsbyte of that register and use it for the msbyte > ;; of the final sum. > > ;; extern void utoa_bytes(char *pBuf, void *pNum, uint8_t Length) > ;; Length in [1, 127] > #define ZH r31 > #define ZL r30 > #define XH r27 > #define XL r26 > > #define Q2 r23 /* Byte 2 of 4-byte product */ Maybe it's a bit easier to grasp if #define Q3 Q1 and then use Q3 where byte #3 is computed (starting with "clr Q1") > #define Count r21 > #define Length r20 /* Argument */ > #define Rem r19 > #define Hundred r18 /* Constant 0x64 = 100 */ > #define Khi r17 /* Constant 0x28 = 40 */ > #define Klo r16 /* Constant 0xf6 = 246 */ > #define Num r15 > > > DEFUN utoa_bytes > movw XL, r24 ; pBuf (output) to X > movw ZL, r22 ; pNum (input) to Z > ldi Hundred, 100 > push Khi > ldi Khi, 0x28 > push Klo > ldi Klo, 0xf6 > push Num > clr Count > > .LoopDigits2: > add ZL, Length > adc ZH, Count > > mov Count, Length > ldi Length, -1 > clr Rem > > .LoopBytes2: > ld Num, -Z > > /* Multiply Rem:Num by Khi:Klo */ > mul Num, Khi > mov Q1, r0 > mov Q2, r1 Can use "wmov Q1, r0" > mul Rem, Klo > add Q1, r0 > adc Q2, r1 ; Cannot overflow > mul Num, Klo > add Q1, r1 > clr Q1 ; We only need the carry bit out > adc Q2, Q1 ; This cannot overflow, either > mul Rem, Khi > add Q2, r0 > adc Q1, r1 > > ; We now have the high 12 bits of the 28-bit product in Q1:Q2. > ; Shift down by 4 bits > andi Q2, 0xf0 > or Q1, Q2 > swap Q1 > ;; We now have the new quotient in "Q1". > st Z, Q1 > > ;; If Q1 == 0, don't set length (and multiply is redundant, too) > breq 1f > sbrc Length, 7 ;; Do we have a length yet? > mov Length, Count ;; Remember position of highest non-zero Q > > mul Q1, Hundred > sub Num, r0 ; New remainder > 1: > mov Rem, Num > dec Count > brne .LoopBytes2 > > ;; We now have the final base-100 remiander in Rem. > ;; Break it apart into two base-10 digits in Q1:Rem > ;; (We could use the multiplier again, but that woud require > ;; even more registers for the constants and more instructions, > ;; and this isn't the inner loop.) > ldi Q1, '0' > 3: subi Q1, -3 > subi Rem, 30 > brcc 3b > 4: dec Q1 > subi Rem, -10 > brcs 4b > > subi Rem, -'0' > st X+, Rem > st X+, Q1 > > ;; Only continue if we determined Length (Length > 0) > ;; Length < 0 means all bytes are zero, hence we are done. > sbrs Length, 7 > rjmp .LoopDigits2 > > ;; Chop off redundant trailing zero > clr __zero_reg__ > cpi Q1, '0' > brne 5f > st -X, __zero_reg__ ; Could sbiw, but this is same speed > 5: st X, __zero_reg__ > > ; Restore registers > pop Num > pop Klo > pop Khi > > movw r24, r26 ; Return end of output > ret > ENDF utoa_bytes > > #undef Num > #undef Klo > #undef Khi > #undef Hundred > #undef Rem > #undef Length > #undef Count > #undef Q1 > #undef Q2 > #undef XL > #undef XH > #undef ZL > #undef ZH > _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
Georg-Johann Lay wrote:
> George Spelvin schrieb: >> So now that we have several good candidates, how to proceed? >> What size/speed tradeoff should be the final choice? > After all it's you who will provide the final implementation and > testing, hence the final decision of what's appropriate, how much > effort will be put into the implementation, and what the final code > will look like is your decision, IMO. Well, thank you very much, but after your "that's quite some size increase" e-mail (and showing me better code than I'd been working on for a couple of weeks), I'm feeling rather less confident. (And, despite my asking, nobody's expressed any opinion at all about my "save RAM by using BCD" suggestion. Brilliant or crackpot?) > We only have multilib granularity, and there are not so many features > that are related to the flash size. One is __AVR_HAVE_JMP_CALL__ which > applies to devices with >= 16 KiB flash. The next size milestone is > __AVR_HAVE_ELPM__ which means >= 128 KiB. The JMP + CALL looks > reasonable to me; I used it for 64-bit divisions in libgcc (which leads > to the funny situation that 64-bit division might run faster than a > 32-bit division for the same values). Interesting suggestion. I could just use the multiplierless base-100 code, which is smaller and still reasonably fast. And thank you very much! I knew that HAVE_JUMP_CALL meant that RJMP/RCALL range wasn't enough, which means more than 12 bits of PC (2^13 bytes of flash), but it had gotten lost in the forest of confusion. I'm befuddled by all of the different architecture options and don't understand the difference between most of them. I've been slowly downloading data sheets for different examples from gcc's list and looking for differences, but it's a laborious process. (That document on avr-tiny started out with me documenting my realization that avr1 was something else.) For example, does MUL support imply MOVW support? (I've been assuming so, but that's an easy edit.) And what's the relationship between MOVW support and ADIW/SBIW? Are they the same feature, or are there processors with one and not the other? (For aggressive size squeezing, I've realized that a lot of code is wasted copying pointer return values from X or Z to r24:r25, only to have the caller copy them right back to use the pointers. It would be lovely to tell if there were a way to tell gcc "expect the resturn vaue for this function in r30:r31". And, sometimes, "this function preserves r18-r20".) > For smaller devices (e.g. no CALL but MUL some bytes can be squeezed > out by moving LDI of the constants to the inner loop which saves > PUSH + POP. But on smaller devices, where xprintf is already a > major code consumer, a programmer might prefer something like ulltoa > over the bloat from xprintf. Um...I see how I can swap the Hundred constant around, but Khi/Klo are both used twice each, so loading them twice would not save as much. (If I return the end pointer rather than returning r24:r25 to the end for a call to strrev that's not in the current code, that avoids two more push/pop anyway.) The way I have the multiply organized, I have to do the two middle partial products first, then the low, then the high. I can swap the middle ones around, but I can't make both constants' uses adjacent. >> #define Q2 r23 /* Byte 2 of 4-byte product */ >> #define Q1 r22 /* Byte 1 (and 3) of 4-byte product */ > Maybe it's a bit easier to grasp if > > #define Q3 Q1 > > and then use Q3 where byte #3 is computed (starting with "clr Q1") I thought about that, but remembering that two names refer to the same register (and thus you may not rearrange code to overlap their usage) is also a pain. I originally called them "Qeven" and "Qodd". Maybe I can just improve the comments... >> /* Multiply Rem:Num by Khi:Klo */ >> mul Num, Khi >> mov Q1, r0 >> mov Q2, r1 > > Can use "wmov Q1, r0" Ooh, nice! I forgot that movw isn't limited to high registers. Let me try to improve the comments... do you still think this would be better with Q3? (It dawned on me that even if the product *isn't* guaranteed to not overflow, the structure can compute the high half of a 32-bit product in only two accumulator registers if we add one more "ADC Q1,Q1".) >> mul Rem, Klo >> add Q1, r0 >> adc Q2, r1 ; Cannot overflow >> mul Num, Klo >> add Q1, r1 clr Q1 ; No longer need Q1; re-use register for Q3 adc Q2, Q1 ; Propagate carry to Q2 ;adc Q1, Q1 ; (Omit: no carry possible due to input range) >> mul Rem, Khi add Q2, r0 ; Now byte 2 (hlo, Q2) of 32-bit product adc Q1, r1 ; Now msbyte (hhi, Q3) of 32-bit product >> >> ; We now have the high 12 bits of the 28-bit product in Q1:Q2. >> ; Shift down by 4 bits >> andi Q2, 0xf0 >> or Q1, Q2 >> swap Q1 >> ;; We now have the new quotient in "Q1". >> st Z, Q1 _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
I find all this fascinating, but I'm really not the one to be commenting on
what the best approach here is. I will say, however, that in many of my applications, I would be more likely to chose a speed increase over reduced memory. I tend to live with mostly compute-bound, control applications though. I've not had enough time to look at the code to make fine-grained suggestions (and I'm out of practice with AVR ASM), but I've not felt that the comments were too bad. I've been able to follow the code as written easily enough. Cheers, Martin Jay McKee On Sat, Dec 24, 2016 at 10:59 AM, George Spelvin <[hidden email]> wrote: > Georg-Johann Lay wrote: > > George Spelvin schrieb: > >> So now that we have several good candidates, how to proceed? > >> What size/speed tradeoff should be the final choice? > > > After all it's you who will provide the final implementation and > > testing, hence the final decision of what's appropriate, how much > > effort will be put into the implementation, and what the final code > > will look like is your decision, IMO. > > Well, thank you very much, but after your "that's quite some size > increase" e-mail (and showing me better code than I'd been working on > for a couple of weeks), I'm feeling rather less confident. > > (And, despite my asking, nobody's expressed any opinion at all about > my "save RAM by using BCD" suggestion. Brilliant or crackpot?) > > > We only have multilib granularity, and there are not so many features > > that are related to the flash size. One is __AVR_HAVE_JMP_CALL__ which > > applies to devices with >= 16 KiB flash. The next size milestone is > > __AVR_HAVE_ELPM__ which means >= 128 KiB. The JMP + CALL looks > > reasonable to me; I used it for 64-bit divisions in libgcc (which leads > > to the funny situation that 64-bit division might run faster than a > > 32-bit division for the same values). > > Interesting suggestion. I could just use the multiplierless base-100 code, > which is smaller and still reasonably fast. > > And thank you very much! I knew that HAVE_JUMP_CALL meant that RJMP/RCALL > range wasn't enough, which means more than 12 bits of PC (2^13 bytes of > flash), but it had gotten lost in the forest of confusion. > > I'm befuddled by all of the different architecture options and don't > understand the difference between most of them. I've been slowly > downloading data sheets for different examples from gcc's list and > looking for differences, but it's a laborious process. (That document > on avr-tiny started out with me documenting my realization that avr1 > was something else.) > > For example, does MUL support imply MOVW support? (I've been assuming > so, but that's an easy edit.) > > And what's the relationship between MOVW support and ADIW/SBIW? Are they > the same feature, or are there processors with one and not the other? > > > (For aggressive size squeezing, I've realized that a lot of code is wasted > copying pointer return values from X or Z to r24:r25, only to have the > caller copy them right back to use the pointers. It would be lovely to > tell if there were a way to tell gcc "expect the resturn vaue for this > function in r30:r31". And, sometimes, "this function preserves r18-r20".) > > > For smaller devices (e.g. no CALL but MUL some bytes can be squeezed > > out by moving LDI of the constants to the inner loop which saves > > PUSH + POP. But on smaller devices, where xprintf is already a > > major code consumer, a programmer might prefer something like ulltoa > > over the bloat from xprintf. > > Um...I see how I can swap the Hundred constant around, but Khi/Klo are > both used twice each, so loading them twice would not save as much. > (If I return the end pointer rather than returning r24:r25 to the end > for a call to strrev that's not in the current code, that avoids two > more push/pop anyway.) > > The way I have the multiply organized, I have to do the two middle > partial products first, then the low, then the high. I can swap the > middle ones around, but I can't make both constants' uses adjacent. > > >> #define Q2 r23 /* Byte 2 of 4-byte product */ > >> #define Q1 r22 /* Byte 1 (and 3) of 4-byte product */ > > > Maybe it's a bit easier to grasp if > > > > #define Q3 Q1 > > > > and then use Q3 where byte #3 is computed (starting with "clr Q1") > > I thought about that, but remembering that two names refer to the same > register (and thus you may not rearrange code to overlap their usage) > is also a pain. I originally called them "Qeven" and "Qodd". > > Maybe I can just improve the comments... > > >> /* Multiply Rem:Num by Khi:Klo */ > >> mul Num, Khi > >> mov Q1, r0 > >> mov Q2, r1 > > > > Can use "wmov Q1, r0" > > Ooh, nice! I forgot that movw isn't limited to high registers. > > Let me try to improve the comments... do you still think this would > be better with Q3? (It dawned on me that even if the product *isn't* > guaranteed to not overflow, the structure can compute the high half of > a 32-bit product in only two accumulator registers if we add one more > "ADC Q1,Q1".) > > >> mul Rem, Klo > >> add Q1, r0 > >> adc Q2, r1 ; Cannot overflow > >> mul Num, Klo > >> add Q1, r1 > clr Q1 ; No longer need Q1; re-use register for Q3 > adc Q2, Q1 ; Propagate carry to Q2 > ;adc Q1, Q1 ; (Omit: no carry possible due to input > range) > >> mul Rem, Khi > add Q2, r0 ; Now byte 2 (hlo, Q2) of 32-bit product > adc Q1, r1 ; Now msbyte (hhi, Q3) of 32-bit product > >> > >> ; We now have the high 12 bits of the 28-bit product in Q1:Q2. > >> ; Shift down by 4 bits > >> andi Q2, 0xf0 > >> or Q1, Q2 > >> swap Q1 > >> ;; We now have the new quotient in "Q1". > >> st Z, Q1 > > _______________________________________________ > AVR-libc-dev mailing list > [hidden email] > https://lists.nongnu.org/mailman/listinfo/avr-libc-dev > AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
> I find all this fascinating, but I'm really not the one to be commenting on
> what the best approach here is. I will say, however, that in many of my > applications, I would be more likely to chose a speed increase over reduced > memory. I tend to live with mostly compute-bound, control applications > though. Thank you very much, this is very useful! Even if it's just one person's experience, at least it's *a* data point. May I ask, whay do you think of my RAM-for-ROM tradeoff idea? Is there one or the other that you more commonly run out of? > I've not had enough time to look at the code to make fine-grained > suggestions (and I'm out of practice with AVR ASM), but I've not felt that > the comments were too bad. I've been able to follow the code as written > easily enough. Yes, my original algorithm was excessively tricky. The last (and fastest) one is actually a lot simpler. _______________________________________________ AVR-libc-dev mailing list [hidden email] https://lists.nongnu.org/mailman/listinfo/avr-libc-dev |
Free forum by Nabble | Edit this page |