Printing octal (brain dump)

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Printing octal (brain dump)

George Spelvin
If anyone wants to flex their asm-hacking skills, there's some draft
code here.  If not, don't panic; this is primarily notes to myself
that someone else might be interested in.

To finish long long support in printf, I need to print 64-bit numbers
(preferably, arbitrary-precision numbers) in octal, and it seems to
be taking as long to write as the decimal code.

Hex is easy to do byte at a time, but octal is messy because 3 doesn't
divide 8.  It's not that it's difficult, per se.  Or that I need to
stress about efficiency (to a first approximation, nobody uses octal,
so it matters even less than usual).

The problem is that messy translates to large code (boo!), and
it's hard to see how large an alternative will be without writing
it in detail.

The old code just masked and printed the low-order bits, did a full-width
shift by 1 bit, repeated it 3x, and continued as long as the result
wasn't zero.  It did that with a 12-instruction helper routine to shift
the buffer by cnt bits:

.L_lsr_4:
        ldi cnt, 4
.L_lsr:
        lsr v_fifth
        ror v_hhi
        ror v_hlo
        ror v_hi
        ror v_lo
        dec cnt
        brne .L_lsr
  ; tst
        sbiw v_hlo, 0 ; only Z flag is needed
        cpc v_lo, rzero
        cpc v_hi, rzero
        ret

This was shared between the hex/octal print code and the decimal print
code, so the rest of the hex/octal print code was only 20 instructions
(+4 to branch to it depending on "base" on function entry).

The equivalent for a variable-length memory buffer is rather more awkward,
14 instructions which *aren't* shared with the decimal print code:

.L_lsr_4:
        ldi cnt, 4
.L_lsr:
        movw ZL, r24
        sub merge, merge ; Set to to zero and clear carry
        mov tlen, len
1: ld r0, -Z
        ror r0
        st Z, r0
        or merge, r0
        dec tlen
        bne 1b
        dec cnt
        bne .L_lsr
        tst merge
        ret

Worse, the inner loop is 9 cycles to shift 1 byte by 1 bit, and the outer
loop is another 5 per bit.  Shifting an 8-byte value is 9 * 8 + 5 = 77
cycles per bit.

If I have a 64-bit input to print, then that's 77 * 64 = 4928 cycles
just spent shifting!  Given that I can print a *decimal* 64-bit number
in 4103 cycles, that seems Just Wrong.

And even assuming I could keep the rest to 20 instructions, that's a
total of 34.  (It's not obvious I can, because while the shift leaves the
lsbyte in r0 ready for printing, the first digit doesn't have a shift,
so it needs to be handled specially.)


After a lot of messing about, I came up with the following O(n) scheme.
It buffers two registers, keeping 8..15 bits of the input buffered at any
given time.  Shifting is bit at a time, with the buffer refilled every 8
bits, and a digit printed every 3.  (Or 4.  The code is not only shared
with hex, but can handle bases 2, 4 or 32 as well.)

A trick I use repeatedly is bit shifting for loop counters.  Rather than
initialize a counter to "3" and decrement it each iteration, I initialize
a counter to "7" (a value I already have available for digit-extraction
purposes) and shift it right each iteration.

For the bit buffer, I have 8..15 bits of buffered data, and rather than
have a separate counter, mark the end of the valid data with a 1 bit.
So the data starts out like "1abcdefg hijklmno" and shifts down until
it's "00000001 abcdefgh".  At that point, we start the next shift,
but the msbyte becomes zero (with the 1 bit stored in the carry).
That's the signal to fetch another byte.

This reduces the loop overhead in both time and code size.


Here's the current draft.  Suggestions solicited!
Registers:
        len: Length of input (in bytes).  Shared code with
                the decimal print has stripped off leading zero
                bytes.
        X: Points to output buffer
        Z: Points to input buffer
        flags: Input flags distinguishing decimal, octal,
                and upper/lower case hex
        msb, lsb: Current bit buffer.  Most significant set bit of msb
                counts bits per byte.
        mask: A bitmask the size of the current base's digits (7 or 15)
        tmask: A loop counter initialized to "mask" which counts bits per digit
        digit: a temporary used to form the ASCII digit
        delta: The amount to add to digits past '9' to make
                them start at 'a' or 'A'.

.L_octhex:
        ldi mask, 7
        lsr flags ; bit 1 in carry, but 1 in register
        breq 1f
        ldi mask, 15
        ldi delta, 'A'-'0'+10
        brcc 1f
         ldi delta, 'a'-'0'+10
1:
        ldi msb, 1
        ldi tmask, 1
        ld lsb, Z+
        rjmp 2f

.L_bitloop: ; Main loop
        ror lsb
2: lsr tmask ; Entry point, tmask >>= 1
        brne 4f ; if (tmask == 0) {
        ; Spit out a digit of (lsb & mask)
        mov digit, lsb
        and digit, mask
        cpi digit, 10
        brcs 3f
         add digit, delta
3: add digit, '0'
        st X+, digit
        mov tmask, mask
4: ; }
        lsr msb
        brne .L_bitloop ; End of main loop

        ; Load another byte
        ; Note: we know that carry is SET
        dec len ; Load another byte
        breq .L_lastbyte
        ld msb, Z+
        ror msb ; Shift carry=1 into msbit
        rjmp .L_bitloop

.L_lastbyte:
lastbyte:
        adc msb, tmask ; Pad until end of digit (and clear carry)
        inc len
        cpse lsb, __zero_reg__
         rjmp 4b
        movw r24, r26 ; Copy X to return value
        ret

This is 35 instructions in its current state.  I can probably shave
one off the end by sharing the epilogue with the decimal print code,
and I'm thinking of making the "mask" value be what's passed in from
the caller to select the base, which will simplify the prologue.

The main loop has two conditional branches, either of which could be
the loop end.  I could save a cycle (but cost an instruction) in the
main loop by placing both conditions out of line.

The other tricky thing is the "lastbyte" section, invoked when we need
another byte but len == 0.  When we get to the end of the input, we need
to pad with zeros until the significant digits have all been printed.
This involves setting msb to some suitable all-zero, and keeping going
until lsb == 0 as well.

We stop when len == 0 and lsb == 0.  len is decremented *after* using
the byte, so when len is decremented to 0, the last byte has already
been used and we need to load an msbyte of padding.

There are three ways to check for this, and I *think* I've picked the one
with minimum code size (and fastest among those), but let me analyze them
in detail.

In the first two cases, we can pad with a full byte of 8 zero bits,
then the "lastbyte" code is simply "ror msb" (which sets msb to 0x80
and clears the carry in one instruction, since we know we started with
msb = 0 and carry = 1) and "rjmp .L_bitloop".  Oh!  Those instructions
already exist at the end of the "load another byte" branch, so maybe
the savings is more than I thought.

The three ways of checking and the additional costs are:

1) Check each iteration of the main bit loop.  This is slowest, so only
   use it if it has no rivals for smallest.  This can be placed after
   the "ror lsb", adding three more instructions: branch if not equal,
   test len == 0, and branch if equal to epilogue code.
2) Check with each digit printed.  This is identical code to the above,
   but inside the "if (!tmask)" condition to reduce overhead.
   The problem is that we don't get the test of lsb for free any more,
   so I think it's four instructions.  No!  We can do it in three:
        cp lsb, __zero_reg__ $ cpc len,__zero_reg__ $ breq epilogue
   CPC ANDs its result with the previous zero flag.  *If* the zero flag
   was set (the only case we care about) then the carry flag is clear,
   and the cpc will compute the test we want.  If the carry is set,
   the cpc will do the wrong thing, but that doesn't matter because
   zero is already clear.
3) The way I did it, checking with each byte fetched.  We have to
   test len == 0 anyway, so we just have to check lsb == 0.
   But we need four additional instructions:
   - We have to use "adc msb,tmask" to set the msb correctly, so we'll
     fall back into the byte-load check after only one more digit, and
   - We have to "inc len" so it will decrement again properly.
   - And we need two instructions to test lsb == 0

I thought option 3 was smallest because I thought the "rol msb" the
other cases required was the same size as "adc msb.tmask".  I didn't
notice the 2-instruction code sharing.

So now it seems that option 2 is the smallest.  Here it is in 34
instructions:
 
.L_octhex:
        ldi mask, 7
        lsr flags ; bit 1 in carry, but 1 in register
        breq 1f
        ldi mask, 15
        ldi delta, 'A'-'0'+10
        brcc 1f
         ldi delta, 'a'-'0'+10
1:
        ldi msb, 1
        ldi tmask, 1
        ld lsb, Z+
        rjmp 2f

.L_bitloop: ; Main loop
        ror lsb
2: lsr tmask ; Entry point, tmask >>= 1
        brne 4f ; if (tmask == 0) {
        ; Check if we're done
        cp lsb, __zero_reg__
        cpc len, __zero_reg__
        breq .L_epilogue
        ; Spit out a digit of (lsb & mask)
        mov digit, lsb
        and digit, mask
        cpi digit, 10
        brcs 3f
         add digit, delta
3: add digit, '0'
        st X+, digit
        mov tmask, mask
4: ; }
        lsr msb
        brne .L_bitloop ; End of main loop

        ; Load another byte
        ; Note: we know that msb == 0 and carry is SET
        dec len ; Load another byte
        breq 5f
        ld msb, Z+
5: ror msb ; Shift carry=1 into msbit
        rjmp .L_bitloop

.L_epilogue:
        movw r24, r26 ; Copy X to return value
        ret

This is reasonably small (although there is definitely a size hit
relative to the previous code, sigh) and reasonably efficient.  As I said,
the 2-instruction epilogue can be shared with the decimal code, and the
prologue code might be shortened by 3 instructions by choosing the "flags"
values so they can be used directly as "mask".  That would cut it to 29.
(Plus two in the common code to branch to this depending on the flags.)

I have to code it up and test it, but I wanted to write down my thoughts
about it first, and that seemed post-worthy.


_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Working octal code (FYI)

George Spelvin
(Am I annoying everyone by using this mailing list as my personal
coding blog?)

After considerable rearranging (and fixing one nasty logic bug in
the first algorithm posted), I have octal converison working to my
satisfaction.

The logic bug was that I assumed I'd need at most one byte of zero-padding
to print a number.  But I was checking for termination before printing
a digit.  That ended up not working with 1-byte octal numbers where the
top digit is non-zero.  By the time I was ready to print the fourth digit
(when the termination check would fire), the lsbyte wanted to hold bits
9..16, and that meant loading a *second* byte (bits 16..23).

So I changed to checking for termination *after* printing a digit,
which I knew would save time, but I unexpectedly found additional
space savings, too.

Not counting preamble code shared with decimal printing (all the
stuff before the label "3:"), it's down to 29 instructions.  Still
a bit more than 20, but I'm satisfied.

It's even slightly faster than the previous code:

Bits Old New
 0 56 42
 8 144 113
16 276 232
24 364 314
32 496 430
40 546
48 628
56 744
64 860


/* Arguments */
#define out X /* Arrives in r24:r25, but we move it immediately */
#define out_lo r26
#define out_hi r27
#define bin Z /* Arrives in r22:r23, but we move it immediately */
#define bin_lo r30
#define bin_hi r31
#define len r20
#define flags r18 /* Mask, after removing two lsbits */

/* Local variables */
#define msb r25 /* Overlaps input */
#define lsb r24 /* Overlaps input */
#define digit r23 /* Overlaps input */
#define delta r22 /* Overlaps input */
#define tmask r21
// len = r20
#define k r19
// flags = r18

        .text
        .global binprint
        .type binprint, @function
binprint:
        movw out_lo, r24
        movw bin_lo, r22
#if 1
        add bin_lo, len
        adc bin_hi, zero
#else
        mov tmask, len
        ; Conditional negate using the standard identity -x = ~x + 1.
        ; Given mask of -1 or 0, (x ^ mask) - mask returns -x or x.
        ; However, we would need the carry bit clear to start this, and
        ; forming "mask" from the carry bit in one instruction preserves
        ; the carry bit.  So instead add zero with carry.
        lsr flags ; Lsbit is negate flag
        sbc k, k ; Set to 0 or -1, carry preserved
1:
        ld __tmp_reg__, bin
        eor __tmp_reg__, k
        adc __tmp_reg__, __zero_reg__
        st bin+, __tmp_reg__
        dec tmask
        brne 1b
#endif
        ; Strip trailing (most-significant) zeros from bin */
2: dec len
        breq 3f ; If we've reached the end, stop
        ld __tmp_reg__, -bin
        or __tmp_reg__, __tmp_reg__
        breq 2b ; Continue as long as bytes are zero

3: movw bin_lo, r22 ; Reset bin to lsbyte
        ; Len is now pre-decremented

        ; Done with args in r22-r25; now allowed to use delta, digit, lsb, msb
        ldi delta, 'A'-'0'-10
        lsr flags
        brcc 4f
         ldi delta, 'a'-'0'-10
4: ldi msb, 1
        ld lsb, bin+

.L_digit_out: ; Spit out a digit
        mov digit, lsb
        and digit, flags
        cpi digit, 10
        brcs 5f
         add digit, delta ; Hex digit > 9
5: subi digit, -'0'
        st X+, digit
        ; Check for done: is len:lsb < 0:flags?
        cp flags, lsb
        cpc __zero_reg__, len
        brcc .L_epilogue ; if (!lsb && !len) return X
        mov tmask, flags
.L_bitloop:
        lsr msb
        brne 7f ; if ((msb >>= 1) == 0) get another byte
        ; Fetch another byte
        or len, len ; Preserves carry
        breq 6f
         dec len ; Preserves carry
         ld msb, Z+
6: ror msb ; Shift carry=1 into msbit
7: ror lsb
        lsr tmask
        brne .L_bitloop ; if ((tmask >>= 1)== 0) {
        rjmp .L_digit_out
.size binprint, .-binprint

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Working octal code (FYI)

Dale Whitfield-2
Hi George,

> (Am I annoying everyone by using this mailing list as my personal
> coding blog?)
>

No. But I speak for myself.

Some of us are interested and reading but don't have time to comment or
to work through the code.

Your efforts are appreciated.

> After considerable rearranging (and fixing one nasty logic bug in
> the first algorithm posted), I have octal converison working to my
> satisfaction.
>
> The logic bug was that I assumed I'd need at most one byte of zero-padding
> to print a number.  But I was checking for termination before printing
> a digit.  That ended up not working with 1-byte octal numbers where the
> top digit is non-zero.  By the time I was ready to print the fourth digit
> (when the termination check would fire), the lsbyte wanted to hold bits
> 9..16, and that meant loading a *second* byte (bits 16..23).
>
> So I changed to checking for termination *after* printing a digit,
> which I knew would save time, but I unexpectedly found additional
> space savings, too.
>
> Not counting preamble code shared with decimal printing (all the
> stuff before the label "3:"), it's down to 29 instructions.  Still
> a bit more than 20, but I'm satisfied.
>
> It's even slightly faster than the previous code:
>
> Bits Old New
>  0 56 42
>  8 144 113
> 16 276 232
> 24 364 314
> 32 496 430
> 40 546
> 48 628
> 56 744
> 64 860
>
>
> /* Arguments */
> #define out X /* Arrives in r24:r25, but we move it immediately */
> #define out_lo r26
> #define out_hi r27
> #define bin Z /* Arrives in r22:r23, but we move it immediately */
> #define bin_lo r30
> #define bin_hi r31
> #define len r20
> #define flags r18 /* Mask, after removing two lsbits */
>
> /* Local variables */
> #define msb r25 /* Overlaps input */
> #define lsb r24 /* Overlaps input */
> #define digit r23 /* Overlaps input */
> #define delta r22 /* Overlaps input */
> #define tmask r21
> // len = r20
> #define k r19
> // flags = r18
>
> .text
> .global binprint
> .type binprint, @function
> binprint:
> movw out_lo, r24
> movw bin_lo, r22
> #if 1
> add bin_lo, len
> adc bin_hi, zero
> #else
> mov tmask, len
> ; Conditional negate using the standard identity -x = ~x + 1.
> ; Given mask of -1 or 0, (x ^ mask) - mask returns -x or x.
> ; However, we would need the carry bit clear to start this, and
> ; forming "mask" from the carry bit in one instruction preserves
> ; the carry bit.  So instead add zero with carry.
> lsr flags ; Lsbit is negate flag
> sbc k, k ; Set to 0 or -1, carry preserved
> 1:
> ld __tmp_reg__, bin
> eor __tmp_reg__, k
> adc __tmp_reg__, __zero_reg__
> st bin+, __tmp_reg__
> dec tmask
> brne 1b
> #endif
> ; Strip trailing (most-significant) zeros from bin */
> 2: dec len
> breq 3f ; If we've reached the end, stop
> ld __tmp_reg__, -bin
> or __tmp_reg__, __tmp_reg__
> breq 2b ; Continue as long as bytes are zero
>
> 3: movw bin_lo, r22 ; Reset bin to lsbyte
> ; Len is now pre-decremented
>
> ; Done with args in r22-r25; now allowed to use delta, digit, lsb, msb
> ldi delta, 'A'-'0'-10
> lsr flags
> brcc 4f
> ldi delta, 'a'-'0'-10
> 4: ldi msb, 1
> ld lsb, bin+
>
> .L_digit_out: ; Spit out a digit
> mov digit, lsb
> and digit, flags
> cpi digit, 10
> brcs 5f
> add digit, delta ; Hex digit > 9
> 5: subi digit, -'0'
> st X+, digit
> ; Check for done: is len:lsb < 0:flags?
> cp flags, lsb
> cpc __zero_reg__, len
> brcc .L_epilogue ; if (!lsb && !len) return X
> mov tmask, flags
> .L_bitloop:
> lsr msb
> brne 7f ; if ((msb >>= 1) == 0) get another byte
> ; Fetch another byte
> or len, len ; Preserves carry
> breq 6f
> dec len ; Preserves carry
> ld msb, Z+
> 6: ror msb ; Shift carry=1 into msbit
> 7: ror lsb
> lsr tmask
> brne .L_bitloop ; if ((tmask >>= 1)== 0) {
> rjmp .L_digit_out
> .size binprint, .-binprint
>
> _______________________________________________
> AVR-libc-dev mailing list
> [hidden email]
> https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
>

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Working octal code (FYI)

David Brown-4
On 16/12/16 09:56, Dale Whitfield wrote:

> Hi George,
>
>> (Am I annoying everyone by using this mailing list as my personal
>> coding blog?)
>>
>
> No. But I speak for myself.
>
> Some of us are interested and reading but don't have time to comment or
> to work through the code.
>
> Your efforts are appreciated.
>

My feelings are much the same.

I would not put in too much effort into making this code fast (except
for the fun of it, of course).  Printing in octal is necessary to make
printf long long support complete.  But it is unlikely to be useful in
practice - people very rarely use octal.  I can think of only 4 use-cases:

1. chmod parameters on *nix.
2. 0 is an octal constant in C, and is often quite handy.
3. "uint8_t month = 09;" is a fine way to get accidental compile-time
errors.
4. "uint16_t year_of_nicaea = 0325;" is a fine way to get accidental
run-time errors.

Still, as long as octal is in the C standards, it is great that you are
doing this work to finish the standard support in printf.



_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

George Spelvin
In reply to this post by George Spelvin
tl;dr: Beta code included below; it passes my initial testing.  I named
it _itoa_engine, following "ftoa_engine", but suggestions are welcome.
It is unfortunately one instruction longer than ultoa_invert:

   text   data    bss    dec    hex filename
    188      0      0    188     bc ultoa_invert.o
    190      0      0    190     be itoa_engine.o
    208      0      0    208     d0 itoa_engine.o (+OPTIMIZE_SPEED)
    212      0      0    212     d4 itoa_engine.o (without HAVE_MUL)
    230      0      0    230     e6 itoa_engine.o (+OPTIMIZE_SPEED)

but I'm reasonably satisfied.


Well, it turned out that my great idea from last time had a nasty
bug in it.  There's no really good way to check for termination after
printing each digit.

Recall that my bit buffer only has room for 15 bits of the number, so I
arrange for it to always hold 8 <= n <= 15 bits.  When it's about to fall
from 8 to 7, I fetch another byte and decrement it from 16 to 15 instead.

The problem is that there are two ways to handle the length, and
both of them have problems.

One is to count so len=1 after fetching the last byte of input, and
len=0 after fetching the first byte of padding.

In that case, consider printing one byte of zero in hex.

I start with 8 bits in the buffer (msbyte = 0x01), and len = 1.  I print a
'0', and check if len == 0.  It's not, so I keep going and print a second
zero before realizing that I should stop.


Okay, that doesn't work, so what if I arrange so len counts the bytes
remaining?  len=0 after fetching the last byte, and it holds there
after fetching padding.

In that case, there are no extra digits, but there are missing
digits.  Consider printing 2^15 = 0x8000.  I start out with
len=1 and the last byte in the bit buffer.  I print the final 0, then
have to fetch another byte to start shifting bits into the lsbyte.

So the bit buffer holds 800 (msbyte = 0x18, lsbyte = 0x00, len = 0) when
I print the second 0.  At this point, len == 0 and lsbyte == 0, so
I should stop printing.  But there are more bits in the msbyte.

There are various tricks you can try in hex, but they break down in
octal.  Checking for extra bits in the msbyte makes for messy
(= large) code.


So I went back to my original idea of checing for termination when
fetching padding.  It's a couple of instructions longer, but it works.

Anyway, here's the current state of the function I intend to rewrite
vfprintf() around.  Any comments?  I know about a few mechanical things:
I need to integrate it into avr-libc and #include the appropriate headers
rather than #define things myself.  Likewise X_movw and so on.




/*
 * bum
 *
 * 1. vt. (obs.) To make highly efficient, either in time or space,
 *   often at the expense of clarity.
 *
 * The core integer-to-ASCII code.  This takes as input arbitrary-
 * precision byte arrays, and outputs ASCII in bases 10 or 2^k.
 * (With mostly separate code.)
 *
 * This code is small and fast, but difficult to understand.  It uses
 * tricky number theory (dividing by multiplying by the 2-adic inverse)
 * and low-level bit hacks to implement it.  Sorry about that.
 *
 * The arguments are:
 * char *out:
 * Buffer in which the ASCII digits are placed, in little-endian
 * order.  Must be reversed by the caller for printing.  The caller is
 * also responsible for ensuring that it is big enough.
 * uint8_t *bin:
 * Buffer holding the input value.  This is NOT const; it is used as
 * working storage by the algorithm.  More specifically, negation
 * is done in place if requested, after which decimal conversion
 * leaves the buffer filled with zero, and binary-base conversion
 * leaves the buffer unmodified.
 * uint8_t len:
 * The length of the data at "bin".  Must be between 1 and 255;
 * if you pass in 0, Bad Things will happen.
 * uint8_t flags:
 * A variable whise bits determine the form of the output.
 * flags bit 0: If set, the input is negated before printing.
 * This is a convenience to the caller for printing signed
 * numbers.
 * flags bit 1: If set, digits greater than 10 (a, b, c, ...)
 * are printed in lower case.  If clear, they are printed in
 * upper case.  This has no effect if the base is 10 or less.
 * flags bits 2..7: A contiguous bit mask (of the form (1 << k) - 1)
 * selecting the base to print.  If zero, decimal is output.
 * If non-zero, a mask of the binary base to print (1 for
 * binary, 7 for octal, 15 for hex).  No error-checking
 * is done; if you pass in something else, you'll get
 * peculiar output.
 *
 * The return value is a pointer to just past the end of the output.
 *
 * An overview of the code is as follows:
 * - The conditional negate is fairly straightforward.  It's convenient
 *   to do here because it leaves the pointer at the msbyte for the
 *   next step.
 * - Then we strip off most significant zero bytes until we get to a
 *   non-zero byte or only one byte is left.  (The latter ensures we
 *   print "0" for zero, rather than the null string.)
 * - Then the code splits depending on whether the remaining flags are
 *   zero (decimal) or not.
 *
 * - The decimal code alternates two phases, finding the last digit (the
 *   input mod 5), and then dividing the input by 10.  Each involves a
 *   pass over the input, although conveniently in opposite directions.
 * - Actually, the first pass (msbyte-to-lsbyte) divides the input by 2,
 *   and stores the lsbit.  It also computes the quotient mod 255,
 *   which is then reduced mod 5.  Combining (x % 2) and (x/2 % 5)
 *   gives the digit.
 * - The second pass (lsbyte-to-msbyte) then divides the input by 5.
 *   This is done by subtracting the remainder mod 5 to produce an exact
 *   multiple of 5, then multiplying the result by the 2-adic expansion of
 *   -1/5 (which has a particularly simple form 0x...33333) and negating.
 * - If the second pass produces an msbyte of 0, the length is reduced.
 *   When the length reaches zero, conversion is done.
 *
 * - The binary code does one lsbit-to-msbit pass over the input.
 * - It mostly counts loop iterations with right shifts rather than
 *   counters.  When the shift produces a zero result (and a set carry,
 *   since it must have been non-zero before), the count has expired.
 * - The input value is kept in a 2-byte shift register which holds
 *   8..15 bits of the input, with a 1 bit at the high end delimiting
 *   the valid data.
 * - The main loop shifts the 16-bit buffer right one bit at a time until
 *   it's time to print a digit, or the buffer is about to underflow to
 *   7 bits, meaning it's time to load another byte into the high half.
 * - The "digit" value is a mask of "bits already printed".  We shift
 *   it down until it's zero, then print the digit (lsb & mask), then
 *   reload it with the mask.
 * - When the right shift of the msbyte of the buffer produces zero
 *   (the delimiter bit was just shifted out), we load a new msbyte and
 *   retry the shift.
 * - When we run out of input bytes, there's some tricky code to add
 *   enough padding msbits to ensure remaining bits in the buffer
 *   are printed, for any remaining bits in the buffer to be printed,
 *   stopping when there are no more.
 *
 * The code often sets the carry bit several instructions before using it.
 * When reading, it's important to know that ld, st, mov, inc, dec,
 * and logical operations do not modify the carry bit.
 */

#ifndef __tmp_reg__
# define __tmp_reg__ r0
#endif
#ifndef __zero_reg__
# define __zero_reg__ r1
#endif

//#undef __AVR_HAVE_MUL__
#define __AVR_HAVE_MUL__ 1
#define OPTIMIZE_SPEED 1

/* Arguments */
#define out X /* Arrives in r24:r25, but we move it immediately */
#define out_lo r26
#define out_hi r27
#define bin Z /* Arrives in r22:r23, but we move it immediately */
#define bin_lo r30
#define bin_hi r31
#define len r20
#define flags r18 /* Mask, after removing two lsbits */

/* Local variables.  The first four overlap with arguments. */
#define msb r25 /* 16-bit accumulator */
#define lsb r24
#define rem r23 /* Remainder mod 255, or mod 5 */
#define digit r22 /* Current digit */
#define tlen r21 /* Loop counter, copy of len or flags */
#define k r19 /* Constant value, usually the multiplier 0x33 */

#if __AVR_HAVE_MUL__
#define zero r18 /* Used instead of r1 to free up multiplier */
#else
#define zero __zero_reg__
#endif

        .text
        .global _itoa_engine
        .type _itoa_engine, @function
_itoa_engine:
        movw out_lo, r24
        movw bin_lo, r22

        lsr flags ; Lsbit is negate flag
#if OPTIMIZE_SPEED
        ; 4 more instructions, but avoids a pass over the input (saving
        ; 9*len - 4 cycles) in the common case when no negation is desired.
        brcs 1f
        add bin_lo, len
        adc bin_hi, __zero_reg__
        rjmp 3f
#endif
        ;; Conditional negate, depending on carry.  The standard identity
        ;; -x = ~x + 1 can be extended to do a conditional negate.
        ;; Given a mask of 0 or -1, (x^mask) - mask returns x or -x.
        ;; However, we would need the carry bit clear to start this,
        ;; and forming "mask" from the carry bit in one instruction
        ;; preserves the carry bit.  So instead do the conditional
        ;; increment using add zero with carry.
        ;;
        ;; It turns out there's no faster way to do an unconditional
        ;; negate.  Multi-precision negate is awkward on AVR, which lacks
        ;; negate-with-carry or any form of dest = src-dest instruction.
        ;; So it's two instructions minimum (mov+sbc or invert+adc),
        ;; and either way we need one instruction of setup, to clear
        ;; the carry or create a constant of 0xff.  The latter because
        ;; the COM instruction overwrites carry, and there's no EORI,
        ;; so we need to load a constant 0xff into a register.

1: mov tlen, len
        sbc k, k ; Set to 0 or -1, carry preserved

2: ld __tmp_reg__, bin
        eor __tmp_reg__, k
        adc __tmp_reg__, __zero_reg__
        st bin+, __tmp_reg__
        dec tlen
        brne 2b

        ; Strip redundant trailing (most-significant) zeros from bin.
3: ld __tmp_reg__, -bin
        or __tmp_reg__, __tmp_reg__
        brne 4f
        dec len
        brne 3b
        inc len

        ; Split into two branches based on msbits of flag.
4: lsr flags ; lower/upper case flag to carry
        brne .L_binary
        ;; The longest form of the decimal code (OPTIMIZE_SPEED && !HAVE_MUL)
        ;; is exactly 63 instructions (126 bytes) long, which is the extreme
        ;; limit of this conditional branch.  If it gets any longer, you'll
        ;; have add an instruction.  I suggest swapping the cases and dealing
        ;; with the fact that the .L_epilogue branch is now out of range by
        ;; placing it at the end of the binary code and having the decimal
        ;; code end in a rjmp to it.

.L_decimal:
        adiw bin_lo, 1
#if __AVR_HAVE_MUL__
        clr zero
        ldi k, 0x33
#endif

        ; The main loop, repeated while len > 0
.L_decimal_loop:
#if OPTIMIZE_SPEED
        ldi digit, '0'+10 ; Adjust for later offset on rem
#else
        ldi digit, '0' ; Sneakily preload the ASCII offset
#endif
        ser rem ; Sum of all bytes, mod 255
        mov tlen, len

        ;; Pass 1, msb-to-lsb: finding the input mod 10.
        ;;
        ;; We do two things here: divide by 2 (saving the lsbit), and sum
        ;; the result mod 255. This is then used to compute the result
        ;; mod 5, which combined with the digit gives the decimal digit
        ;; we want.  The awkward thing is that we want *two* carry bits
        ;; in this loop (one for the shift and the other for summing the
        ;; bytes), so we use the lsbit of digit for one of them.
.L_mod10:
        ld __tmp_reg__, -bin
        lsr digit ; Lsbit to carry bit
        ror __tmp_reg__
        st bin, __tmp_reg__
        rol digit ; Carry bit to lsbit
        add rem, __tmp_reg__
        adc rem, zero ; End-around carry for mod 255

        dec tlen
        brne .L_mod10

#if OPTIMIZE_SPEED
        ; Reduce rem mod 15  (from 1 <= rem <= 255 to 1 <= rem <= 15)
        mov __tmp_reg__, rem
        swap __tmp_reg__
        andi rem, 0xf0
        add rem, __tmp_reg__ ; Add high halves to get carry bit
        andi rem, 0xf0
        swap rem
        adc rem, zero ; End-around carry
        ; Reduce rem mod 5 (average of 2.2 subtracts)
0: subi rem, 5
        brcc 0b
        ; We are left with rem-5, which can be compensated for elsewhere.
#else
        ;; Reduce rem mod 35, using a loop.
        ;;
        ;; This is a bit slower (22.82 cycles average to the end of the
        ;; mod-5 loop, vs. 12.6 cycles using the above), but saves 5
        ;; instructions.  Any multiple of 5 will work, but 35 is optimal.
0: subi rem, 35
        brcc 0b
        ; Now add back mod 5 until we go positive again.
0: subi rem, -5
        brcs 0b
#endif

        ; Form and store ASCII digit (2*rem + digit)
        add digit, rem
        add digit, rem
        st out+, digit

        ;; Pass 2, lsb-to-msb: dividing by 5.
        ;;
        ;; Rather than do a general divide by 5, we can subtract rem
        ;; to produce a multiple of 5, and then do an exact division by
        ;; multiplying by the 2-adic expansion of 1/5, 0xCCC...CCD.
        ;;
        ;; To get this into an even simpler form, we multiply by -1/5 =
        ;; 0x333...333 and negate.  Each byte is multiplied by 0x33 and
        ;; added to an accumulator to be used for each higher byte.
        ;;
        ;; The accumulator has to be 16 bits wide, but after storing
        ;; each output byte, we can fold the msbyte into the lsbyte.
        ;;
        ;; Negating the output can be "complement and add one", but
        ;; we do it as "subtract one and complement", initializing the
        ;; accumulator to 0xff, then complementing before storing.
        ;;
        ;; To subtract rem without an separate carry propagation pass,
        ;; subtract 0x33 times rem from the accumulator to start.
        ;; (Since 0 <= rem <= 4, this is very easy.)

        ; msb:lsb = 255 - (rem * 0x33).  Guaranteed to fit into 8 bits.
#if !OPTIMIZE_SPEED
        ldi lsb, 255
#elif __AVR_HAVE_MUL__
        ldi lsb, 0 ; Adjust for pre-decrement of rem by 5
#else
        ldi lsb, 45 ; Adjust for pre-decrement of rem by 5
#endif
#if __AVR_HAVE_MUL__
        mul rem, k
        sub lsb, r0
#else
        ; Multiply by 0x11
        mov __tmp_reg__, rem
        swap __tmp_reg__ ; rem < 16, so this is rem << 4
        add __tmp_reg__, rem
        ; Multiply by 3
        sub lsb, __tmp_reg__
        sub lsb, __tmp_reg__
        sub lsb, __tmp_reg__
#endif
        clr msb

        ; Here is the actual divide loop
        mov tlen, len
.L_divide:
        ld __tmp_reg__, bin
        ; acc += 0x33 * r0.  This product could be up to 14 bits long.
#if __AVR_HAVE_MUL__
        mul __tmp_reg__, k
        add lsb, r0
        adc msb, r1
#else
        ; let rem:digit = __tmp_reg__ << 4 = __tmp_reg__ * 0x10
        mov digit, __tmp_reg__
        swap digit
        mov rem, digit
        andi rem, 15 ; Mask off high 4 bits
        eor digit, rem ; Mask off low 4 bits

        ; Add __tmp_reg__ to get __tmp_reg__ * 0x11
        add digit, __tmp_reg__
        adc rem, zero

        ; Now add it to the accumulator 3 times (there is no faster way)
        add lsb, digit
        adc msb, rem
        add lsb, digit
        adc msb, rem
        add lsb, digit
        adc msb, rem
#endif
        ; Store the complemented accumulator (*bin++ = ~lsb)
        mov __tmp_reg__, lsb
        com __tmp_reg__
        st bin+, __tmp_reg__

        ; Fold the accumulator: msb:lsb = msb + lsb
        add lsb, msb
        clr msb
        adc msb, zero

        dec tlen
        brne .L_divide

        ;; End of main loop: check if the new msbyte was zero.
        ;; If so, drop it (decrement len), and stop when len == 0.
        cpse r0, zero
         rjmp .L_decimal_loop
        sbiw bin_lo, 1
        dec len
        brne .L_decimal_loop

#if __AVR_HAVE_MUL__
        clr __zero_reg__
#endif
        ; Finally copy the out pointer to r24 and return it.
.L_epilogue:
        movw r24, out_lo
        ret


.L_binary:
        movw bin_lo, r22 ; Reset bin to lsbyte
        ; Done with args in r22-r25; now allowed to use digit, rem, lsb, msb

        ; Load startup values.
        ldi rem, 'A'-'0'-10 ; ASCII adjustment for hex digits past 9
        brcc 0f
         ldi rem, 'a'-'0'-10
0: ldi msb, 1
        ld lsb, bin+

.L_digit_out: ; Spit out a digit
        mov digit, flags
        and digit, lsb
        cpi digit, 10
        brcs 0f
         add digit, rem ; Hex digit > 9
0: subi digit, -'0'
        st X+, digit
        mov digit, flags
.L_bitloop:
        lsr msb
        brne 6f
         ; msb is empty; fetch another byte
         dec len ; Preserves carry
         breq 7f ; No more input
         ld msb, Z+
5: ror msb ; Shift carry=1 into msbit
6: ror lsb
        lsr digit
        brne .L_bitloop ; if ((tlen >>= 1)== 0) {
        rjmp .L_digit_out

        ;; We've run out of bytes.  If all 1 bits in lsb have been printed
        ;; (i.e. lsb <= digit), stop.  If not, pad with enough zeros to print
        ;; one more digit, and so we'll check again after printing it.
7: cp digit, lsb ; if (lsb <= digit) return X
        brcc .L_epilogue
        inc len
        adc msb, digit ; msb = digit + 1, and clear carry
        rjmp 5b
        .size _itoa_engine, .-_itoa_engine

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

Georg-Johann Lay-2
George Spelvin schrieb:
> tl;dr: Beta code included below; it passes my initial testing.  I named
> it _itoa_engine, following "ftoa_engine", but suggestions are welcome.
> It is unfortunately one instruction longer than ultoa_invert:
>
>    text   data    bss    dec    hex filename
>     188      0      0    188     bc ultoa_invert.o
>     190      0      0    190     be itoa_engine.o
>     208      0      0    208     d0 itoa_engine.o (+OPTIMIZE_SPEED)
>     212      0      0    212     d4 itoa_engine.o (without HAVE_MUL)

hmmm, that's quite some increase in code size...

IMO octal conversion is so remote from any practical purposes that we
should use an algorithm that works for all radices.

This can be done in less than 70 bytes (variable length / radix
algorithm that operates on memory, no need for MUL, needs strrev.

So all is boiling down to the question what execution time is
acceptable or not:

Is 8000 ticks too slow?

Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just
because we have an algorithm that performs in 3000 ticks?

My strong preference is still to have a one-fits-all algorithm that
might very well be slower than an optimal one.  But hey, an ordinary
division of a 64-bit value by 10 already costs 2300 cycles, so why
should we hunt cycles just for printf...?

Also I think it's a good idea to have conversion routines like
[u]lltoa, so we need generic conversions anyway.

The current __ultoa_invert is quite verbose and also adds a noticeable
amount of code.  It performs faster than ultoa from stdlib.h, but is
this essential?

I played around some more; attached is a version that operates on
memory, variable length and radices, and doesn't need MUL. It's
called mem_toa and needs < 70 bytes / 8000 ticks for decimal.
The signed version adds yet another 40 bytes:

#define __zero_reg__ r1
#define __tmp_reg__ r0

.macro DEFUN name
.global \name
.func \name
\name:
.endm

.macro ENDF name
.size \name, .-\name
.endfunc
.endm

#if defined (__AVR_HAVE_JMP_CALL__)
#define XCALL call
#define XJMP  jmp
#else
#define XCALL rcall
#define XJMP  rjmp
#endif

.macro wmov  r_dest, r_src
#if defined (__AVR_HAVE_MOVW__)
     movw \r_dest,   \r_src
#else
     mov \r_dest,    \r_src
     mov \r_dest+1,  \r_src+1
#endif
.endm

;; End preamble

#define pNum    26
#define pBuf    30

#define nBytes  20
#define Radix   18

#define Num     19
#define Count   21
#define nBits   22
#define Rem     23

;; extern void mem_toa (char *pBuf, void *pNum, uint8_t nBytes, uint8_t
Radix);
.section .text.mem_toa,"ax",@progbits

DEFUN mem_toa
     wmov    pBuf, 24
     wmov    pNum, 22

.LnextDigit:
     add     pNum, nBytes
     adc     pNum+1, __zero_reg__
     mov     Count, nBytes
     ;; nBytes < 0  <==>  nBytes is unknown
     ldi     nBytes, -1
     clr     Rem

.LnextByte:
     ld      Num, -X
     ldi     nBits, 8

.LnextBit:
     ;; Bit-wise construct (complement of) quotient into Num.
     ;; Bit-wise reconstruct Num into Rem.
     rol     Num
     rol     Rem

     ;; Reducing Rem mod Radix;  C = 0  <==>  reduction applied
     cp      Rem, Radix
     brcs    1f
     sub     Rem, Radix
1:
     dec     nBits
     brne    .LnextBit

     rol     Num
     com     Num
     st      X, Num

     breq    2f
     sbrc    nBytes, 7
     mov     nBytes, Count
2:
     dec     Count
     brne    .LnextByte

     subi    Rem, -'0'
     cpi     Rem, '9' + 1
     brlt    3f
     subi    Rem, '9' + 1 - 'a'
3:
     st      Z+, Rem

     sbrs    nBytes, 7
     rjmp    .LnextDigit

     st      Z, __zero_reg__

     ;; pBuf survived in R25:R24
     XJMP    strrev

ENDF mem_toa

#define Minus1 pBuf

;; extern void mem_toa_signed (char *pBuf, void *pNum, uint8_t nBytes,
uint8_t Radix);
.section .text.mem_toa_signed,"ax",@progbits

DEFUN mem_toa_signed
     wmov    pNum, 22
     add     pNum, nBytes
     adc     pNum+1, __zero_reg__
     ld      Num, -X
     tst     Num
     brpl    9f

     ;; Negate pNum[] according to -N = ~N + 1 = ~N - (-1).
     ;; Carry is clear now because ADC above didn't overflow.
     wmov    pNum, 22
     mov     Count, nBytes
     ldi     Minus1, -1
1:
     ld      Num, X
     ;; Must use EOR for complement because COM clobbers carry.
     eor     Num, Minus1
     sbci    Num, -1
     st      X+, Num
     dec     Count
     brne    1b

     wmov    pBuf, 24
     ldi     Num, '-'
     st      Z, Num
     adiw    24, 1
9:
     XJMP    mem_toa

ENDF mem_toa_signed

#undef Minus1
#undef pNum
#undef pBuf

#undef nBytes
#undef Radix

#undef Rem
#undef Num
#undef Count
#undef nBits



For purely decimal there's a version consuming 90 bytes only,
requiring MUL and which is only 10% slower than yours:

(Using the same preamble as the code above)

.section .text.u64toa_nibbles,"ax",@progbits

#ifdef __AVR_HAVE_MUL__
#define SPEED 1
#else
#define SPEED 0
#endif

#define pBuf    26
#define pNum    30
#define Length  r20

#define Rem     r19
#define Quot    r21
#define Count   r22
#define Num     r23

#if SPEED
#   define Ten     r16
#   define Ox67    r18
#endif

;; extern void u64toa_nibbles (char *pBuf, void *pNum, uint8_t Length)
;; Length in [1, 127]

DEFUN u64toa_nibbles

     wmov    pBuf, 24
     wmov    pNum, 22

#if SPEED
     push    Ten
     ldi     Ten, 10
     ;; For / 10 by multiply we would usually use a value of 0xcd ~
0x800/10,
     ;; however Rem below will be in [0, 0x9f] so that ~ 1/2 of that value
     ;; also works, and that saves us one "lsr r1" per division.
     ldi     Ox67, 0x67
     clr     Count
#endif

.LoopDigits:

     add     pNum, Length
#if SPEED
     ;; Count is 0.
     adc     pNum+1, Count
#else
     adc     pNum+1, __zero_reg__
#endif

     mov     Count, Length
     clt     ; Length unknown
     clr     Rem

.LoopBytes:

     ld      Num, -Z

#if SPEED

     ;; Rem:Num <<= 4
     ;; "andi Rem, 0xf" not needed because Rem < 10.
     eor     Rem, Num
     andi    Num, 0x0f
     eor     Rem, Num
     swap    Rem

     ;; Quot := Rem / 10
     mul     Rem, Ox67
     lsr     r1
     lsr     r1
     mov     Quot, r1

     ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10
     mul     r1, Ten
     sub     Rem, r0

     ;; ...and (mostly) the same again for the low nibble.

     ;; Quot <<= 4
     swap    Quot

     ;; Rem:Num <<= 4
     swap    Rem
     eor     Rem, Num

     ;; Quot += Rem / 10
     mul     Rem, Ox67
     lsr     r1
     lsr     r1
     add     Quot, r1

     brts    0f
     ;; Quot = 0  ==>  R1 = 0, hence we can also skip the MUL + SUB below.
     breq    1f
     ;; Current Length not yet known: Set it according to highest
non-zero byte.
     set
     mov     Length, Count
0:
     ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10
     mul     r1, Ten
     sub     Rem, r0
1:
#else // !SPEED

     clr     Quot

.LoopNibbles:
     ;; Quot <<= 4
     swap    Quot

     ;; Rem:Num <<= 4
     swap    Num
     swap    Rem
     ;; "andi Rem, 0xf0" not needed because Rem was < 10.
     eor     Rem, Num
     andi    Num, 0xf0
     eor     Rem, Num

1:  subi    Quot, -4
     subi    Rem, 40
     brcc    1b
     subi    Rem, -40
     ;; "+1" avoids "dec Quot" after the next loop.
     subi    Quot, 4 + 1

2:  inc     Quot
     subi    Rem, 10
     brcc    2b
     subi    Rem, -10
     ;; dec     Quot

     subi    Count, 0x80
     brmi    .LoopNibbles

     brts    0f
     tst     Quot
     breq    0f
     ;; Current Length not yet known: Set it according to highest
non-zero byte.
     set
     mov     Length, Count
0:
#endif // SPEED

     st      Z, Quot

     dec     Count
     brne    .LoopBytes

     subi    Rem, -'0'
     st      X+, Rem

     ;; Only continue if we determined Length (T=1).
     ;; T=0 means all bytes are zero, hence we are done then.
     brts    .LoopDigits

#if SPEED
     pop     Ten
     clr     __zero_reg__
#endif

     st      X, __zero_reg__

     ;; pBuf survived in R25:R24
     XJMP    strrev
ENDF u64toa_nibbles

#undef Count
#undef pNum
#undef pBuf
#undef Rem
#undef Num
#undef Quot
#undef Ten
#undef Ox67
#undef Length
#undef SPEED





>     230      0      0    230     e6 itoa_engine.o (+OPTIMIZE_SPEED)
>
> but I'm reasonably satisfied.
>
>
> Well, it turned out that my great idea from last time had a nasty
> bug in it.  There's no really good way to check for termination after
> printing each digit.
>
> Recall that my bit buffer only has room for 15 bits of the number, so I
> arrange for it to always hold 8 <= n <= 15 bits.  When it's about to fall
> from 8 to 7, I fetch another byte and decrement it from 16 to 15 instead.
>
> The problem is that there are two ways to handle the length, and
> both of them have problems.
>
> One is to count so len=1 after fetching the last byte of input, and
> len=0 after fetching the first byte of padding.
>
> In that case, consider printing one byte of zero in hex.
>
> I start with 8 bits in the buffer (msbyte = 0x01), and len = 1.  I print a
> '0', and check if len == 0.  It's not, so I keep going and print a second
> zero before realizing that I should stop.
>
>
> Okay, that doesn't work, so what if I arrange so len counts the bytes
> remaining?  len=0 after fetching the last byte, and it holds there
> after fetching padding.
>
> In that case, there are no extra digits, but there are missing
> digits.  Consider printing 2^15 = 0x8000.  I start out with
> len=1 and the last byte in the bit buffer.  I print the final 0, then
> have to fetch another byte to start shifting bits into the lsbyte.
>
> So the bit buffer holds 800 (msbyte = 0x18, lsbyte = 0x00, len = 0) when
> I print the second 0.  At this point, len == 0 and lsbyte == 0, so
> I should stop printing.  But there are more bits in the msbyte.
>
> There are various tricks you can try in hex, but they break down in
> octal.  Checking for extra bits in the msbyte makes for messy
> (= large) code.
>
>
> So I went back to my original idea of checing for termination when
> fetching padding.  It's a couple of instructions longer, but it works.
>
> Anyway, here's the current state of the function I intend to rewrite
> vfprintf() around.  Any comments?  I know about a few mechanical things:
> I need to integrate it into avr-libc and #include the appropriate headers
> rather than #define things myself.  Likewise X_movw and so on.
>
>
> [...]
>
> or __tmp_reg__, __tmp_reg__
There's the shortcut "TST Reg, Reg" to test for sign and zero.  It's
actually encoded as "AND Reg, Reg" so for any practical purposes it has
the same effect (and purpose) as "OR Reg, Reg".

Johann

#define __zero_reg__ r1
#define __tmp_reg__ r0

.macro DEFUN name
.global \name
.func \name
\name:
.endm

.macro ENDF name
.size \name, .-\name
.endfunc
.endm

#if defined (__AVR_HAVE_JMP_CALL__)
#define XCALL call
#define XJMP  jmp
#else
#define XCALL rcall
#define XJMP  rjmp
#endif

.macro wmov  r_dest, r_src
#if defined (__AVR_HAVE_MOVW__)
    movw \r_dest,   \r_src
#else
    mov \r_dest,    \r_src
    mov \r_dest+1,  \r_src+1
#endif
.endm

.text

.section .text.u64toa_nibbles,"ax",@progbits

#ifdef __AVR_HAVE_MUL__
#define SPEED 1
#else
#define SPEED 0
#endif

#define pBuf    26
#define pNum    30
#define Length  r20

#define Rem     r19
#define Quot    r21
#define Count   r22
#define Num     r23

#if SPEED
#   define Ten     r16
#   define Ox67    r18
#endif

;; extern void u64toa_nibbles (char *pBuf, void *pNum, uint8_t Length)
;; Length in [1, 127]

DEFUN u64toa_nibbles

    wmov    pBuf, 24
    wmov    pNum, 22

#if SPEED
    push    Ten
    ldi     Ten, 10
    ;; For / 10 by multiply we would usually use a value of 0xcd ~ 0x800/10,
    ;; however Rem below will be in [0, 0x9f] so that ~ 1/2 of that value
    ;; also works, and that saves us one "lsr r1" per division.
    ldi     Ox67, 0x67
    clr     Count
#endif

.LoopDigits:

    add     pNum, Length
#if SPEED
    ;; Count is 0.
    adc     pNum+1, Count
#else
    adc     pNum+1, __zero_reg__
#endif

    mov     Count, Length
    clt     ; Length unknown
    clr     Rem

.LoopBytes:

    ld      Num, -Z

#if SPEED

    ;; Rem:Num <<= 4
    ;; "andi Rem, 0xf" not needed because Rem < 10.
    eor     Rem, Num
    andi    Num, 0x0f
    eor     Rem, Num
    swap    Rem

    ;; Quot := Rem / 10
    mul     Rem, Ox67
    lsr     r1
    lsr     r1
    mov     Quot, r1

    ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10
    mul     r1, Ten
    sub     Rem, r0

    ;; ...and (mostly) the same again for the low nibble.

    ;; Quot <<= 4
    swap    Quot

    ;; Rem:Num <<= 4
    swap    Rem
    eor     Rem, Num

    ;; Quot += Rem / 10
    mul     Rem, Ox67
    lsr     r1
    lsr     r1
    add     Quot, r1

    brts    0f
    ;; Quot = 0  ==>  R1 = 0, hence we can also skip the MUL + SUB below.
    breq    1f
    ;; Current Length not yet known: Set it according to highest non-zero byte.
    set
    mov     Length, Count
0:
    ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10
    mul     r1, Ten
    sub     Rem, r0
1:
#else // !SPEED

    clr     Quot

.LoopNibbles:
    ;; Quot <<= 4
    swap    Quot

    ;; Rem:Num <<= 4
    swap    Num
    swap    Rem
    ;; "andi Rem, 0xf0" not needed because Rem was < 10.
    eor     Rem, Num
    andi    Num, 0xf0
    eor     Rem, Num

1:  subi    Quot, -4
    subi    Rem, 40
    brcc    1b
    subi    Rem, -40
    ;; "+1" avoids "dec Quot" after the next loop.
    subi    Quot, 4 + 1

2:  inc     Quot
    subi    Rem, 10
    brcc    2b
    subi    Rem, -10
    ;; dec     Quot

    subi    Count, 0x80
    brmi    .LoopNibbles

    brts    0f
    tst     Quot
    breq    0f
    ;; Current Length not yet known: Set it according to highest non-zero byte.
    set
    mov     Length, Count
0:
#endif // SPEED

    st      Z, Quot

    dec     Count
    brne    .LoopBytes

    subi    Rem, -'0'
    st      X+, Rem

    ;; Only continue if we determined Length (T=1).
    ;; T=0 means all bytes are zero, hence we are done then.
    brts    .LoopDigits

#if SPEED
    pop     Ten
    clr     __zero_reg__
#endif

    st      X, __zero_reg__

    ;; pBuf survived in R25:R24
    XJMP    strrev
ENDF u64toa_nibbles

#undef Count
#undef pNum
#undef pBuf
#undef Rem
#undef Num
#undef Quot
#undef Ten
#undef Ox67
#undef Length
#undef SPEED

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


#define pNum    26
#define pBuf    30

#define nBytes  20
#define Radix   18

#define Num     19
#define Count   21
#define nBits   22
#define Rem     23

;; extern void mem_toa (char *pBuf, void *pNum, uint8_t nBytes, uint8_t Radix);
.section .text.mem_toa,"ax",@progbits

DEFUN mem_toa
    wmov    pBuf, 24
    wmov    pNum, 22

.LnextDigit:
    add     pNum, nBytes
    adc     pNum+1, __zero_reg__
    mov     Count, nBytes
    ;; nBytes < 0  <==>  nBytes is unknown
    ldi     nBytes, -1
    clr     Rem

.LnextByte:
    ld      Num, -X
    ldi     nBits, 8

.LnextBit:
    ;; Bit-wise construct (complement of) quotient into Num.
    ;; Bit-wise reconstruct Num into Rem.
    rol     Num
    rol     Rem

    ;; Reducing Rem mod Radix;  C = 0  <==>  reduction applied
    cp      Rem, Radix
    brcs    1f
    sub     Rem, Radix
1:
    dec     nBits
    brne    .LnextBit

    rol     Num
    com     Num
    st      X, Num

    breq    2f
    sbrc    nBytes, 7
    mov     nBytes, Count
2:
    dec     Count
    brne    .LnextByte

    subi    Rem, -'0'
    cpi     Rem, '9' + 1
    brlt    3f
    subi    Rem, '9' + 1 - 'a'
3:
    st      Z+, Rem

    sbrs    nBytes, 7
    rjmp    .LnextDigit

    st      Z, __zero_reg__

    ;; pBuf survived in R25:R24
    XJMP    strrev

ENDF mem_toa

#define Minus1 pBuf

;; extern void mem_toa_signed (char *pBuf, void *pNum, uint8_t nBytes, uint8_t Radix);
.section .text.mem_toa_signed,"ax",@progbits

DEFUN mem_toa_signed
    wmov    pNum, 22
    add     pNum, nBytes
    adc     pNum+1, __zero_reg__
    ld      Num, -X
    tst     Num
    brpl    9f

    ;; Negate pNum[] according to -N = ~N + 1 = ~N - (-1).
    ;; Carry is clear now because ADC above didn't overflow.
    wmov    pNum, 22
    mov     Count, nBytes
    ldi     Minus1, -1
1:
    ld      Num, X
    ;; Must use EOR for complement because COM clobbers carry.
    eor     Num, Minus1
    sbci    Num, -1
    st      X+, Num
    dec     Count
    brne    1b

    wmov    pBuf, 24
    ldi     Num, '-'
    st      Z, Num
    adiw    24, 1
9:
    XJMP    mem_toa

ENDF mem_toa_signed

#undef Minus1
#undef pNum
#undef pBuf

#undef nBytes
#undef Radix

#undef Rem
#undef Num
#undef Count
#undef nBits

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

.section .text.put64,"ax",@progbits


#define A0 18
#define A1 A0 + 1
#define A2 A0 + 2
#define A3 A0 + 3
#define A4 A0 + 4
#define A5 A0 + 5
#define A6 A0 + 6
#define A7 A0 + 7

#define BUF0 26
#define BUF1 BUF0 + 1

#define Digit   31

#define Carry   31
#define Ten     30
#define Count   29
#define nonZero 28

;;; extern void put64 (uint64_t A, char *buf);
;;; This function writes up to 21 characters (including final '\0') to buf.

DEFUN put64
    push    r28
    push    r29

    wmov    BUF0, 16
    ldi     Count, 19
    ldi     nonZero, '0'

.Loop:
    ;; A[] -= 1.000.000.000.000.000.000  as often as we can.
    ;; Digit will hold the net number of subtractions (plus '0').
    ldi     Digit, '0'-1
1:
    inc     Digit
    ;;  1.000.000.000.000.000.000 = 0DE0 B6B3 A764 0000
    subi    A2, 0x64
    sbci    A3, 0xa7
    sbci    A4, 0xb3
    sbci    A5, 0xb6
    sbci    A6, 0xe0
    sbci    A7, 0xd
    brcc    1b

    ;; -1.000.000.000.000.000.000 = F21F 494C 589C 0000
    ;; Undo the underflow from above
    subi    A2, 0x9c
    sbci    A3, 0x58
    sbci    A4, 0x4c
    sbci    A5, 0x49
    sbci    A6, 0x1f
    sbci    A7, 0xf2

    ;; As ffff ffff ffff ffff = 18.***.***.***.***.***.***  we can get
    ;; Digit up to 18+'0' in the first round.
    cpi     Digit, '0'+10
    brlo    2f

    ;; If that is the case, split it into 2 digits: a '1' and the
    ;; low part in Digit.
    ldi     nonZero, '1'
    st      X+, nonZero
    subi    Digit, 10
2:
    ;; nonZero "accumulates" digits: it holds a value != '0' if we ever
    ;; output a digit not equal to '0'.
    or      nonZero, Digit
    ;; We only output digits if we saw a digit != '0', i.e. strip leading '0's.
    cpi     nonZero, '0'
    breq    3f
    st      X+, Digit
3:
    ;; A[] *= 10
    ldi     Ten, 10
    clr     Carry
.Lrot:
    mul     A0, Ten
    add     r0, Carry
    mov     Carry, r1
    clr     __zero_reg__
    adc     Carry, __zero_reg__
    ;; A[] >>>= 8
    ;; Rotate 8 times so that we can loop the *= 10 over A[].
    ;; The current value of A0 resides in r0, hence no "mov r0,A0" needed
    ;; to start the rotation.
    mov A0,A1  $  mov A1,A2  $  mov A2,A3  $  mov A3,A4
    mov A4,A5  $  mov A5,A6  $  mov A6,A7  $  mov A7,r0

    ;; Use the upper 3 bits of Count as loop counter for 8 loops.
    subi    Count, -32
    brcs    .Lrot

    dec     Count
    brne    .Loop

    ;; If all digits were '0', we had A[] = 0:  Output one '0'.
    cpi     nonZero, '0'
    brne    4f
    st      X+, nonZero
4:
    ;; String end and epilogue.
    st      X+, __zero_reg__
    pop     r29
    pop     r28
    ret
ENDF put64

.section .text.put64s,"ax",@progbits

;;; extern void put64s (int64_t A, char *buf);
;;; Writes up to 21 characters (including '-' and final '\0') to buf.

DEFUN put64s
#ifdef __AVR_ERRATA_SKIP_JMP_CALL__
    tst     A7
    brmi    0f
#else
    sbrs    A7, 7
#endif
    XJMP    put64
0:
    push    r16
    push    r17
    XCALL   __negdi2            ; Assumes avr-gcc 4.7+
    wmov    BUF0, 16
    ldi     Digit, '-'
    st      X+, Digit
    wmov    16, BUF0
    XCALL   put64
    pop     r17
    pop     r16
    ret
ENDF put64s


#undef A0
#undef A1
#undef A2
#undef A3
#undef A4
#undef A5
#undef A6
#undef A7

#undef Digit
#undef BUF0
#undef Ten
#undef Count
#undef Carry
#undef nonZero

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

George Spelvin
Longer reply coming; I'm benchmarking.

But the tl;dr is that damn it, you may be right and I've wasted several
days of effort. :-(

I can even shave a few cycles (and one instruction) off of your code.
At less than half the size and slightly better than half the speed,
it's hard to argue with.

While the optimal tradeoff depends on the application, and all we can be
sure of is that it's somewhere on the convex hull of the space/time plot,
optimizing time^k * space for some exponent k is usually good, and you're
beating me with k=1.

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

George Spelvin
In reply to this post by Georg-Johann Lay-2
Georg-Johann Lay wrote:
> George Spelvin schrieb:
>> It is unfortunately one instruction longer than ultoa_invert:
>>
>>    text   data    bss    dec    hex filename
>>     188      0      0    188     bc ultoa_invert.o
>>     190      0      0    190     be itoa_engine.o
>>     208      0      0    208     d0 itoa_engine.o (+OPTIMIZE_SPEED)
>>     212      0      0    212     d4 itoa_engine.o (without HAVE_MUL)

> hmmm, that's quite some increase in code size...

Um... just how serious are you being?  I didn't think two bytes was
"quite some increase".  But compared to your alternatives... sigh,
I may have been wasting my time. :-(

> IMO octal conversion is so remote from any practical purposes that we
> should use an algorithm that works for all radices.

Well, I sort of did that.  I did one that works for all power-of-two
sizes, and used it for both hex and octal.

Hex alone can be done much faster, but it isn't bad with the more
generic code.

...but you don't mean limited to powers of 2.  Just plain old binary long
division.  As I said when I started this, it's hard to escape certain
optimization habits.

> This can be done in less than 70 bytes (variable length / radix
> algorithm that operates on memory, no need for MUL, needs strrev.

H'm... let me have a poke at that.

The actual code changes are at the end, but I found 90 cycles
in one place, and 91 cycles + 1 instruction in another.

With those changes, here's a timing table, for values of the from 2^k - 1:

        Decimal Hex mod100
Input mem_toa itoa mem_toa itoa mem_toa
 8 bit 293 217 194  98 194
16 bit 700 451 527 187 440
24 bit 1342 838 1008 276 748
32 bit 2119 1167 1637 365 1142
40 bit 3143 1649 2414 454 1697
48 bit 4290 2127 3339 543 2301
56 bit 5585 2733 4412 632 2991
64 bit 7115 3420 5633 721 3743

That's really attractive for decimal.  Also, for not much extra code,
we could do the long division in base 100 (timing for only this part is
above), and then split that into 2 digits.

For hex, it's a bit painful.

> So all is boiling down to the question what execution time is
> acceptable or not:
>
> Is 8000 ticks too slow?
>
> Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just
> because we have an algorithm that performs in 3000 ticks?

The answer, of course, is the same as to "Is 190 bytes too big?":
It Depends On The Application.

And as library writers, we don't *know* the application.  For many
applications, an AVR is too slow and they need an ARM or a DSP.

"Best" is somewhere on the convex hull of the space/time plot, but I
can't say where.

I was designing a helper function for printf, and that's a certain size:

   text   data    bss    dec    hex filename
   1648      0      0   1648    670 ./avr/lib/avr4/vfprintf_flt.o
    948      0      0    948    3b4 ./avr/lib/avr4/vfprintf_std.o
    505      0      0    505    1f9 ./avr/lib/avr4/vfprintf_min.o

That, and the initial size of __ultoa_invert, kind of set the size scale
of what I was aiming for.

I have a faster algorithm that uses a 100-byte binary-to-BCD lookup table
to do everything two digits at a time.  Some applications might want that.

> My strong preference is still to have a one-fits-all algorithm that
> might very well be slower than an optimal one.  But hey, an ordinary
> division of a 64-bit value by 10 already costs 2300 cycles, so why
> should we hunt cycles just for printf...?

There is that.  But an intelligent programmer might only use 64 bits
for a final accumulation (of, say, 32x32-bit products), avoiding
even 64-bit multiplies.

> I played around some more; attached is a version that operates on
> memory, variable length and radices, and doesn't need MUL. It's
> called mem_toa and needs < 70 bytes / 8000 ticks for decimal.
> The signed version adds yet another 40 bytes:

I made the following changes:

Since you already have pointers to the beginning and end of the string,
which is what the first half of strrev finds, a slight rearrangement of
that code (which saves space and 1 cycle per byte at the expense of one
extra loop iteration for odd-length inputs) would let you jump to the
middle of it.

Copying that code to the end of this to make my benchmarking
simpler, I get 7296 cycles to print 2^64-1.

One speedup that leaps to mind is that you can negate the radix and
do the trial subtraction using an add.  That will produce the quotient
with the right polarity.  Moves one instruction out of the middle loop
to the setup code.

That saves 90 cycles, taking it to 7206.

Another possibility is some bit-shifting tricks.  It's of limited use here
becaue it messes up your trick of using Num for both the input number and
the output quotient.  But consider the following, with "nBits" renamed to
"Quot".  It takes another cycle out of the second loop, saving
another 91 cycles (7115).

It starts "Quot" out equal to 1, and detect the completion of 8 bit
shifts when that bit ends up in the carry flag.  So "rol Quot" replaces
"dec nBits", and thereby avoids the need for the second "rol Num" just
before the store.

DEFUN mem_toa
        wmov pBuf, 24
        wmov pNum, 22
        neg Radix

.LnextDigit:
        add pNum, nBytes
        adc pNum+1, __zero_reg__
        mov Count, nBytes
        ;; nBytes < 0  <==>  nBytes is unknown
        ldi nBytes, -1
        clr Rem

.LnextByte:
        ld Num, -X
        ldi Quot, 1

.LnextBit:
        ;; Bit-wise construct quotient into Quot.
        ;; Bit-wise reconstruct Num into Rem.
        lsl Num
        rol Rem

        ;; Reducing Rem mod Radix;  C = 1  <==>  reduction applied
        add Rem, Radix
        brcs 1f
        sub Rem, Radix
1:
        rol Quot
        brcc .LnextBit

        st X, Quot

        breq 2f
        sbrc nBytes, 7
        mov nBytes, Count
2:
        dec Count
        brne .LnextByte

        subi Rem, -'0'
        cpi Rem, '9' + 1
        brlt 3f
        subi Rem, '9' + 1 - 'a'
3:
        st Z+, Rem

        sbrs nBytes, 7
        rjmp .LnextDigit

        st Z, __zero_reg__

        ;; pBuf survived in R25:R24
        # XJMP strrev
        wmov pNum, 24
2: ld Num, X
        ld Count, -Z
        st X+, Count
        st Z, Num
        /* Loop while X < Z */
3: cp XL, ZL
        cpc XH, ZH
        brlo 2b

        ret
ENDF mem_toa

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

George Spelvin
In reply to this post by Georg-Johann Lay-2
> Is 8000 ticks too slow?
>
> Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just
> because we have an algorithm that performs in 3000 ticks?
>
> My strong preference is still to have a one-fits-all algorithm that
> might very well be slower than an optimal one.  But hey, an ordinary
> division of a 64-bit value by 10 already costs 2300 cycles, so why
> should we hunt cycles just for printf...?

Well, I went and asked the customer.

As I mentioned, the motivating application is the TAPR time interval
counter (TICC).
Info:   http://tapr.org/kits_ticc.html
Source: https://github.com/TAPR/TICC                (Not up to date.)
Manual: http://www.tapr.org/~n8ur/TICC_Manual.pdf

Basically, it timestamps input events to sub-nanosecond resolution.
It prints them with picosecond (12 decimal place) resolution.

E.g. fed a 1 Hz input signal, it might print:

104.897999794440
105.897999794492
106.897999794549
107.897999794551
108.897999794553
109.897999794552
110.897999794667

It would like to be able to run until 2^64 picoseconds wrap around in
213 days.

Anyway, although it only prints every input transition, the main
processing loop has a 1 ms schedule to meet (it *can* print at up to
1 kHz, synchronized with the USB polling interval), and of the 16,000
clock cycles available in that ms, 8000 are currently spoken for.
8000 are available for formatting and output device drivers.

So yeah, they'd definitely prefer 4000 cycles to 8000.

But they're going to use custom code *anyway*, since they don't want
to wait for an avr-libc release, so that doesn't have to determine what
avr-libc does.

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

George Spelvin
In reply to this post by George Spelvin
I implemented the idea I had, of using mem_itoa with base 100
and then printing two digits.

Omitting the strrev code to make everything equal, mem_toa is 64 bytes,
and the modified decimal-only version I call mem_tod is 80 bytes.
u64toa_nibbles is 90 bytes.

(That's assuming a ret of the reversed string or a 2-byte rjmp to
strrev; add 2 bytes if using JMP.)

Here's the timing:

        Decimal Hex
Input mem_toa mem_tod itoa nibbles mem_toa itoa
 8 bit 269 220 217 141 194  98
16 bit 664 462 451 321 527 187
24 bit 1294 783 838 608 1008 276
32 bit 2059 1219 1167 948 1637 365
40 bit 3059 1775 1649 1395 2414 454
48 bit 4194 2373 2127 1895 3339 543
56 bit 5477 3069 2733 2459 4412 632
64 bit 6995 3822 3420 3130 5633 721

mem_tod has the advantage that it does no multiplies, so it works fine
on all devices.  The nibbles code is the fastest, and I haven't tried
to improve it yet.

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

mem_tod.S

George Spelvin
Oops, I meant to include this last time.  Note how the mod-10 reduction
works, subtracting 30s until the remainder goes negative, then adding
back 10s until it goes positive again.  That takes an average of 12.2
cycles (averaged over all inputs 00..99).

#define __zero_reg__ r1
#define __tmp_reg__ r0

.macro DEFUN name
.global \name
.func \name
\name:
.endm

.macro ENDF name
.size \name, .-\name
.endfunc
.endm

#if defined (__AVR_HAVE_JMP_CALL__)
#define XCALL call
#define XJMP  jmp
#else
#define XCALL rcall
#define XJMP  rjmp
#endif

.macro wmov  r_dest, r_src
#if defined (__AVR_HAVE_MOVW__)
        movw \r_dest, \r_src
#else
        mov \r_dest, \r_src
        mov \r_dest+1,  \r_src+1
#endif
.endm

;; End preamble

#define pNum 26
#define pBuf 30

#define nBytes  20
#define Radix 18

#define Num 19
#define Count 21
#define Quot 22
#define Rem 23

;; extern void mem_tod (char *pBuf, void *pNum, uint8_t nBytes);
.section .text.mem_tod,"ax",@progbits

DEFUN mem_tod
        wmov pBuf, 24
_mem_tod_aux:
        wmov pNum, 22
        ldi Radix, -100

.LnextDigit:
        add pNum, nBytes
        adc pNum+1, __zero_reg__
        mov Count, nBytes
        ;; nBytes < 0  <==>  nBytes is unknown
        ldi nBytes, -1
        clr Rem

.LnextByte:
        ld Num, -X
        ldi Quot, 1

.LnextBit:
        ;; Bit-wise construct quotient into Quot.
        ;; Bit-wise reconstruct Num into Rem.
        lsl Num
        rol Rem

        ;; Reducing Rem mod Radix;  C=1  <==>  reduction applied
        add Rem, Radix
        brcs 1f
        sub Rem, Radix
1:
        rol Quot
        brcc .LnextBit

        st X, Quot

        breq 2f
        sbrc nBytes, 7
        mov nBytes, Count
2:
        dec Count
        brne .LnextByte

        ;; Now break Rem apart into two base-10 digits
        ;; Use Num, because we know it's zero
3: subi Num, -3
        subi Rem, 30
        brcc 3b
4: dec Num
        subi Rem, -10
        brcs 4b

        subi Rem, -'0'
        st Z+, Rem
        subi Num, -'0'
        st Z+, Num

        sbrs nBytes, 7
        rjmp .LnextDigit

        ;; Chop off redundant trailing zero
        cpi Num, '0'
        brne 5f
        st -Z, __zero_reg__
5: st Z, __zero_reg__

        ;; pBuf survived in R25:R24
#if 0
        # XJMP strrev
        wmov pNum, 24
7: ld Num, X
        ld Count, -Z
        st X+, Count
        st Z, Num
        /* Loop while X < Z */
        cp XL, ZL
        cpc XH, ZH
        brlo 7b
#endif
        ret
ENDF mem_tod

#define Minus1 Num

;; extern void mem_toa_signed (char *pBuf, void *pNum, uint8_t nBytes, uint8_t Radix);
.section .text.mem_toa_signed,"ax",@progbits

DEFUN mem_tod_signed
        wmov pBuf, 24
        wmov pNum, 22
        add pNum, nBytes
        adc pNum+1, __zero_reg__
        ld __tmp_reg__, -X
        cp __tmp_reg__, __zero_reg__
        brpl 9f

        ;; Negate pNum[] according to -N = ~N + 1 = ~N - (-1).
        ;; Carry is clear now because ADC above didn't overflow.
        wmov pNum, 22
        mov Count, nBytes
        ldi Minus1, -1
1:
        ld __tmp_reg__, X
        ;; Must use EOR for complement because COM clobbers carry.
        eor __tmp_reg__, Minus1
        sbc __tmp_reg__, Minus1
        st X+, __tmp_reg__
        dec Count
        brne 1b

        ldi Num, '-'
        st Z+, Num
9:
        XJMP mem_tod_aux

ENDF mem_tod_signed

#undef Minus1
#undef pNum
#undef pBuf

#undef nBytes
#undef Radix

#undef Rem
#undef Num
#undef Count
#undef nBits

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

Georg-Johann Lay-2
In reply to this post by George Spelvin
On 19.12.2016 23:05, George Spelvin wrote:

> Georg-Johann Lay wrote:
>> George Spelvin schrieb:
>>> It is unfortunately one instruction longer than ultoa_invert:
>>>
>>>    text   data    bss    dec    hex filename
>>>     188      0      0    188     bc ultoa_invert.o
>>>     190      0      0    190     be itoa_engine.o
>>>     208      0      0    208     d0 itoa_engine.o (+OPTIMIZE_SPEED)
>>>     212      0      0    212     d4 itoa_engine.o (without HAVE_MUL)
>
>> hmmm, that's quite some increase in code size...
>
> Um... just how serious are you being?  I didn't think two bytes was
> "quite some increase".  But compared to your alternatives... sigh,
> I may have been wasting my time. :-(

oh, I was just misreading this table and thought that it means yet
another 200 bytes atop of ultoa_invert (just demonstrating that
it isn't worse than ultoa_invert).

But it appears you are intending to drop ultoa_invert which is great!

>> IMO octal conversion is so remote from any practical purposes that we
>> should use an algorithm that works for all radices.
>
> Well, I sort of did that.  I did one that works for all power-of-two
> sizes, and used it for both hex and octal.
>
> Hex alone can be done much faster, but it isn't bad with the more
> generic code.
>
> ...but you don't mean limited to powers of 2.  Just plain old binary long
> division.  As I said when I started this, it's hard to escape certain
> optimization habits.
>
>> This can be done in less than 70 bytes (variable length / radix
>> algorithm that operates on memory, no need for MUL, needs strrev.
>
> H'm... let me have a poke at that.
>
> The actual code changes are at the end, but I found 90 cycles
> in one place, and 91 cycles + 1 instruction in another.
>
> With those changes, here's a timing table, for values of the from 2^k - 1:
>
> Decimal Hex mod100
> Input mem_toa itoa mem_toa itoa mem_toa
>  8 bit 293 217 194  98 194
> 16 bit 700 451 527 187 440
> 24 bit 1342 838 1008 276 748
> 32 bit 2119 1167 1637 365 1142
> 40 bit 3143 1649 2414 454 1697
> 48 bit 4290 2127 3339 543 2301
> 56 bit 5585 2733 4412 632 2991
> 64 bit 7115 3420 5633 721 3743

Really funny results (just noticed you are using TABs and therefore
the whole table is rigged ;-))

> That's really attractive for decimal.  Also, for not much extra code,
> we could do the long division in base 100 (timing for only this part is
> above), and then split that into 2 digits.
>
> For hex, it's a bit painful.
>
>> So all is boiling down to the question what execution time is
>> acceptable or not:
>>
>> Is 8000 ticks too slow?
>>
>> Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just
>> because we have an algorithm that performs in 3000 ticks?
>
> The answer, of course, is the same as to "Is 190 bytes too big?":
> It Depends On The Application.
>
> And as library writers, we don't *know* the application.  For many
> applications, an AVR is too slow and they need an ARM or a DSP.
>
> "Best" is somewhere on the convex hull of the space/time plot, but I
> can't say where.
>
> I was designing a helper function for printf, and that's a certain size:
>
>    text   data    bss    dec    hex filename
>    1648      0      0   1648    670 ./avr/lib/avr4/vfprintf_flt.o
>     948      0      0    948    3b4 ./avr/lib/avr4/vfprintf_std.o
>     505      0      0    505    1f9 ./avr/lib/avr4/vfprintf_min.o
>
> That, and the initial size of __ultoa_invert, kind of set the size scale
> of what I was aiming for.
>
> I have a faster algorithm that uses a 100-byte binary-to-BCD lookup table
> to do everything two digits at a time.  Some applications might want that.
>
>> My strong preference is still to have a one-fits-all algorithm that
>> might very well be slower than an optimal one.  But hey, an ordinary
>> division of a 64-bit value by 10 already costs 2300 cycles, so why
>> should we hunt cycles just for printf...?
>
> There is that.  But an intelligent programmer might only use 64 bits
> for a final accumulation (of, say, 32x32-bit products), avoiding
> even 64-bit multiplies.

Often programmers are bitten by their smartness when they observe
that avr-gcc generates "some" extra instructions around the core
64-bit arithmetic.  But that's a different story...

>> I played around some more; attached is a version that operates on
>> memory, variable length and radices, and doesn't need MUL. It's
>> called mem_toa and needs < 70 bytes / 8000 ticks for decimal.
>> The signed version adds yet another 40 bytes:
>
> I made the following changes:
>
> Since you already have pointers to the beginning and end of the string,
> which is what the first half of strrev finds, a slight rearrangement of
> that code (which saves space and 1 cycle per byte at the expense of one
> extra loop iteration for odd-length inputs) would let you jump to the
> middle of it.
>
> Copying that code to the end of this to make my benchmarking
> simpler, I get 7296 cycles to print 2^64-1.

IMO including the time of strrev doesn't add much to the execution
times; and if strrev is invoked it actually adds to the conversion
times anyway.


> One speedup that leaps to mind is that you can negate the radix and
> do the trial subtraction using an add.  That will produce the quotient
> with the right polarity.  Moves one instruction out of the middle loop
> to the setup code.
>
> That saves 90 cycles, taking it to 7206.

Just a small speed-up, but really cool idea :-)

> Another possibility is some bit-shifting tricks.  It's of limited use here
> because it messes up your trick of using Num for both the input number and
> the output quotient.  But consider the following, with "nBits" renamed to
> "Quot".  It takes another cycle out of the second loop, saving
> another 91 cycles (7115).
>
> It starts "Quot" out equal to 1, and detect the completion of 8 bit
> shifts when that bit ends up in the carry flag.  So "rol Quot" replaces
> "dec nBits", and thereby avoids the need for the second "rol Num" just
> before the store.
>
> DEFUN mem_toa
> wmov pBuf, 24
> wmov pNum, 22
> neg Radix
>
> .LnextDigit:
> add pNum, nBytes
> adc pNum+1, __zero_reg__
> mov Count, nBytes
> ;; nBytes < 0  <==>  nBytes is unknown
> ldi nBytes, -1
> clr Rem
>
> .LnextByte:
> ld Num, -X
> ldi Quot, 1
>
> .LnextBit:
> ;; Bit-wise construct quotient into Quot.
> ;; Bit-wise reconstruct Num into Rem.
> lsl Num
> rol Rem
>
> ;; Reducing Rem mod Radix;  C = 1  <==>  reduction applied
> add Rem, Radix
> brcs 1f
> sub Rem, Radix
> 1:
> rol Quot
> brcc .LnextBit
>
> st X, Quot
>
> breq 2f
> sbrc nBytes, 7
> mov nBytes, Count
> 2:
> dec Count
> brne .LnextByte
>
> subi Rem, -'0'
> cpi Rem, '9' + 1
> brlt 3f
> subi Rem, '9' + 1 - 'a'
> 3:
> st Z+, Rem
>
> sbrs nBytes, 7
> rjmp .LnextDigit
>
> st Z, __zero_reg__
>
> ;; pBuf survived in R25:R24
> # XJMP strrev
> wmov pNum, 24
> 2: ld Num, X
> ld Count, -Z
> st X+, Count
> st Z, Num
> /* Loop while X < Z */
> 3: cp XL, ZL
> cpc XH, ZH
> brlo 2b
>
> ret
> ENDF mem_toa
>


_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

Georg-Johann Lay-2
In reply to this post by George Spelvin
On 20.12.2016 00:51, George Spelvin wrote:

>> Is 8000 ticks too slow?
>>
>> Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just
>> because we have an algorithm that performs in 3000 ticks?
>>
>> My strong preference is still to have a one-fits-all algorithm that
>> might very well be slower than an optimal one.  But hey, an ordinary
>> division of a 64-bit value by 10 already costs 2300 cycles, so why
>> should we hunt cycles just for printf...?
>
> Well, I went and asked the customer.
>
> As I mentioned, the motivating application is the TAPR time interval
> counter (TICC).
> Info:   http://tapr.org/kits_ticc.html
> Source: https://github.com/TAPR/TICC                (Not up to date.)
> Manual: http://www.tapr.org/~n8ur/TICC_Manual.pdf
>
> Basically, it timestamps input events to sub-nanosecond resolution.
> It prints them with picosecond (12 decimal place) resolution.
>
> E.g. fed a 1 Hz input signal, it might print:
>
> 104.897999794440
> 105.897999794492
> 106.897999794549
> 107.897999794551
> 108.897999794553
> 109.897999794552
> 110.897999794667
>
> It would like to be able to run until 2^64 picoseconds wrap around in
> 213 days.

Just out of interest:  Isn't this overflow a problem? Hence isn't a
different representation warranted that doesn't overflow so soon?

> Anyway, although it only prints every input transition, the main
> processing loop has a 1 ms schedule to meet (it *can* print at up to
> 1 kHz, synchronized with the USB polling interval), and of the 16,000
> clock cycles available in that ms, 8000 are currently spoken for.
> 8000 are available for formatting and output device drivers.
>
> So yeah, they'd definitely prefer 4000 cycles to 8000.
>
> But they're going to use custom code *anyway*, since they don't want
> to wait for an avr-libc release, so that doesn't have to determine what
> avr-libc does.

IMO that's a sound approach -- not only because of avr-libc release
schedule but also because you actually have to guarantee some upper
bound for the execution time (of conversion etc.).

What consumes the other 8000 cycles? Arithmetic?

Johann


_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

George Spelvin
> What consumes the other 8000 cycles? Arithmetic?

I gave a link to the code (which I haven't read either, so I'm not
blaming you), but I expect it's a combination of dealing with the TDC7200
time-to-digital converter chip (it's an SPI peripheral) and arithmetic.

The chip is a combination of a 10 MHz "coarse" counter clocked by an
external quartz reference, and a 17 GHz ring oscillator.  The latter is
subject to drift, so every input measurement is followed by a calibration
cycle which measures one cycle of the coarse counter.  Then you have to
do some math to translate the ring oscillator cycle count to fractions
of the 10 MHz timebase.

While I expect there's slac in that code which could be taken out,
just knowing that there's a 16000 clock cycle budget for *everything*
suggests that maybe I shouldn't be cavalier about 4000 cycles.

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Printing octal (brain dump)

George Spelvin
In reply to this post by Georg-Johann Lay-2
Georg-Johann Lay wrote:
> oh, I was just misreading this table and thought that it means yet
> another 200 bytes atop of ultoa_invert (just demonstrating that
> it isn't worse than ultoa_invert).
>
> But it appears you are intending to drop ultoa_invert which is great!

*Whew*!  No, I'm comparing it because it's a *replacement* for
ultoa_invert.o.  I wanted smaller *and* faster *and* arbitrary-length.

I was feeling very deflated by your complaints.

> Really funny results (just noticed you are using TABs and therefore
> the whole table is rigged ;-))

Here is the updated table (which consistently omits the string reverse)
in a form that can take one level of quoting:

Input Decimal Hex
bits mem_toa mem_tod itoa nibbles mem_toa itoa
 8 269 220 217 141 194  98
16 664 462 451 321 527 187
24 1294 783 838 608 1008 276
32 2059 1219 1167 948 1637 365
40 3059 1775 1649 1395 2414 454
48 4194 2373 2127 1895 3339 543
56 5477 3069 2733 2459 4412 632
64 6995 3822 3420 3130 5633 721

For binary bases, yu're seeing here the difference between an O(n)
algorithm (89n+9 cycles) and an O(n^2) one (74*n^2 + 111*n + 9).
I'm really not happy with that.

For decimal, however mem_tod without a multiplier is almost fast as my
decimal code *with* one, and is 80 bytes long.  That's the code I'd like
to use on multiplierless machines.  mem_toa is 64 bytes, but the almost
2x speed difference is worth the 16 bytes, IMHO.

Your u64toa_nibbles code (after I tweaked it a bit) is 90 bytes and the
fastest of all.  There, it's only about 75% the time of the multiplierless
code, so whether the speed is worth it is more of a question.

One thing that justfies it to me is that the enhanced cores tend
to come with more flash, so an additional 10 bytes is affordable.
Also contributing to eh afforadility is that the enhanced cores tend to
generate smaller code overall.

On the other hand, maintaining two completely different code paths is
a bother.  There's a lot to be said for just one.


One alternative I mentioned earlier, which I'm thinking seriously about,
is to reorganize the code into two phases:

1) Convert decimal and octal to little-endian BCD.
   (%x would just find the length.)
2) Print little-endian hex as ASCII.

That would enlarge the code somewhat, but reduce stack usage by 11 bytes.
As I noted, ROM:RAM is usually 16:1 so I could argue that those 16 bytes
of RAM are "worth" 176 bytes of code, which is far more than code size
increase.

Note that mem_tod produces digits two at a time anyway, so it's a
natural fit.  I have some ideas for how to adapt the u64toa_nibbles code,
and octal wouldn't be too hard.

I'd really appreciate your opinion of the idea.

> Often programmers are bitten by their smartness when they observe
> that avr-gcc generates "some" extra instructions around the core
> 64-bit arithmetic.  But that's a different story...

I don't quite know what you're alluding to.

What frustrates me abut avr-gcc is code like time/gm_sidereal.c,
where it's doing a 32x32->64 bit multiply, but if !__AVR_HAVE_MUL__,
then gcc "knows" that there's no __umulsidi3 function, and generates an
absolutely massive spill & fill sequence to call __muldi3, which ends
up being a lot bigger & slower than a call to the __umulsidi3 wrapper
which totally exists.

>> That saves 90 cycles, taking it to 7206.
>
> Just a small speed-up, but really cool idea :-)

You had some awfully cool ideas yourself, particularly integrating the
length-finding into the main loop.

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Even faster decimal code

George Spelvin
I mentioned last time trying doing something like this, and now
I have it coded up (as utoa_bytes()).

Input Decimal
bits mem_toa mem_tod itoa nibbles bytes
 8 269 220 217 141 143
16 664 462 451 321 273
24 1294 783 838 608 432
32 2059 1219 1167 948 666
40 3059 1775 1649 1395 941
48 4194 2373 2127 1895 1217
56 5477 3069 2733 2459 1551
64 6995 3822 3420 3130 1902

It's larger (120 bytes, vs, 90 for the nibbles code), and has a bit more
startup overhead, but is quite fast.  I also have a 122-byte variant
which saves 1 cycle per pair of digits.  (1895 cycles for 2^64-1.)

I'm interested in this both for speed, and because it's a good fit to my
suggestion of saving RAM space buffering the converted digits in BCD.

So now that we have several good candidates, how to proceed?
What size/speed tradeoff should be the final choice?

Personally, I'm not worried about 30 bytes of code on enhanced AVRs with
a multiplier, and although I haven't coded up the combined algorithm, I
believe the whole thing is smaller than the ultoa_invert code it replaces.

But this is a judgement call, and I'd appreciate some other voices.



#define __zero_reg__ r1
#define __tmp_reg__ r0

.macro DEFUN name
.section .text.\name,"ax",@progbits
.global \name
.func \name
\name:
.endm

.macro ENDF name
.size \name, .-\name
.endfunc
.endm

#if defined (__AVR_HAVE_JMP_CALL__)
#define XCALL call
#define XJMP  jmp
#else
#define XCALL rcall
#define XJMP  rjmp
#endif

.macro wmov  r_dest, r_src
#if defined (__AVR_HAVE_MOVW__)
    movw \r_dest,   \r_src
#else
    mov \r_dest,    \r_src
    mov \r_dest+1,  \r_src+1
#endif
.endm

.text

#define pBuf    26
#define pNum    30
#define Length  r20

#define Rem     r19
#define Quot    r21
#define Count   r22
#define Num     r23

#define Ten     r16
#define Ox67    r18

;; extern void u64toa_nibbles (char *pBuf, void *pNum, uint8_t Length)
;; Length in [1, 127]

DEFUN u64toa_nibbles

    wmov    pBuf, 24
    wmov    pNum, 22

    push    Ten
    ldi     Ten, 10
    ;; For / 10 by multiply we would usually use a value of 0xcd ~ 0x800/10,
    ;; however Rem below will be in [0, 0x9f] so that ~ 1/2 of that value
    ;; also works, and that saves us one "lsr r1" per division.
    ldi     Ox67, 0x67
    clr     Count

.LoopDigits:

    add     pNum, Length
    ;; Count is 0.
    adc     pNum+1, Count

    mov     Count, Length
    ldi    Length, -1
    clr     Rem

.LoopBytes:

    ld      Num, -Z

    ;; Rem:Num <<= 4
    ;; "andi Rem, 0xf" not needed because Rem < 10.
    eor     Rem, Num
    andi    Num, 0x0f
    eor     Rem, Num
    swap    Rem

    ;; Quot := Rem / 10
    mul     Rem, Ox67
    lsr     r1
    lsr     r1
    mov     Quot, r1

    ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10
    mul     r1, Ten
    sub     Rem, r0

    ;; ...and (mostly) the same again for the low nibble.

    ;; Quot <<= 4
    swap    Quot

    ;; Rem:Num <<= 4
    swap    Rem
    eor     Rem, Num

    ;; Quot += Rem / 10
    mul     Rem, Ox67
    lsr     r1
    lsr     r1
    add     Quot, r1

    ;; Quot = 0  ==>  R1 = 0, hence we can also skip the MUL + SUB below.
    breq    1f
    sbrc    Length, 7
     ;; Current Length not yet known: Set it according to highest non-zero byte.
     mov     Length, Count

    ;; Rem := Rem - 10 * (Rem / 10) = Rem % 10
    mul     r1, Ten
    sub     Rem, r0
1:

    st      Z, Quot

    dec     Count
    brne    .LoopBytes

    subi    Rem, -'0'
    st      X+, Rem

    ;; Only continue if we determined Length (Length > 0)
    ;; Length < 0 means all bytes are zero, hence we are done then.
    sbrs    Length, 7
     rjmp    .LoopDigits

    pop     Ten
    clr     __zero_reg__

    st      X, __zero_reg__

    ;; pBuf survived in R25:R24
#if 0
        # XJMP strrev
        wmov ZL, 24
7: ld Count, -X
        ld Num, Z
        st Z+, Count
        st X, Num
        /* Loop while Z < X */
        cp ZL, XL
        cpc ZH, XH
        brlo 7b
#endif
        ret
ENDF u64toa_nibbles

#undef Count
#undef pNum
#undef pBuf
#undef Rem
#undef Num
#undef Quot
#undef Ten
#undef Ox67
#undef Length

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; This works simularly, but in base 100.  Every step, we take a
;; remainder <= 99, multiply by 256, and add the next byte (<=255).
;; The result is in the range 0 <= x < 25600.
;;
;; To divide by 100 with multiply:
;; x *   0x29 >> 12 == x/100 for 0 <= x <  1099
;; x *  0x51f >> 17 == x/100 for 0 <= x <  4699
;; x * 0x147b >> 19 == x/100 for 0 <= x < 43699
;; x * 0x28f6 >> 20 == x/100 for 0 <= x < 43699
;;
;; Using a multiplier one bit larger than strictly necessary gives us
;; a more convenient shift amount.  This is still small enough that we
;; can compute the high bits of the product we need in only two registers.
;;
;; The largest possible x = 0x63ff also has the largest possible
;; lsbyte, so it will give us the largest possible partial products.
;; Multiplying that by 0x28f6 requires four partial products:
;;
;;   FF *   F6 =     F50A
;;   FF * 28-- =   27D8--
;; 63-- *   F6 =   5F22--
;; 63-- * 28-- =  F78----
;;
;; The important thing is that the sum of the first three partial products
;; is 0x87ef0a, a 24-bit number.  So there is no need to propagate
;; carries into the msbyte.  We immediately discard the lsbyte of the
;; first partial product, and sum the others into a 2-byte register.
;; Then we throw away the lsbyte of that register and use it for the msbyte
;; of the final sum.

;; extern void utoa_bytes(char *pBuf, void *pNum, uint8_t Length)
;; Length in [1, 127]
#define ZH r31
#define ZL r30
#define XH r27
#define XL r26

#define Q2 r23 /* Byte 2 of 4-byte product */
#define Q1 r22 /* Byte 1 (and 3) of 4-byte product */
#define Count r21
#define Length r20 /* Argument */
#define Rem r19
#define Hundred r18 /* Constant 0x64 = 100 */
#define Khi r17 /* Constant 0x28 =  40 */
#define Klo r16 /* Constant 0xf6 = 246 */
#define Num r15


DEFUN utoa_bytes
        movw XL, r24 ; pBuf (output) to X
        movw ZL, r22 ; pNum (input) to Z
        ldi Hundred, 100
    push Khi
        ldi Khi, 0x28
    push Klo
        ldi Klo, 0xf6
        push Num
        clr Count

.LoopDigits2:
        add ZL, Length
        adc ZH, Count

        mov Count, Length
        ldi Length, -1
        clr Rem

.LoopBytes2:
        ld Num, -Z

        /* Multiply Rem:Num by Khi:Klo */
        mul Num, Khi
        mov Q1, r0
        mov Q2, r1
        mul Rem, Klo
        add Q1, r0
        adc Q2, r1 ; Cannot overflow
        mul Num, Klo
        add Q1, r1
        clr Q1 ; We only need the carry bit out
        adc Q2, Q1 ; This cannot overflow, either
        mul Rem, Khi
        add Q2, r0
        adc Q1, r1

        ; We now have the high 12 bits of the 28-bit product in Q1:Q2.
        ; Shift down by 4 bits
        andi Q2, 0xf0
        or Q1, Q2
        swap Q1
        ;; We now have the new quotient in "Q1".
        st Z, Q1

        ;; If Q1 == 0, don't set length (and multiply is redundant, too)
        breq    1f
        sbrc    Length, 7 ;; Do we have a length yet?
         mov     Length, Count ;; Remember position of highest non-zero Q

        mul Q1, Hundred
        sub Num, r0 ; New remainder
1:
        mov Rem, Num
        dec Count
        brne .LoopBytes2

        ;; We now have the final base-100 remiander in Rem.
        ;; Break it apart into two base-10 digits in Q1:Rem
        ;; (We could use the multiplier again, but that woud require
        ;; even more registers for the constants and more instructions,
        ;; and this isn't the inner loop.)
        ldi Q1, '0'
3: subi Q1, -3
        subi Rem, 30
        brcc 3b
4: dec Q1
        subi Rem, -10
        brcs 4b

        subi Rem, -'0'
        st X+, Rem
        st X+, Q1

        ;; Only continue if we determined Length (Length > 0)
        ;; Length < 0 means all bytes are zero, hence we are done.
        sbrs Length, 7
         rjmp .LoopDigits2

        ;; Chop off redundant trailing zero
        clr __zero_reg__
        cpi Q1, '0'
        brne 5f
         st -X, __zero_reg__ ; Could sbiw, but this is same speed
5: st X, __zero_reg__

        ; Restore registers
        pop Num
        pop Klo
        pop Khi

        movw r24, r26 ; Return end of output
        ret
ENDF utoa_bytes

#undef Num
#undef Klo
#undef Khi
#undef Hundred
#undef Rem
#undef Length
#undef Count
#undef Q1
#undef Q2
#undef XL
#undef XH
#undef ZL
#undef ZH

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Even faster decimal code

Georg-Johann Lay-2
George Spelvin schrieb:

> I mentioned last time trying doing something like this, and now
> I have it coded up (as utoa_bytes()).
>
> Input Decimal
> bits mem_toa mem_tod itoa nibbles bytes
>  8 269 220 217 141 143
> 16 664 462 451 321 273
> 24 1294 783 838 608 432
> 32 2059 1219 1167 948 666
> 40 3059 1775 1649 1395 941
> 48 4194 2373 2127 1895 1217
> 56 5477 3069 2733 2459 1551
> 64 6995 3822 3420 3130 1902
>
> It's larger (120 bytes, vs, 90 for the nibbles code), and has a bit more
> startup overhead, but is quite fast.  I also have a 122-byte variant
> which saves 1 cycle per pair of digits.  (1895 cycles for 2^64-1.)
>
> I'm interested in this both for speed, and because it's a good fit to my
> suggestion of saving RAM space buffering the converted digits in BCD.
>
> So now that we have several good candidates, how to proceed?
> What size/speed tradeoff should be the final choice?

After all it's you who will provide the final implementation and
testing, hence the final decision of what's appropriate, how much
effort will be put into the implementation, and what the final code
will look like is your decision, IMO.

> Personally, I'm not worried about 30 bytes of code on enhanced AVRs with
> a multiplier, and although I haven't coded up the combined algorithm, I
> believe the whole thing is smaller than the ultoa_invert code it replaces.

We only have multilib granularity, and there are not so many features
that are related to the flash size.  One is __AVR_HAVE_JMP_CALL__ which
applies to devices with >= 16 KiB flash.  The next size milestone is
__AVR_HAVE_ELPM__ which means >= 128 KiB.  The JMP + CALL looks
reasonable to me; I used it for 64-bit divisions in libgcc (which leads
to the funny situation that 64-bit division might run faster than a
32-bit division for the same values).

> But this is a judgement call, and I'd appreciate some other voices.

For smaller devices (e.g. no CALL but MUL some bytes can be squeezed
out by moving LDI of the constants to the inner loop which saves
PUSH + POP.  But on smaller devices, where xprintf is already a
major code consumer, a programmer might prefer something like ulltoa
over the bloat from xprintf.

>
> #define __zero_reg__ r1
> #define __tmp_reg__ r0
>
> .macro DEFUN name
> .section .text.\name,"ax",@progbits
> .global \name
> .func \name
> \name:
> .endm
>
> .macro ENDF name
> .size \name, .-\name
> .endfunc
> .endm
>
> #if defined (__AVR_HAVE_JMP_CALL__)
> #define XCALL call
> #define XJMP  jmp
> #else
> #define XCALL rcall
> #define XJMP  rjmp
> #endif
>
> .macro wmov  r_dest, r_src
> #if defined (__AVR_HAVE_MOVW__)
>     movw \r_dest,   \r_src
> #else
>     mov \r_dest,    \r_src
>     mov \r_dest+1,  \r_src+1
> #endif
> .endm
>
> .text
>
> [...u64toa_nibbles...]
>
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>
> ;; This works simularly, but in base 100.  Every step, we take a
> ;; remainder <= 99, multiply by 256, and add the next byte (<=255).
> ;; The result is in the range 0 <= x < 25600.
> ;;
> ;; To divide by 100 with multiply:
> ;; x *   0x29 >> 12 == x/100 for 0 <= x <  1099
> ;; x *  0x51f >> 17 == x/100 for 0 <= x <  4699
> ;; x * 0x147b >> 19 == x/100 for 0 <= x < 43699
> ;; x * 0x28f6 >> 20 == x/100 for 0 <= x < 43699
> ;;
> ;; Using a multiplier one bit larger than strictly necessary gives us
> ;; a more convenient shift amount.  This is still small enough that we
> ;; can compute the high bits of the product we need in only two registers.
> ;;
> ;; The largest possible x = 0x63ff also has the largest possible
> ;; lsbyte, so it will give us the largest possible partial products.
> ;; Multiplying that by 0x28f6 requires four partial products:
> ;;
> ;;   FF *   F6 =     F50A
> ;;   FF * 28-- =   27D8--
> ;; 63-- *   F6 =   5F22--
> ;; 63-- * 28-- =  F78----
> ;;
> ;; The important thing is that the sum of the first three partial products
> ;; is 0x87ef0a, a 24-bit number.  So there is no need to propagate
> ;; carries into the msbyte.  We immediately discard the lsbyte of the
> ;; first partial product, and sum the others into a 2-byte register.
> ;; Then we throw away the lsbyte of that register and use it for the msbyte
> ;; of the final sum.
>
> ;; extern void utoa_bytes(char *pBuf, void *pNum, uint8_t Length)
> ;; Length in [1, 127]
> #define ZH r31
> #define ZL r30
> #define XH r27
> #define XL r26
>
> #define Q2 r23 /* Byte 2 of 4-byte product */
 > #define Q1 r22 /* Byte 1 (and 3) of 4-byte product */

Maybe it's a bit easier to grasp if

#define Q3    Q1

and then use Q3 where byte #3 is computed (starting with "clr Q1")

> #define Count r21
> #define Length r20 /* Argument */
> #define Rem r19
> #define Hundred r18 /* Constant 0x64 = 100 */
> #define Khi r17 /* Constant 0x28 =  40 */
> #define Klo r16 /* Constant 0xf6 = 246 */
> #define Num r15
>
>
> DEFUN utoa_bytes
> movw XL, r24 ; pBuf (output) to X
> movw ZL, r22 ; pNum (input) to Z
> ldi Hundred, 100
>     push Khi
> ldi Khi, 0x28
>     push Klo
> ldi Klo, 0xf6
> push Num
> clr Count
>
> .LoopDigits2:
> add ZL, Length
> adc ZH, Count
>
> mov Count, Length
> ldi Length, -1
> clr Rem
>
> .LoopBytes2:
> ld Num, -Z
>
> /* Multiply Rem:Num by Khi:Klo */
> mul Num, Khi
> mov Q1, r0
> mov Q2, r1

Can use "wmov Q1, r0"

> mul Rem, Klo
> add Q1, r0
> adc Q2, r1 ; Cannot overflow
> mul Num, Klo
> add Q1, r1
> clr Q1 ; We only need the carry bit out
> adc Q2, Q1 ; This cannot overflow, either
> mul Rem, Khi
> add Q2, r0
> adc Q1, r1
>
> ; We now have the high 12 bits of the 28-bit product in Q1:Q2.
> ; Shift down by 4 bits
> andi Q2, 0xf0
> or Q1, Q2
> swap Q1
> ;; We now have the new quotient in "Q1".
> st Z, Q1
>
> ;; If Q1 == 0, don't set length (and multiply is redundant, too)
> breq    1f
> sbrc    Length, 7 ;; Do we have a length yet?
> mov     Length, Count ;; Remember position of highest non-zero Q
>
> mul Q1, Hundred
> sub Num, r0 ; New remainder
> 1:
> mov Rem, Num
> dec Count
> brne .LoopBytes2
>
> ;; We now have the final base-100 remiander in Rem.
> ;; Break it apart into two base-10 digits in Q1:Rem
> ;; (We could use the multiplier again, but that woud require
> ;; even more registers for the constants and more instructions,
> ;; and this isn't the inner loop.)
> ldi Q1, '0'
> 3: subi Q1, -3
> subi Rem, 30
> brcc 3b
> 4: dec Q1
> subi Rem, -10
> brcs 4b
>
> subi Rem, -'0'
> st X+, Rem
> st X+, Q1
>
> ;; Only continue if we determined Length (Length > 0)
> ;; Length < 0 means all bytes are zero, hence we are done.
> sbrs Length, 7
> rjmp .LoopDigits2
>
> ;; Chop off redundant trailing zero
> clr __zero_reg__
> cpi Q1, '0'
> brne 5f
> st -X, __zero_reg__ ; Could sbiw, but this is same speed
> 5: st X, __zero_reg__
>
> ; Restore registers
> pop Num
> pop Klo
> pop Khi
>
> movw r24, r26 ; Return end of output
> ret
> ENDF utoa_bytes
>
> #undef Num
> #undef Klo
> #undef Khi
> #undef Hundred
> #undef Rem
> #undef Length
> #undef Count
> #undef Q1
> #undef Q2
> #undef XL
> #undef XH
> #undef ZL
> #undef ZH
>


_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Even faster decimal code

George Spelvin
Georg-Johann Lay wrote:
> George Spelvin schrieb:
>> So now that we have several good candidates, how to proceed?
>> What size/speed tradeoff should be the final choice?

> After all it's you who will provide the final implementation and
> testing, hence the final decision of what's appropriate, how much
> effort will be put into the implementation, and what the final code
> will look like is your decision, IMO.

Well, thank you very much, but after your "that's quite some size
increase" e-mail (and showing me better code than I'd been working on
for a couple of weeks), I'm feeling rather less confident.

(And, despite my asking, nobody's expressed any opinion at all about
my "save RAM by using BCD" suggestion.  Brilliant or crackpot?)

> We only have multilib granularity, and there are not so many features
> that are related to the flash size.  One is __AVR_HAVE_JMP_CALL__ which
> applies to devices with >= 16 KiB flash.  The next size milestone is
> __AVR_HAVE_ELPM__ which means >= 128 KiB.  The JMP + CALL looks
> reasonable to me; I used it for 64-bit divisions in libgcc (which leads
> to the funny situation that 64-bit division might run faster than a
> 32-bit division for the same values).

Interesting suggestion.  I could just use the multiplierless base-100 code,
which is smaller and still reasonably fast.

And thank you very much!  I knew that HAVE_JUMP_CALL meant that RJMP/RCALL
range wasn't enough, which means more than 12 bits of PC (2^13 bytes of
flash), but it had gotten lost in the forest of confusion.

I'm befuddled by all of the different architecture options and don't
understand the difference between most of them.  I've been slowly
downloading data sheets for different examples from gcc's list and
looking for differences, but it's a laborious process.  (That document
on avr-tiny started out with me documenting my realization that avr1
was something else.)

For example, does MUL support imply MOVW support?  (I've been assuming
so, but that's an easy edit.)

And what's the relationship between MOVW support and ADIW/SBIW?  Are they
the same feature, or are there processors with one and not the other?


(For aggressive size squeezing, I've realized that a lot of code is wasted
copying pointer return values from X or Z to r24:r25, only to have the
caller copy them right back to use the pointers.  It would be lovely to
tell if there were a way to tell gcc "expect the resturn vaue for this
function in r30:r31".  And, sometimes, "this function preserves r18-r20".)

> For smaller devices (e.g. no CALL but MUL some bytes can be squeezed
> out by moving LDI of the constants to the inner loop which saves
> PUSH + POP.  But on smaller devices, where xprintf is already a
> major code consumer, a programmer might prefer something like ulltoa
> over the bloat from xprintf.

Um...I see how I can swap the Hundred constant around, but Khi/Klo are
both used twice each, so loading them twice would not save as much.
(If I return the end pointer rather than returning r24:r25 to the end
for a call to strrev that's not in the current code, that avoids two
more push/pop anyway.)

The way I have the multiply organized, I have to do the two middle
partial products first, then the low, then the high.  I can swap the
middle ones around, but I can't make both constants' uses adjacent.

>> #define Q2 r23 /* Byte 2 of 4-byte product */
>> #define Q1 r22 /* Byte 1 (and 3) of 4-byte product */

> Maybe it's a bit easier to grasp if
>
> #define Q3    Q1
>
> and then use Q3 where byte #3 is computed (starting with "clr Q1")

I thought about that, but remembering that two names refer to the same
register (and thus you may not rearrange code to overlap their usage)
is also a pain.  I originally called them "Qeven" and "Qodd".

Maybe I can just improve the comments...

>> /* Multiply Rem:Num by Khi:Klo */
>> mul Num, Khi
>> mov Q1, r0
>> mov Q2, r1
>
> Can use "wmov Q1, r0"

Ooh, nice!  I forgot that movw isn't limited to high registers.

Let me try to improve the comments... do you still think this would
be better with Q3?  (It dawned on me that even if the product *isn't*
guaranteed to not overflow, the structure can compute the high half of
a 32-bit product in only two accumulator registers if we add one more
"ADC Q1,Q1".)

>> mul Rem, Klo
>> add Q1, r0
>> adc Q2, r1 ; Cannot overflow
>> mul Num, Klo
>> add Q1, r1
  clr Q1 ; No longer need Q1; re-use register for Q3
  adc Q2, Q1 ; Propagate carry to Q2
        ;adc Q1, Q1 ; (Omit: no carry possible due to input range)
>> mul Rem, Khi
  add Q2, r0 ; Now byte 2 (hlo, Q2) of 32-bit product
  adc Q1, r1 ; Now msbyte (hhi, Q3) of 32-bit product
>>
>> ; We now have the high 12 bits of the 28-bit product in Q1:Q2.
>> ; Shift down by 4 bits
>> andi Q2, 0xf0
>> or Q1, Q2
>> swap Q1
>> ;; We now have the new quotient in "Q1".
>> st Z, Q1

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Even faster decimal code

Martin McKee
I find all this fascinating, but I'm really not the one to be commenting on
what the best approach here is.  I will say, however, that in many of my
applications, I would be more likely to chose a speed increase over reduced
memory.  I tend to live with mostly compute-bound, control applications
though.

I've not had enough time to look at the code to make fine-grained
suggestions (and I'm out of practice with AVR ASM), but I've not felt that
the comments were too bad.  I've been able to follow the code as written
easily enough.

Cheers,
Martin Jay McKee

On Sat, Dec 24, 2016 at 10:59 AM, George Spelvin <[hidden email]>
wrote:

> Georg-Johann Lay wrote:
> > George Spelvin schrieb:
> >> So now that we have several good candidates, how to proceed?
> >> What size/speed tradeoff should be the final choice?
>
> > After all it's you who will provide the final implementation and
> > testing, hence the final decision of what's appropriate, how much
> > effort will be put into the implementation, and what the final code
> > will look like is your decision, IMO.
>
> Well, thank you very much, but after your "that's quite some size
> increase" e-mail (and showing me better code than I'd been working on
> for a couple of weeks), I'm feeling rather less confident.
>
> (And, despite my asking, nobody's expressed any opinion at all about
> my "save RAM by using BCD" suggestion.  Brilliant or crackpot?)
>
> > We only have multilib granularity, and there are not so many features
> > that are related to the flash size.  One is __AVR_HAVE_JMP_CALL__ which
> > applies to devices with >= 16 KiB flash.  The next size milestone is
> > __AVR_HAVE_ELPM__ which means >= 128 KiB.  The JMP + CALL looks
> > reasonable to me; I used it for 64-bit divisions in libgcc (which leads
> > to the funny situation that 64-bit division might run faster than a
> > 32-bit division for the same values).
>
> Interesting suggestion.  I could just use the multiplierless base-100 code,
> which is smaller and still reasonably fast.
>
> And thank you very much!  I knew that HAVE_JUMP_CALL meant that RJMP/RCALL
> range wasn't enough, which means more than 12 bits of PC (2^13 bytes of
> flash), but it had gotten lost in the forest of confusion.
>
> I'm befuddled by all of the different architecture options and don't
> understand the difference between most of them.  I've been slowly
> downloading data sheets for different examples from gcc's list and
> looking for differences, but it's a laborious process.  (That document
> on avr-tiny started out with me documenting my realization that avr1
> was something else.)
>
> For example, does MUL support imply MOVW support?  (I've been assuming
> so, but that's an easy edit.)
>
> And what's the relationship between MOVW support and ADIW/SBIW?  Are they
> the same feature, or are there processors with one and not the other?
>
>
> (For aggressive size squeezing, I've realized that a lot of code is wasted
> copying pointer return values from X or Z to r24:r25, only to have the
> caller copy them right back to use the pointers.  It would be lovely to
> tell if there were a way to tell gcc "expect the resturn vaue for this
> function in r30:r31".  And, sometimes, "this function preserves r18-r20".)
>
> > For smaller devices (e.g. no CALL but MUL some bytes can be squeezed
> > out by moving LDI of the constants to the inner loop which saves
> > PUSH + POP.  But on smaller devices, where xprintf is already a
> > major code consumer, a programmer might prefer something like ulltoa
> > over the bloat from xprintf.
>
> Um...I see how I can swap the Hundred constant around, but Khi/Klo are
> both used twice each, so loading them twice would not save as much.
> (If I return the end pointer rather than returning r24:r25 to the end
> for a call to strrev that's not in the current code, that avoids two
> more push/pop anyway.)
>
> The way I have the multiply organized, I have to do the two middle
> partial products first, then the low, then the high.  I can swap the
> middle ones around, but I can't make both constants' uses adjacent.
>
> >> #define Q2   r23     /* Byte 2 of 4-byte product */
> >> #define Q1   r22     /* Byte 1 (and 3) of 4-byte product */
>
> > Maybe it's a bit easier to grasp if
> >
> > #define Q3    Q1
> >
> > and then use Q3 where byte #3 is computed (starting with "clr Q1")
>
> I thought about that, but remembering that two names refer to the same
> register (and thus you may not rearrange code to overlap their usage)
> is also a pain.  I originally called them "Qeven" and "Qodd".
>
> Maybe I can just improve the comments...
>
> >>      /* Multiply Rem:Num by Khi:Klo */
> >>      mul     Num, Khi
> >>      mov     Q1, r0
> >>      mov     Q2, r1
> >
> > Can use "wmov Q1, r0"
>
> Ooh, nice!  I forgot that movw isn't limited to high registers.
>
> Let me try to improve the comments... do you still think this would
> be better with Q3?  (It dawned on me that even if the product *isn't*
> guaranteed to not overflow, the structure can compute the high half of
> a 32-bit product in only two accumulator registers if we add one more
> "ADC Q1,Q1".)
>
> >>      mul     Rem, Klo
> >>      add     Q1, r0
> >>      adc     Q2, r1          ; Cannot overflow
> >>      mul     Num, Klo
> >>      add     Q1, r1
>         clr     Q1              ; No longer need Q1; re-use register for Q3
>         adc     Q2, Q1          ; Propagate carry to Q2
>         ;adc    Q1, Q1          ; (Omit: no carry possible due to input
> range)
> >>      mul     Rem, Khi
>         add     Q2, r0          ; Now byte 2 (hlo, Q2) of 32-bit product
>         adc     Q1, r1          ; Now msbyte (hhi, Q3) of 32-bit product
> >>
> >>      ; We now have the high 12 bits of the 28-bit product in Q1:Q2.
> >>      ; Shift down by 4 bits
> >>      andi    Q2, 0xf0
> >>      or      Q1, Q2
> >>      swap    Q1
> >>      ;; We now have the new quotient in "Q1".
> >>      st      Z, Q1
>
> _______________________________________________
> AVR-libc-dev mailing list
> [hidden email]
> https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
>
_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
Reply | Threaded
Open this post in threaded view
|

Re: Even faster decimal code

George Spelvin
> I find all this fascinating, but I'm really not the one to be commenting on
> what the best approach here is.  I will say, however, that in many of my
> applications, I would be more likely to chose a speed increase over reduced
> memory.  I tend to live with mostly compute-bound, control applications
> though.

Thank you very much, this is very useful!  Even if it's just one person's
experience, at least it's *a* data point.

May I ask, whay do you think of my RAM-for-ROM tradeoff idea?
Is there one or the other that you more commonly run out of?

> I've not had enough time to look at the code to make fine-grained
> suggestions (and I'm out of practice with AVR ASM), but I've not felt that
> the comments were too bad.  I've been able to follow the code as written
> easily enough.

Yes, my original algorithm was excessively tricky.  The last (and
fastest) one is actually a lot simpler.

_______________________________________________
AVR-libc-dev mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
12