Golded

- COMP.ARCH---------------- < Пред. | След. > -- < @ > -- < Сообщ. > -- < Эхи > --

Nп/п : 7 из 100

От : robf...@gmail.com 2:5075/128 23 сен 23 12:12:27

К : MitchAlsup 23 сен 23 22:16:03

Тема : Re: Solving the Floating-Point Conundrum

----------------------------------------------------------------------------------

@MSGID:
<c2f2f9ca-0789-48b5-9047-024f69e2116cn@googlegroups.com> 9fdb66b3
@REPLY:
<f2fd635d-71e6-4757-877a-5bedb276afc0n@googlegroups.com> d5e824fe
@REPLYADDR robf...@gmail.com <robfi680@gmail.com>
@REPLYTO 2:5075/128 robf...@gmail.com
@CHRS: CP866 2
@RFC: 1 0
@RFC-References:
<57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com> <8a5563da-3be8-40f7-bfb9-39eb5e889c8an@googlegroups.com>
<f097448b-e691-424b-b121-eab931c61d87n@googlegroups.com> 1@newsreader4.netcologne.de> 1@gal.iecc.com>
<9f5be6c2-afb2-452b-bd54-314fa5bed589n@googlegroups.com> 1@newsreader4.netcologne.de>
<deeae38d-da7a-4495-9558-f73a9f615f02n@googlegroups.com> <9141df99-f363-4d64-9ce3-3d3aaf0f5f40n@googlegroups.com>
<78cd4ff6-d715-4886-950d-cb1a8d3c6654n@googlegroups.com> <f2fd635d-71e6-4757-877a-5bedb276afc0n@googlegroups.com>
@RFC-Message-ID:
<c2f2f9ca-0789-48b5-9047-024f69e2116cn@googlegroups.com>
@TZUTC: -0700
@PID: G2/1.0
@TID: FIDOGATE-5.12-ge4e8b94
On Saturday, September 23, 2023 at 12:29:49 PM UTC-4, MitchAlsup wrote:
> On Friday, September 22, 2023 at 8:50:53 PM UTC-5, robf...@gmail.com wrote:
> > On Friday, September 22, 2023 at 10:26:38 AM UTC-4, MitchAlsup wrote:
> > > On Thursday, September 21, 2023 at 9:05:14 PM UTC-5, JimBrakefield wrote:
> > > > On Wednesday, September 20, 2023 at 3:32:03 PM UTC-5,
Thomas Koenig wrote:
> > > > > MitchAlsup <Mitch...@aol.com> schrieb:
> > > > > > On Sunday, September 17, 2023 at 3:30:19 PM UTC-5,
John Levine wrote:
> > > > > >> According to Thomas Koenig <tko...@netcologne.de>:
> > > > > >> >> That`s not a power-of-two length, so how do I
keep using these numbers both
> > > > > >> >> efficient and simple?
> > > > > >> >
> > > > > >> >Make the architecture byte-addressable, with another
width for the
> > > > > >> >bytes; possible choices are 6 and 9.
> > > > > >> I`m pretty sure the world has spoken and we are going to use 8-bit
> > > > > >> bytes forever. I liked the PDP-8 and PDP-10 but
they are, you know, dead.
> > > > > ><
> > > > > > In addition, the world has spoken and little endian also won.
> > > > > ><
> > > > > >> >Then make your architecture capable of misaligned
loads and stores
> > > > > >> >and an extra floating point format, maybe 45 bits, with 9 bits
> > > > > >> >exponent and 36 bits of significand.
> > > > > ><
> > > > > >> If you`re worried about performance, use your 45
bit format and store
> > > > > >> it in a 64 bit word.
> > > > > ><
> > > > > > In 1985 one could get a descent 32-bit pipelined
RISC architecture in 1cm^2
> > > > > > Today this design in < 0.1mm^2 or you can make a
GBOoO version < 2mm^2.
> > > > > ><
> > > > > > And you really need 5mm^2 to get enough pins on
the part to feed what you
> > > > > > can put inside; 7mm^2 makes even more sense on pins versus perf.
> > > > > ><
> > > > > > So, why are you catering to ANY bit counts less than 64 ??
> > > > > > Intel has version with 512-bit data paths, GPUs
generally use 1024-bits in
> > > > > > and 1024 bits out per cycle continuously per shader core.
> > > > > ><
> > > > > > It is no longer 1990, adjust your thinking to the
modern realities or our time !
> > > > >
> > > > > There could be a justification for an intermediate floating point
> > > > > design - memory bandwidth (and ALU width).
> > > > >
> > > > > If you look at linear algebra solvers, these are usually limited
> > > > > by memory bandwidth. A 512-bit cache line size accomodates
> > > > > 8 64-bit numbers, 10 48-bit numbers, 12 40-bit numbers, 14
> > > > > 36-bit numbers or 16 32-bit numbers.
> > > > >
> > > > > For problems where 32 bits are not enough, but a few more bits
> > > > > might suffice, having additional intermediate floating point sizes
> > > > > could offer significant speedup.
> > > > Ugh The business case for non-power-of-two floats:
> > > > The core count (or lane count) increases for shorter floats
> > > > 25% increase for 48-bit floats, 60% for 40-bit floats and
75% for 36-bit floats versus 64-bit floats.
> > > > Ignoring super-linear transistor counts and logic delay,
this directly translates into performance advantage.
> > > <
> > > One builds FP calculation resources as big as longest
container needed at full throughput.
> > > In a 64-bit machine, this is one with a 11-bit exponent
and a 52-bit fraction.
> > > On such a machine, the latency is set by the calculations
on this sized number.
> > > AND
> > > Smaller width numbers do not save any cycles.
> > > <
> > > So, the only advantage one has with 48-bit, ... numbers is
memory footprint.
> > > There is NO (nada, zero, zilch) advantage in calculation latency.
> > > <
> > Does that include complicated calculations too? What about trig
functions, square root, or other iterative functions?
> <
> FDIV 17 cycles Goldschmidt with 1 Newton-Raphson iteration
> SQRT 22 cycles Goldschmidt with 1 Newton-Raphson iteration
> Ln2 16 cycles
> ln/ln10 19 cycles
> Exp2 17 cycles
> Exp/exp10 20 cycles
> Sin/Cos 21 cycles including argument reduction
> Tan 21 or 38 including argument reduction
> Atan 21 or 38 cycles
> Pow 36 cycles
> All double precision all faithfully rounded all Chebyshev
polynomials except as noted.
> <
> > As I have implemented reciprocal square root in micro-code it
takes longer for greater precision. Makes me think
> > there is some benefit to supporting varying precisions.
> <
> SQRT and RSQRT can be done such that precision doubles each iteration.
> as such, if you can do 32-bits in K cycles you can do 64-bits in K+3 cycles,
> there is not much room for making 48-bits faster. Oh, BTW, K = 19 and loop
> iteration is 3 cycles.
> <

I have many more than 3 cycles for an iteration. An FMA takes 8
cycles and there are multiple per iteration.
However, I should have looked at my micro-code more closely. There
is indeed no difference in between
calculating out to 64 bit or 48 bits because of the number of
bits reached in each iteration.

To get 48 bits an iteration faster would require a much more
accurate initial approximation which probably
is not practical.
// RSQRT initial approximation  0
//   y = y*(1.5f - xhalf *y*y);  // first NR iteration9.16 bits accurate
// y = y*(1.5f - xhalf *y*y);  // second NR iteration 17.69 bits accurate
// y = y*(1.5f - xhalf *y*y);  // third NR iteration   35 bits accurate
//   y = y*(1.5f - xhalf *y*y);  // fourth NR iteration 70 bits accurate

In my case it still looks like there may be value in supporting
separate 8/16/32/64/128 bit ops. I guess it
depends on how fast an iteration is. Limited parallel hardware here.

Reciprocal estimate bits/clocks:
16 - 22 clocks
32 - 38 clocks
64 - 54 clocks
16 clocks per iteration plus some overhead.

Scratching my head over how to make the ISA less dependent on the
implementation.

> > > > As L1, L2 and L3 data caches are on chip, they can be
specialized for the float size.
> > > > Data transfers between DRAM and the processor chip become
more complicated, but as DRAM is much slower,
> > > > the effect is less noticeable.
> > > > The instructions for these different floating point units
can remain 8-bit byte sized, e.g. employ a Harvard architecture.
> > > > (a given chip would normally support a single float size
or a half or fourth thereof)
--- G2/1.0
* Origin: usenet.network (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441

GoldED+ VK │ │ 09:55:30