----------------------------------------------------------------------------------
@MSGID:
<79c18404-45b4-49b0-b947-9cb838600ddbn@googlegroups.com> b1c0e4b7
@REPLY:
<c2f2f9ca-0789-48b5-9047-024f69e2116cn@googlegroups.com> 9fdb66b3
@REPLYADDR MitchAlsup <MitchAlsup@aol.com>
@REPLYTO 2:5075/128 MitchAlsup
@CHRS: CP866 2
@RFC: 1 0
@RFC-References:
<57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com> <8a5563da-3be8-40f7-bfb9-39eb5e889c8an@googlegroups.com>
<f097448b-e691-424b-b121-eab931c61d87n@googlegroups.com> 1@newsreader4.netcologne.de> 1@gal.iecc.com>
<9f5be6c2-afb2-452b-bd54-314fa5bed589n@googlegroups.com> 1@newsreader4.netcologne.de>
<deeae38d-da7a-4495-9558-f73a9f615f02n@googlegroups.com> <9141df99-f363-4d64-9ce3-3d3aaf0f5f40n@googlegroups.com>
<78cd4ff6-d715-4886-950d-cb1a8d3c6654n@googlegroups.com> <f2fd635d-71e6-4757-877a-5bedb276afc0n@googlegroups.com>
<c2f2f9ca-0789-48b5-9047-024f69e2116cn@googlegroups.com>
@RFC-Message-ID:
<79c18404-45b4-49b0-b947-9cb838600ddbn@googlegroups.com>
@TZUTC: -0700
@PID: G2/1.0
@TID: FIDOGATE-5.12-ge4e8b94
On Saturday, September 23, 2023 at 2:12:30 PM UTC-5,
robf...@gmail.com wrote:
> On Saturday, September 23, 2023 at 12:29:49 PM UTC-4, MitchAlsup wrote:
> > On Friday, September 22, 2023 at 8:50:53 PM UTC-5,
robf...@gmail.com wrote:
> > > On Friday, September 22, 2023 at 10:26:38 AM UTC-4, MitchAlsup wrote:
> > > > On Thursday, September 21, 2023 at 9:05:14 PM UTC-5,
JimBrakefield wrote:
> > > > > On Wednesday, September 20, 2023 at 3:32:03 PM UTC-5,
Thomas Koenig wrote:
> > > > > > MitchAlsup <
Mitch...@aol.com> schrieb:
> > > > > > > On Sunday, September 17, 2023 at 3:30:19 PM
UTC-5, John Levine wrote:
> > > > > > >> According to Thomas Koenig <
tko...@netcologne.de>:
> > > > > > >> >> That`s not a power-of-two length, so how do
I keep using these numbers both
> > > > > > >> >> efficient and simple?
> > > > > > >> >
> > > > > > >> >Make the architecture byte-addressable, with
another width for the
> > > > > > >> >bytes; possible choices are 6 and 9.
> > > > > > >> I`m pretty sure the world has spoken and we are
going to use 8-bit
> > > > > > >> bytes forever. I liked the PDP-8 and PDP-10 but
they are, you know, dead.
> > > > > > ><
> > > > > > > In addition, the world has spoken and little endian also won.
> > > > > > ><
> > > > > > >> >Then make your architecture capable of misaligned
loads and stores
> > > > > > >> >and an extra floating point format, maybe 45 bits, with 9 bits
> > > > > > >> >exponent and 36 bits of significand.
> > > > > > ><
> > > > > > >> If you`re worried about performance, use your 45
bit format and store
> > > > > > >> it in a 64 bit word.
> > > > > > ><
> > > > > > > In 1985 one could get a descent 32-bit pipelined
RISC architecture in 1cm^2
> > > > > > > Today this design in < 0.1mm^2 or you can make
a GBOoO version < 2mm^2.
> > > > > > ><
> > > > > > > And you really need 5mm^2 to get enough pins on
the part to feed what you
> > > > > > > can put inside; 7mm^2 makes even more sense on pins versus perf.
> > > > > > ><
> > > > > > > So, why are you catering to ANY bit counts less than 64 ??
> > > > > > > Intel has version with 512-bit data paths, GPUs
generally use 1024-bits in
> > > > > > > and 1024 bits out per cycle continuously per shader core.
> > > > > > ><
> > > > > > > It is no longer 1990, adjust your thinking to
the modern realities or our time !
> > > > > >
> > > > > > There could be a justification for an intermediate floating point
> > > > > > design - memory bandwidth (and ALU width).
> > > > > >
> > > > > > If you look at linear algebra solvers, these are usually limited
> > > > > > by memory bandwidth. A 512-bit cache line size accomodates
> > > > > > 8 64-bit numbers, 10 48-bit numbers, 12 40-bit numbers, 14
> > > > > > 36-bit numbers or 16 32-bit numbers.
> > > > > >
> > > > > > For problems where 32 bits are not enough, but a few more bits
> > > > > > might suffice, having additional intermediate floating point sizes
> > > > > > could offer significant speedup.
> > > > > Ugh The business case for non-power-of-two floats:
> > > > > The core count (or lane count) increases for shorter floats
> > > > > 25% increase for 48-bit floats, 60% for 40-bit floats
and 75% for 36-bit floats versus 64-bit floats.
> > > > > Ignoring super-linear transistor counts and logic delay,
this directly translates into performance advantage.
> > > > <
> > > > One builds FP calculation resources as big as longest
container needed at full throughput.
> > > > In a 64-bit machine, this is one with a 11-bit exponent
and a 52-bit fraction.
> > > > On such a machine, the latency is set by the
calculations on this sized number.
> > > > AND
> > > > Smaller width numbers do not save any cycles.
> > > > <
> > > > So, the only advantage one has with 48-bit, ... numbers
is memory footprint.
> > > > There is NO (nada, zero, zilch) advantage in calculation latency.
> > > > <
> > > Does that include complicated calculations too? What about
trig functions, square root, or other iterative functions?
> > <
> > FDIV 17 cycles Goldschmidt with 1 Newton-Raphson iteration
> > SQRT 22 cycles Goldschmidt with 1 Newton-Raphson iteration
> > Ln2 16 cycles
> > ln/ln10 19 cycles
> > Exp2 17 cycles
> > Exp/exp10 20 cycles
> > Sin/Cos 21 cycles including argument reduction
> > Tan 21 or 38 including argument reduction
> > Atan 21 or 38 cycles
> > Pow 36 cycles
> > All double precision all faithfully rounded all Chebyshev
polynomials except as noted.
> > <
> > > As I have implemented reciprocal square root in micro-code
it takes longer for greater precision. Makes me think
> > > there is some benefit to supporting varying precisions.
> > <
> > SQRT and RSQRT can be done such that precision doubles each iteration.
> > as such, if you can do 32-bits in K cycles you can do
64-bits in K+3 cycles,
> > there is not much room for making 48-bits faster. Oh, BTW, K = 19 and loop
> > iteration is 3 cycles.
> > <
> I have many more than 3 cycles for an iteration. An FMA takes
8 cycles and there are multiple per iteration.
<
You should be able to start a new FMAC every cycle; once you can do this, the
iterations are just dropping stuff into the pipeline.
<
> However, I should have looked at my micro-code more closely.
There is indeed no difference in between
> calculating out to 64 bit or 48 bits because of the number of
bits reached in each iteration.
>
> To get 48 bits an iteration faster would require a much more
accurate initial approximation which probably
> is not practical.
> // RSQRT initial approximation 0
> // y = y*(1.5f - xhalf *y*y); // first NR iteration 9.16 bits accurate
> // y = y*(1.5f - xhalf *y*y); // second NR iteration 17.69 bits accurate
> // y = y*(1.5f - xhalf *y*y); // third NR iteration 35 bits accurate
> // y = y*(1.5f - xhalf *y*y); // fourth NR iteration 70 bits accurate
>
3 dependent multiplies per iteration--Goldschmidt changes this to 2
dependent multiplies per iteration.
{{Then inside the FU:: the multiplies are treated as fixed point so
the iteration latency is the height
of the multiplier tree not the latency of the FU itself (from 360/91 FDIV)}}
<
<
> In my case it still looks like there may be value in supporting
separate 8/16/32/64/128 bit ops. I guess it
> depends on how fast an iteration is. Limited parallel hardware here.
>
> Reciprocal estimate bits/clocks:
> 16 - 22 clocks
> 32 - 38 clocks
> 64 - 54 clocks
> 16 clocks per iteration plus some overhead.
>
> Scratching my head over how to make the ISA less dependent on
the implementation.
> > > > > As L1, L2 and L3 data caches are on chip, they can
be specialized for the float size.
> > > > > Data transfers between DRAM and the processor chip
become more complicated, but as DRAM is much slower,
> > > > > the effect is less noticeable.
> > > > > The instructions for these different floating point
units can remain 8-bit byte sized, e.g. employ a Harvard architecture.
> > > > > (a given chip would normally support a single float
size or a half or fourth thereof)
--- G2/1.0
* Origin: usenet.network (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441