Golded

- COMP.ARCH---------------- < Пред. | След. > -- < @ > -- < Сообщ. > -- < Эхи > --

Nп/п : 38 из 100

От : BGB-Alt 2:5075/128 25 сен 23 16:43:26

К : Stefan Monnier 25 сен 23 00:47:03

Тема : Re: Solving the Floating-Point Conundrum

----------------------------------------------------------------------------------

@MSGID: 1@dont-email.me> 6584331c
@REPLY: <jwvil7yces1.fsf-monnier+comp.arch@gnu.org>
beab195e
@REPLYADDR BGB-Alt
<bohannonindustriesllc@gmail.com>
@REPLYTO 2:5075/128 BGB-Alt
@CHRS: CP866 2
@RFC: 1 0
@RFC-Message-ID: 1@dont-email.me>
@RFC-References:
<57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com> <a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com> <c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com> <43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org> 2@dont-email.me>
<jwvil7yces1.fsf-monnier+comp.arch@gnu.org>
@TZUTC: -0500
@PID: Mozilla/5.0 (Windows NT 10.0; Win64; x64;
rv:102.0) Gecko/20100101 Thunderbird/102.15.1
@TID: FIDOGATE-5.12-ge4e8b94
On 9/25/2023 1:11 PM, Stefan Monnier wrote:
>> I am now evaluating the possible use of a 48-bit floating-point format, but
>> this is (merely) in terms of memory storage (in registers, it will still use
>> Binary64).
>
> I suspect this is indeed the only sane way to go about it.
> Also, I suspect that such 48bit floats would only be worthwhile when you
> have some large vectors/matrices and care about the 33% bandwidth
> overhead of using 64bit rather than 48bit.  So maybe the focus should be
> on "load 3 chunks, then spread turn them into 4" since the limiting
> factor would presumably be the memory bandwidth.
>

Yeah, memory bandwidth tends to be one of the major limiting factors for
performance in my experience for many algorithms.

This is partly why I had some wonk like 3x Float21 vectors (with 64-bit
storage). And, part of why I do a lot of stuff using Binary16 (where, in
this case, both Binary16 and Binary32 have the same latency).

Well, and for a lot of my 3D rendering is using RGB555 framebuffers,
16-bit Z-buffer, and texture compression...

As noted in some past examples, even desktop PCs are not immune to this,
and saving some memory via bit-twiddly can often be cheaper than using a
"less memory dense" strategy (that results in a higher number of L1 misses).

Ironically, this seems to run counter to the conventional wisdom of
saving calculations via lookup tables (depending on the algorithm, the
lookup table may only "win" if it is less than 1/4 or 1/8 the size of
the L1 cache).

Many people seem to evaluate efficiency solely in terms of how many
clock-cycles it would take to evaluate a sequence of instructions,
rather than necessarily how much memory is touched by the algorithm or
its probability of resulting in L1 misses and similar.

Granted, OoO CPUs can sort of hide away L1 miss costs to some extent
(they are a little more obvious with a strictly in-order CPU).

> E.g. load 3 chunks (C1, C2, and C3) of 256bits each using standard SIMD
> load, and then add an instruction to turn C1+C2 into two 256bit vectors
> of 4x64bit floats, and another to do the same with C2+C3 (basically, the
> same instruction except it uses the other half of the bits of C2).
>

I don`t really have a good way to do 256-bit loads or work with 256-bit
vectors in my core.

But, yeah, this is how it works for things like the 3x Float21 (64-bit)
and 3x Float42 vectors (128-bit). Where, Float42 vectors would be split
up into three Binary64 values for doing math on them, and then repacked
later.

The Float48 format is basically special case Load/Store ops which pad
the value to 64 bits on Load, and narrow it to 48 bits on Store. These
would be more intended for scalar operations on arrays.

There is a separate penalty for using it in arrays, in that it needs an
extra LEA.W instruction to handle the 48-bit element size (well, along
with the drawback of the Disp5 direct-displacement ops only having a
60-byte range in this case).

But, emulating this without these special instructions, would take a
somewhat longer instruction sequence (and ends up needing to stomp R16
and R17 for the pseudo-instructions, ...), so special ops seem
justifiable. I also added the fallback case to BGBCC as well.

>
>          Stefan

--- Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
* Origin: A noiseless patient Spider (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441

GoldED+ VK │ │ 09:55:30