----------------------------------------------------------------------------------
@MSGID: 1@dont-email.me> 660f8141
@REPLY: 1@dont-email.me> f0106269
@REPLYADDR BGB <cr88192@gmail.com>
@REPLYTO 2:5075/128 BGB
@CHRS: CP866 2
@RFC: 1 0
@RFC-Message-ID: 1@dont-email.me>
@RFC-References:
<57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com> <a0dd4fb4-d708-48ae-9764-3ce5e24aec0cn@googlegroups.com>
<5fa92a78-d27c-4dff-a3dc-35ee7b43cbfan@googlegroups.com> <c9131381-2e9b-4008-bc43-d4df4d4d8ab4n@googlegroups.com>
<edb0d2c4-1689-44b4-ae81-5ab1ef234f8en@googlegroups.com> <43901a10-4859-43d7-b500-70030047c8b2n@googlegroups.com>
<jwvzg1acja6.fsf-monnier+comp.arch@gnu.org> 2@dont-email.me> <jwvil7yces1.fsf-monnier+comp.arch@gnu.org>
1@dont-email.me> 1@dont-email.me>
@TZUTC: -0500
@PID: Mozilla/5.0 (Windows NT 10.0; Win64; x64;
rv:102.0) Gecko/20100101 Thunderbird/102.15.1
@TID: FIDOGATE-5.12-ge4e8b94
On 9/26/2023 1:09 AM, Terje Mathisen wrote:
> BGB-Alt wrote:
>> On 9/25/2023 1:11 PM, Stefan Monnier wrote:
>>>> I am now evaluating the possible use of a 48-bit floating-point
>>>> format, but
>>>> this is (merely) in terms of memory storage (in registers, it will
>>>> still use
>>>> Binary64).
>>>
>>> I suspect this is indeed the only sane way to go about it.
>>> Also, I suspect that such 48bit floats would only be worthwhile when you
>>> have some large vectors/matrices and care about the 33% bandwidth
>>> overhead of using 64bit rather than 48bit. So maybe the focus should be
>>> on "load 3 chunks, then spread turn them into 4" since the limiting
>>> factor would presumably be the memory bandwidth.
>>>
>>
>> Yeah, memory bandwidth tends to be one of the major limiting factors
>> for performance in my experience for many algorithms.
>>
>> This is partly why I had some wonk like 3x Float21 vectors (with
>> 64-bit storage). And, part of why I do a lot of stuff using Binary16
>> (where, in this case, both Binary16 and Binary32 have the same latency).
>>
>> Well, and for a lot of my 3D rendering is using RGB555 framebuffers,
>> 16-bit Z-buffer, and texture compression...
>>
>>
>>
>> As noted in some past examples, even desktop PCs are not immune to
>> this, and saving some memory via bit-twiddly can often be cheaper than
>> using a "less memory dense" strategy (that results in a higher number
>> of L1 misses).
>>
>> Ironically, this seems to run counter to the conventional wisdom of
>> saving calculations via lookup tables (depending on the algorithm, the
>> lookup table may only "win" if it is less than 1/4 or 1/8 the size of
>> the L1 cache).
>
> I used to be the master of lookup tables (i.e. a Word Count clone that
> ran in 1.5 clock cycles/byte on a Pentium, using two dependent lookup
> tables and zero branching), but more lately vector operations and the
> increasing cost of storage means that re-calculating stuff can be faster
> and/or more power-efficient.
Yeah.
Or, in my core, because of the comparably high cost of L1 misses...
Using lookup tables to sidestep a divide operation can be a win though.
Software divide loop is slow.
The hardware DIVx.x ops are also slow.
Except that DIVS.L / DIVU.L can optimize a similar subset to the lookup
tables (internally handling it as a multiply-by-reciprocal). So, say,
for divisors in the range of 0..63, the DIVS.L op can handle it in 3
cycles (vs 36 cycles for most everything else).
>>
>> Many people seem to evaluate efficiency solely in terms of how many
>> clock-cycles it would take to evaluate a sequence of instructions,
>> rather than necessarily how much memory is touched by the algorithm or
>> its probability of resulting in L1 misses and similar.
>
> BTDT, lots of times!
>
> Any serious optimization effort have to measure how each intended
> improvement actually work as part of the full program, and not in
> isolation. I have been in a situation where I really wanted to beat the
> best closed-source ogg vorbis decoder, and I was getting desperate near
> the end: About 9 out of 10 new ideas that would all work in
> theory/isolation ended up either a wash or with a slowdown in the full
> program.
>
Yeah.
This is related sort of to some of my codecs using Rice and other
non-standard entropy encoders (rather than Huffman), as while in
isolation, Huffman is short and fast, once has multiple contexts (and a
lot of L1 misses), performance quickly takes a hit.
So, there may be wonky encoders, like ranking all the symbols by
probability and then Rice-encoding the index (so, only need ~ 256 bytes
per context in this case), usually with the Rice-coding being
length-limited (usually, Q=7 as a limit, where a case of 8x 1-bits or
similar is followed by a raw symbol).
More complex logic, but fewer L1 misses, ...
Though, Huffman with a 12-bit length-limit is also a promising
compromise, as it is generally easier for an 8K lookup table to fit into
the L1 cache... And, 12-bit Huffman does still lead to better
compression than the permutation-table + Rice variants.
...
> Terje
>
--- Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
* Origin: A noiseless patient Spider (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441