----------------------------------------------------------------------------------
@MSGID: 1@dont-email.me> be752993
@REPLYADDR BGB <cr88192@gmail.com>
@REPLYTO 2:5075/128 BGB
@CHRS: CP866 2
@RFC: 1 0
@RFC-Message-ID: 1@dont-email.me>
@TZUTC: -0500
@PID: Mozilla/5.0 (Windows NT 10.0; Win64; x64;
rv:102.0) Gecko/20100101 Thunderbird/102.15.1
@TID: FIDOGATE-5.12-ge4e8b94
I recently had an idea (that small scale testing doesn`t require
redesigning my whole pipeline):
If one delays nearly all of the operations to at least a 2-cycle
latency, then seemingly the timing gets a fair bit better.
In particular, a few 1-cycle units:
SHAD (bitwise shift and similar)
CONV (various bit-repacking instructions)
Were delayed to 2 cycle:
SHAD:
2 cycle latency doesn`t have much obvious impact;
CONV: Minor impact
I suspect due to delaying MOV-2R and EXTx.x and similar.
I could special-case these in Lane 1.
There was already a slower CONV2 path which had mostly dealt with things
like FPU format conversion and other "more complicated" format
converters, so the CONV path had mostly been left for operations that
mostly involved shuffling the bits around (and the simple case
2-register MOV instruction and similar, etc).
Note that most ALU ops were already generally 2-cycle as well.
Partly this idea was based on the observation that adding the logic for
a BSWAP.Q instruction to the CONV path had a disproportionate impact on
LUT cost and timing. The actual logic in this case is very simple
(mostly shuffling the bytes around), so theoretically should not have
had as big of an impact.
Testing this idea, thus far, isn`t enough to get the clock boosted to
75MHz, but did seemingly help here, and has seemingly redirected the
"worst failing paths" from being through the D$->EXn->RF pipeline, over
to being D$->ID1.
Along with paths from the input to the output side of the instruction
decoder. Might also consider disabling the (mostly not used for much)
RISC-V decoders, and see if this can help.
Had also now disabled the LDTEX instruction, now as it is "somewhat less
important" if TKRA-GL is mapped through a hardware rasterizer module.
And, thus far, unlike past attempts in this area, this approach doesn`t
effectively ruin the performance of the L1 D$.
Seems like one could possibly try to design a core around this
assumption, avoiding any cases where combinatorial logic feeds into the
register-forwarding path (or, cheaper still, not have any register
forwarding; but giving every op a 3-cycle latency would be a little steep).
Though, one possibility could be to disable register forwarding from
Lane 3, in which case only interlocks would be available.
This would work partly as Lane 3 isn`t used anywhere near as often as
Lanes 1 or 2.
...
Any thoughts?...
--- Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
* Origin: A noiseless patient Spider (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441