----------------------------------------------------------------------------------
@MSGID:
<99d27cfd-fd0b-452f-ba82-cb46e0e744edn@googlegroups.com> d39cb09f
@REPLY: 1@dont-email.me> be752993
@REPLYADDR robf...@gmail.com <robfi680@gmail.com>
@REPLYTO 2:5075/128 robf...@gmail.com
@CHRS: CP866 2
@RFC: 1 0
@RFC-References: 1@dont-email.me>
@RFC-Message-ID:
<99d27cfd-fd0b-452f-ba82-cb46e0e744edn@googlegroups.com>
@TZUTC: -0700
@PID: G2/1.0
@TID: FIDOGATE-5.12-ge4e8b94
On Thursday, September 28, 2023 at 1:08:10 PM UTC-4, BGB wrote:
> I recently had an idea (that small scale testing doesn`t require
> redesigning my whole pipeline):
> If one delays nearly all of the operations to at least a 2-cycle
> latency, then seemingly the timing gets a fair bit better.
>
> In particular, a few 1-cycle units:
> SHAD (bitwise shift and similar)
> CONV (various bit-repacking instructions)
> Were delayed to 2 cycle:
> SHAD:
> 2 cycle latency doesn`t have much obvious impact;
> CONV: Minor impact
> I suspect due to delaying MOV-2R and EXTx.x and similar.
> I could special-case these in Lane 1.
>
>
> There was already a slower CONV2 path which had mostly dealt with things
> like FPU format conversion and other "more complicated" format
> converters, so the CONV path had mostly been left for operations that
> mostly involved shuffling the bits around (and the simple case
> 2-register MOV instruction and similar, etc).
>
> Note that most ALU ops were already generally 2-cycle as well.
>
>
> Partly this idea was based on the observation that adding the logic for
> a BSWAP.Q instruction to the CONV path had a disproportionate impact on
> LUT cost and timing. The actual logic in this case is very simple
> (mostly shuffling the bytes around), so theoretically should not have
> had as big of an impact.
>
>
> Testing this idea, thus far, isn`t enough to get the clock boosted to
> 75MHz, but did seemingly help here, and has seemingly redirected the
> "worst failing paths" from being through the D$->EXn->RF pipeline, over
> to being D$->ID1.
>
> Along with paths from the input to the output side of the instruction
> decoder. Might also consider disabling the (mostly not used for much)
> RISC-V decoders, and see if this can help.
>
> Had also now disabled the LDTEX instruction, now as it is "somewhat less
> important" if TKRA-GL is mapped through a hardware rasterizer module.
>
>
> And, thus far, unlike past attempts in this area, this approach doesn`t
> effectively ruin the performance of the L1 D$.
>
>
> Seems like one could possibly try to design a core around this
> assumption, avoiding any cases where combinatorial logic feeds into the
> register-forwarding path (or, cheaper still, not have any register
> forwarding; but giving every op a 3-cycle latency would be a little steep).
>
> Though, one possibility could be to disable register forwarding from
> Lane 3, in which case only interlocks would be available.
> This would work partly as Lane 3 isn`t used anywhere near as often as
> Lanes 1 or 2.
>
> ...
>
>
>
> Any thoughts?...
Sounds like super-pipelining. I did this sort of thing for my PowerPC compatible
core. Each stage was multi-cycle, but it was still an overlapped
pipeline. And it
did boost the clock frequency significantly. Overall performance was not a
whole lot better though. It might be good if it is desired to use a high clock
frequency. I prefer to use a lower clock frequency; it consumes less power.
--- G2/1.0
* Origin: usenet.network (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441