Golded

- COMP.ARCH---------------- < Пред. | След. > -- < @ > -- < Сообщ. > -- < Эхи > --

Nп/п : 92 из 100

От : BGB 2:5075/128 30 сен 23 12:50:47

К : EricP 30 сен 23 20:54:03

Тема : Re: Misc: Another (possible) way to more MHz...

----------------------------------------------------------------------------------

@MSGID: 1@dont-email.me> 3986acd5
@REPLY: c.100057@fx09.iad>
1adec2bc
@REPLYADDR BGB <cr88192@gmail.com>
@REPLYTO 2:5075/128 BGB
@CHRS: CP866 2
@RFC: 1 0
@RFC-Message-ID: 1@dont-email.me>
@RFC-References: 1@dont-email.me>
<6zCRM.67038$fUu6.58754@fx47.iad> 1@dont-email.me> Hih7.154829@fx11.iad>
1@dont-email.me> c.100057@fx09.iad>
@TZUTC: -0500
@PID: Mozilla/5.0 (Windows NT 10.0; Win64; x64;
rv:102.0) Gecko/20100101 Thunderbird/102.15.1
@TID: FIDOGATE-5.12-ge4e8b94
On 9/30/2023 11:04 AM, EricP wrote:
> BGB wrote:
>> On 9/29/2023 2:02 PM, EricP wrote:
>>> BGB wrote:
>>>>>>
>>>>>> Any thoughts?...
>>>>>
>>>>> Its not just the MHz but the IPC you need to think about.
>>>>> If you are running at 50 MHz but an actual IPC of 0.1 due to
>>>>> stalls and pipeline bubbles then that`s really just 5 MIPS.
>>>>>
>>>>
>>>> For running stats from a running full simulation (predates to these
>>>> tweaks, running GLQuake with the HW rasterizer):
>>>>   ~ 0.48 .. 0.54 bundles clock;
>>>>   ~ 1.10 .. 1.40 instructions/bundle.
>>>>
>>>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604
>>>> MIPs/MHz).
>>>
>>> Oh that`s pretty efficient then.
>>> In the past you had made comments which made it sound like
>>> having tlb, cache, and dram controller all hung off of what
>>> you called your "ring bus", which sounded like a token ring,
>>> and that the RB consumed many cycles latency.
>>> That gave me the impression of frequent, large stalls to cache,
>>> lots of bubbles, leading to low IPC.
>>>
>>
>> It does diminish IPC, but not as much as my older bus...
>
> Oh I thought this was 1-wide but I see elsewhere that it is 3-wide.
> That`s not that efficient. I was thinking you were getting an IPC
> of 0.5 out ~0.7, the maximum possible with 1 register write port.
> A 3-wide should get an IPC > 1.0 but since you only have 1 RF write port
> that pretty much bottlenecks you at WB/Retire to < 1.0.
>

There are 3 write ports to the register file.

However, they only see much use when the code actually uses them, which
for the most part, my C compiler doesn`t. It basically emits normal
1-wide RISC style code, then tries to jostle the instructions around and
put them in bundles.

Results are pretty mixed, and it only really works if the code is
written in certain ways.

Ironically, for GLQuake, most of the ASM was in areas that dropped off
the map when switching to a hardware rasterizer; so the part of the
OpenGL pileline that remains, is mostly all the stuff that was written
in C (with a few random bits of ASM thrown in).

> I suspect those ring bus induced bubbles are likely killing your IPC.
> Fiddling with the internals won`t help if the pipeline is mostly empty.
>

Ringbus latency doesn`t matter when there are no L1 misses...

> I suggest the primary thing to think about for the future is getting the
> pipeline as full as possible. Then consider making it more efficient
> internally, adding more write register ports so you can retire > 1.0 IPC
> (there is little point in having 3 lanes if you can only retire 1/clock).
> Then thirdly start look at things like forwarding buses.
>

Well, would be back to a lot more fiddling with my C compiler in this case.

As noted, the ISA in question is statically scheduled, so depends mostly
on either the compiler or ASM programmer to do the work.

>> It seems like, if there were no memory related overheads (if the L1
>> always hit), as is it would be in the area of 22% faster.
>>
>> L1 misses are still not good though, but this is true even on a modern
>> desktop PC.
>
> The cache miss rate may not be the primary bottleneck.
> Are you using the ring bus to talk to TLB`s, I$L1, D$L1, L2, etc?
>

L1 caches are mounted directly to the pipeline, and exist in EX1..EX3
stages.

So:
   PF IF ID1 ID2 EX1 EX2 EX3 WB
Or, alternately:
   PF IF ID RF EX1 EX2 EX3 WB

So, access is like:
   EX1: Calculate address, send request to L1 cache;
   EX2: Cache checks hit/miss, extracts data for load, prepare for store.
     This is the stage where the pipeline stall is signaled on miss.
   EX3: Data fetched for Load, final cleanup.
     Final cleanup: Sign-extension, Binary32->Binary64 conversion, etc.
     Data stored back into L1 arrays here (on next clock edge).

> Some questions about your L1 cache:
>
> In clocks, what is I$L1 D$L1 read and write hit latency,
> and the total access latency including ring bus overhead?
> And is the D$L1 store pipelined?
>

Loads and stores are pipelined.

TLB doesn`t matter yet, L1 caches are virtually indexed and tagged.

> Do you use the same basic design for your 2-way assoc. TLB
> as the L1 cache, so the same numbers apply?
>
> And do you pipeline the TLB lookup in one stage, and D$L1 access in a
> second?
>

TLB is a separate component external to the L1 caches, and performs
translation on L1 miss.

It has a roughly 3 cycle latency.
   1: Request comes in, setup for fetch from TLB arrays;
   2: Check for TLB hit/miss, raise exception on miss;
   3: Replace original request with translated request.
Output is on the clock-edge following the 3rd cycle.

> I`m suggesting that your primary objective is making that pathway from the
> Load Store Unit (LSU) to TLB to D$L1 as simple and efficient as possible.
> So a direct 1:1 connect, zero bus overhead and latency, just cache latency.
>
> Such that ideally it takes 2 pipelined stages for a cache read hit,
> and if the D$L1 read hit is 1 clock that the load-to-use
> latency is 2 clocks (or at least that is possible), pipelined.
>
> And that a store is passed to D$L1 in 1 clock,
> and then the LSU can continue while the cache deals with it.
> The cache bus handshake would go "busy" until the store is complete.
> Also ideally store hits would pipeline the tag and data accesses
> so back to back store hits take 1 clock (but that`s getting fancy).
>

There is no LSU in this design, or effectively, the L1 cache itself
takes on this role.

>> I suspect ringbus efficiency is diminishing the efficiency of external
>> RAM access, as both the L2 memcpy and DRAM stats tend to be lower than
>> the "raw" speed of accessing the RAM chip (in the associated unit tests).
>
> At the start this ring bus might have been a handy idea by
> making it easy to experiment with different configurations, but I
> think you should be looking at direct connections whenever possible.
>

Within the core itself, everything is bolted directly to the main pipeline.

External to this, everything is on the ringbus.

As noted, when there are no cache misses and no MMIO access or similar,
the bus isn`t really involved.

But, yeah, I am left to realize that, say, driving the L2 cache with a
FIFO might have been better for performance (rather than just letting
requests circle the ring until they can be handled).

>>
>>>> Top ranking uses of clock-cycles (for total stall cycles):
>>>>   L2 Miss: ~ 28%  (RAM, L2 needs to access DDR chip)
>>>>   Misc   : ~ 23%  (Misc uncategorized stalls)
>>>>   IL     : ~ 20%  (Interlock stalls)
>>>>   L1 I$  : ~ 18%  (16K L1 I$, 1)
>>>>   L1 D$  : ~  9%  (32K L1 D$)
>>>>
>>>> The IL (or Interlock) penalty is the main one that would be effected
>>>> by increasing latency.
>>>
>>> By "interlock stalls" do you mean register RAW dependency stalls?
>>
>> Yeah.
>>
>> Typically:
>> If an ALU operation happens, the result can`t be used until 2 clock
>> cycles later;
>> If a Load happens, the result is not available for 3 clock cycles;
>> Trying to use the value before then stalls the frontend stages.
>
> Ok this sounds like you need more forwarding buses.
> Ideally this should allow back-to-back dependent operations.
>

Early on, I did try forwarding ADD results directly from the EX1 stage
(or, directly from the adder`s combinatorial logic into the register
forwarding, which is more combinatorial logic feeding back into the ID2
stage).

FPGA timing was not so happy with this sort of thing (it is a lot
happier when there are clock-edges for everything to settle out on).

>>> As distinct from D$L1 read access stall, if read access time > 1 clock
>>> or multi-cycle function units like integer divide.
>>>
>>
>> The L1 I$ and L1 D$ have different stats, as shown above.
>>
>> Things like DIV and FPU related stalls go in the MISC category.
>>
>> Based on emulator stats (and profiling), I can see that most of the
>> MISC overhead in GLQuake is due to FPU ops like FADD and FMUL and
>> similar.
>>
>>
>> So, somewhere between 5% and 9% of the total clock-cycles here are
>> being spent waiting for the FPU to do its thing.
>>
>> Except for the "low precision" ops, which are fully pipelined (these
>> will not result in any MISC penalty, but may result in an IL penalty
>> if the result is used too quickly).
>>
>>> IIUC you are saying your fpga takes 2 clocks at 50 MHz = 40 ns
>>> to do a 64-bit add, so it uses two pipeline stages for ALU.
>>>
>>
>> It takes roughly 1 cycle internally, so:
>>   ID2 stage: Fetch inputs for the ADD;
>>   EX1 stage: Do the ADD;
>>   EX2 stage: Make result visible to the world.
>>
>> For 1-cycle ops, it would need to forward the result directly from the
>> adder-chain logic or similar into the register forwarding logic. I
>> discovered fairly early on that for things like 64-bit ADD, this is bad.
>
> It should not be bad, you just need to sort out the clock edges and
> forwarding. In a sense these are feedback loops so they just need
> to be self re-enforcing (see below).
>
>> Most operations which "actually do something" thus sort of end up
>> needing a clock-edge for their results to come to rest (causing them
>> to effectively have a 2-cycle latency as far as the running program is
>> concerned).
>
> Ok that shouldn`t happen. If your ALU is 1 clock latency then
> back-to-back execution should be possible with a forwarding bus.
>
> You are taking ALU result AFTER its stage output buffer and forwarding
> that back to register read, rather than taking the ALU result BEFORE
> the stage output buffer, and this is introducing an extra clock delay.
>

But, doing it this way makes FPGA timing constraints significantly
happier...

> I`m thinking of this organization:
>
>         Decode Logic
>              |
>              v
>     == Decode Stage Buffer ==
ID2 Stage.
>           Immediate Data
>              |
>              |  |---< Reg File
>              |  |
>              |  |  |---------------
>              v  v  v              |
>          Operand source mux       |
>              | |                  |
>              v v                  |
>     == Reg Read Stage Buffer ==   |
EX1 Stage
>           Operand values          |
>              | |                  |
>              v v                  |
>              ALU                  |
>               |>-------------------
>               v
>      == ALU Result Stage Buffer ==
Start of EX2 stage, ALU result gets forwarded here...
> > Notice that the clock edge locks the ALU forwarded value into
> the RR output stage, unless the RR stage output is stalled in which case
> it holds the ALU input stable and therefore also the forwarded result.
>
> This needs to be done consistently across all stages.
> It also needs to forward a load value to RR stage before WB.
>
> It also should be able to forward a register read value, or ALU result,
> or load result, or WB exception address to fetch for branches and jumps,
> again with no extra clocks. So for example
>     JMP reg
> can go directly from RR to Fetch and start fetching the next cycle.
>
> But this forwarding is gold plating, after the pipeline is full.
>

Yeah.

I was experimenting with going the way of "increasing" the effective
latency in some cases, trying to loosen timing enough that I could
hopefully boost the clock speed.

Getting more stuff to flow through the pipeline could be better, but is
the tired old path of continuing to beat on my C compiler...

>

--- Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
* Origin: A noiseless patient Spider (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441

GoldED+ VK │ │ 09:55:30