----------------------------------------------------------------------------------
@MSGID:
<bc120c1f-9ac5-4cad-9e6a-393cc39deaa3n@googlegroups.com> 9a7fd250
@REPLY: 1@dont-email.me> 3986acd5
@REPLYADDR MitchAlsup <MitchAlsup@aol.com>
@REPLYTO 2:5075/128 MitchAlsup
@CHRS: CP866 2
@RFC: 1 0
@RFC-References: 1@dont-email.me>
<6zCRM.67038$fUu6.58754@fx47.iad> 1@dont-email.me> Hih7.154829@fx11.iad>
1@dont-email.me> c.100057@fx09.iad> 1@dont-email.me>
@RFC-Message-ID:
<bc120c1f-9ac5-4cad-9e6a-393cc39deaa3n@googlegroups.com>
@TZUTC: -0700
@PID: G2/1.0
@TID: FIDOGATE-5.12-ge4e8b94
On Saturday, September 30, 2023 at 12:50:55 PM UTC-5, BGB wrote:
> On 9/30/2023 11:04 AM, EricP wrote:
> > BGB wrote:
> >> On 9/29/2023 2:02 PM, EricP wrote:
> >>> BGB wrote:
> >>>>>>
> >>>>>> Any thoughts?...
> >>>>>
> >>>>> Its not just the MHz but the IPC you need to think about.
> >>>>> If you are running at 50 MHz but an actual IPC of 0.1 due to
> >>>>> stalls and pipeline bubbles then that`s really just 5 MIPS.
> >>>>>
> >>>>
> >>>> For running stats from a running full simulation (predates to these
> >>>> tweaks, running GLQuake with the HW rasterizer):
> >>>> ~ 0.48 .. 0.54 bundles clock;
> >>>> ~ 1.10 .. 1.40 instructions/bundle.
> >>>>
> >>>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604
> >>>> MIPs/MHz).
> >>>
> >>> Oh that`s pretty efficient then.
> >>> In the past you had made comments which made it sound like
> >>> having tlb, cache, and dram controller all hung off of what
> >>> you called your "ring bus", which sounded like a token ring,
> >>> and that the RB consumed many cycles latency.
> >>> That gave me the impression of frequent, large stalls to cache,
> >>> lots of bubbles, leading to low IPC.
> >>>
> >>
> >> It does diminish IPC, but not as much as my older bus...
> >
> > Oh I thought this was 1-wide but I see elsewhere that it is 3-wide.
> > That`s not that efficient. I was thinking you were getting an IPC
> > of 0.5 out ~0.7, the maximum possible with 1 register write port.
> > A 3-wide should get an IPC > 1.0 but since you only have 1 RF write port
> > that pretty much bottlenecks you at WB/Retire to < 1.0.
> >
> There are 3 write ports to the register file.
>
> However, they only see much use when the code actually uses them, which
> for the most part, my C compiler doesn`t. It basically emits normal
> 1-wide RISC style code, then tries to jostle the instructions around and
> put them in bundles.
>
> Results are pretty mixed, and it only really works if the code is
> written in certain ways.
>
>
> Ironically, for GLQuake, most of the ASM was in areas that dropped off
> the map when switching to a hardware rasterizer; so the part of the
> OpenGL pileline that remains, is mostly all the stuff that was written
> in C (with a few random bits of ASM thrown in).
> > I suspect those ring bus induced bubbles are likely killing your IPC.
> > Fiddling with the internals won`t help if the pipeline is mostly empty.
> >
> Ringbus latency doesn`t matter when there are no L1 misses...
> > I suggest the primary thing to think about for the future is getting the
> > pipeline as full as possible. Then consider making it more efficient
> > internally, adding more write register ports so you can retire > 1.0 IPC
> > (there is little point in having 3 lanes if you can only retire 1/clock).
> > Then thirdly start look at things like forwarding buses.
> >
> Well, would be back to a lot more fiddling with my C compiler in this case.
>
> As noted, the ISA in question is statically scheduled, so depends mostly
> on either the compiler or ASM programmer to do the work.
> >> It seems like, if there were no memory related overheads (if the L1
> >> always hit), as is it would be in the area of 22% faster.
> >>
> >> L1 misses are still not good though, but this is true even on a modern
> >> desktop PC.
> >
> > The cache miss rate may not be the primary bottleneck.
> > Are you using the ring bus to talk to TLB`s, I$L1, D$L1, L2, etc?
> >
> L1 caches are mounted directly to the pipeline, and exist in EX1..EX3
> stages.
>
> So:
> PF IF ID1 ID2 EX1 EX2 EX3 WB
> Or, alternately:
> PF IF ID RF EX1 EX2 EX3 WB
>
> So, access is like:
> EX1: Calculate address, send request to L1 cache;
> EX2: Cache checks hit/miss, extracts data for load, prepare for store.
> This is the stage where the pipeline stall is signaled on miss.
> EX3: Data fetched for Load, final cleanup.
> Final cleanup: Sign-extension, Binary32->Binary64 conversion, etc.
> Data stored back into L1 arrays here (on next clock edge).
> > Some questions about your L1 cache:
> >
> > In clocks, what is I$L1 D$L1 read and write hit latency,
> > and the total access latency including ring bus overhead?
> > And is the D$L1 store pipelined?
> >
> Loads and stores are pipelined.
>
> TLB doesn`t matter yet, L1 caches are virtually indexed and tagged.
> > Do you use the same basic design for your 2-way assoc. TLB
> > as the L1 cache, so the same numbers apply?
> >
> > And do you pipeline the TLB lookup in one stage, and D$L1 access in a
> > second?
> >
> TLB is a separate component external to the L1 caches, and performs
> translation on L1 miss.
>
> It has a roughly 3 cycle latency.
<
So, you take a 2-cycle look at L1 tag and if you are going to get a miss,
you then take a 3-cycle access of TLB so you can "get on" ring-bus.
So, AGEN to PA is 5 cycles.
<
> 1: Request comes in, setup for fetch from TLB arrays;
> 2: Check for TLB hit/miss, raise exception on miss;
> 3: Replace original request with translated request.
> Output is on the clock-edge following the 3rd cycle.
> > I`m suggesting that your primary objective is making that pathway from the
> > Load Store Unit (LSU) to TLB to D$L1 as simple and efficient as possible.
> > So a direct 1:1 connect, zero bus overhead and latency, just cache latency.
> >
> > Such that ideally it takes 2 pipelined stages for a cache read hit,
> > and if the D$L1 read hit is 1 clock that the load-to-use
> > latency is 2 clocks (or at least that is possible), pipelined.
> >
> > And that a store is passed to D$L1 in 1 clock,
> > and then the LSU can continue while the cache deals with it.
> > The cache bus handshake would go "busy" until the store is complete.
> > Also ideally store hits would pipeline the tag and data accesses
> > so back to back store hits take 1 clock (but that`s getting fancy).
> >
> There is no LSU in this design, or effectively, the L1 cache itself
> takes on this role.
> >> I suspect ringbus efficiency is diminishing the efficiency of external
> >> RAM access, as both the L2 memcpy and DRAM stats tend to be lower than
> >> the "raw" speed of accessing the RAM chip (in the associated unit tests).
> >
> > At the start this ring bus might have been a handy idea by
> > making it easy to experiment with different configurations, but I
> > think you should be looking at direct connections whenever possible.
> >
> Within the core itself, everything is bolted directly to the main pipeline.
>
> External to this, everything is on the ringbus.
>
> As noted, when there are no cache misses and no MMIO access or similar,
> the bus isn`t really involved.
>
>
> But, yeah, I am left to realize that, say, driving the L2 cache with a
> FIFO might have been better for performance (rather than just letting
> requests circle the ring until they can be handled).
<
You know that this allows for un-ordered memory accesses--ACK.
PAs get to memory banks in an order unlike the misses occurred--
CDC 6600 had these effects and CDC 7600 got rid of them.
<
> >>
> >>>> Top ranking uses of clock-cycles (for total stall cycles):
> >>>> L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
> >>>> Misc : ~ 23% (Misc uncategorized stalls)
> >>>> IL : ~ 20% (Interlock stalls)
> >>>> L1 I$ : ~ 18% (16K L1 I$, 1)
> >>>> L1 D$ : ~ 9% (32K L1 D$)
> >>>>
> >>>> The IL (or Interlock) penalty is the main one that would be effected
> >>>> by increasing latency.
> >>>
> >>> By "interlock stalls" do you mean register RAW dependency stalls?
> >>
> >> Yeah.
> >>
> >> Typically:
> >> If an ALU operation happens, the result can`t be used until 2 clock
> >> cycles later;
> >> If a Load happens, the result is not available for 3 clock cycles;
> >> Trying to use the value before then stalls the frontend stages.
> >
> > Ok this sounds like you need more forwarding buses.
> > Ideally this should allow back-to-back dependent operations.
> >
> Early on, I did try forwarding ADD results directly from the EX1 stage
> (or, directly from the adder`s combinatorial logic into the register
> forwarding, which is more combinatorial logic feeding back into the ID2
> stage).
>
> FPGA timing was not so happy with this sort of thing (it is a lot
> happier when there are clock-edges for everything to settle out on).
> >>> As distinct from D$L1 read access stall, if read access time > 1 clock
> >>> or multi-cycle function units like integer divide.
> >>>
> >>
> >> The L1 I$ and L1 D$ have different stats, as shown above.
> >>
> >> Things like DIV and FPU related stalls go in the MISC category.
> >>
> >> Based on emulator stats (and profiling), I can see that most of the
> >> MISC overhead in GLQuake is due to FPU ops like FADD and FMUL and
> >> similar.
> >>
> >>
> >> So, somewhere between 5% and 9% of the total clock-cycles here are
> >> being spent waiting for the FPU to do its thing.
> >>
> >> Except for the "low precision" ops, which are fully pipelined (these
> >> will not result in any MISC penalty, but may result in an IL penalty
> >> if the result is used too quickly).
> >>
> >>> IIUC you are saying your fpga takes 2 clocks at 50 MHz = 40 ns
> >>> to do a 64-bit add, so it uses two pipeline stages for ALU.
> >>>
> >>
> >> It takes roughly 1 cycle internally, so:
> >> ID2 stage: Fetch inputs for the ADD;
> >> EX1 stage: Do the ADD;
> >> EX2 stage: Make result visible to the world.
> >>
> >> For 1-cycle ops, it would need to forward the result directly from the
> >> adder-chain logic or similar into the register forwarding logic. I
> >> discovered fairly early on that for things like 64-bit ADD, this is bad.
> >
> > It should not be bad, you just need to sort out the clock edges and
> > forwarding. In a sense these are feedback loops so they just need
> > to be self re-enforcing (see below).
> >
> >> Most operations which "actually do something" thus sort of end up
> >> needing a clock-edge for their results to come to rest (causing them
> >> to effectively have a 2-cycle latency as far as the running program is
> >> concerned).
> >
> > Ok that shouldn`t happen. If your ALU is 1 clock latency then
> > back-to-back execution should be possible with a forwarding bus.
> >
> > You are taking ALU result AFTER its stage output buffer and forwarding
> > that back to register read, rather than taking the ALU result BEFORE
> > the stage output buffer, and this is introducing an extra clock delay.
> >
> But, doing it this way makes FPGA timing constraints significantly
> happier...
> > I`m thinking of this organization:
> >
> > Decode Logic
> > |
> > v
> > == Decode Stage Buffer ==
> ID2 Stage.
> > Immediate Data
> > |
> > | |---< Reg File
> > | |
> > | | |---------------
> > v v v |
> > Operand source mux |
> > | | |
> > v v |
> > == Reg Read Stage Buffer == |
> EX1 Stage
> > Operand values |
> > | | |
> > v v |
> > ALU |
> > |>-------------------
> > v
> > == ALU Result Stage Buffer ==
> Start of EX2 stage, ALU result gets forwarded here...
<
Given that you understand the nature of the Adder->result->forwarding->
Operand as four 1/4 -cycle units of work. I have seen the flip-flops put in
3 of the 4 possible places. The thing is a logic-loop that needs a point
of clock synchronization. But the flip-flops can be put:: I have seen::
1) flop->operand->adder->result->forward->flop
2) operand->flop->adder->result->forward->operand
3) operand->adder->flop->adder->result->forward
but nothing rules out::
4) operand->adder->result->flop->forward->operand
<
And there are other variants when you have latches.......
<
> > > Notice that the clock edge locks the ALU forwarded value into
> > the RR output stage, unless the RR stage output is stalled in which case
> > it holds the ALU input stable and therefore also the forwarded result.
> >
> > This needs to be done consistently across all stages.
> > It also needs to forward a load value to RR stage before WB.
> >
> > It also should be able to forward a register read value, or ALU result,
> > or load result, or WB exception address to fetch for branches and jumps,
> > again with no extra clocks. So for example
> > JMP reg
> > can go directly from RR to Fetch and start fetching the next cycle.
> >
> > But this forwarding is gold plating, after the pipeline is full.
> >
> Yeah.
>
> I was experimenting with going the way of "increasing" the effective
> latency in some cases, trying to loosen timing enough that I could
> hopefully boost the clock speed.
>
>
> Getting more stuff to flow through the pipeline could be better, but is
> the tired old path of continuing to beat on my C compiler...
>
>
> >
--- G2/1.0
* Origin: usenet.network (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441