----------------------------------------------------------------------------------
@MSGID:
<0f60c2f2-f44d-408b-806b-609aba926f03n@googlegroups.com> 7c816197
@REPLY: 1@dont-email.me> b288ea7e
@REPLYADDR MitchAlsup <MitchAlsup@aol.com>
@REPLYTO 2:5075/128 MitchAlsup
@CHRS: CP866 2
@RFC: 1 0
@RFC-References: 1@dont-email.me>
<6zCRM.67038$fUu6.58754@fx47.iad> 1@dont-email.me>
@RFC-Message-ID:
<0f60c2f2-f44d-408b-806b-609aba926f03n@googlegroups.com>
@TZUTC: -0700
@PID: G2/1.0
@TID: FIDOGATE-5.12-ge4e8b94
On Friday, September 29, 2023 at 12:07:47 PM UTC-5, BGB wrote:
> On 9/29/2023 11:02 AM, EricP wrote:
> >
> >
> For running stats from a running full simulation (predates to these
> tweaks, running GLQuake with the HW rasterizer):
> ~ 0.48 .. 0.54 bundles clock;
> ~ 1.10 .. 1.40 instructions/bundle.
<
So, about equal to the 1-wide 1st generation RISC machines, which got
0.7 I/C {including cache misses, delay slots, interlocks, TLB misses.}
>
> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604 MIPs/MHz).
>
Probably good for a 1-wide, not so good for a 3-wide.
>
> Top ranking uses of clock-cycles (for total stall cycles):
> L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
> Misc : ~ 23% (Misc uncategorized stalls)
> IL : ~ 20% (Interlock stalls)
> L1 I$ : ~ 18% (16K L1 I$, 1)
> L1 D$ : ~ 9% (32K L1 D$)
>
> The IL (or Interlock) penalty is the main one that would be effected by
> increasing latency.
>
> In general, the full simulation simulates pretty much all of the
> hardware modules via Verilator (displaying the VGA output image as
> output, and handling keyboard inputs via a PS/2 interface).
>
> 1: The bigger D$ was better for Doom and similar, but GLQuake seems to
> lean in a lot more to the I$. Switching to a HW rasterizer seems to have
> increased this imbalance.
>
>
> At the moment, Doom tends to average slightly higher in terms of MIPs
> scores.
>
>
>
> As for emulator stats, the main instructions with high interlock
> penalties are:
> MOV.Q, MOV.L, ADD
>
> MOV.Q and MOV.L seem to be spending around half of their clock-cycles on
> interlocks, so around ~2 cycles average.
>
> ADD seems to be spending around 1/3 of its cycles on interlock (the main
> ALU ops already had a 2-cycle cost since early on).
>
>
>
> The SHxD.x operators were increased to 2 cycles by this change, but are
> generally a bit further down the list so that they don`t hurt as much.
>
> I did notice an increase in time spent in the millisecond counters on in
> the boot-up sequence, but this seems mostly due to the CONV operations.
>
>
> This appears most likely due to the "2 register MOV" instruction, which
> itself uses around 3% of the total cycle budget (as a 1-cycle op), so
> likely isn`t helped by being demoted to 2-cycles.
>
> I may need to add a special case here to allow either for "MOV"
> specifically, or a subset of CONV ops (such as MOV and Sign/Zero
> extension) to remain as 1-cycle ops (EXTS.L and EXTU.L often being used
> in place of MOV for "int" and "unsigned int" to make sure they remain
> properly extended).
> > I don`t know what diagnostic probes you have. I would want to see
> > what each stage is doing in real time as it single stepped each clock,
> > see which stage buffers are valid or empty,
> > where stalls are originating and propagating.
> > Essentially the same info a cycle accurate simulator would show you.
> >
> I have my emulator, which I try to keep cycle-accurate (though, it
> hasn`t been updated for this tweak yet).
>
> It doesn`t display any real-time information, but rather a big mess of
> stats dumping at the end.
> > That information can be used to guide where, for example,
> > a limited budget of forwarding buses or extra 64-bit adders
> > might best be utilized to eliminate bubbles and increase the IPC.
> >
> > If whole-pipeline stalls are eating your IPC then maybe it doesn`t
> > need elastic buffers on all stages, but maybe on just one stage
> > after RR to decouple the MEM stalls from IF-ID-RR stages.
> >
> I had considered possibly redesigning the pipeline at one point to allow
> a different mechanism for handling L1 misses. Namely marking registers
> for missed loads as "not ready" and then stalling the Fetch/Decode
> stages if the bundle would depend on a not-ready register (injecting the
> fetched data back into the pipeline once the load completes).
>
> However, redesigning the main pipeline would not be a small task.
>
>
>
> However, as-is, my fiddling still falls well-short of being able to
> boost the clock-speed to 75MHz.
>
> Though, it looks like 75 MHz would be able to boost GLQuake mostly into
> double-digit territory (assuming I can do so without wrecking the L1
> caches or similar... which was always the problem in the past).
>
>
> If the L1 caches are dropped to 2K or so, timing gets easier, but these
> have a significantly higher L1 miss rate and then most of the clock
> cycles end up going into dealing with L1 misses.
>
> ...
--- G2/1.0
* Origin: usenet.network (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441