----------------------------------------------------------------------------------
@MSGID: 1@dont-email.me> 611d58dd
@REPLY: Hih7.154829@fx11.iad>
1ef52c5b
@REPLYADDR BGB <cr88192@gmail.com>
@REPLYTO 2:5075/128 BGB
@CHRS: CP866 2
@RFC: 1 0
@RFC-Message-ID: 1@dont-email.me>
@RFC-References: 1@dont-email.me>
<6zCRM.67038$fUu6.58754@fx47.iad> 1@dont-email.me> Hih7.154829@fx11.iad>
@TZUTC: -0500
@PID: Mozilla/5.0 (Windows NT 10.0; Win64; x64;
rv:102.0) Gecko/20100101 Thunderbird/102.15.1
@TID: FIDOGATE-5.12-ge4e8b94
On 9/29/2023 2:02 PM, EricP wrote:
> BGB wrote:
>> On 9/29/2023 11:02 AM, EricP wrote:
>>> BGB wrote:
>>>> I recently had an idea (that small scale testing doesn`t require
>>>> redesigning my whole pipeline):
>>>> If one delays nearly all of the operations to at least a 2-cycle
>>>> latency, then seemingly the timing gets a fair bit better.
>>>>
>>>> In particular, a few 1-cycle units:
>>>> SHAD (bitwise shift and similar)
>>>> CONV (various bit-repacking instructions)
>>>> Were delayed to 2 cycle:
>>>> SHAD:
>>>> 2 cycle latency doesn`t have much obvious impact;
>>>> CONV: Minor impact
>>>> I suspect due to delaying MOV-2R and EXTx.x and similar.
>>>> I could special-case these in Lane 1.
>>>>
>>>>
>>>> There was already a slower CONV2 path which had mostly dealt with
>>>> things like FPU format conversion and other "more complicated"
>>>> format converters, so the CONV path had mostly been left for
>>>> operations that mostly involved shuffling the bits around (and the
>>>> simple case 2-register MOV instruction and similar, etc).
>>>>
>>>> Note that most ALU ops were already generally 2-cycle as well.
>>>>
>>>>
>>>> Partly this idea was based on the observation that adding the logic
>>>> for a BSWAP.Q instruction to the CONV path had a disproportionate
>>>> impact on LUT cost and timing. The actual logic in this case is very
>>>> simple (mostly shuffling the bytes around), so theoretically should
>>>> not have had as big of an impact.
>>>>
>>>>
>>>> Testing this idea, thus far, isn`t enough to get the clock boosted
>>>> to 75MHz, but did seemingly help here, and has seemingly redirected
>>>> the "worst failing paths" from being through the D$->EXn->RF
>>>> pipeline, over to being D$->ID1.
>>>>
>>>> Along with paths from the input to the output side of the
>>>> instruction decoder. Might also consider disabling the (mostly not
>>>> used for much) RISC-V decoders, and see if this can help.
>>>>
>>>> Had also now disabled the LDTEX instruction, now as it is "somewhat
>>>> less important" if TKRA-GL is mapped through a hardware rasterizer
>>>> module.
>>>>
>>>>
>>>> And, thus far, unlike past attempts in this area, this approach
>>>> doesn`t effectively ruin the performance of the L1 D$.
>>>>
>>>>
>>>> Seems like one could possibly try to design a core around this
>>>> assumption, avoiding any cases where combinatorial logic feeds into
>>>> the register-forwarding path (or, cheaper still, not have any
>>>> register forwarding; but giving every op a 3-cycle latency would be
>>>> a little steep).
>>>>
>>>> Though, one possibility could be to disable register forwarding from
>>>> Lane 3, in which case only interlocks would be available.
>>>> This would work partly as Lane 3 isn`t used anywhere near as often
>>>> as Lanes 1 or 2.
>>>>
>>>> ....
>>>>
>>>>
>>>>
>>>> Any thoughts?...
>>>
>>> Its not just the MHz but the IPC you need to think about.
>>> If you are running at 50 MHz but an actual IPC of 0.1 due to
>>> stalls and pipeline bubbles then that`s really just 5 MIPS.
>>>
>>
>> For running stats from a running full simulation (predates to these
>> tweaks, running GLQuake with the HW rasterizer):
>> ~ 0.48 .. 0.54 bundles clock;
>> ~ 1.10 .. 1.40 instructions/bundle.
>>
>> Seems to be averaging around 29..32 MIPs at 50MHz (so, ~ 0.604 MIPs/MHz).
>
> Oh that`s pretty efficient then.
> In the past you had made comments which made it sound like
> having tlb, cache, and dram controller all hung off of what
> you called your "ring bus", which sounded like a token ring,
> and that the RB consumed many cycles latency.
> That gave me the impression of frequent, large stalls to cache,
> lots of bubbles, leading to low IPC.
>
It does diminish IPC, but not as much as my older bus...
It seems like, if there were no memory related overheads (if the L1
always hit), as is it would be in the area of 22% faster.
L1 misses are still not good though, but this is true even on a modern
desktop PC.
I suspect ringbus efficiency is diminishing the efficiency of external
RAM access, as both the L2 memcpy and DRAM stats tend to be lower than
the "raw" speed of accessing the RAM chip (in the associated unit tests).
However, if L1 cache sizes are reduced, performance tends to go in the
toilet.
As-is, performance is "not awful", and I am frequently getting upwards
of 20 fps in Doom (where Doom seems to max out at 35 fps).
Earlier on, somehow I had thought Doom would also get 20-30 fps on an
386, but I have since realized that this was not the case... (hence why
Doom has a feature to make the screen size smaller).
By most metrics, it seems like I am outperforming a 386; some stats
seems to imply that I am getting performance on par with early PowerPC
systems (though, I don`t have any way to test this).
Did see a video not to long ago where someone ran Doom on a "NeXTcube",
and it was pretty obvious that it was running at single-digit
framerates. So, it seems like I am not doing too badly.
In my childhood, a 486DX2-66 had no trouble running Doom, so I had sort
of thought this was the general experience (I do remember though that
trying to run Quake on the thing "sucked pretty bad" though...).
But, in contrast, the original PlayStation had a 33 MHz MIPS R3000 CPU
and seemingly had no real difficulty running fully 3D games like
"MegaMan Legends" and similar.
Granted, I don`t know what gamedev practices were like on the
PlayStation, like if it was C or ASM, or if people accessed the hardware
directly or went through APIs along similar lines to Windows GDI or
OpenGL or DirectX or similar...
As noted, in my case, I was going for an API design partly inspired by
the Windows GDI and also supporting OpenGL. Though, the design differs
slightly in that the practice is to render into a GL context, read the
raster image from the GL context, and then use the "GDI" calls to copy
it over into the screen "window" (by passing a "BIMAPINFOHEADER" pointer
along with a pointer to the buffer holding the raster image).
Though, it is possible things could be faster if it were possible to
draw directly into screen memory. Though, this is foiled some by me not
using a raster-oriented framebuffer for the video memory.
Possible eventual TODO would be to add some raster-oriented display
modes. Maybe also adding a call along similar lines to "SwapBuffers";
but then this would require a mechanism to change the location of the
screen`s framebuffer in memory (I guess probably a better option for
performance than "memcpy`ing the screen contents or similar").
Will imagine that the original PlayStation probably did not have these
issues...
Though, did recently go and add support for multitexturing, which does
make lightmap rendering a little more practical (can use lightmaps with
less performance impact).
However, dynamic lighting effects don`t work so well, as every time
GLQuake tries redrawing and reuploading the dynamic lightmaps,
performance tanks (but, setting "r_dynamic" to 0 is sort of lacking in
animated lighting effects).
I guess one possible option could be to modify it to render 2 versions
of each lightmap:
One with dynamic light-sources on;
One with dynamic light-sources off;
Then, weight and average the applicable "lightstyle" values, and then
use a threshold to select which of the lightmap textures to use.
This would at least deal with light styles like "flourospark", but not
so much the gentle strobe or flicker effects, but alas.
Then again, as-is I will probably stick with vertex lighting, as this is
faster.
Also, using the "glBegin/glEnd/glVertex" rendering strategy has some
amount of overhead (vertex arrays are faster; but the parts of the
engine that were modified to use vertex arrays don`t currently support
the lightmap modes).
There is still some overhead that the engine walks the BSP and rebuilds
the vertex arrays each frame (had considered possibly modifying it to
cache the vertex arrays, and only rebuild them when the player moves
into a different PVS, but hadn`t gotten around to experimenting with this).
>> Top ranking uses of clock-cycles (for total stall cycles):
>> L2 Miss: ~ 28% (RAM, L2 needs to access DDR chip)
>> Misc : ~ 23% (Misc uncategorized stalls)
>> IL : ~ 20% (Interlock stalls)
>> L1 I$ : ~ 18% (16K L1 I$, 1)
>> L1 D$ : ~ 9% (32K L1 D$)
>>
>> The IL (or Interlock) penalty is the main one that would be effected
>> by increasing latency.
>
> By "interlock stalls" do you mean register RAW dependency stalls?
Yeah.
Typically:
If an ALU operation happens, the result can`t be used until 2 clock
cycles later;
If a Load happens, the result is not available for 3 clock cycles;
Trying to use the value before then stalls the frontend stages.
> As distinct from D$L1 read access stall, if read access time > 1 clock
> or multi-cycle function units like integer divide.
>
The L1 I$ and L1 D$ have different stats, as shown above.
Things like DIV and FPU related stalls go in the MISC category.
Based on emulator stats (and profiling), I can see that most of the MISC
overhead in GLQuake is due to FPU ops like FADD and FMUL and similar.
So, somewhere between 5% and 9% of the total clock-cycles here are being
spent waiting for the FPU to do its thing.
Except for the "low precision" ops, which are fully pipelined (these
will not result in any MISC penalty, but may result in an IL penalty if
the result is used too quickly).
>> In general, the full simulation simulates pretty much all of the
>> hardware modules via Verilator (displaying the VGA output image as
>> output, and handling keyboard inputs via a PS/2 interface).
>>
>> 1: The bigger D$ was better for Doom and similar, but GLQuake seems to
>> lean in a lot more to the I$. Switching to a HW rasterizer seems to
>> have increased this imbalance.
>>
>>
>> At the moment, Doom tends to average slightly higher in terms of MIPs
>> scores.
>>
>>
>>
>> As for emulator stats, the main instructions with high interlock
>> penalties are:
>> MOV.Q, MOV.L, ADD
>>
>> MOV.Q and MOV.L seem to be spending around half of their clock-cycles
>> on interlocks, so around ~2 cycles average.
>
> I assume these are memory moves, aka LD/ST.
> In this context does interlock mean a source register RAW dependency stall?
>
Yeah.
MOV.L: Load or Store a 32-bit DWORD
MOV.Q: Load or Store a 64-bit QWORD
In the case of a memory load, there is a 3 cycle latency.
But, my compiler output has a bad habit of not waiting 3 cycles before
trying to use the result, so stalls here are common.
One needs to write their C code like:
t0=ptr[i0];
t1=ptr[i1];
... (find something else to do)
i4=t0+t1;
To avoid this, but this is kind of a pain (and ugly), and programs like
Quake were similar were not really written in a way to treat C like it
was assembler.
Though, it is a tradeoff, as trying to write C like it was ASM also
often tends to lead to an increase register pressure and an increase in
the number of register MOV instructions (as my compiler isn`t
particularly good about avoiding unnecessary MOV`s; and trying to
address this sort of thing turns into never-ending whack-a-mole).
In many cases, it is still possible to get around a 2x-3x speedup by
writing stuff in ASM (but, C is more portable and less effort).
>> ADD seems to be spending around 1/3 of its cycles on interlock (the
>> main ALU ops already had a 2-cycle cost since early on).
>
> IIUC you are saying your fpga takes 2 clocks at 50 MHz = 40 ns
> to do a 64-bit add, so it uses two pipeline stages for ALU.
>
It takes roughly 1 cycle internally, so:
ID2 stage: Fetch inputs for the ADD;
EX1 stage: Do the ADD;
EX2 stage: Make result visible to the world.
For 1-cycle ops, it would need to forward the result directly from the
adder-chain logic or similar into the register forwarding logic. I
discovered fairly early on that for things like 64-bit ADD, this is bad.
Most operations which "actually do something" thus sort of end up
needing a clock-edge for their results to come to rest (causing them to
effectively have a 2-cycle latency as far as the running program is
concerned).
> And by interlock you mean if an instruction following in the next 2 slots
> reads that same dest register then it must stall at RR for 1 or 2 clocks,
> (assuming you have a forwarding bus from ALU result to RR, otherwise more).
>
For ADD, it is 1 instruction:
ADD R4, 1, R5
ADD R5, 1, R6 //triggers a stall
But:
ADD R4, 1, R5
MOV R6, R7
ADD R5, 1, R6 //no stall
For operations with a 3 cycle latency, the next two instructions may
generate a stall.
So:
MOV.Q (R4, 0), R5
ADD R5, 1, R6
Causes a 2-cycle stall, MOV.Q effectively taking 3 cycles.
So:
MOV.Q (R4, 0), R5
MOV R6, R7
MOV R8, R9
ADD R5, 1, R6 //this is fine
Here, MOV.Q only takes a single clock cycle.
> So 1/3 of ADD instructions have this dependency.
> Is that correct?
>
Seemingly, around 1/3 of ADD instructions are followed immediately by an
instruction which is trying to make use of the result of the ADD.
Say:
j=(i+7)>>4;
Will very likely result in such a scenario.
--- Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.15.1
* Origin: A noiseless patient Spider (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441