Golded

- COMP.ARCH---------------- < Пред. | След. > -- < @ > -- < Сообщ. > -- < Эхи > --

Nп/п : 83 из 100

От : luke.l...@gmail.com 2:5075/128 30 сен 23 02:34:48

К : Anton Ertl 30 сен 23 12:37:02

Тема : Re: RISCs and virtual vectors (was: Introducing ForwardCom)

----------------------------------------------------------------------------------

@MSGID:
<128abea4-a935-4ef6-b7fe-71f0f1f35be0n@googlegroups.com> eacf2e99
@REPLY: <2023Sep30.092454@mips.complang.tuwien.ac.at>
8a3d1d51
@REPLYADDR luke.l...@gmail.com
<luke.leighton@gmail.com>
@REPLYTO 2:5075/128 luke.l...@gmail.com
@CHRS: CP866 2
@RFC: 1 0
@RFC-References: 1@dont-email.me>
<memo.20230916204926.16432A@jgd.cix.co.uk> 1@dont-email.me> 1@dont-email.me>
<287e35a5-e78f-4ae2-bbb1-606f7bbdfe98n@googlegroups.com> <jwv5y3ydyqr.fsf-monnier+comp.arch@gnu.org>
<901e43e0-902d-4f5d-8ae4-22c570a94191n@googlegroups.com> <95KRM.207899$Hih7.53988@fx11.iad>
<72b6255a-e9b5-4549-b243-49ab8c49b064n@googlegroups.com> <2023Sep30.092454@mips.complang.tuwien.ac.at>
@RFC-Message-ID:
<128abea4-a935-4ef6-b7fe-71f0f1f35be0n@googlegroups.com>
@TZUTC: -0700
@PID: G2/1.0
@TID: FIDOGATE-5.12-ge4e8b94
On Monday, September 25, 2023, `Stefan Monnier` via comp.arch
<comp.arch@googlegroups.com> wrote:
>
>
>
> I think you can`t get a good answer before clarifying what it is that
> you consider as the problem in "opcode proliferation"
>
>

as Mitch notes in a later post (today), the "Cartesian Product" turns out
in say Audio DSPs doing SWAR (SIMD Within A Register) such as AndesStar
to be an O(N^6) clusterpoop. these DSPs, they are under enormous cost
and power budget pressure (think "C-Media sub-$1 USB Audio PHYs with
built-in volume control", they potentially only run at 8-12 mhz and are likely
in 130 or even 180 nm! USB1.1 and/or have a PLL that slaves to the host
USB bus, the STM32F072 does this: really neat trick)

take even just an ADD, you would think there
would only ever be one add instruction in a RISC ISA? ehhhhm no

* 8/16 bit source selection doubles that
* 8/16 bit destination selection doubles it again
* hi/lo half on source 1 doubles it again
* hi/lo half on source 2 doubles it again
* signed and unsigned saturation triples the number of ADD operations
   (clipping in audio is important rather than getting wrapping distortion)
* "average-add" (x+y+1)>>1 (for Audio this is crucial) doubles again
  (if you only have 16 bit audio and a low-speed DSP you cannot afford
   to do that kind of calculation in 3 instructions, and you need 32-bit regs)

we are up to 6 dimensions and a whopping NINETY SIX *commercially necessary*
variants on what is supposed to be one simple ADD operation!

>  (after all,
> "prefixing" is just another name for variable-length instructions, and
> what is considered as "opcode" vs "immediate operand" within an
> instruction is largely philosophical).

my feeling is that if stored in cached internal state (such as CSRs,
as you describe below) that you can get the effect of variable-length
instructions without complicating multi-issue decode too much.
of course if you want interrupts in the middle then you must have
a mechanism for noting that there *is* some cached internal state
(that you must not only context-save but also recover through
rewind-replay if that cache is invalidated by servicing the interrupt).

>
> As you point out, part of the complexity doesn`t come from
> instruction encoding, really, but from the actual desired semantics.
>
> So in terms of instruction encoding, the main issues would be:
>
> A. Code size.
> B. Not preventing the core from working at full speed.
> C. The cost of decoding.
>
> (C) can only bite when backward compatibility forces inconvenient
> encodings, but even the amd64 architecture seems to do fine, so it
> doesn`t seem to be a serious issue.

mmmm... one for another post, another time :)

>
>
> I think (B) is never an unsolvable problem.  E.g. CSR-based solutions
> sometimes require pipeline drainage, but that`s usually solvable
> without changing the encoding by passing the CSR through the pipeline.
> But maybe tagged registers could be problematic because you end up
> getting this info too late, in the execution rather than in the decode?

in my designs, i am counting on the reg tags being saveable (and cacheable)
just as you suggest CSRs may be, and consequently be passed along
(somewhat naively) down through the pipeline(s). things like "option
to saturate"
spring to mind, and in many FPUs it is already industry practice to have CSRs
to set "lower or higher" accuracy rather than have "estimation assistance" FP
operations, although Power ISA has both concepts in one Architecture.

(MIPS 3D ASE did something fascinating: a pair of instructions that *continued*
the operation at higher bit-accuracy if you actually needed it...
but again this is yet
another doubling in that Cartesian Product that Mitch refers to in
a later post...
now we are up to O(N^7) SIMD opcode proliferation....)

>
>
> So I`d guess it`s really mostly a code-size issue (beside the semantic
> issues when you want to make it so old code can work with new sizes, of
> course, but IIUC you were talking about the scalar case where this seems
> to be too problematic to even contemplate).

multi-dimensioning *pre-existing* scalar ISAs without modifying that scalar
ISA, by adding Prefixing context of some description: this has become my
speciality for the past 5 years.

slightly confusing, i know :)

>
> My gut tells me that in terms of code size, tagged registers should be
> the better choice.  But I don`t know of any actual efforts to try and
> confirm it experimentally.

am on the case, it will however be... at least a year before i can discuss it,
as i need to write up the associated patents but as you say actually put in
the effort and do the experiments :)  yes am taking a bit of a risk on a new
design, based on a similar gut feeling.

On Saturday, September 30, 2023, `MitchAlsup` via comp.arch
<comp.arch@googlegroups.com> wrote:
>
>
> The problem is that the Cartesian-product associated with SIMD
> causes thousands of microscopic instructions to be needed (for
> example ARM has at least 1,300 SIMD instructions, others worse.)

iirc SVE/SVE2 just upped the ARM "RISC" ISA to ~7000...

> It is also my contention that nobody with more than
> 200 instructions can be called RISC.

i`d agree that`s a reasonable level. tricky to meet if you introduce
BMI/TBM subsets as well...

On Saturday, September 30, 2023, `MitchAlsup` via comp.arch
<comp.arch@googlegroups.com> wrote:
>
>
> And in comparison:: I got almost all of that capability with 2 instructions
> that guarantees forward and backwards compatibility, and scales with
> machine resources.

to be fair what you have is fully-parallel independent element-level only
(for i in 0..N x[i] = OP(y[i])) and anything outside of that involves pushing
data through memory (to do traditional vector shuffle/permute...)
but given that the majority of ubiquitous compute is straightforward like
that (al la memcpy / strncpy / daxpy) i feel it`s a really good decision.

> <
> 2 versus 1300 :: Which one is really RISC ??

the one with 1300? how many guesses am i allowed here?

l.
--- G2/1.0
* Origin: usenet.network (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441

GoldED+ VK │ │ 09:55:30