----------------------------------------------------------------------------------
@MSGID:
<e2196cd2-5268-4b04-b7ad-19b5b2a4cb8fn@googlegroups.com> 4f2b12b6
@REPLY: 1@newsreader4.netcologne.de>
9abbee92
@REPLYADDR Michael S <already5chosen@yahoo.com>
@REPLYTO 2:5075/128 Michael S
@CHRS: CP866 2
@RFC: 1 0
@RFC-References:
<57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com> <2023Sep23.123024@mips.complang.tuwien.ac.at> 2@gal.iecc.com>
<09798d75-4962-47b8-8816-d554d201a522n@googlegroups.com> 1@gal.iecc.com>
1@newsreader4.netcologne.de>
@RFC-Message-ID:
<e2196cd2-5268-4b04-b7ad-19b5b2a4cb8fn@googlegroups.com>
@TZUTC: -0700
@PID: G2/1.0
@TID: FIDOGATE-5.12-ge4e8b94
On Sunday, September 24, 2023 at 10:56:08 AM UTC+3, Thomas Koenig wrote:
> John Levine <
jo...@taugh.com> schrieb:
> > Z implements a lot of the complex instructions in microcode which uses
> > the hardwired subset of the instruction set, which they call millicode
> > but apparently DFP is hardware.
> >
> > This paper describes the hardware and software support for DFP:
> >
> >
https://speleotrove.com/mfc/files/schwarz2009-decimalFP-on-z10.pdf
> Very interesting link, thanks!
>
> A few interesting snippets: They give the cycle time of the z10
> as 15 FO4, which they say is much faster than prior generations.
> Not sure how that compares to current designs, but it seems
> fast to me.
>
> They also write
>
> "[...] the execution pipeline [for the IBM z] for one instruction
> includes both a memory access and an execution stage, whereas
> RISC computers require multiple instructions to accomplish the
> same task. Nevertheless, resolving memory interlock dependencies
> is a concern. Since the operands are in memory, using the result
> of a prior operation creates an interlock in memory. If the
> operations are not spaced apart in time, the load/store unit (LSU)
> or IDU must compare the full addresses to determine the interlock
> and somehow bypass the operands. The new decimal floating-point
> architecture makes dependencies easier and faster to handle because
> the interlocks are simply in registers."
>
> Fixed-point BCD operations are also (to me) surprisingly slow:
>
> "For addition and subtraction, the execution latency is seven
> cycles for operands of 8 bytes or less and nine cycles for
> operands with greater length. This includes all special cases,
> including overflow."
>
> Seven cycles (105 FO4 gate delays) seems like a lot for adding,
> but I guess that just speaks to the complexity of BCD arithmetic.
IIRC, z10 was IBM`s last "native CISC" design in zArch series.
Starting from z196 they crack load-op into 2 or more uOus, just
like majority of x86 cores does.
It`s hard to be sure, because terminology use by IBM is so unique.
--- G2/1.0
* Origin: usenet.network (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441