Golded

- COMP.ARCH---------------- < Пред. | След. > -- < @ > -- < Сообщ. > -- < Эхи > --

Nп/п : 15 из 100

От : Thomas Koenig 2:5075/128 24 сен 23 07:56:04

К : John Levine 24 сен 23 10:57:03

Тема : Re: lotsa money and data sizes, Solving the Floating-Point Conundrum

----------------------------------------------------------------------------------

@MSGID: 1@newsreader4.netcologne.de>
9abbee92
@REPLY: 1@gal.iecc.com> faf2ccb5
@REPLYADDR Thomas Koenig <tkoenig@netcologne.de>
@REPLYTO 2:5075/128 Thomas Koenig
@CHRS: CP866 2
@RFC: 1 0
@RFC-Message-ID:
1@newsreader4.netcologne.de>
@RFC-References:
<57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com> <2023Sep23.123024@mips.complang.tuwien.ac.at> 2@gal.iecc.com>
<09798d75-4962-47b8-8816-d554d201a522n@googlegroups.com> 1@gal.iecc.com>
@TZUTC: -0000
@PID: slrn/1.0.3 (Linux)
@TID: FIDOGATE-5.12-ge4e8b94
John Levine <johnl@taugh.com> schrieb:

> Z implements a lot of the complex instructions in microcode which uses
> the hardwired subset of the instruction set, which they call millicode
> but apparently DFP is hardware.
>
> This paper describes the hardware and software support for DFP:
>
> https://speleotrove.com/mfc/files/schwarz2009-decimalFP-on-z10.pdf

Very interesting link, thanks!

A few interesting snippets:  They give the cycle time of the z10
as 15 FO4, which they say is much faster than prior generations.
Not sure how that compares to current designs, but it seems
fast to me.

They also write

"[...]  the execution pipeline [for the IBM z] for one instruction
includes both a memory access and an execution stage, whereas
RISC computers require multiple instructions to accomplish the
same task. Nevertheless, resolving memory interlock dependencies
is a concern. Since the operands are in memory, using the result
of a prior operation creates an interlock in memory. If the
operations are not spaced apart in time, the load/store unit (LSU)
or IDU must compare the full addresses to determine the interlock
and somehow bypass the operands. The new decimal floating-point
architecture makes dependencies easier and faster to handle because
the interlocks are simply in registers."

Fixed-point BCD operations are also (to me) surprisingly slow:

"For addition and subtraction, the execution latency is seven
cycles for operands of 8 bytes or less and nine cycles for
operands with greater length. This includes all special cases,
including overflow."

Seven cycles (105 FO4 gate delays) seems like a lot for adding,
but I guess that just speaks to the complexity of BCD arithmetic.
--- slrn/1.0.3 (Linux)
* Origin: news.netcologne.de (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441

GoldED+ VK │ │ 09:55:30