----------------------------------------------------------------------------------
@MSGID: <2023Sep29.100918@mips.complang.tuwien.ac.at>
a659e1d7
@REPLY: 1@newsreader4.netcologne.de>
9abbee92
@REPLYADDR Anton Ertl
<anton@mips.complang.tuwien.ac.at>
@REPLYTO 2:5075/128 Anton Ertl
@CHRS: CP866 2
@RFC: 1 0
@RFC-Message-ID:
<2023Sep29.100918@mips.complang.tuwien.ac.at>
@RFC-References:
<57c5e077-ac71-486c-8afa-edd6802cf6b1n@googlegroups.com> <2023Sep23.123024@mips.complang.tuwien.ac.at> 2@gal.iecc.com>
<09798d75-4962-47b8-8816-d554d201a522n@googlegroups.com> 1@gal.iecc.com>
1@newsreader4.netcologne.de>
@TZUTC: 0000
@TID: FIDOGATE-5.12-ge4e8b94
Thomas Koenig <
tkoenig@netcologne.de> writes:
>A few interesting snippets: They give the cycle time of the z10
>as 15 FO4, which they say is much faster than prior generations.
>Not sure how that compares to current designs, but it seems
>fast to me.
There is a paper which claims that "The Optimal Logic Depth per
Pipeline Stage is 6 to 8 FO4 Inverter Delays" [hrishikesh+02]. This
kind of thinking led to the Tejas successor to the Pentium 4 and to
the high-clocked K9 project that Mitch Alsup has been posting about
now and then. Both projects were cancelled; my guess is that they
both bet on advances in cooling technology that did not materialize.
@InProceedings{hrishikesh+02,
author = {M. S. Hrishikesh and Norman P. Jouppi and Keith
I. Farkas and Doug Burger and Stephen W. Keckler and
Premkishore Shivakumar},
title = {The Optimal Logic Depth per Pipeline Stage is 6 to 8
FO4 Inverter Delays},
crossref = {isca02},
pages = {14--24},
annote = {This paper takes a low-level simulator of the 21264,
varies the number of pipeline stages, uses this to
run a number of workloads (actually only traces from
them), and reports performance results for
them. With a latch overhead of about 2 FO4
inverters, the optimal pipeline stage length is
about 8 FO4 inverters (with work-load-dependent
variations). Discusses various issues involved in
quite some depth. In particular, this paper
discusses how to pipeline the instruction window
design (which has been identified as a bottleneck in
earlier papers).}
}
@Proceedings{isca02,
title = "$29^\textit{th}$ Annual International Symposium on Computer
Architecture",
booktitle = "$29^\textit{th}$ Annual International Symposium on
Computer Architecture",
year = "2002",
key = "ISCA 29",
}
>They also write
>
>"[...] the execution pipeline [for the IBM z] for one instruction
>includes both a memory access and an execution stage, whereas
>RISC computers require multiple instructions to accomplish the
>same task.
Yes, if you want to do a 486 that performs well for load-and-operate
instructions, you design a six-stage pipeline:
IF-ID-MEM1-MEM2-ALU-WB
instead of (what the 486 and Pentium actually have):
IF-ID-MEM1-MEM2/ALU-WB
IBM did the former, Intel the latter. I wonder why.
>Fixed-point BCD operations are also (to me) surprisingly slow:
>
>"For addition and subtraction, the execution latency is seven
>cycles for operands of 8 bytes or less and nine cycles for
>operands with greater length. This includes all special cases,
>including overflow."
>
>Seven cycles (105 FO4 gate delays) seems like a lot for adding,
>but I guess that just speaks to the complexity of BCD arithmetic.
My guess is that the extra hardware they invested was relatively
little, and they need a number of microoperations to actually get the
desired result. Compare with the decimal sequences on PA-RISC that we
have discussed here some time ago
<
2022Dec7.100130@mips.complang.tuwien.ac.at>.
- anton
--
`Anyone trying for "industrial quality" ISA should avoid undefined behavior.`
Mitch Alsup, <
c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- xrn 10.11
* Origin: Institut fuer Computersprachen, Technische Universitaet (2:5075/128)
SEEN-BY: 5001/100 5005/49 5015/255 5019/40 5020/715
848 1042 4441 12000
SEEN-BY: 5030/49 1081 5058/104 5075/128
@PATH: 5075/128 5020/1042 4441