The basic measurement of a computing system is speed. Speed is is measured by:
- Execution time (or latency)
Throughput is usually relevant to I/O, particularly in large systems which handle many jobs.
Reducing the execution time will nearly always improve throughput, but not the reverse.
There are 3 components to measure the execution time:
Execution Time = Instruction executed * CPI * Clock cycle time
These are the dynamic instruction count. We are not interested int he static instruction count: how many lines of code are in a program, instead we are interested in the amount of instructions are actually executed when the programs runs.
a five line assembly instruction can be interpreted to thousands of executed instructions.
The average number of clock cycles per instruction. This depends on the architecture executed on, and the program itself ofcourse. The latest Intel architecture can execute up to 4 instruction per clock, meaning CPI of 0.25 (super-scalar). CPI can be higher, and even more than 1, due to memory stalls, and slow instructions.
Clock Cycle Time
One “Cycle” is the minimum time for the CPU to do work. The clock period is just the length of the cycle. Generally, a higher frequency is better. A 500MHZ processor has a cycle time of 2ns.
To improve performance, we need just to make any component smaller. Since clock cycle time is usually an hardware limitation, we should focus on instruction executed and CPI.
For instruction executed the best way is to change the program logic, usually you need to investigate the compiler output, and by changing program behavior consume less instructions. Sometimes you’ll need to use intrinsics, and write in assembly code to force some more efficient instructions as the compiler might not always choose them.
CPI – the paradigm
Improving CPI means, choosing instructions which consume less instructions. This often means that these instructions are doing less work, means you need to increase the Instruction executed – leading you back to zero improvement. Decreasing the instruction executed means that each instruction is doing more, hence CPI will increase – leading you again to zero improvement.
A smart compiler will decrease CPI by choosing the right kind of instructions, without a large increase of instruction count. Compiler implementation has a major impact of the program performance.
To compare program performance, the industry has invented more metrics like: MIPS, MegaFlops and SPEC.
Speeding Up Multi-Core Applications
Speeding up of multi-core applications is done by using Amdahl’s Law. From wikipedia:
Amdahl’s law, also known as Amdahl’s argument, is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors. The law is named after computer architect Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.
The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. For example, if 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20× as shown in the diagram, no matter how many processors are used.
This is a very improtant law, as usually performance improvements involves great amount of time and resources. We should concentrate the efforts to the problems and areas which we will benefit the most. Regarding multi-core programming, the addition of cores is not always the right way to go, there are a lot of multi-core bottlenecks which should be eliminated before increasing the core count. This is out of the scope of this post, and i’ll dedicate one for that later.