컴퓨터구조 공부8

October 25, 2023

Computer Architecture

우리가 항상 사용하는 컴퓨터는 정말 복잡한 시스템으로 구성되어 있다.

multiple computer를 연결하여 더 높은 성능을 가지기 위해 Multiprocessor와, Scalability, availability, power efficiency를 알아야함.

Scalability는 resource에 비례해서 성능이 올라가는 지에 관련된 것임.

대부분 multiple processor (cores)로 이뤄져 있는 컴퓨터 체계임.

Hardware and Software

HW는 직렬, 병렬이고 SW는 순차적, 병행적임. SW는 병렬적으로 하는 것이 아닌 병행적으로 처리하기 때문에 HW가 병렬적이라고 해도 SW에서는 다를 수 있음. 항상 coefficient한 HW와 SW를 봐야함.

Parallel Programming

Parallel software is the problem
Need to get significant performance improvement
- Otherwise, just use a faster uniprocessor, since it’s easier
Difficulties
- Partitioning
- Coordination
- Communications overhead

Core를 4개 쓴다고 4배 성능이 좋아지지 않음. 그리고 여러 문제점이 있음.

Amdahl’s Law

Sequential part can limit speedup
Example : 100 processors, 90x speedup?
- $T_{new} = T_{parallelizable}/100 + T_{sequential}$
- Speedup = $\frac{1}{1-F_{parallelizable}+F_{parallelizable}/100}=90$
- Solving : $F_{parallelizable} = 0.999$

sequential part가 0.1%의 비율을 가져야 한다는 것임.

Scaling Example

Workload: sum of 10 scalars, and 10 x 10 matrix sum
- Speed up from 10 to 100 processors
Single processor : $Time = (10 + 100) \times t_{add}$
10 processors
- Time = $10 \times t_{add} + 100/10 \times t_{add} = 20 \times t_{add}$
- Speedup = $110/20 = 5.5$ (55% of potential)
100 processors
- Time = $10 \times t_{add} + 100/100 \times t_{add} = 11 \times t_{add}$
- Speedup = $110/11 = 10$ (10% of potential)

resource가 많다고 linear하게 성능이 좋아지는 것은 아니나, parallel한 부분이 많으면 성능이 linear과 비슷하게 증가할 수 있음.

Instruction and Data Stream

SIMD

Operate elementwise on vectors of data
- Multiple data elements in 128-bit wide registers
All processors execute the same instruction at the same time
- Each with different data address, etc
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel applications

딥러닝 가속기로 많이 쓰임

Multithreading

좋은 블로그 자료 링크

Performing multiple threads of execution in parallel
- Replicate registers, PC, etc
- Fast switching between threads
Fine-grain multithreading
- Switch threads after each cycle
- Interleave instruction execution
- If one thread stalls, others are executed
Coarse-grain multithreading
- Only switch on long stall
- Simplifies hardware, but doesn’t hide short stalls

Modeling Performance

Assume performance metric of interest is achievable GFLOPs/sec
- Measured using computational kernels from Berkeley Design Patterns
Arithmetic intensity of a kernel
- FLOPs per byte of memory accessed
For a given computer, determine
- Peak GFLOPS (from data sheet)
- Peak memory bytes/sec (using Stream benchmark)

FLOPS = FLOPs/sec, FLOPs = Floating Point Operations

Roofline Diagram

Parallel processor

Optimizing Performance

Parallel processor

Optimize FP performance
- Balance adds & multiplies
- Improve superscalar ILP and use of SIMD instructions
Optimize memory usage
- Software prefetch
  - Avoid load stalls
- Memory affinity
  - Avoid non-local data accesses

Optimizing Performance

Choice of optimization depends on arithmetic intensity of code
Arithmetic intensity is not always fixed
- May scale with problem size
- Caching reduces memory accesses
  - Increases arithmetic intensity