You are given two **N** x **N** matrices, _matA_ and _matB_.
Consider the following two scenarios:
- Scenario 1: Reading columns of matrix _A_ and writing rows on matrix _B_
- Scenario 2: Reading rows of matrix _A_ and writing columns on matrix _B_.
Assume **N** is always greater than 0.
- Note:There are the _same number of total cache hits and misses_ overall, but _that does not mean performance will be identical_.
>[!note] Extra Credit will be awarded based on the completeness and level of detail of your answers. **Be very specific**.
>[!note] *Hint* What happens as **N** changes? Consider a wide range of values for **N**.
### Question 1
Describe and explain in detail how and why these scenarios differ in performance.
Your explanation should:
- Compare access patterns and their interaction with:
- **Spatial and temporal locality**
- **Cache line size and cache associativity**
- **Write policies** (write-allocate vs write-no-allocate)
- **Hardware prefetching behavior**
- **TLB (Translation Lookaside Buffer) misses**
- **Memory bandwidth saturation**
- Describe _scenarios_ where:
- Scenario 1 might outperform Scenario 2
- Scenario 2 might outperform Scenario 1
- They may perform roughly the same (and why)
- Discuss how **data layout** (row-major vs column-major) affects these results.
- Optionally, consider how:
- **SIMD vectorization** (auto or manual)
- **Multithreading and NUMA locality**
- **Matrix alignment and padding**
might influence either case.
> Extra credit is awarded based on _completeness_, _technical accuracy_, and _depth of insight_.
> Consider exploring both _software-level_ (loop order, compiler optimization) and _hardware-level_ (cache behavior, prefetchers) reasoning.
---
Scenario 1 and scenario two are two different implementations of transposing a given matrix $A$ and storing the result into a given matrix $B$. Assuming the layout of the matrices are row-major, then scenario 1 writes the transpose column by column while scenario 2 writes the transpose row by row. Based on this, my hypotheses for how these two perform is that scenario 1 will run faster than scenario 2 when $N*sizeof(R)$ is greater than the cache line size of the target hardware.
### Question 2
Describe how you tested your hypotheses.
Your explanation should:
- Include **exact code**, **compiler flags**, and **timing methods** used
(e.g., `std::chrono`, `rdtsc`, or performance counters like `perf` / VTune / uProf)
- Identify the **CPU and cache characteristics** of your test system
(L1/L2/L3 sizes, cache line width, NUMA node, SMT status, etc.)
- Specify the **compiler optimizations** enabled (`-O2`, `-O3`, `/O2`, `/arch:AVX2`, etc.)
- Explain how you ensured:
- Warm vs cold cache conditions
- Large `N` values that exceed L1/L2/L3 cache
- Fair measurement (e.g., multiple runs, averaged results)
- Present and interpret your results:
- Show timing tables, graphs, or screenshots
- Interpret _why_ the observed performance matches (or doesn’t match) your expectations
> You may include screenshots, benchmark logs, profiler traces, or GitHub links.
> Credit will be awarded based on **experimental rigor** and **clarity of reasoning**.
---