You are given two **N** x **N** matrices, _matA_ and _matB_. Consider the following two scenarios: - Scenario 1: Reading columns of matrix _A_ and writing rows on matrix _B_  - Scenario 2: Reading rows of matrix _A_ and writing columns on matrix _B_. Assume **N** is always greater than 0. - Note:There are the _same number of total cache hits and misses_ overall, but _that does not mean performance will be identical_. >[!note] Extra Credit will be awarded based on the completeness and level of detail of your answers. **Be very specific**. >[!note] *Hint* What happens as **N** changes? Consider a wide range of values for **N**. ### Question 1 Describe and explain in detail how and why these scenarios differ in performance. Your explanation should: - Compare access patterns and their interaction with: - **Spatial and temporal locality** - **Cache line size and cache associativity** - **Write policies** (write-allocate vs write-no-allocate) - **Hardware prefetching behavior** - **TLB (Translation Lookaside Buffer) misses** - **Memory bandwidth saturation** - Describe _scenarios_ where: - Scenario 1 might outperform Scenario 2 - Scenario 2 might outperform Scenario 1 - They may perform roughly the same (and why) - Discuss how **data layout** (row-major vs column-major) affects these results. - Optionally, consider how: - **SIMD vectorization** (auto or manual) - **Multithreading and NUMA locality** - **Matrix alignment and padding** might influence either case. > Extra credit is awarded based on _completeness_, _technical accuracy_, and _depth of insight_. > Consider exploring both _software-level_ (loop order, compiler optimization) and _hardware-level_ (cache behavior, prefetchers) reasoning. --- Scenario 1 and scenario two are two different implementations of transposing a given matrix $A$ and storing the result into a given matrix $B$. Assuming the layout of the matrices are row-major, then scenario 1 writes the transpose column by column while scenario 2 writes the transpose row by row. Based on this, my hypotheses for how these two perform is that scenario 1 will run faster than scenario 2 when $N*sizeof(R)$ is greater than the cache line size of the target hardware. ### Question 2 Describe how you tested your hypotheses. Your explanation should: - Include **exact code**, **compiler flags**, and **timing methods** used (e.g., `std::chrono`, `rdtsc`, or performance counters like `perf` / VTune / uProf) - Identify the **CPU and cache characteristics** of your test system (L1/L2/L3 sizes, cache line width, NUMA node, SMT status, etc.) - Specify the **compiler optimizations** enabled (`-O2`, `-O3`, `/O2`, `/arch:AVX2`, etc.) - Explain how you ensured: - Warm vs cold cache conditions - Large `N` values that exceed L1/L2/L3 cache - Fair measurement (e.g., multiple runs, averaged results) - Present and interpret your results: - Show timing tables, graphs, or screenshots - Interpret _why_ the observed performance matches (or doesn’t match) your expectations > You may include screenshots, benchmark logs, profiler traces, or GitHub links. > Credit will be awarded based on **experimental rigor** and **clarity of reasoning**. ---