Memory model related:
Threads and memory model for C++
The JSR-133 Cookbook for Compiler Writers
The JSR-133 Cookbook
Links related to Java's Memory Model
Coding related:
Blog entires on Linkers from Ian Lance Taylor
The GNU C Library
GCC-Atomic-Builtins
Const versus volatil
Assembly related:
GCC-Inline-Assembly-HOWTO
Gentle Intro to x86-64
Linux related:
How the Kernel Manages Your Memory
Git - SVN Crash Course
Understanding /proc/cpuinfo
Signal Handling
The Linux Kernel
Debugging with GDB
Description of 4 level Page table implemented in Linux kernel for x86_64
How to read your *.so file
Architecture related:
Detailed Architecture of AMD's Opteron
Nehalem White Paper (pdf)
Tom's Hardware review on Nehalem
Nehalem configureation (ramie):
16 virtual processors
2 physical processors with 4 cores each
Hyper-threading, 2 threads per core
Private L1 I-cache (per core): 32KB
Private L1 D-cache (per core): 32KB
Private L2 cache (per core, for both I and D): 256KB
Shared L3 cache (for both I and D): 8MB for all cores, inclusive (i.e.
contains data that are in L1 / L2)
True 2-level TLB
Level1 D-TLB, 64 entries for 4K pages, 32 entries for 2M/4M (large) pages
Level1 I-TLB, 128 entries for 4K pages, 7 entries for 2M/4M (large) pages
Level2 TLB (unified), 512 entries for 4K pages
L1 D-TLB and L2 TLB are shared dynamically between 2 hyper threads in one core
L1 I-TLB is statically partitioned for 4K pages, and entirely replicated
for 2M/4M (large) pages
QuickPath Interconnect, 6.4GB/sec each direction; 12.8GB/sec total
Barcelona configureation (velour):
16 virtual processors
4 physical processors with 4 cores each
Private L1 I-cache (per core): 64KB
Private L1 D-cache (per core): 64KB (1024 entries, 2-way set associative)
Private L2 cache (per core): 512KB
Shared L3 cache (shared among 4 cores): 2MB (so 8 MB total for 4 procs)
Level1 DTLB, 32 entries for 4K pages, 8 entries for 2M/4M pages
Level2 DTLB, 512 entries for 4K, 4 way set-associative
HyperTransport links (up to 8GB/sec)
Cloud machine configureation (via TB):
2 physical processors with 12 cores each
Private L1 I-cache (per core): 32KB
Private L1 D-cache (per core): 32KB
Private L2 cache (per core, for both I and D): 256KB
Shared L3 cache (for both I and D): 24MB for all cores with 12MB each socket
Performance measuring related:
BIOS and Kernel Developer's Guide (BKDG) for AMD Family 10F (pdf)
(For Hardware event counters, see section 3.14)
Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A:
System Programming Guide Part 1 (pdf)
Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3B:
System Programming Guide Part 2 (pdf)
(For Hardware event counters, see Appendix A in vol. 3B)