Memory model related:

Threads and memory model for C++
The JSR-133 Cookbook for Compiler Writers
The JSR-133 Cookbook
Links related to Java's Memory Model

Coding related:

Blog entires on Linkers from Ian Lance Taylor
The GNU C Library
GCC-Atomic-Builtins
Const versus volatil


Assembly related:

GCC-Inline-Assembly-HOWTO
Gentle Intro to x86-64


Linux related:

How the Kernel Manages Your Memory
Git - SVN Crash Course
Understanding /proc/cpuinfo
Signal Handling
The Linux Kernel
Debugging with GDB
Description of 4 level Page table implemented in Linux kernel for x86_64


How to read your *.so file


Architecture related:

Detailed Architecture of AMD's Opteron
Nehalem White Paper (pdf)
Tom's Hardware review on Nehalem

Nehalem configureation (ramie):

16 virtual processors
2 physical processors with 4 cores each
Hyper-threading, 2 threads per core
Private L1 I-cache (per core): 32KB
Private L1 D-cache (per core): 32KB
Private L2 cache (per core, for both I and D): 256KB
Shared L3 cache (for both I and D): 8MB for all cores, inclusive (i.e. contains data that are in L1 / L2)
True 2-level TLB
Level1 D-TLB, 64 entries for 4K pages, 32 entries for 2M/4M (large) pages
Level1 I-TLB, 128 entries for 4K pages, 7 entries for 2M/4M (large) pages
Level2 TLB (unified), 512 entries for 4K pages
L1 D-TLB and L2 TLB are shared dynamically between 2 hyper threads in one core
L1 I-TLB is statically partitioned for 4K pages, and entirely replicated for 2M/4M (large) pages
QuickPath Interconnect, 6.4GB/sec each direction; 12.8GB/sec total

Barcelona configureation (velour):

16 virtual processors
4 physical processors with 4 cores each
Private L1 I-cache (per core): 64KB
Private L1 D-cache (per core): 64KB (1024 entries, 2-way set associative)
Private L2 cache (per core): 512KB
Shared L3 cache (shared among 4 cores): 2MB (so 8 MB total for 4 procs)
Level1 DTLB, 32 entries for 4K pages, 8 entries for 2M/4M pages
Level2 DTLB, 512 entries for 4K, 4 way set-associative HyperTransport links (up to 8GB/sec)

Cloud machine configureation (via TB):

2 physical processors with 12 cores each
Private L1 I-cache (per core): 32KB
Private L1 D-cache (per core): 32KB
Private L2 cache (per core, for both I and D): 256KB
Shared L3 cache (for both I and D): 24MB for all cores with 12MB each socket


Performance measuring related:

BIOS and Kernel Developer's Guide (BKDG) for AMD Family 10F (pdf)
(For Hardware event counters, see section 3.14)

Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide Part 1 (pdf)

Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide Part 2 (pdf)
(For Hardware event counters, see Appendix A in vol. 3B)