EricAndreychek and I are looking at putting a 1RU server together, and naturally we want the most bang for our buck. Given that the box will be a webserver, we want HTTPd processes and their associated data to always be either in RAM or on the PCI bus (I/O). Since the standard PC architecture is:
The L1 cache is on-die [say what?], and operates at the Hz of the processor. Since most processors nowadays have on-die L2 Caches, the backside bus also runs at processor speed (Xeon) or half-processor speed (plain Pentium?). This leaves the Frontside Bus as the last determiner of throughput.
In order to compare performance, we need to know what the typical cache-miss rate is for our applications (i.e. maybe our processes' Locality of Reference is so much larger than a MB, that L2 cache is only marginally effective). Luckily, we have the following research on average cache-miss rates over a wide-range of applications (but here we use the database model, 255.vortex, and although this data is for the Alpha AXP, it should be analogous to Intels for the cache-miss rates for applications):
----------------------------------------------------------------------------- | U-cache misses/inst: 544,083,218,265 unified refs (1.392152-/inst); | |-----------------------------------------------------------------------------| | 348,800,683,706 U-cache 64-Byte block accesses (0.891942-/inst) | |-----------------------------------------------------------------------------| | Size | Direct | 2-way LRU | 4-way LRU | 8-way LRU | Full LRU | |-------+-------------+-------------+-------------+-------------+-------------| | 1KB | 0.26780097- | 0.22677614- | 0.21181042- | 0.20626928- | 0.20403274- | | 2KB | 0.20561816- | 0.16472790- | 0.15549086- | 0.14867232- | 0.14565999- | | 4KB | 0.14643086- | 0.10978899- | 0.10100589- | 0.09772142- | 0.09402592- | | 8KB | 0.10829900- | 0.06585053- | 0.05598715- | 0.05277136- | 0.04964558- | | 16KB | 0.07532274- | 0.03921008- | 0.02633094- | 0.02032304- | 0.01536376- | | 32KB | 0.03393586- | 0.02226750- | 0.01298755- | 0.01003383- | 0.00876593- | | 64KB | 0.01894350- | 0.00906869- | 0.00676262- | 0.00584272- | 0.00525280- | | 128KB | 0.01062717- | 0.00440508- | 0.00357717- | 0.00332806- | 0.00297348- | | 256KB | 0.00513303- | 0.00222342- | 0.00175441- | 0.00156760- | 0.00129169- | | 512KB | 0.00308863- | 0.00128724- | 0.00103971- | 0.00096630- | 0.00088150- | | 1MB | 0.00168843- | 0.00086016- | 0.00074038- | 0.00071857- | 0.00067457- | -----------------------------------------------------------------------------Intel non-ia64 chips are 4-way LRU on the L2 cache.
To compare bang for the buck, we get some pricing from AccuPC, and they say that on July 4th, 2004, a P4 2.4GHz 1MB L2 533MHz FSB was $139.00, and a Celeron 2.4GHz 128KB L2 400MHz FSB was $76.50. If the overall performance for in-memory processes is more than 2x faster on the P4, we're getting the P4.
The equation for average Hz per instruction (AHpI) is:
AHpI = (cache-miss rate * FSB MHz / 2 trips) + (cache-hit rate * BSB MHz / 2 trips) for the P4: AHpI = (0.00074038 * 533 / 2) + ((1 - 0.00074038) * (2400 / 2 / 2)) = 599.75 for the Celeron: AHpI = (0.00357717 * 400 / 2) + ((1 - 0.00357717) * (2400 / 2 / 2)) = 598.57 (AHpI(Celeron) - AHpI(p4))/AHpI(p4) = 0.0020
Since the Celeron is less than 1% slower for an analogous application than the P4, but is ~45% cheaper, we should buy the Celeron. This is because the time for an instruction on average is dominated by the cache-hit times, not the cache-miss, even for database operations. This performance penalty does not seem to fit "anecdotal" evidence, so now I need another analysis to determine which is wrong: