When is a Celeron a better buy than a Pentium4?

EricAndreychek and I are looking at putting a 1RU server together, and naturally we want the most bang for our buck. Given that the box will be a webserver, we want HTTPd processes and their associated data to always be either in RAM or on the PCI bus (I/O). Since the standard PC architecture is:

Processor -> L1 Cache -> Backside Bus -> L2 Cache -> Frontside Bus -> RAM

The L1 cache is on-die [say what?], and operates at the Hz of the processor. Since most processors nowadays have on-die L2 Caches, the backside bus also runs at processor speed (Xeon) or half-processor speed (plain Pentium?). This leaves the Frontside Bus as the last determiner of throughput.

In order to compare performance, we need to know what the typical cache-miss rate is for our applications (i.e. maybe our processes' Locality of Reference is so much larger than a MB, that L2 cache is only marginally effective). Luckily, we have the following research on average cache-miss rates over a wide-range of applications (but here we use the database model, 255.vortex, and although this data is for the Alpha AXP, it should be analogous to Intels for the cache-miss rates for applications):


 -----------------------------------------------------------------------------
| U-cache misses/inst: 544,083,218,265 unified refs (1.392152-/inst);         |
|-----------------------------------------------------------------------------|
| 348,800,683,706 U-cache 64-Byte block accesses (0.891942-/inst)             |
|-----------------------------------------------------------------------------|
|  Size |   Direct    |  2-way LRU  |  4-way LRU  |  8-way LRU  |  Full LRU   |
|-------+-------------+-------------+-------------+-------------+-------------|
|   1KB | 0.26780097- | 0.22677614- | 0.21181042- | 0.20626928- | 0.20403274- |
|   2KB | 0.20561816- | 0.16472790- | 0.15549086- | 0.14867232- | 0.14565999- |
|   4KB | 0.14643086- | 0.10978899- | 0.10100589- | 0.09772142- | 0.09402592- |
|   8KB | 0.10829900- | 0.06585053- | 0.05598715- | 0.05277136- | 0.04964558- |
|  16KB | 0.07532274- | 0.03921008- | 0.02633094- | 0.02032304- | 0.01536376- |
|  32KB | 0.03393586- | 0.02226750- | 0.01298755- | 0.01003383- | 0.00876593- |
|  64KB | 0.01894350- | 0.00906869- | 0.00676262- | 0.00584272- | 0.00525280- |
| 128KB | 0.01062717- | 0.00440508- | 0.00357717- | 0.00332806- | 0.00297348- |
| 256KB | 0.00513303- | 0.00222342- | 0.00175441- | 0.00156760- | 0.00129169- |
| 512KB | 0.00308863- | 0.00128724- | 0.00103971- | 0.00096630- | 0.00088150- |
|   1MB | 0.00168843- | 0.00086016- | 0.00074038- | 0.00071857- | 0.00067457- |
 -----------------------------------------------------------------------------

Intel non-ia64 chips are 4-way LRU on the L2 cache.

To compare bang for the buck, we get some pricing from AccuPC, and they say that on July 4th, 2004, a P4 2.4GHz 1MB L2 533MHz FSB was $139.00, and a Celeron 2.4GHz 128KB L2 400MHz FSB was $76.50. If the overall performance for in-memory processes is more than 2x faster on the P4, we're getting the P4.

The equation for average Hz per instruction (AHpI) is:

	AHpI = (cache-miss rate * FSB MHz / 2 trips) + (cache-hit rate * BSB MHz / 2 trips)

	for the P4:
	AHpI = (0.00074038 * 533 / 2) + ((1 - 0.00074038) * (2400 / 2 / 2))
	     = 599.75 

	for the Celeron:
	AHpI = (0.00357717 * 400 / 2) + ((1 - 0.00357717) * (2400 / 2 / 2))
	     = 598.57

	(AHpI(Celeron) - AHpI(p4))/AHpI(p4) = 0.0020

Since the Celeron is less than 1% slower for an analogous application than the P4, but is ~45% cheaper, we should buy the Celeron. This is because the time for an instruction on average is dominated by the cache-hit times, not the cache-miss, even for database operations. This performance penalty does not seem to fit "anecdotal" evidence, so now I need another analysis to determine which is wrong:

The research data
My assumptions
My analysis
Received wisdom regarding Celerons

Dell has quite different numbers re: cache miss rates for a database application: