June 12th, 2019 ~ by admin

Xeon Overclocking: Making Gallatin Gallop

This article is part of The CPU Shack’s continued partnership with guest author max1024, hailing from Belarus. I have provided some minor edits/tweaks in the translation from Belorussian to English.

If you still remember the times of the Pentium 4 running on Socket 478 with the Northwood, Prescott and Gallatin cores, then you should remember what about these processor cores were different from each other. Northwood was fast like a mountain doe due to a shorter 20-stage pipeline that allowed it to perform many operations very quickly without tremendous losses due to branch mis-predictions etc. , but inferior to Prescott frequency potential in overclocking, which in turn was as strong as a buffalo, due to twice the L2 cache memory(1M vs 512K) and finer tech process (90nm vs 130nm). But like any hoofed animal, it was not agile, to achieve the higher clock speeds its pipeline was extended to 31-stages, resulting in some cases, clock for clock out performing Northwood, But doing so at the expense of much heat.

A separate niche in the food chain was occupied by “Gallatin”, which combined the properties of the two previous iterations, a shorter 20-stage pipeline, with the high clock speed of the Prescott, but in its arsenal it also had a very formidable weapon, which was the presence of an additional L3 cache of 2 MB. The price of ownership of this “beast” was high, and in the literal sense of the word, it was equal, like any other representative of the Extreme Edition series – $ 999. I resisted this extreme processor, choosing  hero from AMD, the FX-51, which I consider to be one of the most outstanding processors of all times and peoples.

Xeon Universal Chip Analyzer by x86.fr

What could be better, cooler or faster? I’ve been looking for an answer to this question for a long time, until I became acquainted with the Intel Xeon server processors on Socket 604 and in particular with processors based on the Prescott 2M core, which have twice the cache size compared to their desktop counterparts and can run on ASUS production boards.

As everybody knows, it is the advanced desktop flagships of both processor manufacturers that originate from the server segment. So from the Opteron’s turned out the AMD Athlon FX-51, and from the Intel Xeon MP – the Pentium Extreme Edition. This parity of events has been preserved until now.

Xeon Gallatin MP

The server representatives of Intel Xeon processors on the Gallatin core are divided into two branches: Xeon MP (Gallatin) and simply Xeon (Gallatin). The differences are in the number of simultaneously supported processors in the system. So Xeon MP supported running up to four processors  usual Xeon could be installed in servers only in pairs. There is also a difference in steppings of the processor core itself. Let me remind you that the desktop version of “Gallatin” were the M0 stepping, just like the regular Intel Xeon series.

The Xeon MP line, by contrast, is based on an earlier stepping from A0 to C0. Among the representatives of M0 stepping, you can find four Xeon models (Gallatin) with 1M of L3 cache, with frequencies from 2.4 GHz to 3.2 GHz, and one model with a doubled  L3 cache to 2 MB, pretty much the same as a Pentium 4 Extreme Edition. This model gave rise to the first “extreme” Pentium.

The most powerful representative of the Netburst microarchitecture on the Gallatin core belongs to the Xeon MP line, which also has models with a 2 MB L3 cache, but there is another processor model with twice the L3 cache compared to Pentium Extreme Edition – This is a Xeon MP (Gallatin) with a frequency of 3 GHz and a sSPEC code – SL79V. Such a “Gallatin” with 4M of L3 should be called “Ultra Gallatin”!

I will dwell a little more on this processor. This processor is based on the C0 stepping, it has an effective frequency of 3 GHz, while the FSB frequency is 400 MHz (Quad pumped so 100MHz base). The processor is compatible with its three similar counterparts in Socket 603 motherboards. The cost in batches of 1000 pieces began at $ 3,692. As you can see, for it you can purchase as many as three Intel Pentium Extreme Editions and have money left over for the mainboard with RAM. This processor was introduced on March 2, 2004, while the Pentium 4 Extreme Edition with a frequency of 3.2 GHz on a more recent stepping a little earlier – on November 3, 2003.

Apparently, to produce such a monster was not quite an easy thing for Intel. The processor is visually different from all the other representatives of the Xeon, firstly it has a much larger mass, unfortunately it does not its part number on top (Xeons of this era had the markings on the BOTTOM on the PCB), as well as a more massive heat spreader. For clarity, I made a photo of five similar processors from my collection. In the top row on the left is the classic Socket 423 Intel Pentium 4  on the left, a Socket 603 Intel Xeon DP running the Prestonia core or a sort of Northwood in the server version.

In the bottom row from left to right: today’s hero is Xeon MP (Gallatin) with L3 = 4 MB, then the Socket 604 Intel Xeon – Irwindale core (essentially a 2M Prescott) and the usual representative of Socket 478. Socket 423, 478, 603 and 604 are all similar Netburst architectures. But the Xeon MP (Gallatin) can be seen with the naked eye that the cover is just huge. The height of the cover is also higher than any of the other representatives of  the Intel Xeon.

Even if you put any Socket 478 processor on the cover of the Xeon MP, it will fit in its entirety.

On the reverse side, the processors look like this (after the S423 Intel stopped using the stagger PGA arrangement and went for a higher density non staggered PGA arrangement):

True, this processor has one weak point, no, this is not a cost, although at the time of its appearance a ready to go four-processor server cost tens of thousands of dollars, but still they did not reach five-digit amounts, as in the case of six Pentium Pros. The Achilles heel was the FSB frequency. I have not yet met a single processor product where everything would be perfect, at least one minus there, such as, for example, support only for registered memory from AMD Athlon FX-51 and so on. The FSB frequency was equal to a modest 400 megahertz, whereas for the Pentium 4 Extreme Edition it was 800 MHz.

Naturally, there were limitations of a purely technical nature. FOr a single processor a 800MHz FSB was possible, but to design a 4-way system with such a FSB was not possible at the time, due to the architecture. Since April 2003, the Intel Xeon line has a very serious competitor – the 1st generation 64-bit AMD Opteron server processor, which by its characteristics was superior to the competitor. If you look at the organization of a four-processor AMD Opteron-based server, you can see that there are no bottlenecks in the interaction of processors, chipset and RAM.

AMD Opterons at the time were an advanced processor, each Opteron had its own built-in north bridge and memory controller, so there were no problems with interaction with the external chipset. In addition, each Opteron had three point-to-point HyperTransport buses providing 3.2 GB / s bandwidth in each direction (full duplex 6.4 GB / s). Due to this scheme of interaction of processors, the main advantage was achieved – Scalability, which is a very serious argument for building any server platform.

Now let’s look at a similar scheme for Intel server processors:

So, as the concept of “chipset” in its original form then meant what was supposed to be, and nothing in the processor core was “superfluous” yet, except for the ALU, FPU and cache, the interaction with the north bridge, memory and other processors was carried out over a single 64-bit bus with 3.2 GB/s of bandwidth. If one processor was used, then there were no particular problems with insufficient bandwidth. For a pair of processors, it can be tolerated, but when the system already has four Xeon MPs installed, then each processor could be satisfied with only an exchange rate of 800 MB / s, which was clearly not enough. That’s precisely why, in order to exclude, as far as possible, frequent access to data in RAM, Intel decided to increase the cache, which in essence was the only right decision then.

Looking ahead, I note that I do not have a four-processor server, but there are a couple of processors with the SL79V marking. I wanted to compare “Gallatin” with 4 MB L3 with desktop extreme “Gallatin” for a long time and now there was such an opportunity, it’s good that computer components become obsolete so rapidly and from the cost of one processor that was $ 3,692, can now be had for but a mere fraction of that, or about $5 apiece.

Motherboard Selection

As it is known, the absolute majority of server motherboards and workstation boards are of no interest to enthusiasts due to the complete absence of any settings in BIOS, and often AGP or PCI-Express interfaces are missing. Below is an example of such a boring motherboard.

The Xeon MP “Gallatin” processors with 4 MB L3 are not difficult to find, finding the right motherboard is not so simple. Sometimes it takes a year, and sometimes even more, to find the right hardware, a specific motherboard or a rare processor. To implement this idea, I was looking for only one motherboard model – ASUS PC-DL Deluxe, which is ideally suited for overclocking and fits into my concept of building such a system, and also supports the Xeon MP “Gallatin” with 4 MB L3! although there is no official mention of this even on the official Asus website.

ASUS PC-DL Deluxe – Dual Gallatin Xeon MP Support (unofficially)

I searched for it for more than a year, but in the end it was two of my comrades who helped me get a brand new Old Stock box. If we were talking about motherboards that still can overclock, my list of recommended motherboards on Socket 603/604 is as follows: ASUS PC-DL Deluxe, ASUS NCT-D, ASUS NCCH-DL and Iwill DH800. That’s all. Only four models, three of which are based on the familiar Intel 875P logic set from the desktop Socket 478 motherboard and just one, the ASUS NCT-D, on the Intel E7525. So if someone wants to collect such a monster, then I give you reference points for searching;-)

Turning on the board with a single Xeon MP “Gallatin” processor with 4 MB L3 on the monitor screen you can see the following POST screen:

I used the latest BIOS version 1009, which correctly identifies the fastest Gallatin, pointing to the processor’s clock speed and the distinctive feature – a very large third-level cache. So, as this processor supports Hyper-Threading technology, the line with the model name appears twice on the screen. If two processors are installed, the processor records are doubled.

Enter the BIOS Setup Standard by pressing the “Del” key. I will give you the main settings screens that are interesting from the enthusiast’s point of view.

The “Advanced” menu contains all motherboard devices that are implemented using separate controllers, or are part of the system logic of the motherboard and which can be turned off or on if necessary.

All the most interesting settings are concentrated in the section – Advanced Chipset Features:

This section contains settings that affect the operation of the processor and memory subsystem.

The Asus motherboard allows you to change the FSB to 165 MHz through the BIOS Setup settings. The board also has a jumper, which is responsible for selecting the FSB frequency of 100/133 MHz. If the switch is set to 100 MHz, then a maximum FSB frequency of 132 MHz is available in the BIOS. This is sometimes enough to overclock 400FSB Xeons, since they have high multipliers, like my 3 GHz processor, the multiplier is x30. But sometimes this is not enough. But there is a way out, thanks to the use of SetFSB overclocking utility.

So, as this motherboard, due to the system logic used, is closer to the desktop options on Socket 478 with the Intel 875P chipset, the special version of the utility for the Asus P4C800 motherboard, which allows you to overclock the FSB even further than the physical capabilities of the motherboard themselves. If you have a question: “And what if using the jumper to immediately switch the FSB to the position of 133 MHz?”, Then I will answer – the board refuses to start completely, although with the help of SetFSB I overclocked it on the bus to 140 MHz.

The Xeon MP processor allows you to change the multiplier in the BIOS, so purely theoretically at the system bus frequency of 140 MHz and the x30 multiplier, its clock frequency would be equal to 4200 MHz, but in reality this is not achievable (something would probably melt if it were). Go to the settings of the RAM above, you could already see that in the Advanced Chipset Features section it is possible to change the main four primary timings of the main memory, but as regards the memory clock frequency, the choice here is not great:

The entire choice comes down to DDR266 and Auto, which also corresponds to PC2100 or DDR266 mode. But still, you can squeeze something else out of the board, resorting to the “software-overclock” method of overclocking and help us in memory settings by another utility, MemSet.

Although it is a mistakenly determines the frequency of RAM, it does make it possible to change the timings “on the fly” in the OS. By default, the board takes inf from the SPD (Serial Presence Detect, a eway for DIMMs to tell the BIOS what they can do), but for this test I use a couple of sticks based on Winbond BH-5 chips, which differ in their ability to work at the lowest possible timings. Therefore, I naturally changed the memory timings to all twos. Summarizing all the above, we can conclude that this motherboard allows you to flexibly configure the system, and for a motherboard with two processors, this is the exception rather than the rule. The only thing that it lacks is the choice of settings for the voltage supply of the processor.

 

Gallatin the Great

As the main air cooling system, I used a classic cooler – Thermaltake Big Typhoon for the S478 P4 EE and a server heatsink for the Socket 603 made of aluminum. I had a copper radiator, inherited from ASUS NCT-D on Socket 604, but the dimensions of the mounting holes in the ASUS PC-DL Deluxe motherboard are different, and it simply does not fit there, we will consider it a nuance. Overclocking was performed using the SetFSB, MemSet and BIOS Setup utilities.

This is how the processor is seen by the CPU-Z and AIDA64 utilities.

And this is how a pair of Gallatins, one Socket 478, the second Socket 603, appear. The strengths of one are the third-level cache, the other has the FSB frequency twice as high.

And the complete system configuration according to CPU-Z utility:

The system turned out to be interesting, it remains to check it for overclocking, and at the same time find out what the system bus frequency or memory cache is better. Both test systems can not run at the same bus, primarily due to different frequencies of RAM, but it will be more interesting to find out where the “bottleneck” is in each of the systems.

As I mentioned above, the system bus frequency with the fastest Gallatin at its peak was 140 MHz. The maximum processor validation was 3597 MHz, there is no talk about stability here, but this figure gives an idea of the overclocking potential. The system could pass short tests at a frequency of 3500 MHz, but was completely stable around 3400 MHz. In principle, this will give us the opportunity to compare both “Gallatin’s” at equal frequencies.

To get 3200 MHz, the FSB must be increased to 106.8 MHz and in order to get 3400 MHz, the FSB increased to 113.4 MHz, while the RAM in the first case worked in DDR285 mode, and in the second, as DDR302.

Test Configuration

The main components of the system:
Processor:
• Intel Xeon MP, 3000 MГц «Gallatin», L3=4 Mb, SL79V;
Motherboard:
• ASUS PC-DL Deluxe, Socket 603, chipset Intel 875P;
Memory:
• DDR1 SDRAM, 256 МБ х2, 400 МГц (Winbond BH-5) CL=2;
Videocard:
• Gainward GeForce 6800 Ultra, AGP, 256 Мб (Forceware 81.85).

Testing was conducted in Windows XP Sp3 and Windows 7 x64 using the following software:
• Super Pi mod. 1.5XS (task 1M);
• PiFast v.4.1;
• wPrime v.1.43;
• AIDA64 5.50.3600;
• WinRAR x86 v. 5.40;
• Cinebench 2003;
• Cinebench 11.5;
• PCMark 2005 v.1.20;
• 3Dmark2001SE Pro b330;
• 3DMark 2005 v.1.3.1;
• DOOM III;
• Far Cry

Tests

Super Pi mod. 1.5XS (task 1M)
Seconds (less is better)

In a single-threaded test, the presence of an impressive cache, from which data is clearly drawn, clearly gives an advantage. The 3 GHz Xeon MP outpaced the higher-frequency Pentium 4 Extreme Edition and, at the same time, all other Socket 478 representatives as well as other Xeons. Overclocked to 3400 MHz, the Xeon MP also left behind all of the above representatives, in order to catch up with it you will need either a Pentium 4 Extreme Edition running on ~ 3700 MHz, or ~ 3.9 GHz Prescott. Separately, representatives from AMD look very good, in order to catch up with the AMD FX-51, an Intel needs an extra 1GHz.

PiFast v.4.1
Seconds (less is better)

The situation in this test has changed a bit, since the Pentium 4 Extreme Edition overtook the Xeon MP at an equivalent frequency, apparently the cache size is not critical here. at 3.4 GHz it is no longer dominant, but occupies a middle position.

AIDA64 5.50.3600
Reading from memory, MB/s
AIDA64 5.50.3600
Writing to memory, MB/s

So we got to the weak point – the memory subsystem. In terms of speed, the Xeon MP-based system doesn’t have enough stars from the sky, and if it were given a full-featured DDR400, the situation could be completely different …

 

Cache and Memory benchmark from the AIDA64 test package: Pentium 4 Extreme Edition and Xeon MP with L3 = 4 MB at the same clock frequency – 3.2 GHz.

The cache latency of the 1st and 2nd level is identical for both processors. But the third-level cache of the Xeon MP is a little slower than its counterpart brother, and in terms of speed performance of the RAM between them there is a general gap or almost two-fold superiority of the desktop system over the server one.

And relative results to other platforms
(Reading from memory, MB/s)
And relative results to other platforms
(Write to memory, MB/s)

AIDA64 5.50.3600 CPU Queen Test
score (more is better)

WinRAR x86 v. 5.40
Kb / sec (more is better)

A large third-level cache saves the situation in some way, but the frequency of the RAM in this test is also very important.

PCMark 2005 v.1.20
score (more is better)

For the PCMark 2005 Xeon MP test, this is clearly not the best candidate, although on the same frequency as the Pentium 4 Extreme Edition it scores almost the same amount of final points, you can say the bus versus the cache is equal to parity.

3DMark 2001SE Pro b330
Total Score (more is better)

But this popular “game”  from 2001, “Gallatin 4M” came to the liking. At an equal frequency, we overtook both the Pentium 4 Extreme Edition and other representatives, despite the low frequency of the RAM.

3Dmark 2005 v.1.3.1
Total Score (more is better)

In a more modern test, the clock frequency and memory bandwidth is not enough, despite all the advantages of a large cache, the result was the worst among equals.

DOOM III (1024×768, High Quality, AA4x, timedemo1, 3x loop)
Average result, frames per seconds

Is there something missing for this system in DOOM III, or is the game not able to use the entire cache, or is there not enough clock frequency? Raising the frequency from 3000 to 3400MHz (a 13% increase) yields only a ~4% increase in frame rate. This frequency increase also is increasing the FSB, so if the memory was the bottle neck, we would expect to see performance increase inline with that, so there must be another limit here (AGP bandwidth limitation or something perhaps)

Far Cry (1024×768, Max Quality, demo 3DNews – Research, 3x loop)
Average result, frames per seconds

In Far Cry, the situation is not much improved, literally a couple of FPS difference.  In this case we start to see similar diminishing returns though for even the P4EE.  Going from 3600-4100MHz (~14%) results in only a 5.4% gain, Video card bandwidth may be limiting things here.

We turn to multi-threaded tests, see what a single Xeon MP (Gallatin) with L3 = 4 MB will be capable of, as well as the work of two processors in a pair.

wPrime v.1.43
Seconds (less is better)

This multi-threaded and simple test loves the clock frequency in the first place, so a single Xeon MP lost to everyone, even 4 MB L3 cache did not save it. At a frequency of 3.2 GHz, the presence of a larger cache allowed it to become one of the leaders among the equal-frequency models, which means there is still a certain advantage behind the cache.  There is a critical point here where the clock speed hits a speed high enough to make cache important. At slower speed the CPU isn’t completing the instructions fast enough to empty the L1/L2 caches, but at over 3GHz the CPU is able to process more data then the L1/L2 alone can fulfill, making the larger L3 cache helpful once again. Along the same lines, dual 3400 Gallatins are almost twice as fast as a single.  Slow it down to 3000, and the gap, while small, widens a bit (from 0.3% to 0.7%).

Cinebench 2003
points (more is better)

In this 2003 sample rendering task, the Xeon MP also looks decent, and a pair of processors at 3.4 GHz are matched close to the next generation Xeon “Irwindale” with a frequency of 3.8 GHz, and in uniprocessor configuration, actually beats the 3.8GHz Irwindale.

Cinebench 11.5
points (more is better)

In the 2010 release of Cinebench, the Gallatin, while still quick, isn’t quite as close as in the 2003, but still manages to compete with higher clocked Irwindales in dual configurations.

Conclusion

Undoubtedly, the system based on two Xeon MP “Gallatin” and ASUS PC-DL Deluxe is of interest. And if I was asked which system is better and more interesting in terms of retro-build PCs, a pair of Xeon MP “Gallatin” with ASUS PC-DL Deluxe and AGP or a pair of Intel Xeon DP with ASUS NCT-D with PCI-Express ?, I would not hesitate to answer – the first. Let it lose in terms of pure performance, but it is much more interesting to deal with it, it remains to make a pin-mod and bring the system to a new level of performance, but this is some other time. Once again, I can say that I am not ashamed of this Asus, its convenient layout, thoughtful design and overclocking abilities of this two-socket board make it the best choice for retro-clocking.

As for the processor itself, on the one hand, it surprised, and on the other, it left a good impression, despite such an impressive third-level cache, everything was fine with overclocking. The only thing I would like to receive is a photograph of what is under such a massive lid of the heat spreader. (coming soon 🙂 )

2 Responses to Xeon Overclocking: Making Gallatin Gallop

  1. Max

    Great work!

  2. Andy

    Good article. I was thinking about picking up the 3.2 Ghz/ 533 FSB / 2 MB Gallatins for a retro dual 604 build.

    On a side note, do you happen to know if any of the dual 604 boards supports Tulsa Xeons? They are 800 FSB and have 16MB L3 cache in the highest configuration.

Leave a Reply