Bug Forces Intel to Halt Some Xeon Sapphire Rapids Shipments

Sapphire Rapids
(Image credit: Tom's Hardware)

Intel has confirmed that it has paused shipments of some of its fourth-gen Xeon Sapphire Rapids processors due to a newly-discovered bug, and it hasn't set a specific date for shipments to resume. We received a tip that Intel had paused the shipments, and following up on the matter, we learned several details about the issue from Dylan Patel, Chief Analyst at SemiAnalysis, who says shipments have been paused for certain SKUs since mid-June. We also followed up with Intel on the matter, and the company issued the following statement to Tom's Hardware:

"We became aware of an issue on a subset of 4th Generation Intel Xeon Medium Core Count Processors (SPR-MCC) that could interrupt system operation under certain conditions and are actively investigating. This issue was not observed when running commercially available software, and other 4th Generation Intel Xeon processor SKUs (i.e., XCC and HBM) have not exhibited the issue. Out of an abundance of caution, we did temporarily pause some SPR MCC shipments while we gained confidence in the expected firmware mitigation and expect to release remaining shipments shortly." — Intel Spokesperson to Tom's Hardware.

In response to a follow-up question, Intel also told us that it doesn't expect the firmware mitigation to have an impact on performance.

Intel's oft-delayed Sapphire Rapids processors are created using two types of underlying designs: The XCC package, which employs four compute tiles (die) to create a single chip, and the MCC package, which uses a single monolithic die. As shown in the slides above, the MCC design is used for chips up to 32 cores, which are the source of high-volume sales for Intel, while the XCC variants are used for the halo chips between 36 and 60 cores.

"Intel has faced another crop of design issues related to Sapphire Rapids MCC, the highest volume version of Sapphire Rapids. The 2-socket and 4-socket SKUs have paused shipments due to a timing issue since mid-June," Patel said.

Intel hasn't confirmed that the issue is confined to dual- and quad-socket SKUs, instead classifying this issue as limited to a 'subset' of the SKUs, and hasn't stated when the pause in shipments began. Intel also hasn't confirmed Patel's assertions that the bug is timing-related, or given us any clarification on the nature of the issue.

A timing issue could consist of any number of possibilities ranging from UPI interconnect to instruction timing issues, so the true nature of the bug remains nebulous for now. We do know that Intel can correct the issue with a firmware fix that apparently remains in validation for now, so the issue will not require a redesign or new revision/stepping to fix. Additionally, since new firmware is an adequate fix, Intel might not be required to replace any processors already in the field — although it could pose a validation headache for its customers.

Intel has earned plenty of criticism not only for its missteps on process node tech for the oft-delayed Sapphire Rapids, but also for the issues in its design and validation methodology that led to further delays and numerous new steppings (a typically minor redesign that requires a new version of silicon to correct an issue). Intel's Sapphire Rapids has been plagued with rumors that its design/verification missteps led to 12 steppings for some configs (an unnaturally large number — most chips see three steppings at most). Naturally, that led to severe production delays and missed launch dates.

The company has since communicated that it plans to take a different approach to its design, simulation, and validation flow that will correct those issues. Intel says those adjustments will kick in fully in the next generation of Emerald Rapids Xeon processors.

Intel says this new Sapphire Rapids bug wasn't encountered while "running commercially available software" (perhaps this was a hyperscaler's custom application), and it obviously wasn't caught during validation. This type of situation isn't entirely unheard of; nearly all complex chips have both known and unknown errata and bugs that are addressed with firmware, driver, and software workarounds that can reduce or eliminate those issues, and they ship that way — that's the very nature of modern semiconductor design and production.

For example, Intel's Skylake generation of processors shipped with 53 known errata, and six months later, Intel listed another 40 errata. Another example is the recent discovery that AMD's EPYC Rome chips crash after 1,044 days of uptime. Some bugs are simply left unfixed, as they aren't deemed critical enough to fix, or they are fixed with a combination of firmware and software. The most critical bugs sometimes require a new stepping to correct, which is the worst-case scenario. Luckily for Intel, that doesn't seem to be the case here.

However, while bugs aren't uncommon, it is uncommon for those types of bugs to lead to a halt in shipments, implying that this is more than a garden-variety errata. Intel hasn't clarified when it plans to resume shipments for its Sapphire Rapids MCC chips, but we'll update our coverage as we learn more.

Paul Alcorn
Deputy Managing Editor

Paul Alcorn is the Deputy Managing Editor for Tom's Hardware US. He writes news and reviews on CPUs, storage and enterprise hardware.

  • bit_user
    Just wow... this CPU seems destined to gain a reputation up there with the most cursed projects at Intel.

    On the flip side, hopefully Emerald Rapids will benefit from all the debugging and iterating that has gone into Sapphire Rapids.

    The real irony is that you'd expect the XCC to be the one with the late issue(s) - not the MCC. However, maybe because the XCC is more complex, it had more scrutiny and troubleshooting early-on.
    Reply
  • IamNotChatGpt
    Credit where credit is due, at least Intel gives a flying chicken about its server chips unlike AMD which basically told all of its customers to go ..... themselves.
    Reply
  • bit_user
    IamNotChatGpt said:
    Credit where credit is due, at least Intel gives a flying chicken about its server chips unlike AMD which basically told all of its customers to go ..... themselves.
    You're talking about the Rome 1044-day bug? That's apples-and-oranges.
    Rome is already 4-years-old and 2 generations behind, whereas Sapphire Rapids just started shipping earlier this year.
    The mitigation for the 1044-day bug is just reboot at least once every 2.86 years, which nearly all server operators will already be doing for software upgrades & maintenance.
    It's not like Intel CPUs don't have plenty of errata, including side-channel vulnerabilities on older Xeons they didn't even bother to release mitigations for.
    Intel has even removed features in shipping CPUs, like when they withdrew TSX via microcode updates!
    People in glass houses shouldn't throw stones.
    Reply
  • Ravestein NL
    It's getting "normal" to market new stuff without extensive testing. Quality is less important than making the deadlines these days. These days it's in all the tech departments not only in chip and computer branches.
    And we the customers are the ones who pay for this in my opinion by the loss we have if something goes wrong with new tech. We pay al lot of money for it but there seem to be no guarantees. It's saddening.
    Reply
  • TerryLaze
    bit_user said:
    You're talking about the Rome 1044-day bug? That's apples-and-oranges.

    People in glass houses shouldn't throw stones.
    More like the exploding CPUs where AMD did not stop sending out their CPUs but rather opted to blame everybody else.
    AMD should have recalled every CPU and fixed their microcode or temperature sensor or whatever the problem is, just like intel does here.
    They opt to not send out CPUs they know have an issue and if they can they fix them them before sending them out.

    This is not even making the CPUs explode it just could interrupt system operation under certain conditions , as always you are the one throwing stones around.

    "We became aware of an issue on a subset of 4th Generation Intel Xeon Medium Core Count Processors (SPR-MCC) that could interrupt system operation under certain conditions
    ...
    ...
    Out of an abundance of caution, we did temporarily pause some SPR MCC shipments while we gained confidence in the expected firmware mitigation and expect to release remaining shipments shortly."
    Reply
  • pbfonseca
    Just a correction: "Errata" is a plural, meaning "errors" or "list of errors". The singular is "erratum". Companies like Intel publish ONE Errata for each product, which gets updated with new entries over time.

    To be fair most pundits and even tech company representatives get this wrong nowadays.
    Reply
  • tommo1982
    Years of Bulldozer and Piledriver made Intel complacent. For the company their size, it will take some time before they resolve their issues. In case of GPU's I hope it's earlier than later. I'm looking forward to next generation. Arc A750 is a decent GPU.
    Reply
  • bit_user
    TerryLaze said:
    More like the exploding CPUs where AMD did not stop sending out their CPUs
    Fixed in firmware, just as Intel is stating it will do in this case.

    TerryLaze said:
    AMD should have recalled every CPU and fixed their microcode or temperature sensor or whatever the problem is, just like intel does here.
    No, the article very clearly does not say that Intel is actually recalling anything. It just says they stopped shipment, pending a fix.

    TerryLaze said:
    They opt to not send out CPUs they know have an issue and if they can they fix them them before sending them out.
    I think the difference is that Intel probably knows the problem could manifest quite frequently, if some software tickles it in just the right way.

    Another key difference is that AMD was a lot quicker in diagnosing the root cause and issuing a BIOS fix. If they stopped shipment of new CPUs, it'd have only been for a few days (and who knows? maybe they did!).

    TerryLaze said:
    This is not even making the CPUs explode it just could interrupt system operation under certain conditions,
    "Interrupt system operations" = system hang, which likely means a hard reboot + potential for data corruption.

    The real question is about the relative frequency. We only know of a handful of AMD CPUs that actually failed in the wild. If Intel believes its problem can manifest quite frequently, then the calculus is different.

    TerryLaze said:
    as always you are the one throwing stones around.
    No, I'm not the partisan operative, here.
    Reply
  • bit_user
    Ravestein NL said:
    It's getting "normal" to market new stuff without extensive testing. Quality is less important than making the deadlines these days.
    What's ironic about that statement is that Sapphire Rapids was years late to market, depending on which roadmap you look at.

    Even as of quite recently, a volume ramp was expected in '22, but it didn't happen until earlier this year. Here's a slide Intel published in mid-Feb '22, showing they still expected a 2022 launch:
    Reply
  • Elusive Ruse
    TerryLaze said:
    More like the exploding CPUs where AMD did not stop sending out their CPUs but rather opted to blame everybody else.
    AMD should have recalled every CPU and fixed their microcode or temperature sensor or whatever the problem is, just like intel does here.
    They opt to not send out CPUs they know have an issue and if they can they fix them them before sending them out.

    This is not even making the CPUs explode it just could interrupt system operation under certain conditions , as always you are the one throwing stones around.

    "We became aware of an issue on a subset of 4th Generation Intel Xeon Medium Core Count Processors (SPR-MCC) that could interrupt system operation under certain conditions
    ...
    ...
    Out of an abundance of caution, we did temporarily pause some SPR MCC shipments while we gained confidence in the expected firmware mitigation and expect to release remaining shipments shortly."
    Is it Intel's official stance on the matter, or your personal take? As an Intel employee you should always make it clear cause this load of horsecrap wouldn't even fly on a private Slack channel.
    Reply