Intel, Ampere show running LLMs on CPUs isn't as crazy as it sounds

📆 5/1/2024 9:02 PM

Australia News News

Australia Latest News,Australia Headlines

📆 5/1/2024 9:02 PM
📰 TheRegister

⏱ Reading Time:
89 sec. here
3 min. at publisher
📊 Quality Score:
News: 39%
Publisher: 61%

If you lower you expectations, of course. Think more Llama2-7B, less GPT-4

Popular generative AI chatbots and services like ChatGPT or Gemini mostly run on GPUs or other dedicated accelerators, but as smaller models are more widely deployed in the enterprise, CPU-makers Intel and Ampere are suggesting their wares can do the job too – and their arguments aren't entirely without merit.

While slow compared to modern GPUs, it's still a sizeable improvement over Chipzilla's 5th-gen Xeon processors launched in December, which only managed 151ms of second token latency.for running the smaller Llama2-7B model on Ampere's Altra CPUs. Putting its 64-core OCI A1 instance against a 4-bit quantized version of the model, Oracle was able to achieve between 33 and 119 tokens per second of throughput for batch sizes 1 and 16, respectively.

While Intel and Ampere have demonstrated LLMs running on their respective CPU platforms, it's worth noting that various compute and memory bottlenecks mean they won't replace GPUs or dedicated accelerators for larger models. As the name suggests, AMX extensions are designed to accelerate the kinds of matrix math calculations common in deep learning workloads. Since then, Intel has beefed up its AMX engines to achieve higher performance on larger models. This appears to be the case with Intel's Xeon 6 processors, due out later this year.

Now that might sound fast – certainly way speedier than an SSD – but eight HBM modules found on AMD's MI300X or Nvidia's upcomingare capable of speeds of 5.3 TB/sec and 8TB/sec respectively. The main drawback is a maximum of 192GB of capacity. Wittich notes Ampere is also looking at MCR DIMMs, but didn't say when we might see the tech employed in silicon.

"The sweet spot right now from a customer perspective is that 7–13 billion parameter model. That's where we put most of our focus today," Wittich declared., Xeon 6 performs quite well when running smaller models with second token latencies as low as 20ms for a dual socket config when running at the more challenging BF16 data type.

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

Australia Latest News, Australia Headlines

Similar News:You can also read news stories similar to this one that we have collected from other news sources.

Intel fuels Huawei's AI PC ambitions with Meteor Lake CPUs in MateBook X ProBut for how much longer?
Read more »

New study shows LLMs respond differently based on user's motivationA new study recently published in the Journal of the American Medical Informatics Association (JAMIA) reveals how large language models (LLMs) respond to different motivational states.
Read more »

ChatGPT-3.5, Claude 3 kick pixelated butt in Street Fighter III tournament for LLMsBut don't expect them to compete with human pros anytime soon
Read more »

OpenAI's GPT-4 can exploit real vulnerabilities by reading security advisoriesWhile some other LLMs appear to flat-out suck
Read more »

How this open source LLM chatbot runner hit the gas on x86, Arm CPUsWay to whip that LLaMA's ass
Read more »