In a remarkable demonstration of clever hardware utilization, a Redditor has captured the tech community's attention by deploying a 1-trillion-parameter Large Language Model (LLM) on a system featuring just one GPU. The secret weapon behind this feat was 768GB of Intel Optane Persistent Memory (PMem) DIMMs, ingeniously repurposed to function as system RAM.
Traditionally, running such an enormous LLM locally would necessitate an exorbitant amount of conventional, high-speed RAM, often coupled with multiple A6000 or A100 GPUs. The cost and complexity associated with such a setup typically relegate these models to cloud-based supercomputing environments. However, this Redditor's approach highlights a more accessible, albeit unconventional, path.
Affiliate contentGames up to -90% off
Instant key delivery on Instant Gaming
Browse deals →The Intel Optane PMem DIMMs, while not as fast as standard DDR4 or DDR5 RAM, offer significantly higher capacities and a much lower price point per gigabyte. By configuring a workstation to utilize these DIMMs, the user created a system with a vast memory pool capable of accommodating the monumental size of the 1-trillion-parameter LLM. The specific model used was a local Kimi K2.5 install, demonstrating that even with the slower memory access speeds of Optane, practical inference is achievable.
The performance observed, estimated at roughly four tokens per second, is competitive for a single-GPU setup, especially considering the model's gargantuan size. This experiment opens up intriguing possibilities for researchers and enthusiasts looking to run large models without the prohibitive costs of top-tier, specialized hardware. It underscores the potential of repurposing enterprise-grade memory solutions for high-memory-demand consumer applications, shaking up expectations of what's possible on a more modest budget.



