DeepSeek on Apple Silicon: In-Depth Test on 4 MacBooks

Telegram Group Join Now
WhatsApp Group Join Now

Hey everyone, it’s Rupnath back with another deep dive into the world of local LLMs. This time, we’re diving into DeepSeek R1. It’s the latest top-tier large language model creating a buzz. It’s free, but remember: if you don’t pay, you are the product. Well, I value my privacy, so I’m exploring how to run DeepSeek locally on Apple Silicon. And trust me, the hardware really matters. I’ve got four MacBooks – M1, M2, M3, and M4 Max – ready to be put to the test. Let’s see how they handle the power of DeepSeek.

Why Local LLMs Matter (and Why Hardware is King)

Running LLMs locally is becoming increasingly popular, and for good reason. You get the power of a state-of-the-art model without sending your data to some far-off server. Plus, you’re not at the mercy of internet connectivity. But here’s the catch: hardware is the bottleneck. You can run DeepSeek on a Raspberry Pi, like Jeff Geerling showed. You can also run it on a Jetson Nano; I made a video about that recently. But if you want speed and performance, you need something more robust. That’s where my collection of MacBooks comes in. We’re going to look at how much performance varies among different Apple Silicon chips.

The Tools of the Trade: Ollama and LM Studio

I’m a developer, so I’m comfortable with command-line tools like llama.cpp. (Check out my previous video on that if you’re interested). I want to show user-friendly options for everyone, no matter their technical skills. That’s where Ollama and LM Studio shine.

  • Ollama: A simple, cross-platform tool that lets you download and run LLMs with minimal fuss. It even provides a server functionality for developers!

    • Installation: Super easy. Download the Mac OS version from olama.com, drag it to Applications, and run it from the terminal.

    • Running a Model: Use the ollama run command followed by the model name (more on that below).

  • LM Studio: Offers even more versatility and control over your models. Plus, it has a slick graphical user interface.

    • Installation: Similar to Ollama – download, drag to Applications, and run.

    • Model Management: LM Studio makes it easy to find, download, and switch models.

Both tools make it easy to run DeepSeek locally. They work on Windows and Linux, so all users can benefit.

Quantization: Shrinking Models for Smaller Hardware

Before we jump into the benchmarks, let’s talk about quantization. It’s a method to compress models by lowering the precision of their numbers. This makes them smaller and faster, but it can also impact the quality of their output.

  • Understanding Quantization Levels: You’ll often see quantizations like Q4 and Q8. Lower numbers mean more aggressive quantization, which may lower quality.

  • Hugging Face: This guide shows you how to look at quantized versions of DeepSeek and other models. You’ll find file sizes and quality assessments, thanks to contributors like Bowski! to choose the right one for your hardware.

Ollama often picks a default quantization for you. But it’s helpful to know what options you have and how they impact performance. I’ve got a separate video coming out on quantizations, so stay tuned for a deeper dive into this topic!

Benchmarking DeepSeek on Apple Silicon

Now for the main event: let’s see how DeepSeek performs on my four MacBooks.

M1 MacBook Air (8GB RAM)

  • Model: DeepSeek R1 1.5B (Q4 km)

  • Ollama Performance: ~33 tokens/second

  • LM Studio Performance: ~58 tokens/second

  • Observations: The M1 held its own with the smaller 1.5B model, especially using LM Studio. Usable, but not blazing fast. Trying to run larger models led to memory issues.

M2 MacBook Air (8GB RAM)

  • Model: DeepSeek R1 1.5B (Q4 km)

  • Ollama Performance: ~50 tokens/second

  • LM Studio Performance: ~54 tokens/second

  • Observations: Similar performance to the M1, with marginal improvements. Again, limited by RAM when trying larger models. The 8GB RAM is definitely a constraint here.

M3 MacBook Air (8GB RAM)

  • Model: DeepSeek R1 1.5B (Q4 km)

  • Ollama Performance: ~45 tokens/second

  • LM Studio Performance: ~54 tokens/second

  • Observations: Like the M2, RAM limits these smaller models. Larger models like the 8B had big challenges, even with good quantization and mlx optimisation. The mlx version improved speed and memory stability over the ggml version. However, it still didn’t make the larger models fully usable.

M4 Max MacBook Pro (128GB RAM)

  • Model: DeepSeek R1 1.5B (Q4 km)

  • Ollama Performance: ~162 tokens/second

  • LM Studio Performance: ~182 tokens/second

  • Observations: The M4 Max is a beast! It tore through the 1.5B model. Even with the 70B model, it managed a respectable (though not ideal) 9-10 tokens/second. This really showcases the impact of having ample RAM and a powerful GPU. I’ve got a 4TB SSD in this machine, but even it’s struggling to keep up with all the models I’ve downloaded. More testing to come on this powerhouse!

Key Takeaways from the Benchmarks

  • RAM is Crucial: The 8GB RAM in the M1, M2, and M3 severely limited the size of the models they could run effectively.

  • M4 Max Dominates: The M4 Max, with its 128GB of RAM, showed the true potential of running large LLMs locally.

  • LM Studio’s Edge: LM Studio consistently delivered slightly better performance than Ollama.

  • Mlx Optimization: The mlx format on Apple Silicon showed great improvements, especially with bigger models. This highlights the benefits of optimising for specific hardware.

  • Quantization is a Balancing Act: Aggressive quantization can shrink model sizes. But if you push it too hard, the output quality can suffer. I learned this with the heavily quantized 14B model.

Why You Shouldn’t Run DeepSeek on the Official Website

I can’t stress this enough: avoid running sensitive data on the official DeepSeek website. You’re sending your information to a server in China, and your privacy is at risk. If you can run it locally, do it.

Final Thoughts and Future Explorations

This exploration has just scratched the surface. I’m planning more in-depth testing on the M4 Max with larger models and various quantization levels. I also want to delve deeper into the creative capabilities of these different models.

Local LLMs are an exciting frontier, and tools like Ollama and LM Studio make them accessible to everyone. The hardware you choose is paramount. If you’re serious about running large LLMs, invest in a machine with ample RAM and a powerful GPU. The results are well worth it.

Stay tuned for more insights into AI and machine learning! Also, check out my other articles. Subscribe to the blog for more content like this.

Read Also:

How I Would Learn Python with ChatGPT in 2025

DeepSeek vs. OpenAI-01 for Data Science Tasks: Which One is Better?

4 Micro-SaaS Ideas You Can Build to Make $100k/Month

Leave a comment