TECHNOLOGY · LOCAL AI

DeepSeek locally, but the real challenge is prefill

Salvatore Sanfilippo shows DeepSeek v4 Flash running on a MacBook Pro with 128 GB of RAM and uses the demo to argue that the bottleneck is no longer the model, but the way it gets into context.

Salvatore SanfilippoMay 2, 20266 min read

On his MacBook Pro with 128 GB of RAM, Salvatore Sanfilippo shows a 270-billion-parameter model that, he says, really does run locally. The scene is not a polished demo, but a test of patience: fast startup, convincing answers, then a brutal slowdown as context grows. The thesis running through the whole demonstration is simple and uncomfortable for the AI market: the problem is no longer just fitting the model into the machine, but making it part of everyday work without prefill becoming an unsustainable tax. For Sanfilippo, that is where the product is decided.

Local is not a toy

Sanfilippo opens with his idea of local inference: not a hobby for tinkerers, but a way to actually use a frontier model on your own machine. He says, however, that talking about democratization would be an exaggeration, because a MacBook with 128 GB of RAM still costs too much to count as accessible.

It’s the most incredible, most frontier model possible to run with 128 GB of RAM on home computers, with some compromises in speed.

It’s not exactly a very democratizing operation for AI, because you still have to shell out quite a bit of money for a machine like that.

His argument is not that the desktop market has suddenly become fair, but that for a segment of users local inference has stopped being an experiment. That distinction matters: changing the toy is not enough, you have to change the workflow.

Quality beats purity

The second part of the argument concerns the new project Sanfilippo is preparing, still not on GitHub. Here the issue is not only technical, but also political: the Amadozzi Plus+ project does not accept contributions written mostly with AI, while he prefers the inference engine code to be produced with GPT 5.5 under tighter guidance.

To me, that is a perspective that makes no sense.

If I have to have a hand-written Lama++ that still doesn’t have DeepSeek’s inference, I’d rather have much tighter control over the product and the product ideas.

His objection is not an unconditional praise of automation. He says the opposite: without knowing how to direct the model, the result is mediocre, even if GPT 5.5 is used. In this story, code purity matters less than the quality of the final product and the ability to enforce coherent choices on templates, sampling, and usage.

The real bottleneck

The most convincing demonstration comes when Sanfilippo shows that DeepSeek v4 Flash uses only a little over an extra gigabyte for a 32K-token context window, even though it starts from an 81 GB GGUF file. With a 250K-token window, he says, the model still works, but memory cost rises and speed begins to drop.

This stuff is incredible, the GGUF file is 81 GB. With 82 GB we can fit this context window in.

Performance drops are truly catastrophic as the context window grows.

Here his enthusiasm shifts from power to practicality. The model, he says, remains very fast in short answers and early interactions, but when context explodes the system starts paying the price of its own memory. That is why he insists on prefill, meaning the phase in which the prompt is processed before actual generation begins.

Benchmarks, shell, and agents

Sanfilippo uses very concrete examples to argue that the model does not behave like many systems that “overthink” too much. He asks it about the presidents of the Italian Republic, then asks which one was the most controversial, and DeepSeek answers with Francesco Cossiga and a coherent summary of his political periods.

It doesn’t do constant overthinking, that kind of overthinking for every answer that we’re used to seeing with certain models.

List the last presidents of the Italian Republic.

For him, the point is that benchmarks no longer describe overall model quality very well. DeepSeek, he argues, shows a density of knowledge that goes beyond the metric, and that makes it suitable for practical uses such as code reading and shell assistance.

This program is a simple interactive command interpreter, a sort of mini-shell that uses line noise to handle terminal input.

The problem is that tool output is very large.

This is where his second target comes in: the layering of tools, agents, and system prompts that makes everything heavier than necessary. He shows that Open Code can use up to 11,000 tokens of system prompt, a figure he considers disproportionate and a symptom of an architecture that does not really know the model it is working with.

The real bet is prefill

At the end, Sanfilippo narrows the focus: generation itself can remain acceptable, the problem is speeding up prefill and handling context compaction better. He says the next few days and weeks will be about improving DS4 kernels, because that is where he expects significant gains.

We can keep generation speed as it is. The point is prefill.

It would be nice to be able, especially in prefill, to do a bit more.

His idea is that a 2-bit model on high-end hardware can already be useful, but the real leap comes when you can compress context without losing the conversation’s memory. It is no longer enough to get the LLM running, or even to have it answer well to a single prompt: it has to fit into a continuous workflow, with acceptable timing and tools that do not become bottlenecks.

There will be waiting, and then you absolutely have to launch prompt compaction.

This model is truly a beautiful model; for asking fairly interesting things that it answers quickly, it is ideal.

The demonstration ends with a double conclusion. DeepSeek v4 Flash, in Sanfilippo’s view, is already strong enough to be usable locally; the still-unsolved part is context engineering, meaning everything that happens before and around the answer. That is where, he says, we will learn whether local AI will really be a work environment or just a demo that went well.

FAQ

Why does Sanfilippo use DeepSeek locally?

He uses it to have direct control over inference and the product. According to him, on certain machines local AI is no longer a toy.

What is the main limit he showed?

The main limit is the slowdown as context grows. Sanfilippo says performance drops become “catastrophic” as the window gets longer.

Why does he criticize Open Code?

Because, in his view, it uses 11,000 tokens of system prompt. For him, that is a sign of an architecture that is too heavy and not sufficiently aware of the underlying model.

What would he like to improve in the next few days?

He would like to improve the DS4 kernels, especially prefill. He says that is where significant gains can be made.

AI-assisted summary of Salvatore Sanfilippo's podcast, verified against the original transcript.