耀
a
r
o
6
e
d
g
2
l
p
a
n

a
r
o
n
h
s
i
a
o
w
a
s
h
e
r
e

 

So the state of things in local inference on my non-sleek non-Strix Halo non-Mac Studio build is that I’ve settled on:

  • The dual Radeon v620s running the chat and orchestrator model, which right now is Qwen 3.6 35B A3B

  • The empty space on the RX 6700xt being used for local embedding model, which right now is qwen3-embedding

The reason for this is that basically the 6700xt is slower than the v620s, and we’re using layer splits (more on this in a moment) which means that when part of the pipeline runs through the 6700xt, it slows down the output. In fact, including the 6700xt loses me about 10t/s in inference speed while in practice only adding about 6-8MB to the inference VRAM pool (since it’s a 12GB card that’s also driving three displays).

With that said, I’ve continued to play with ROCm and try to learn WTF is going on, becasue:

  • ROCm works with a wider variety of components

  • ROCm holds out the promise of tensor parallelism (more on this in a moment as well)

And coming back to ROCm a few weeks later, with more experience under my belt, I’m better able to intuitively grok what’s happening and then confirm my suspicions as I go. So here’s what I’ve learned.

— § —

First, ROCm is about 30% faster that Vulkan on my hardware (dual v620s on PCI x16 that’s topologically just run through the Z390 chipset). That’s not nothing. So if possible, ROCm is to be preferred.

Second, ROCm is actually verrry ragged when it comes to stability. Things I started to suspect that I then was able to confirm with web searches (that LLMs didn’t proactively provide to me when I was trying to solve the ROCm problem before):

  • ROCm doesn’t want any card in the pool to run beyond about 85% VRAM usage as shown by rocm-smi. If you pass 85% you’re into crashy territory and if you hit 90% a crash is almost certainly imminent.

  • ROCm really isn’t built for llama.cpp or layer splits; it’s designed for tensor parallelism with a high degree of symmetry. Almost any tensor split value for layer splits other than 1:1 (i.e. split evenly across all cards) will cause big stability problems. As in, crashes every 2-3 prompts.

  • In general, ROCm also doesn’t really love MoE models, you’ll still get some crashiness even if everything else is perfect in most cases, but if you can solve everything else, it’ll be reduced, i.e. every few dozen prompts. This can be helped by running llama-server under a watchdog so that upon crash it comes back up and we continue seamlessly, just with a slower response for that turn.

What I once thought mattered that probably actually didn’t matter:

  • Being extraordinarily picky and suspicious about PCI-e address space under linux; letting Linux map it with 4G+ and BAR enabled is likely enough to address the cards.

  • ROCm versions actually don’t seem to matter that much, gfx1030 is pretty well supported.

  • All kinds of tweaks and environment variables that cause LLMs to repeatedly say “Aha, I found it! You need to…” but then don’t solve the problem.

Basically the two sins I was committing was:

  • Trying to squeeze the nicest quant and biggest context I could into the VRAM pool, i.e if I had 85% full I was like “Oh I can get a bigger quant, I still have 10GB free! (Nope, doesn’t work that way.)

  • Trying to run 3 cards in splits and trying to tune those splits so that all cards would fill up at the same time. (Miracle it ever worked at all while I was trying that.)

So while before I was trying to run Qwen 3.5 35B A3B at Q8_XL, and trying to tune –tensor-split so that all those % used counters matched exactly, now I’m strictly at 1:1 and I’ve had enough runs to see that if I set to anything other than 1:1 we’re essentially guaranteed to crash within the first three turns.

— § —

What still doesn’t work?

Sadly, vllm. It should be possible to run vllm with tensor splits, which would theoretically give me better multi-context and a better /responses/ API, but there was a regression in the most recent versions that causes it to punt unless you can stand up P2P between the cards.

And, just as importantly, P2P between the cards. Two things on that point:

  1. I now understand that I have, fundamentally, the wrong CPU/mainboard architecture for this, because the Intel platforms at this price point only have 16 lanes to the CPU from PCI-e and only one slot runs direct; the rest of the x16 slots run through the chipset and are essentialy x4 under the hood. So for a while, I was considering swapping out the Z390 and Core i9-9900k for an AMD Threadripper setup, though I think I’ve backed off of that. Threadripper gives each slot dedicated lines to CPU. It also enables P2P between cards.

  2. Happily (and unhappily), before I pulled the trigger, I learned that the v620 / Radion Pro “Navi” cards were really for data center fractional gaming provisioning, and not machine learning workloads, and thus they actually lack the hardware for P2P anyway. Not the end of the world, especially when you consider the value of the price/performance, here—I was able to put together 64GB of VRAM with compute and memory bandwidth that’s like double the speed of a 6700XT, and all for like $500 in cash. That’s a tremendously good deal, even if it won’t reach the same performance level as true machine learning / inference hardware.

Note that there may still be some benefit to Threadripper, even without P2P, as the dedicated x16 lines to CPU for the two cards have far more bandwidth than the x4 lines shared amongst the entire chipset-attached PCI-e bus (i.e. almost everything in the system that isn’t the 6700XT). However:

  • I’m not sure exactly how much benefit there will be to making that round trip happen on a true, unshared x16 pipe vs. an x4 pipe, so it’s hard to measure value or ROI.

  • The cheapo Threadripper on the market (i.e. X399/TR4, last generation) is only PCI-e 3.0 which has half the bandwidth of PCI-e 4.0. So I’m not that inclined to shell out for PCI-e 3.0 for undetermined benefits, but I’m also not inclined to shell out for 4.0 at a much higher price for, still, undetermined benefits. So we wait.

— § —

So that’s the state of things. If I had it all to do again, what would I do differently?

  • Get on Threadripper at the last rebuild (when I moved from an i7-3770k to the i9-9900k and to the Z390 chipset). I was tempted, but I stuck with Intel for the faster single-thread interactive (web, photos, etc.). Who knew that a few months later LLMs would hit the mainstream? But in any case, the AMD platform is obviously better for local inference; Intel consumer is hobbled.

  • Consider a different family of retired server hardware (Insight or similar) on eBay. The AMD data center hardware is still the right move; it’s dirt cheap and readily available if you’re willing to do a bit of hacking. However, for inference, having more modern hardware with faster compute and higher bandwidth is offset by the ability to run P2P with tensor parallelism on slower, cheaper cards. So there’s no reason, if you’re doing multi-card, not to go for the slower, cheaper, older hardware, which, since you’re able to run P2P with parallelism, will end up at the same speed as a couple of v620s that can’t.

  • Not bother replacing the old RX480 with a 6700XT, since the RX480 could also have run an embedding model and it proved not to be practical or worthwhile to bother with adding the 6700XT to the pool. From the outside before this all started I was thinking, in part with help from LLMs, that it would be good to have three cards that were the same compute architecture (Navi / gfx103x) and the 6700XT with 12GB would add yet a few more GB to the pool. In practice, the LLMs were exactly wrong; there is basically no advantage to the 6700XT and adding it to the pool makes things either slower or less stable or both.

  • Not listening to LLMs so much or using them for search so much. My real unlock came when I started to Google search and skip past AI results. AI has a lot of opinions about AI, but they’re all wrong. Even when you ask it to do web search. Better just to hang out in the repos and on Github and read the interactions.

And finally, for anyone looking to run v620s on Linux for inference, my kernel command line is:

pci=realloc,earlydump amdgpu.gpu_recovery=1 amdgpu.noretry=1 amdgpu.ras_enable=0 amdgpu.mcbp=0 iommu=pt intel_iommu=on pci=big_root_window pcie_aspm=off amdgpu.runpm=0 pcie_port_pm=off amdttm.pages_limit=16777216 ttm.pages_limit=16777216 amdttm.page_pool_size=1048576 ttm.page_pool_size=1048576 amdgpu.gartsize=4096

Pair this with BIOS settings that enable addressing beyond 4GB and that enable BAR and VT-d/IOMMU and they’ll get seen. Crazy to remember that I spent the first day just trying to get the cards to (first off) post, and then after that, (next) be seen by the Linux kernel.

I’ve learned a lot. Not sure how transferrable it is, but it’s nice to be in a space where the smoke has cleared.

— § —

Bonus note:

I actually can run Qwen 3.5 122B A10B well on the two v620s at (say) Unsloth UD_IQ3 and I like its output a lot, better than Qwen 3.5 A3B at Q6_XL. So if you’re wanting to run a “big” model like that (at least, big for home office purposes), it’s totally doable. I get about 27 tokens/sec on inference, which is quite respectable. I have to do it with Vulkan, though, where I can push the memory use right to “full”; on ROCm we just don’t have enough space given that ideally we need to stay below 80-85% use for stability purposes, and I don’t want to go more compressed than Q3.

Thing is, Qwen 3.6 35B A3B at Q6_XL with ROCm delivers ~55 tokens/sec, no MTP. Twice as fast. It’s really, really hard to sit and be patient for 122B when 35B is twice as fast and still… acceptable. So that’s where I am now. But if you’re wanting to run 122B or similar biggish MoE, UD_IQ3 and 27 tokens/sec is pretty damned good.