耀
a
r
o
6
e
d
g
2
l
p
a
n

a
r
o
n
h
s
i
a
o
w
a
s
h
e
r
e

 

 

So the state of things in local inference on my non-sleek non-Strix Halo non-Mac Studio build is that I’ve settled on:

  • The dual Radeon v620s running the chat and orchestrator model, which right now is Qwen 3.6 35B A3B

  • The empty space on the RX 6700xt being used for local embedding model, which right now is qwen3-embedding

The reason for this is that basically the 6700xt is slower than the v620s, and we’re using layer splits (more on this in a moment) which means that when part of the pipeline runs through the 6700xt, it slows down the output. In fact, including the 6700xt loses me about 10t/s in inference speed while in practice only adding about 6-8MB to the inference VRAM pool (since it’s a 12GB card that’s also driving three displays).

With that said, I’ve continued to play with ROCm and try to learn WTF is going on, becasue:

  • ROCm works with a wider variety of components

  • ROCm holds out the promise of tensor parallelism (more on this in a moment as well)

And coming back to ROCm a few weeks later, with more experience under my belt, I’m better able to intuitively grok what’s happening and then confirm my suspicions as I go. So here’s what I’ve learned.

— § —

First, ROCm is about 30% faster that Vulkan on my hardware (dual v620s on PCI x16 that’s topologically just run through the Z390 chipset). That’s not nothing. So if possible, ROCm is to be preferred.

Second, ROCm is actually verrry ragged when it comes to stability. Things I started to suspect that I then was able to confirm with web searches (that LLMs didn’t proactively provide to me when I was trying to solve the ROCm problem before):

  • ROCm doesn’t want any card in the pool to run beyond about 85% VRAM usage as shown by rocm-smi. If you pass 85% you’re into crashy territory and if you hit 90% a crash is almost certainly imminent.

  • ROCm really isn’t built for llama.cpp or layer splits; it’s designed for tensor parallelism with a high degree of symmetry. Almost any tensor split value for layer splits other than 1:1 (i.e. split evenly across all cards) will cause big stability problems. As in, crashes every 2-3 prompts.

  • In general, ROCm also doesn’t really love MoE models, you’ll still get some crashiness even if everything else is perfect in most cases, but if you can solve everything else, it’ll be reduced, i.e. every few dozen prompts. This can be helped by running llama-server under a watchdog so that upon crash it comes back up and we continue seamlessly, just with a slower response for that turn.

What I once thought mattered that probably actually didn’t matter:

  • Being extraordinarily picky and suspicious about PCI-e address space under linux; letting Linux map it with 4G+ and BAR enabled is likely enough to address the cards.

  • ROCm versions actually don’t seem to matter that much, gfx1030 is pretty well supported.

  • All kinds of tweaks and environment variables that cause LLMs to repeatedly say “Aha, I found it! You need to…” but then don’t solve the problem.

Basically the two sins I was committing was:

  • Trying to squeeze the nicest quant and biggest context I could into the VRAM pool, i.e if I had 85% full I was like “Oh I can get a bigger quant, I still have 10GB free! (Nope, doesn’t work that way.)

  • Trying to run 3 cards in splits and trying to tune those splits so that all cards would fill up at the same time. (Miracle it ever worked at all while I was trying that.)

So while before I was trying to run Qwen 3.5 35B A3B at Q8_XL, and trying to tune –tensor-split so that all those % used counters matched exactly, now I’m strictly at 1:1 and I’ve had enough runs to see that if I set to anything other than 1:1 we’re essentially guaranteed to crash within the first three turns.

— § —

What still doesn’t work?

Sadly, vllm. It should be possible to run vllm with tensor splits, which would theoretically give me better multi-context and a better /responses/ API, but there was a regression in the most recent versions that causes it to punt unless you can stand up P2P between the cards.

And, just as importantly, P2P between the cards. Two things on that point:

  1. I now understand that I have, fundamentally, the wrong CPU/mainboard architecture for this, because the Intel platforms at this price point only have 16 lanes to the CPU from PCI-e and only one slot runs direct; the rest of the x16 slots run through the chipset and are essentialy x4 under the hood. So for a while, I was considering swapping out the Z390 and Core i9-9900k for an AMD Threadripper setup, though I think I’ve backed off of that. Threadripper gives each slot dedicated lines to CPU. It also enables P2P between cards.

  2. Happily (and unhappily), before I pulled the trigger, I learned that the v620 / Radion Pro “Navi” cards were really for data center fractional gaming provisioning, and not machine learning workloads, and thus they actually lack the hardware for P2P anyway. Not the end of the world, especially when you consider the value of the price/performance, here—I was able to put together 64GB of VRAM with compute and memory bandwidth that’s like double the speed of a 6700XT, and all for like $500 in cash. That’s a tremendously good deal, even if it won’t reach the same performance level as true machine learning / inference hardware.

Note that there may still be some benefit to Threadripper, even without P2P, as the dedicated x16 lines to CPU for the two cards have far more bandwidth than the x4 lines shared amongst the entire chipset-attached PCI-e bus (i.e. almost everything in the system that isn’t the 6700XT). However:

  • I’m not sure exactly how much benefit there will be to making that round trip happen on a true, unshared x16 pipe vs. an x4 pipe, so it’s hard to measure value or ROI.

  • The cheapo Threadripper on the market (i.e. X399/TR4, last generation) is only PCI-e 3.0 which has half the bandwidth of PCI-e 4.0. So I’m not that inclined to shell out for PCI-e 3.0 for undetermined benefits, but I’m also not inclined to shell out for 4.0 at a much higher price for, still, undetermined benefits. So we wait.

— § —

So that’s the state of things. If I had it all to do again, what would I do differently?

  • Get on Threadripper at the last rebuild (when I moved from an i7-3770k to the i9-9900k and to the Z390 chipset). I was tempted, but I stuck with Intel for the faster single-thread interactive (web, photos, etc.). Who knew that a few months later LLMs would hit the mainstream? But in any case, the AMD platform is obviously better for local inference; Intel consumer is hobbled.

  • Consider a different family of retired server hardware (Insight or similar) on eBay. The AMD data center hardware is still the right move; it’s dirt cheap and readily available if you’re willing to do a bit of hacking. However, for inference, having more modern hardware with faster compute and higher bandwidth is offset by the ability to run P2P with tensor parallelism on slower, cheaper cards. So there’s no reason, if you’re doing multi-card, not to go for the slower, cheaper, older hardware, which, since you’re able to run P2P with parallelism, will end up at the same speed as a couple of v620s that can’t.

  • Not bother replacing the old RX480 with a 6700XT, since the RX480 could also have run an embedding model and it proved not to be practical or worthwhile to bother with adding the 6700XT to the pool. From the outside before this all started I was thinking, in part with help from LLMs, that it would be good to have three cards that were the same compute architecture (Navi / gfx103x) and the 6700XT with 12GB would add yet a few more GB to the pool. In practice, the LLMs were exactly wrong; there is basically no advantage to the 6700XT and adding it to the pool makes things either slower or less stable or both.

  • Not listening to LLMs so much or using them for search so much. My real unlock came when I started to Google search and skip past AI results. AI has a lot of opinions about AI, but they’re all wrong. Even when you ask it to do web search. Better just to hang out in the repos and on Github and read the interactions.

And finally, for anyone looking to run v620s on Linux for inference, my kernel command line is:

pci=realloc,earlydump amdgpu.gpu_recovery=1 amdgpu.noretry=1 amdgpu.ras_enable=0 amdgpu.mcbp=0 iommu=pt intel_iommu=on pci=big_root_window pcie_aspm=off amdgpu.runpm=0 pcie_port_pm=off amdttm.pages_limit=16777216 ttm.pages_limit=16777216 amdttm.page_pool_size=1048576 ttm.page_pool_size=1048576 amdgpu.gartsize=4096

Pair this with BIOS settings that enable addressing beyond 4GB and that enable BAR and VT-d/IOMMU and they’ll get seen. Crazy to remember that I spent the first day just trying to get the cards to (first off) post, and then after that, (next) be seen by the Linux kernel.

I’ve learned a lot. Not sure how transferrable it is, but it’s nice to be in a space where the smoke has cleared.

— § —

Bonus note:

I actually can run Qwen 3.5 122B A10B well on the two v620s at (say) Unsloth UD_IQ3 and I like its output a lot, better than Qwen 3.5 A3B at Q6_XL. So if you’re wanting to run a “big” model like that (at least, big for home office purposes), it’s totally doable. I get about 27 tokens/sec on inference, which is quite respectable. I have to do it with Vulkan, though, where I can push the memory use right to “full”; on ROCm we just don’t have enough space given that ideally we need to stay below 80-85% use for stability purposes, and I don’t want to go more compressed than Q3.

Thing is, Qwen 3.6 35B A3B at Q6_XL with ROCm delivers ~55 tokens/sec, no MTP. Twice as fast. It’s really, really hard to sit and be patient for 122B when 35B is twice as fast and still… acceptable. So that’s where I am now. But if you’re wanting to run 122B or similar biggish MoE, UD_IQ3 and 27 tokens/sec is pretty damned good.

Archives »

May 2026
April 2026
March 2026
February 2026
January 2026
December 2025
July 2025
May 2025
April 2025
February 2025
January 2025
December 2024
October 2024
September 2024
August 2024
July 2024
June 2024
May 2024
April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
May 2023
April 2023
March 2023
January 2023
December 2022
November 2022
August 2022
June 2022
May 2022
April 2022
March 2022
January 2022
December 2021
November 2021
September 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
June 2015
February 2015
January 2015
December 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
December 2012
November 2012
October 2012
August 2012
July 2012
June 2012
May 2012
March 2012
December 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
April 2003
March 2003
February 2003
January 2003
December 2002
November 2002
October 2002
September 2002
August 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
July 2001
June 2001
May 2001
April 2001
March 2001
February 2001
January 2001
December 2000
November 2000
October 2000
September 2000
August 2000
July 2000
June 2000
May 2000
April 2000
March 2000
February 2000
January 2000
December 1999
November 1999