• ImmersiveMatthew@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    1
    ·
    2 days ago

    Agreed. I am not longer paying token fees as I am running QWEN 3.6 27B MTP on my 4090 GPU and it is as good and as fast as the frontier models for agentic coding.

    • tristynalxander@mander.xyz
      link
      fedilink
      English
      arrow-up
      3
      ·
      22 hours ago

      Same. I’m running Qwen3.6-35B-A3B-FP8 (Qwen3.6-35B-A3B-UD-IQ4_XS.gguf) via the turboquant fork of llama.cpp with a few tweaked memory settings, and I get like 40 tokens / second – nothing that required special insight on my part just following the instructions I saw on a youtube video I found via [email protected] and asking claude to help me through the installation.

      AI has no economic moat. There’s nothing stopping anyone from running LLMs locally.

      • ImmersiveMatthew@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        3
        ·
        22 hours ago

        I just updated my setup from LMStudio to llama.cpp with the new QWEN 3.6 27B MTP model and I am getting 80-112 tokens/second, 90 average which is just shocking to me. I am on a 4090 with a context Window of 64k. It hardly use cloud AI anymore as I rarely need more than 64k if I ensure my first prompt is written like a design document. Multiple prompts are not great so I often just figure out where my initial prompt went wrong, adjust and try again in a fresh session. Way faster this way too. It has really worked out well for me as I am getting just as much done locally for free as I was with hundreds of dollar a month on cloud AI. I am still shocked and grateful it flowed this way.

      • ImmersiveMatthew@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        2
        ·
        22 hours ago

        I am using llamma.cpp with QWEN 3.6 27B MTP, with a 64k context window on a 4090 that OpenCode talks to and then it in term talks to the Unity Game engine via MCP. Getting 80/112 tokens/second work 90 average which is shocking to me as it really does feel as fast as cloud AI (well faster for me as I am in Vietnam and round trips to US data centers really adds up in a session). The only really issue is you pretty much have to one shot prompts as follow up prompts will easily go over the context window size. If I cannot one shot prompts them use cloud AI both that is very rare for my use case. Maybe 1 in 50 or so and only when the tasks touches a lot of large scripts and scenes.