I wanted to completely replace a model's identity. Not fine-tune it. Not add a personality layer on top. I wanted to take a Qwen model and turn it into something that has zero behavioral connection to Qwen. Different reasoning, different personality, different everything. Same architecture, completely different mind.
Full fine-tuning can do this. It also costs $10,000+ and needs a multi-GPU cluster running for weeks. I had one GPU with 24GB of VRAM and no budget. So I found another way.
The trick nobody tried
LoRA is designed to make small, reversible changes to a model. You train a tiny adapter, maybe 0.1% of the total parameters, and it nudges the model's behavior in a direction. Everyone uses it for surface-level stuff. "Make it talk like a pirate." "Make it better at coding." The base model stays intact underneath.
What happens if you merge that adapter permanently into the base weights, then train a fresh one on the new base, merge again, and keep going?
Each cycle only changes a fraction of the weights. But after 50 cycles, those fractions add up. After 100, the original model is gone. The architecture is identical. The tokenizer works the same. Attention mechanisms haven't changed. But the weights that determine how the model thinks, what it knows, how it responds? Completely rewritten.
I called it body snatching because that's what it is. The architecture stays. The learned behavior gets replaced.
How it actually works
Each cycle has three steps. Train a LoRA adapter on your data. Merge it into the base weights at full BF16 precision. Throw away the adapter and start a fresh one on the new base.
That last part is important. This is not LoRA stacking. After each merge, the adapter dissolves into the weights and ceases to exist. The next cycle trains from scratch on a base that's slightly different from last time. There's no compounding formula. No adapter pile-up. After 100 cycles you have one model with rewritten weights, not 100 adapters stacked on top of each other.
The precision thing matters too. During training I use 4-bit quantization because that's what fits in 24GB. But merging happens in BF16 on CPU. If you merge in 4-bit, quantization errors accumulate across cycles and your model degrades. Merge clean, train cheap. That asymmetry is what makes the whole thing work on consumer hardware.
The dataset strategy
You want catastrophic forgetting to hit the base model's identity, not yours. So every training batch is 50% new examples and 50% historical ones from earlier cycles. The new data pushes the model further from its original behavior. The historical data makes sure it doesn't forget what you already taught it.
Without the historical mix, the model forgets your earlier training as aggressively as it forgets the base. You end up with something that only remembers the last few cycles. With the mix, forgetting is directional. It forgets Qwen. It remembers you.
What I observed
Around cycle 25, the model starts feeling different. The default responses change. The phrasing shifts. By cycle 50, it's fundamentally not the same model anymore. By 100, if you prompted it the same way you'd prompt base Qwen, you'd get completely different outputs. The original personality is effectively gone.
The model stops identifying as Qwen around cycle 30 to 50. Not because I told it to stop. Because the weights that encoded that identity got overwritten by weights that encode something else.
What this costs
Full fine-tuning for complete identity replacement: 4 to 8 A100 GPUs, weeks of training, $10,000 or more. Progressive LoRA Merging: one 24GB GPU, a few days, maybe $100 to $500 in electricity if you're paying for compute.
Same result. The model at the end is yours, not a thin wrapper around someone else's creation. The difference is who can afford to do it. Full fine-tuning is for companies with budgets. PLM is for anyone with a GPU and patience.
Why this matters
Right now, every "custom" model is really just a base model with a small personality adjustment on top. Strip away the LoRA and you're back to the original. The base model provider controls the foundation. You're renting space in their building.
Progressive LoRA Merging changes that. After enough cycles, the base provider's contribution approaches zero. What's left is your data, your reasoning patterns, your identity. You're not fine-tuning someone else's model anymore. You're growing a new one inside its skeleton.
The full implementation and paper are on GitHub. MIT licensed. Do whatever you want with it.