We noticed that mobile agents all require big proprietary models that come with big cost, and our goal was to fine-tune a model that we can run locally and save money.

Challenges We Ran Into

Challenge: Integrating the ERNIE model with Unsloth’s optimization pipeline was tricky. Unsloth is built to work with models that are natively part of the Hugging Face Transformers library, but ERNIE wasn’t yet included in the latest stable release. This meant we had to rely on a custom ERNIE implementation, along with a custom trainer and data collator.

Impact: This setup caused practical problems during training, like inefficient GPU memory usage and difficulty fitting the model. Essentially, we couldn’t take full advantage of Unsloth’s optimizations because they depend on standard Transformers implementations.

Solution: We dug into the Transformers repository and found that ERNIE support already existed in the main branch, even though it hadn’t been officially released. By installing Transformers from the main branch, we could use the native ERNIE implementation. This fixed the memory issues and let us fully leverage Unsloth’s optimizations eliminating the need for the custom, partially compatible setup.

  • The context length of the dataset was too big to fit on the GPU

Solution: We trimmed the dataset by removing the top 10% of examples with the longest context.

  • We did the dataset collection multiple times because of human errors: wrong format, missing screenshots

How We Built It

  • We fine-tuned Baidu's ERNIE-4.5-VL-28B-A3B-Thinking model to create a specialized variant optimized for mobile automation.
  • We used the Droidrun framework for autonomous device interaction.
  • We collected a dataset of training trajectories using Google Research's android_world environment with Gemini 3 Pro.
  • We did SFT on those trajectories.

Built With

  • ernie
  • unsloth
Share this project:

Updates