Generative AI isn’t just for the cloud. Starting with Android 16, you can invoke Gemini Nano running on AICore via the ML Kit GenAI API, completing summarization, proofreading, or image captioning entirely on-device in just a few hundred ms. In the OSS world, methods have emerged for embedding quantized Llama 3 through ggml/llama.cpp via JNI, as well as converting models for the universal runtime LiteRT (formerly TensorFlow Lite) to run inference with NNAPI/GPU. In this session, we integrate “offline AI chat,” “real-time text summarization,” and “real-time proofreading” into a single Compose app, benchmarking three on-device LLMs with the same prompt on the same device. We compare them across five aspects: 1. Effort to setup and steps to build 2. Model size/RAM usage 3. Inference latency 4. Battery consumption 5. Licensing and maintenance We will look at the power efficiency and high-level-API convenience of Gemini Nano, the flexibility and pitfalls of llama.cpp, and the versatility of LiteRT along with the challenges of quantization tuning, walking through each implementation. On-device LLMs that respond instantly offline, keep personal data local, and cut operating costs are poised for rapid growth. By the end of this session you’ll have a concrete picture of how to build practical Android apps powered by on-device LLMs, sparking ideas for new applications or evolutions of existing ones through real-world use cases and implementation guides. (Translated by the DroidKaigi Committee)
daasuu Software Engineer
- Developers who want to build on-device LLMs on Android from scratch - Anyone interested in on-device LLMs - Those concerned about the cost or privacy issues of cloud-based LLMs
Property-Based Testing for UI: Capturing Edge Cases via LLM-Generated Property Definitions
Tetta Noguchi
#Maintenance / Operations / Testing
Android Librarian's Guide: Building Robust Libraries and SDKs
Jaewoong (skydoves)
#Other
Getting Started with Material3 Expressive
chuka
#UI・UX・Design