✦The research question
For my bachelor thesis at Howest I set out to answer one question: how does on-device AI in a Flutter app compare to cloud-based AI when it comes to giving tourists better route and location recommendations?
The field was moving fast. On-device models promise privacy and low latency. Cloud models promise depth and breadth. The question was whether either approach, or a combination of both, could actually make a travel app meaningfully smarter.
✦What I built
The answer was a proof-of-concept Flutter iOS app with four AI backends running side by side.
- GPT-4o-mini: fast, cost-efficient, reliable for structured output
- Gemini 2.5 Flash: stronger reasoning, better at geographic detail
- Hybrid pipeline: GPT generates the initial recommendation, Gemini refines it. A two-stage chain designed to combine the strengths of both.
- Apple Intelligence: fully on-device, no external API calls, full privacy. Available as an optional fourth tier on compatible hardware.
Users can switch models mid-session and compare two responses to the same question side by side. Every AI answer that contains location data gets parsed automatically and plotted as pins on a Mapbox map, with a route calculated between stops.
✦Architecture
The app is a monorepo: Flutter (iOS) for the frontend, .NET 9 ASP.NET Core for the backend. The backend does all the heavy lifting: it holds API keys server-side, orchestrates every AI call, and streams responses back to the Flutter client via HTTP chunked transfer encoding.
Streaming was a deliberate choice over WebSockets. The communication is one-directional (server to client), so a persistent bidirectional connection adds overhead without benefit. Tokens appear on screen as they generate, which matters more for perceived performance than the actual total response time.
Firebase handles authentication. Cloud Firestore stores every conversation with per-message metadata: which model was used, timestamps, and response status. That metadata became the primary data source for comparing model behavior after user tests.
Mapbox was chosen over Google Maps because its Flutter SDK had better support for custom styling and route rendering at the time of development.




✦The hard parts
Unstructured AI output. Getting models to consistently embed location data in a parseable format took more iterations than expected. OpenAI's Structured Outputs feature works, but it blocks streaming and Gemini has no equivalent. The solution: instruct each model to append a JSON block inside Markdown code fences at the end of its response, then extract it with a regex parser. The system prompt went through many revisions before the output was reliable enough.
API key security. Early prototypes stored API keys directly in the Flutter app via a .env file. Flutter compiles to native binaries that can be reverse-engineered, so those keys were effectively exposed. All AI calls were moved to the backend, with every Flutter request authenticated via a Firebase JWT token. The fix also unlocked a side benefit: system prompts can now be updated server-side without shipping a new app version.
Apple Intelligence and the DMA. The Digital Markets Act designates Apple as a gatekeeper, and Apple responded by delaying the rollout of Apple Intelligence across the EU. On-device testing was only possible in the iOS Simulator on macOS, not on a physical device. Battery consumption, real haptic feedback, and genuine mobile usage patterns could not be measured. That limits the external validity of the on-device findings, but it is also a useful finding in itself: platform-specific AI features can disappear from a target market overnight due to regulation.
✦Feedback from the field
Two Flutter developers reviewed the architecture as part of the thesis.
Milan (WOW WONEN) confirmed the factory pattern as a sound starting point but flagged five things to improve before production: evolve toward dependency injection and Riverpod providers, keep fallback output formats consistent, use runtime capability checks instead of version checks, centralise streaming state management, and write explicit mocks for timeout and API failure scenarios.
Shannon (Goomyx) added two observations the architecture had missed: the cost of three sequential API calls stacks up fast at scale, and a sequential pipeline carries a cascade risk. If the output of model A is malformed or manipulated, it becomes the input of model B regardless of how well model B performs on its own. Output validation between pipeline steps is the fix.
✦What I took away
This project taught me things that go beyond technical skills. The gap between a working prototype and a production-ready application is not just about code quality. It is about fallback design, transparency for the user, architecture that survives regulation changes, and the kind of feedback you only get by showing your work to people who build real products every day.
The reflection conversations with Milan and Shannon shaped how I think about architecture now more than any tutorial did. Building something and having experienced developers challenge your decisions is exactly the kind of feedback I was hoping for, and it made me a better developer for it.
