How We Safely Experiment with AI Models at Sync Stream

Flat design vector illustration of safe AI model experimentation on dark background

We’re constantly testing new models. Fast ones. Cheap ones. Reliable ones. Ones that promise the moon but quietly forget how to do basic maths. It’s the nature of working with LLMs—what’s “best” today might be outdated next week.

But here’s the catch. Playing around with AI isn’t just about finding the right model. It’s about doing it without compromising everything else: your data, your stack, your security posture, your peace of mind. So, we’ve had to figure out a system that lets us experiment at speed without the chaos.

Our top picks

Let’s start with OpenRouter.

OpenRouter is our playground. It gives us access to a heap of different LLMs through one tidy little API. Instead of setting up 15 accounts across 15 providers and burning time on authentication madness, we load in credits, swap out models, and move on. Super useful for n8n workflows where you want to see how Claude handles a task compared to GPT-4, or test the latest open model without rewriting half your integration. And since we control all the flows and don’t pipe sensitive data into these tests, we keep things clean. Fast to test. Safe to run.

Now, if we need performance plus proper infrastructure backing, we jump to GCP. Gemini can be a weird one. Gemini 2.0 Flash is fast. Super fast. And the context window is massive, so we can dump in loads of structured data from places like Supabase and get decent output. We’re using that combo when we want to keep everything server-side but also need the model to “see” a lot in one go. That’s the magic trick: speed plus context, without making a model choke.

Then there’s Gemini 2.5 Pro, which we’re watching closely. Once it stabilises for production, we’ll probably move more complex reasoning work over there. Right now we’re still running those through o3-mini—because it’s cheap, it reasons decently, and we’ve got enough control to shape what it does.

Azure’s our other heavyweight. GPT-4o-mini is a workhorse. Not the flashiest, but way more reliable than most mid-range models, and not too expensive to scale. It's what we lean on when something needs to work more than it needs to dazzle. Stuff like structured data classification, light extraction, low-risk tasks that run every day. o3-mini shows up again here too. It’s got just enough reasoning ability to make it worthwhile, and the price makes it attractive for recurring flows where you don’t want to burn GPT-4 tokens every five seconds. Good enough for parsing invoices, good enough for email drafts, and when it screws up—it does so predictably, which we can work around.

Embeddings? We use text-embedding-3-small. It’s fast and good enough for most vector search jobs. We embed queries and results, store everything in Supabase (or Postgres, depending on the stack), and handle retrieval in-house. This bit’s non-negotiable: we don’t send our data to third-party vector databases. If something needs to live in a vector index, it lives inside our own infra. That way we know exactly what’s in it, how it’s stored, and what’s hitting it.

There’s also Replicate. That’s our image generator of choice right now. It runs as part of our internal blog workflow, mostly spitting out illustrations for documentation or internal comms. No sensitive data goes in. We treat it like a creative add-on, not a core system. Think fun, not function. But still—run it clean, run it controlled.

And the glue that holds this all together? n8n. Every single call to a model goes through our automations. Which means we’re in the loop. Always. We control retries, failures, sanitisation, audit logs, everything. Users don’t hit LLMs directly. They hit our flows, which hit the models on our terms. That’s not just about security, though it’s definitely about that—it’s also about maintainability. Want to switch models? Fine. Change one node. Done. Want to introduce a new one just for a small task? Fork the flow. Test it. Keep it isolated. Safe experiments at scale.

What we’ve built isn’t flashy. It’s not bleeding edge. But it works. It lets us play with the latest and greatest while keeping everything tight and under control. No mystery API calls. No random vendors holding our data. No waking up to a Slack message that says, “Hey, did we just send 500 invoices to a public image generator?”

We experiment. A lot. But we do it on rails. We’re not scared of new models—we just refuse to test them recklessly. That’s the Sync Stream way: fast, flexible, locked down where it counts.

‍

How We Safely Experiment with AI Models at Sync Stream

Our top picks

Table of Contents

Cutting through the AI hype to deliver real results