Real-time LLM response streaming with Server-Sent Events for responsive apps

The problem: responses that freeze the interface

Ever asked an AI for a response and stared at a frozen screen for 30 seconds? That's the classic scenario when you make a traditional HTTP request/response to an LLM. The browser or app gets stuck waiting for the entire response, and nothing happens until it arrives complete.

If a user is waiting and the first word takes 5 seconds to appear, they think the app crashed. It's a terrible experience. The solution is simple: deliver the text as it's generated, word by word, in real time.

🚀 Need this built for you?

I build sites, systems, AI and automation — let's talk.

Talk to Adriano Soluções →

That's where Server-Sent Events (SSE) comes in as the hero.

Real-time LLM response streaming with Server-Sent Events for responsive apps

Why SSE instead of WebSocket?

First, let me be clear: WebSocket works too. But for the specific case of a server continuously sending data to the client (without much heavy bidirectional interaction), SSE is lighter and comes native in the browser.

SSE uses plain HTTP, doesn't need a heavy library on the frontend. The connection stays open, and the server sends events whenever it wants. When the response finishes, it closes. Simple.

WebSocket is more powerful if you need heavy two-way communication (real-time chat with lots of rapid exchanges). For an AI generating a response? SSE is the lighter overkill and easier to debug.

Implementing on the backend: Python with FastAPI

I'll show you a real example that works. I'll use FastAPI because it's quick to write and supports SSE natively.

First, you need a function that calls the LLM (I'll use OpenAI here, but it works for Claude, Ollama, etc.) and iterates over the response:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json
import openai

app = FastAPI()
openai.api_key = "your-key-here"

@app.post("/stream-response")
async def stream_response(user_message: str):
    async def generate():
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": user_message}],
            stream=True
        )
        
        for chunk in response:
            delta = chunk["choices"][0].get("delta", {})
            if "content" in delta:
                content = delta["content"]
                # Format as SSE
                yield f"data: {json.dumps({'text': content})}\n\n"
    
    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )

The important detail: the media_type="text/event-stream" header tells the browser this is an SSE stream. And the format data: {json}\n\n (two line breaks at the end) is required for the browser to recognize it as a valid event.

Receiving on the frontend: plain JavaScript

Now the browser side. No massive npm install needed:

const button = document.getElementById("send-button");
const output = document.getElementById("response-output");
const inputField = document.getElementById("user-input");

button.addEventListener("click", async () => {
    const userMessage = inputField.value;
    output.innerHTML = ""; // Clear previous response
    
    const eventSource = new EventSource(
        `/stream-response?user_message=${encodeURIComponent(userMessage)}`
    );
    
    eventSource.onmessage = (event) => {
        const data = JSON.parse(event.data);
        output.innerHTML += data.text;
    };
    
    eventSource.onerror = () => {
        eventSource.close();
        output.innerHTML += "\n[Response completed]";
    };
});

See? The browser listens to the stream with EventSource, and each event triggers onmessage. As the AI generates the response, the text appears on screen live. No freezing.

Gotchas and real-world pitfalls

Issue 1: CORS

If the frontend is on a different domain (e.g., localhost:3000 calling localhost:8000), SSE will throw a CORS error. Solution: configure CORS in FastAPI:

from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # In production: list exact domain
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Issue 2: Connection timeout

If the server takes a long time to respond or stays quiet for too long, some proxies (nginx, cloudflare) might close the connection. Solution: send a heartbeat (ping) every 15 seconds:

async def generate():
    import asyncio
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": user_message}],
        stream=True
    )
    
    for chunk in response:
        delta = chunk["choices"][0].get("delta", {})
        if "content" in delta:
            yield f"data: {json.dumps({'text': delta['content']})}\n\n"
        # Heartbeat
        yield ": heartbeat\n\n"
        await asyncio.sleep(0.01)  # Small delay

Issue 3: Errors in the response

If the LLM API returns an error (quota exceeded, rate limit), the stream breaks mid-way. You need try/catch and notify the frontend:

async def generate():
    try:
        response = openai.ChatCompletion.create(..., stream=True)
        for chunk in response:
            delta = chunk["choices"][0].get("delta", {})
            if "content" in delta:
                yield f"data: {json.dumps({'text': delta['content']})}\n\n"
    except Exception as e:
        yield f"data: {json.dumps({'error': str(e)})}\n\n"

And on JavaScript, handle the event type:

eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.error) {
        output.innerHTML = `Error: ${data.error}`;
    } else {
        output.innerHTML += data.text;
    }
};

Final tip: caching and rate limiting

If you're generating the same responses multiple times, caching is your friend. Redis works well. And rate limiting (max X requests per IP per minute) prevents a sneaky user from burning through your API account in seconds.

In practice, here's what I do: per request, I store a hash of the message in cache for 1 hour. If the same question comes again, I return the cached response also in streaming (read from cache and yield).

Done. Now your app will respond in real time, without freezing, and the user will think it's magic. From here on, it's just iteration: add typing (TypeScript), better CSS for the output, auto-save history, whatever you want. The SSE foundation stays robust and ready for scale.

Tags:LLMstreamingSSEbackendJavaScriptPython