The problem: responses that freeze the interface
Ever asked an AI for a response and stared at a frozen screen for 30 seconds? That's the classic scenario when you make a traditional HTTP request/response to an LLM. The browser or app gets stuck waiting for the entire response, and nothing happens until it arrives complete.
If a user is waiting and the first word takes 5 seconds to appear, they think the app crashed. It's a terrible experience. The solution is simple: deliver the text as it's generated, word by word, in real time.
🚀 Need this built for you?
I build sites, systems, AI and automation — let's talk.
Talk to Adriano Soluções →That's where Server-Sent Events (SSE) comes in as the hero.

Why SSE instead of WebSocket?
First, let me be clear: WebSocket works too. But for the specific case of a server continuously sending data to the client (without much heavy bidirectional interaction), SSE is lighter and comes native in the browser.
SSE uses plain HTTP, doesn't need a heavy library on the frontend. The connection stays open, and the server sends events whenever it wants. When the response finishes, it closes. Simple.
WebSocket is more powerful if you need heavy two-way communication (real-time chat with lots of rapid exchanges). For an AI generating a response? SSE is the lighter overkill and easier to debug.
Implementing on the backend: Python with FastAPI
I'll show you a real example that works. I'll use FastAPI because it's quick to write and supports SSE natively.
First, you need a function that calls the LLM (I'll use OpenAI here, but it works for Claude, Ollama, etc.) and iterates over the response:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json
import openai
app = FastAPI()
openai.api_key = "your-key-here"
@app.post("/stream-response")
async def stream_response(user_message: str):
async def generate():
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": user_message}],
stream=True
)
for chunk in response:
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
content = delta["content"]
# Format as SSE
yield f"data: {json.dumps({'text': content})}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache"}
)
The important detail: the media_type="text/event-stream" header tells the browser this is an SSE stream. And the format data: {json}\n\n (two line breaks at the end) is required for the browser to recognize it as a valid event.
Receiving on the frontend: plain JavaScript
Now the browser side. No massive npm install needed:
const button = document.getElementById("send-button");
const output = document.getElementById("response-output");
const inputField = document.getElementById("user-input");
button.addEventListener("click", async () => {
const userMessage = inputField.value;
output.innerHTML = ""; // Clear previous response
const eventSource = new EventSource(
`/stream-response?user_message=${encodeURIComponent(userMessage)}`
);
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
output.innerHTML += data.text;
};
eventSource.onerror = () => {
eventSource.close();
output.innerHTML += "\n[Response completed]";
};
});
See? The browser listens to the stream with EventSource, and each event triggers onmessage. As the AI generates the response, the text appears on screen live. No freezing.
Gotchas and real-world pitfalls
Issue 1: CORS
If the frontend is on a different domain (e.g., localhost:3000 calling localhost:8000), SSE will throw a CORS error. Solution: configure CORS in FastAPI:
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # In production: list exact domain
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
Issue 2: Connection timeout
If the server takes a long time to respond or stays quiet for too long, some proxies (nginx, cloudflare) might close the connection. Solution: send a heartbeat (ping) every 15 seconds:
async def generate():
import asyncio
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": user_message}],
stream=True
)
for chunk in response:
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
yield f"data: {json.dumps({'text': delta['content']})}\n\n"
# Heartbeat
yield ": heartbeat\n\n"
await asyncio.sleep(0.01) # Small delay
Issue 3: Errors in the response
If the LLM API returns an error (quota exceeded, rate limit), the stream breaks mid-way. You need try/catch and notify the frontend:
async def generate():
try:
response = openai.ChatCompletion.create(..., stream=True)
for chunk in response:
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
yield f"data: {json.dumps({'text': delta['content']})}\n\n"
except Exception as e:
yield f"data: {json.dumps({'error': str(e)})}\n\n"
And on JavaScript, handle the event type:
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.error) {
output.innerHTML = `Error: ${data.error}`;
} else {
output.innerHTML += data.text;
}
};
Final tip: caching and rate limiting
If you're generating the same responses multiple times, caching is your friend. Redis works well. And rate limiting (max X requests per IP per minute) prevents a sneaky user from burning through your API account in seconds.
In practice, here's what I do: per request, I store a hash of the message in cache for 1 hour. If the same question comes again, I return the cached response also in streaming (read from cache and yield).
Done. Now your app will respond in real time, without freezing, and the user will think it's magic. From here on, it's just iteration: add typing (TypeScript), better CSS for the output, auto-save history, whatever you want. The SSE foundation stays robust and ready for scale.