💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Overview

Building modern AI applications goes far beyond simple API calls. You need to integrate streaming, multimodal inputs, RAG (Retrieval-Augmented Generation), and robust authentication into a cohesive product. This guide walks through building a production-grade LLM service using FastAPI on the backend and Next.js on the frontend.

1. AI App Architecture Design

The Three-Layer Architecture

A modern AI application is built around three distinct layers:

- **Frontend Layer**: Next.js App Router, Vercel AI SDK, streaming UI components

- **Backend Layer**: FastAPI, LangChain, auth middleware, caching

- **AI/Data Layer**: OpenAI/Claude APIs, vector databases, embedding models

This separation of concerns makes the system easier to scale, test, and maintain independently.

Streaming vs Batch Processing

Two primary approaches exist for handling LLM responses.

**Streaming** sends tokens to the client as they are generated. It creates a much better perceived performance and is ideal for conversational interfaces. Implemented via Server-Sent Events (SSE) or WebSockets.

**Batch processing** waits for the full response before returning it. This suits document processing, data analysis pipelines, and background jobs. Typically orchestrated with Celery and Redis queues.

Project Folder Structure

ai-app/

├── backend/

│ ├── app/

│ │ ├── main.py

│ │ ├── routers/

│ │ │ ├── chat.py

│ │ │ └── documents.py

│ │ ├── services/

│ │ │ ├── llm_service.py

│ │ │ └── vector_service.py

│ │ └── models/

│ │ └── schemas.py

│ ├── requirements.txt

│ └── Dockerfile

├── frontend/

│ ├── app/

│ │ ├── chat/

│ │ │ └── page.tsx

│ │ └── api/

│ │ └── chat/

│ │ └── route.ts

│ ├── components/

│ └── package.json

└── docker-compose.yml

2. FastAPI Backend

Installation

pip install fastapi uvicorn openai langchain langchain-openai python-dotenv

Pydantic Models for Request/Response Validation

app/models/schemas.py

from pydantic import BaseModel, Field

from typing import List, Optional

from enum import Enum

class Role(str, Enum):

user = "user"

assistant = "assistant"

system = "system"

class Message(BaseModel):

role: Role

content: str

class ChatRequest(BaseModel):

messages: List[Message]

model: str = Field(default="gpt-4o-mini")

temperature: float = Field(default=0.7, ge=0, le=2)

max_tokens: Optional[int] = Field(default=None)

class ChatResponse(BaseModel):

content: str

usage: dict

Async Streaming Endpoint

FastAPI's `StreamingResponse` pushes LLM tokens to the client in real time as they are produced.

from fastapi import FastAPI

from fastapi.responses import StreamingResponse

from fastapi.middleware.cors import CORSMiddleware

from openai import AsyncOpenAI

from app.models.schemas import ChatRequest

app = FastAPI(title="AI App Backend")

app.add_middleware(

CORSMiddleware,

allow_origins=["http://localhost:3000"],

allow_methods=["*"],

allow_headers=["*"],

)

client = AsyncOpenAI()

@app.post("/api/chat/stream")

async def chat_stream(request: ChatRequest):

async def generate():

stream = await client.chat.completions.create(

model=request.model,

messages=[m.dict() for m in request.messages],

stream=True,

temperature=request.temperature,

)

async for chunk in stream:

delta = chunk.choices[0].delta.content

if delta:

yield f"data: {delta}\n\n"

yield "data: [DONE]\n\n"

return StreamingResponse(generate(), media_type="text/event-stream")

Dependency Injection for Auth

from fastapi import Depends, HTTPException, status

from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):

token = credentials.credentials

try:

payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])

return payload

except jwt.ExpiredSignatureError:

raise HTTPException(

status_code=status.HTTP_401_UNAUTHORIZED,

detail="Token has expired."

)

@app.post("/api/chat/secure")

async def secure_chat(request: ChatRequest, user=Depends(verify_token)):

Only accessible to authenticated users

pass

3. LangChain Integration

Conversation Chain with Memory

LangChain simplifies memory management, chain composition, and tool integration.

from langchain_openai import ChatOpenAI

from langchain.memory import ConversationBufferWindowMemory

from langchain.chains import ConversationChain

from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", streaming=True)

memory = ConversationBufferWindowMemory(k=10)

template = """You are a helpful AI assistant.

Current conversation:

{history}

Human: {input}

AI:"""

prompt = PromptTemplate(

input_variables=["history", "input"],

template=template

)

chain = ConversationChain(llm=llm, memory=memory, prompt=prompt)

response = chain.predict(input="Hello, I am a Python developer.")

Building a RAG Pipeline

RAG (Retrieval-Augmented Generation) augments LLM responses by searching external documents and injecting relevant context into the prompt.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from langchain_community.vectorstores import Chroma

from langchain.chains import RetrievalQA

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.document_loaders import PyPDFLoader

Load and chunk documents

loader = PyPDFLoader("document.pdf")

documents = loader.load()

splitter = RecursiveCharacterTextSplitter(

chunk_size=1000,

chunk_overlap=200

)

chunks = splitter.split_documents(documents)

Create vector store

embeddings = OpenAIEmbeddings()

vectorstore = Chroma.from_documents(chunks, embeddings)

Build RAG chain

llm = ChatOpenAI(model="gpt-4o")

qa_chain = RetrievalQA.from_chain_type(

llm=llm,

chain_type="stuff",

retriever=vectorstore.as_retriever(search_kwargs={"k": 5})

)

answer = qa_chain.invoke({"query": "What are the key points in the document?"})

Creating Custom Tools

from langchain.tools import tool

from langchain.agents import initialize_agent, AgentType

@tool

def search_database(query: str) -> str:

"""Search the database for information. The query parameter is the keyword to search."""

results = db.search(query)

return str(results)

@tool

def get_weather(city: str) -> str:

"""Get the current weather for a specific city."""

response = requests.get(f"https://api.weather.com/v1/{city}")

return response.json()["description"]

llm = ChatOpenAI(model="gpt-4o", temperature=0)

tools = [search_database, get_weather]

agent = initialize_agent(tools, llm, agent=AgentType.OPENAI_FUNCTIONS)

4. Next.js Frontend

Streaming Chat with Vercel AI SDK

The Vercel AI SDK is the official library for implementing AI streaming in Next.js.

npm install ai @ai-sdk/openai react-markdown

// app/api/chat/route.ts

export async function POST(req: Request) {

const { messages } = await req.json()

const result = await streamText({

model: openai('gpt-4o-mini'),

messages,

system: 'You are a helpful and friendly AI assistant.',

})

return result.toDataStreamResponse()

}

Chat UI Component

// app/chat/page.tsx

'use client'

export default function ChatPage() {

const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat({

api: '/api/chat',

})

return (

AI Assistant

{messages.map(m => (

key={m.id}

className={`flex ${m.role === 'user' ? 'justify-end' : 'justify-start'}`}

className={`max-w-xs rounded-lg p-3 ${

m.role === 'user'

? 'bg-blue-500 text-white'

: 'bg-gray-100 text-gray-800'

}`}

))}

{isLoading && (

Generating response...

)}

value={input}

onChange={handleInputChange}

className="flex-1 border rounded-lg px-3 py-2 focus:outline-none focus:ring-2 focus:ring-blue-500"

placeholder="Type your message..."

disabled={isLoading}

type="submit"

disabled={isLoading}

className="bg-blue-500 text-white px-4 py-2 rounded-lg disabled:opacity-50"

Send

)

}

File Upload Handling

// app/upload/page.tsx

'use client'

export default function UploadPage() {

const [status, setStatus] = useState('')

async function handleUpload(e: React.FormEvent<HTMLFormElement>) {

e.preventDefault()

const formData = new FormData(e.currentTarget)

setStatus('Uploading...')

const response = await fetch('/api/upload', {

method: 'POST',

body: formData,

})

if (response.ok) {

const data = await response.json()

setStatus(`Done: ${data.message}`)

} else {

setStatus('Upload failed')

}

return (

Upload

{status && <p className="mt-2 text-sm">{status}</p>}

)

}

5. Vector Database Integration

pgvector (PostgreSQL Extension)

Using the pgvector extension allows vector search within your existing PostgreSQL database.

-- Enable the pgvector extension

CREATE EXTENSION vector;

-- Create a table with an embedding column

CREATE TABLE documents (

id SERIAL PRIMARY KEY,

content TEXT,

embedding vector(1536),

metadata JSONB,

created_at TIMESTAMP DEFAULT NOW()

);

-- Create HNSW index for fast approximate nearest neighbor search

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

async def store_embedding(content: str, embedding: list):

conn = await asyncpg.connect(DATABASE_URL)

await conn.execute(

"INSERT INTO documents (content, embedding) VALUES ($1, $2)",

content, embedding

)

async def search_similar(query_embedding: list, k: int = 5):

conn = await asyncpg.connect(DATABASE_URL)

results = await conn.fetch(

"""SELECT content, 1 - (embedding <=> $1) as similarity

FROM documents

ORDER BY embedding <=> $1

LIMIT $2""",

query_embedding, k

)

return results

Chroma DB (Local Development)

from langchain_community.vectorstores import Chroma

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

vectorstore = Chroma(

collection_name="my_documents",

embedding_function=embeddings,

persist_directory="./chroma_db"

)

Add documents

vectorstore.add_texts(

texts=["Python is widely used for AI development.", "FastAPI is a high-performance API framework."],

metadatas=[{"source": "intro.txt"}, {"source": "framework.txt"}]

)

Similarity search

results = vectorstore.similarity_search("API development", k=3)

6. Authentication and Security

JWT Token Authentication

from datetime import datetime, timedelta

from jose import JWTError, jwt

from passlib.context import CryptContext

SECRET_KEY = "your-secret-key"

ALGORITHM = "HS256"

ACCESS_TOKEN_EXPIRE_MINUTES = 30

pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")

def create_access_token(data: dict):

to_encode = data.copy()

expire = datetime.utcnow() + timedelta(minutes=ACCESS_TOKEN_EXPIRE_MINUTES)

to_encode.update({"exp": expire})

return jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM)

def verify_token(token: str):

payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])

username: str = payload.get("sub")

if username is None:

raise HTTPException(status_code=401, detail="Invalid token")

return username

Prompt Injection Defense

INJECTION_PATTERNS = [

r"ignore previous instructions",

r"disregard all prior",

r"you are now",

r"act as",

r"pretend you are",

]

def sanitize_input(user_input: str) -> str:

lower_input = user_input.lower()

for pattern in INJECTION_PATTERNS:

if re.search(pattern, lower_input):

raise HTTPException(

status_code=400,

detail="Potentially harmful input detected."

)

if len(user_input) > 4000:

raise HTTPException(status_code=400, detail="Input is too long.")

return user_input.strip()

Rate Limiting

from slowapi import Limiter, _rate_limit_exceeded_handler

from slowapi.util import get_remote_address

from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)

app.state.limiter = limiter

app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/api/chat")

@limiter.limit("10/minute")

async def chat(request: Request, chat_request: ChatRequest):

Limited to 10 requests per minute per IP

pass

7. Multimodal Input Processing

Image Analysis with GPT-4o Vision

from pathlib import Path

async def analyze_image(image_path: str, question: str) -> str:

with open(image_path, "rb") as f:

image_data = base64.b64encode(f.read()).decode("utf-8")

ext = Path(image_path).suffix.lower()

mime_map = {".jpg": "image/jpeg", ".png": "image/png", ".gif": "image/gif"}

media_type = mime_map.get(ext, "image/jpeg")

response = await client.chat.completions.create(

model="gpt-4o",

messages=[

{

"role": "user",

"content": [

{

"type": "image_url",

"image_url": {

"url": f"data:{media_type};base64,{image_data}"

{"type": "text", "text": question}

}

)

return response.choices[0].message.content

Audio Transcription with Whisper

async def transcribe_audio(audio_file_path: str) -> str:

with open(audio_file_path, "rb") as audio_file:

transcript = await client.audio.transcriptions.create(

model="whisper-1",

file=audio_file,

language="en"

)

return transcript.text

8. Performance Optimization

Redis Response Caching

redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)

def get_cache_key(messages: list) -> str:

content = json.dumps(messages, sort_keys=True)

return hashlib.md5(content.encode()).hexdigest()

async def cached_chat(messages: list) -> str:

cache_key = get_cache_key(messages)

cached = redis_client.get(cache_key)

if cached:

return json.loads(cached)

response = await client.chat.completions.create(

model="gpt-4o-mini",

messages=messages

)

result = response.choices[0].message.content

Cache with 1-hour TTL

redis_client.setex(cache_key, 3600, json.dumps(result))

return result

Database Connection Pooling

from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession

from sqlalchemy.orm import sessionmaker

engine = create_async_engine(

DATABASE_URL,

pool_size=10,

max_overflow=20,

pool_pre_ping=True,

echo=False,

)

AsyncSessionLocal = sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)

async def get_db():

async with AsyncSessionLocal() as session:

try:

yield session

finally:

await session.close()

9. Docker Compose Deployment

docker-compose.yml

version: '3.8'

services:

backend:

build: ./backend

ports:

- '8000:8000'

environment:

- OPENAI_API_KEY=your_key

- DATABASE_URL=postgresql+asyncpg://user:pass@db/aiapp

- REDIS_URL=redis://redis:6379

depends_on:

- db

- redis

restart: unless-stopped

frontend:

build: ./frontend

ports:

- '3000:3000'

environment:

- NEXT_PUBLIC_API_URL=http://backend:8000

depends_on:

- backend

restart: unless-stopped

db:

image: pgvector/pgvector:pg16

environment:

- POSTGRES_DB=aiapp

- POSTGRES_USER=user

- POSTGRES_PASSWORD=pass

volumes:

- postgres_data:/var/lib/postgresql/data

restart: unless-stopped

redis:

image: redis:7-alpine

volumes:

- redis_data:/data

restart: unless-stopped

volumes:

postgres_data:

redis_data:

Backend Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Frontend Dockerfile

FROM node:20-alpine AS builder

WORKDIR /app

COPY package*.json ./

RUN npm ci

COPY . .

RUN npm run build

FROM node:20-alpine AS runner

WORKDIR /app

COPY --from=builder /app/.next/standalone ./

COPY --from=builder /app/.next/static ./.next/static

EXPOSE 3000

CMD ["node", "server.js"]

Deployment Commands

Build and start

docker-compose up --build -d

View logs

docker-compose logs -f backend

Scale out (3 backend instances)

docker-compose up --scale backend=3 -d

Stop

docker-compose down

10. Quiz: Check Your Understanding

**Answer**: SSE is simpler to implement and is purpose-built for unidirectional server-to-client data push over standard HTTP, which perfectly matches the LLM token streaming pattern.

**Explanation**: WebSockets support full-duplex communication but introduce extra complexity (upgrade handshake, connection management). For streaming LLM output, you only need one direction of data flow. SSE reuses HTTP connections, works through proxies more reliably, and has automatic reconnection built in. FastAPI's `StreamingResponse` pairs naturally with the browser's `EventSource` API for this pattern.

**Answer**: RAG does not require retraining the model. It retrieves up-to-date or domain-specific documents at inference time and injects them as context, which is far cheaper and faster to update.

**Explanation**: Fine-tuning bakes knowledge into model weights and requires expensive GPU compute. RAG keeps knowledge outside the model in a vector database. You can update your knowledge base just by adding or removing documents, without touching the model. It also reduces hallucination by grounding answers in retrieved evidence.

**Answer**: It specifies how many characters from the end of one chunk are repeated at the start of the next chunk, preserving context across chunk boundaries.

**Explanation**: When a document is split into chunks, a sentence or concept can be cut in half at the boundary. Overlapping ensures that the surrounding context is present in adjacent chunks. For example, with chunk_size=1000 and chunk_overlap=200, each chunk shares 200 characters with its neighbor, reducing the chance of losing relevant context during retrieval.

**Answer**: HNSW enables approximate nearest neighbor (ANN) search that scales to millions of vectors while remaining fast, unlike brute-force exact search which becomes prohibitively slow at scale.

**Explanation**: An exact cosine similarity scan over one million 1536-dimensional vectors (OpenAI embeddings) takes seconds per query. HNSW builds a hierarchical graph structure that dramatically narrows the search space. The tradeoff is a small accuracy loss (approximate rather than exact), which is acceptable for most retrieval tasks. pgvector, Chroma, Pinecone, and Weaviate all support HNSW or similar ANN algorithms.

**Answer**: It sets the number of recent conversation turns to retain in the context window passed to the LLM.

**Explanation**: LLMs have a finite token context limit. Keeping the entire conversation history would eventually exceed this limit and increase cost. Setting k=10 retains only the 10 most recent user-assistant exchanges. Earlier turns are dropped. This balances coherent multi-turn dialogue against token budget. For longer-term memory, consider summarization memory or a dedicated memory store.

References

- [FastAPI Official Documentation](https://fastapi.tiangolo.com/)

- [Vercel AI SDK Documentation](https://sdk.vercel.ai/docs)

- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction)

- [Next.js App Router Documentation](https://nextjs.org/docs/app)

- [pgvector GitHub](https://github.com/pgvector/pgvector)

- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)