- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Why Prompt Engineering Still Matters
- Technique 1: System Prompt Design Principles
- Technique 2: Few-Shot Prompting
- Technique 3: Chain-of-Thought (CoT) Prompting
- Technique 4: Structured Output
- Technique 5: XML Tags for Structure (Claude-specific but broadly useful)
- Technique 6: Role + Context + Constraint Pattern
- Prompt Version Control: Treat Prompts Like Code
- Model-Specific Optimization Tips
- Prompt Debugging Methodology
- Conclusion: Prompt Engineering Is an Engineering Discipline
Why Prompt Engineering Still Matters
"With LLMs getting so capable, won't prompt engineering become obsolete soon?"
I get this question often. The answer is no. Here's why:
model capability x prompt quality = output quality
The best model with a bad prompt gives bad results. A well-crafted prompt can produce 20-40% better performance from the same model. In some task-specific cases, a small model with an excellent prompt beats a large model with a sloppy one.
As better models keep arriving in 2025, good prompting remains a portable skill that transfers across model generations.
Technique 1: System Prompt Design Principles
The system prompt is the LLM's "configuration." Most developers treat it as an afterthought.
# Bad system prompt (how most people start)
bad_prompt = "You are a helpful assistant."
# Problem: so vague the model doesn't know what to do
# Good system prompt using the COSTAR framework
good_prompt = """
## Role
You are a senior customer service specialist at TechCorp.
## Context
TechCorp sells B2B SaaS software for HR management.
Customers are typically HR managers at companies with 100-5,000 employees.
## Objective
Help customers resolve issues and answer product questions.
Escalate billing issues to finance@techcorp.com.
## Style
- Professional but warm English
- Keep responses under 150 words
- Always end with a follow-up question
## Tone
Professional, empathetic, solution-oriented
## Audience
HR managers, often non-technical
## Response Format
1. Acknowledge the issue
2. Provide solution (step-by-step if needed)
3. Close with: "Is there anything else I can help you with?"
"""
The COSTAR framework: Context, Objective, Style, Tone, Audience, Response Format. Specifying all six produces predictable, consistent results.
Technique 2: Few-Shot Prompting
Showing examples is faster than describing behavior in words.
# Zero-shot (no examples)
zero_shot = """
Classify the sentiment of this review: 'Delivery took way too long'
"""
# Works, but output format and consistency may vary
# Few-shot (with examples)
few_shot = """
Classify review sentiment as: positive / negative / neutral
Review: "Absolutely love this product!" -> positive
Review: "Shipping was slow but the product is great" -> neutral
Review: "Complete junk, want a refund" -> negative
Review: "Nothing special for the price" -> neutral
Review: "Delivery took way too long" -> """
# Output: negative (much more reliably structured)
Few-shot principles:
- Example quality: include representative cases
- Diversity: cover edge cases too
- Count: 3-8 examples is typically optimal (more isn't always better)
- Order: the last example has the most influence on output (recency bias)
Technique 3: Chain-of-Thought (CoT) Prompting
Asking the LLM to show its reasoning dramatically improves complex inference.
# Without CoT (unreliable)
no_cot = """
A train travels 120km in 2 hours, then 180km in 3 hours.
What is the average speed?
"""
# LLM might compute (120/2 + 180/3) / 2 = 60 incorrectly using the wrong formula
# With CoT (reliable)
with_cot = """
A train travels 120km in 2 hours, then 180km in 3 hours.
What is the average speed?
Let me think through this step by step:
"""
# LLM: "Total distance = 120 + 180 = 300km.
# Total time = 2 + 3 = 5 hours.
# Average speed = 300 / 5 = 60 km/h"
# Correct!
# Zero-shot CoT (simplest form — often works just as well)
zero_shot_cot = """
Problem: [complex reasoning problem]
Solution: Let's think step by step.
"""
Why it works: LLMs generate tokens autoregressively. When intermediate reasoning steps are generated as tokens, subsequent tokens can "reference" that reasoning — improving final answer accuracy significantly.
Note: Reasoning models like o1 and o3 perform CoT internally. For these models, focus on clearly stating the problem rather than prompting them to reason step by step.
Technique 4: Structured Output
If your production system parses LLM output, always enforce structured output.
import json
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
# Method 1: JSON Mode (simple cases)
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": """
Extract product info from this text and return JSON:
"iPhone 15 Pro Max, 256GB, Space Black, retail price $1,199"
Return: {"name": ..., "storage": ..., "color": ..., "price_usd": ...}
"""
}]
)
data = json.loads(response.choices[0].message.content)
# Method 2: Structured Outputs (type-safe, more reliable)
class ProductInfo(BaseModel):
name: str
storage: str
color: str
price_usd: float
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[{
"role": "user",
"content": "iPhone 15 Pro Max, 256GB, Space Black, retail price $1,199"
}],
response_format=ProductInfo
)
product = response.choices[0].message.parsed
print(product.price_usd) # 1199.0, guaranteed float type
OpenAI's Structured Outputs guarantee 100% JSON Schema compliance. Parsing errors disappear.
Technique 5: XML Tags for Structure (Claude-specific but broadly useful)
Claude responds especially well to XML-structured prompts. This is likely because Anthropic used extensive XML structure during training.
import anthropic
client = anthropic.Anthropic()
prompt = """
<task>
Summarize the following document.
</task>
<document>
[long document content here]
</document>
<constraints>
- Maximum 3 bullet points
- Each bullet: 1-2 sentences
- Focus on actionable items only
- Replace jargon with plain language
</constraints>
<output_format>
Return as JSON:
{
"summary_bullets": ["...", "...", "..."],
"key_action": "Single most important action item"
}
</output_format>
"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
XML tags help the LLM understand the role of each section and follow instructions more precisely. This pattern works well on other models too, not just Claude.
Technique 6: Role + Context + Constraint Pattern
The most broadly applicable and reliable production prompting pattern.
def create_production_prompt(
role: str,
context: str,
task: str,
constraints: list[str],
output_format: str
) -> str:
constraints_text = "\n".join(f"- {c}" for c in constraints)
return f"""# Role
{role}
# Context
{context}
# Task
{task}
# Constraints
{constraints_text}
# Output Format
{output_format}"""
# Example usage
prompt = create_production_prompt(
role="You are a senior copywriter with 10 years of experience targeting Gen Z audiences.",
context="Client: startup coffee brand 'BREW'. New cold brew product launch.",
task="Write 3 versions of an Instagram caption.",
constraints=[
"Each caption under 50 characters",
"Include 1-2 emojis",
"No mention of price or promotions",
"Do not include hashtags"
],
output_format="Return as JSON array: ['caption1', 'caption2', 'caption3']"
)
Benefits of this pattern:
- Prompts can be dynamically generated in code
- Each component can be modified independently
- Easy to A/B test variations
Prompt Version Control: Treat Prompts Like Code
In production, prompts are code. They need version control.
# Bad: hardcoded in business logic
def summarize(text):
prompt = f"Summarize: {text}" # untracked, untestable
return call_llm(prompt)
# Good: manage prompts as separate versioned artifacts
class PromptManager:
def __init__(self):
self.prompts = {}
self.active_versions = {}
def register(self, name: str, version: str, template: str):
self.prompts[f"{name}_v{version}"] = template
def get(self, name: str, version: str = None) -> str:
v = version or self.active_versions.get(name, "1")
return self.prompts[f"{name}_v{v}"]
def set_active(self, name: str, version: str):
self.active_versions[name] = version
pm = PromptManager()
pm.register("summarize", "1", "Summarize: {text}")
pm.register("summarize", "2", "You are a business analyst. Summarize in 3 lines: {text}")
pm.set_active("summarize", "2")
For more sophisticated management, consider LangSmith or PromptLayer. They provide prompt versioning, A/B testing, and performance monitoring as an integrated platform.
Model-Specific Optimization Tips
The same prompt doesn't work equally well across all models:
GPT-4o:
- Prefers clear, direct instructions
- Follows Markdown formatting well
- Responds well to "You are a [role]" framing
Claude 3.5/3.7:
- Especially responsive to XML tag structure
- Handles long system prompts well
- Compliant with strong constraints ("you must...", "never...")
Gemini:
- Stronger at multimodal contexts
- Built-in search grounding support
Open-source (Llama, Mistral, etc.):
- Always use the model's specific chat template
- Instruction-following may be weaker than commercial models
- Requires more explicit and specific instructions
Prompt Debugging Methodology
When a prompt isn't working as expected:
1. Set temperature to 0
-> Get reproducible results for diagnosis
2. Make instructions more specific
-> "Write it well" -> "Professional tone, active voice, under 50 words"
3. Add output examples (few-shot)
-> Showing is more effective than describing
4. Specify what NOT to do
-> Negative constraints reduce failure cases
5. Add chain-of-thought
-> "Let's think step by step" meaningfully improves complex reasoning
6. Optimize prompt length
-> Longer isn't always better. Remove what isn't earning its place
Conclusion: Prompt Engineering Is an Engineering Discipline
Prompt engineering isn't magic. It's an engineering practice built on testing, measuring, and iterating.
Practical advice:
- Start simple, improve incrementally: begin with a basic prompt and refine when you see failures
- Manage prompts as code: version control, testing, deployment pipelines
- Evaluate quantitatively: measure with concrete metrics, not "feels better"
- Monitor model updates: prompt behavior can change when a model is upgraded
A good prompt is like good code: maintainable, testable, and designed to be changed.