Skip to content
Published on

Prompt Engineering 2025: Getting Maximum Performance from Modern LLMs

Authors

Why Prompt Engineering Still Matters

"With LLMs getting so capable, won't prompt engineering become obsolete soon?"

I get this question often. The answer is no. Here's why:

model capability x prompt quality = output quality

The best model with a bad prompt gives bad results. A well-crafted prompt can produce 20-40% better performance from the same model. In some task-specific cases, a small model with an excellent prompt beats a large model with a sloppy one.

As better models keep arriving in 2025, good prompting remains a portable skill that transfers across model generations.


Technique 1: System Prompt Design Principles

The system prompt is the LLM's "configuration." Most developers treat it as an afterthought.

# Bad system prompt (how most people start)
bad_prompt = "You are a helpful assistant."
# Problem: so vague the model doesn't know what to do

# Good system prompt using the COSTAR framework
good_prompt = """
## Role
You are a senior customer service specialist at TechCorp.

## Context
TechCorp sells B2B SaaS software for HR management.
Customers are typically HR managers at companies with 100-5,000 employees.

## Objective
Help customers resolve issues and answer product questions.
Escalate billing issues to finance@techcorp.com.

## Style
- Professional but warm English
- Keep responses under 150 words
- Always end with a follow-up question

## Tone
Professional, empathetic, solution-oriented

## Audience
HR managers, often non-technical

## Response Format
1. Acknowledge the issue
2. Provide solution (step-by-step if needed)
3. Close with: "Is there anything else I can help you with?"
"""

The COSTAR framework: Context, Objective, Style, Tone, Audience, Response Format. Specifying all six produces predictable, consistent results.


Technique 2: Few-Shot Prompting

Showing examples is faster than describing behavior in words.

# Zero-shot (no examples)
zero_shot = """
Classify the sentiment of this review: 'Delivery took way too long'
"""
# Works, but output format and consistency may vary

# Few-shot (with examples)
few_shot = """
Classify review sentiment as: positive / negative / neutral

Review: "Absolutely love this product!" -> positive
Review: "Shipping was slow but the product is great" -> neutral
Review: "Complete junk, want a refund" -> negative
Review: "Nothing special for the price" -> neutral

Review: "Delivery took way too long" -> """
# Output: negative (much more reliably structured)

Few-shot principles:

  • Example quality: include representative cases
  • Diversity: cover edge cases too
  • Count: 3-8 examples is typically optimal (more isn't always better)
  • Order: the last example has the most influence on output (recency bias)

Technique 3: Chain-of-Thought (CoT) Prompting

Asking the LLM to show its reasoning dramatically improves complex inference.

# Without CoT (unreliable)
no_cot = """
A train travels 120km in 2 hours, then 180km in 3 hours.
What is the average speed?
"""
# LLM might compute (120/2 + 180/3) / 2 = 60 incorrectly using the wrong formula

# With CoT (reliable)
with_cot = """
A train travels 120km in 2 hours, then 180km in 3 hours.
What is the average speed?

Let me think through this step by step:
"""
# LLM: "Total distance = 120 + 180 = 300km.
#        Total time = 2 + 3 = 5 hours.
#        Average speed = 300 / 5 = 60 km/h"
# Correct!

# Zero-shot CoT (simplest form — often works just as well)
zero_shot_cot = """
Problem: [complex reasoning problem]
Solution: Let's think step by step.
"""

Why it works: LLMs generate tokens autoregressively. When intermediate reasoning steps are generated as tokens, subsequent tokens can "reference" that reasoning — improving final answer accuracy significantly.

Note: Reasoning models like o1 and o3 perform CoT internally. For these models, focus on clearly stating the problem rather than prompting them to reason step by step.


Technique 4: Structured Output

If your production system parses LLM output, always enforce structured output.

import json
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

# Method 1: JSON Mode (simple cases)
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[{
        "role": "user",
        "content": """
        Extract product info from this text and return JSON:
        "iPhone 15 Pro Max, 256GB, Space Black, retail price $1,199"

        Return: {"name": ..., "storage": ..., "color": ..., "price_usd": ...}
        """
    }]
)
data = json.loads(response.choices[0].message.content)

# Method 2: Structured Outputs (type-safe, more reliable)
class ProductInfo(BaseModel):
    name: str
    storage: str
    color: str
    price_usd: float

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[{
        "role": "user",
        "content": "iPhone 15 Pro Max, 256GB, Space Black, retail price $1,199"
    }],
    response_format=ProductInfo
)
product = response.choices[0].message.parsed
print(product.price_usd)  # 1199.0, guaranteed float type

OpenAI's Structured Outputs guarantee 100% JSON Schema compliance. Parsing errors disappear.


Technique 5: XML Tags for Structure (Claude-specific but broadly useful)

Claude responds especially well to XML-structured prompts. This is likely because Anthropic used extensive XML structure during training.

import anthropic

client = anthropic.Anthropic()

prompt = """
<task>
Summarize the following document.
</task>

<document>
[long document content here]
</document>

<constraints>
- Maximum 3 bullet points
- Each bullet: 1-2 sentences
- Focus on actionable items only
- Replace jargon with plain language
</constraints>

<output_format>
Return as JSON:
{
  "summary_bullets": ["...", "...", "..."],
  "key_action": "Single most important action item"
}
</output_format>
"""

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}]
)

XML tags help the LLM understand the role of each section and follow instructions more precisely. This pattern works well on other models too, not just Claude.


Technique 6: Role + Context + Constraint Pattern

The most broadly applicable and reliable production prompting pattern.

def create_production_prompt(
    role: str,
    context: str,
    task: str,
    constraints: list[str],
    output_format: str
) -> str:
    constraints_text = "\n".join(f"- {c}" for c in constraints)
    return f"""# Role
{role}

# Context
{context}

# Task
{task}

# Constraints
{constraints_text}

# Output Format
{output_format}"""

# Example usage
prompt = create_production_prompt(
    role="You are a senior copywriter with 10 years of experience targeting Gen Z audiences.",
    context="Client: startup coffee brand 'BREW'. New cold brew product launch.",
    task="Write 3 versions of an Instagram caption.",
    constraints=[
        "Each caption under 50 characters",
        "Include 1-2 emojis",
        "No mention of price or promotions",
        "Do not include hashtags"
    ],
    output_format="Return as JSON array: ['caption1', 'caption2', 'caption3']"
)

Benefits of this pattern:

  • Prompts can be dynamically generated in code
  • Each component can be modified independently
  • Easy to A/B test variations

Prompt Version Control: Treat Prompts Like Code

In production, prompts are code. They need version control.

# Bad: hardcoded in business logic
def summarize(text):
    prompt = f"Summarize: {text}"  # untracked, untestable
    return call_llm(prompt)

# Good: manage prompts as separate versioned artifacts
class PromptManager:
    def __init__(self):
        self.prompts = {}
        self.active_versions = {}

    def register(self, name: str, version: str, template: str):
        self.prompts[f"{name}_v{version}"] = template

    def get(self, name: str, version: str = None) -> str:
        v = version or self.active_versions.get(name, "1")
        return self.prompts[f"{name}_v{v}"]

    def set_active(self, name: str, version: str):
        self.active_versions[name] = version

pm = PromptManager()
pm.register("summarize", "1", "Summarize: {text}")
pm.register("summarize", "2", "You are a business analyst. Summarize in 3 lines: {text}")
pm.set_active("summarize", "2")

For more sophisticated management, consider LangSmith or PromptLayer. They provide prompt versioning, A/B testing, and performance monitoring as an integrated platform.


Model-Specific Optimization Tips

The same prompt doesn't work equally well across all models:

GPT-4o:
- Prefers clear, direct instructions
- Follows Markdown formatting well
- Responds well to "You are a [role]" framing

Claude 3.5/3.7:
- Especially responsive to XML tag structure
- Handles long system prompts well
- Compliant with strong constraints ("you must...", "never...")

Gemini:
- Stronger at multimodal contexts
- Built-in search grounding support

Open-source (Llama, Mistral, etc.):
- Always use the model's specific chat template
- Instruction-following may be weaker than commercial models
- Requires more explicit and specific instructions

Prompt Debugging Methodology

When a prompt isn't working as expected:

1. Set temperature to 0
   -> Get reproducible results for diagnosis

2. Make instructions more specific
   -> "Write it well" -> "Professional tone, active voice, under 50 words"

3. Add output examples (few-shot)
   -> Showing is more effective than describing

4. Specify what NOT to do
   -> Negative constraints reduce failure cases

5. Add chain-of-thought
   -> "Let's think step by step" meaningfully improves complex reasoning

6. Optimize prompt length
   -> Longer isn't always better. Remove what isn't earning its place

Conclusion: Prompt Engineering Is an Engineering Discipline

Prompt engineering isn't magic. It's an engineering practice built on testing, measuring, and iterating.

Practical advice:

  1. Start simple, improve incrementally: begin with a basic prompt and refine when you see failures
  2. Manage prompts as code: version control, testing, deployment pipelines
  3. Evaluate quantitatively: measure with concrete metrics, not "feels better"
  4. Monitor model updates: prompt behavior can change when a model is upgraded

A good prompt is like good code: maintainable, testable, and designed to be changed.