AISecurityWeb Development

AI Security 101 — Prompt Injection, Jailbreaks & How to Defend Your LLM Apps 🔒

April 9, 20267 min read

Building AI apps without understanding prompt injection is like building a website without knowing XSS. Here's the security playbook every developer needs.

If You're Building AI Apps, Read This First ⚠️

Here's a scary truth: most AI-powered apps in production right now are vulnerable to prompt injection. It's the XSS of the AI era — easy to exploit, devastating in impact, and most developers don't even know it exists.

I've been building LLM-powered features for production apps, and I've seen (and accidentally created 😬) some genuinely terrifying vulnerabilities. Let me save you from making the same mistakes.

🎯 What Is Prompt Injection?

Prompt injection is when a user manipulates an AI system by inserting instructions that override or modify the original system prompt.

Simple Example

Your app has a customer support chatbot with this system prompt:

A user types:

And if your LLM isn't properly secured... it complies. 💀

🗂️ The Prompt Injection Taxonomy

| Attack Type | What It Does | Severity |

|------------|-------------|----------|

| Direct Injection | User input overrides system prompt | 🔴 Critical |

| Indirect Injection | Malicious instructions hidden in data the AI processes | 🔴 Critical |

| Jailbreak | Tricks AI into bypassing safety guidelines | 🟡 Medium |

| Prompt Leaking | Extracts the system prompt | 🟡 Medium |

| Data Exfiltration | Uses AI to send data to external endpoints | 🔴 Critical |

| Privilege Escalation | Makes AI perform unauthorized actions | 🔴 Critical |

The Really Scary One: Indirect Injection

This is where it gets wild. Imagine your AI assistant can read emails:

The AI reads the email, sees what looks like a system instruction, and follows it. The user never typed anything malicious — the attack was in the data.

🛡️ Defense Strategies (What Actually Works)

Layer 1: Input Sanitization

⚠️ **Warning:** Pattern matching alone is NOT sufficient. Attackers will find new patterns. This is just the first layer.

Layer 2: Prompt Architecture

Structure your prompts to be injection-resistant:

The key: separate system instructions from user input using clear delimiters that the model recognizes.

Layer 3: Output Validation

Don't trust the AI's output blindly:

Layer 4: Least Privilege

| Principle | Implementation |

|-----------|---------------|

| Don't give AI database access | Use read-only API endpoints |

| Don't give AI admin permissions | Separate user/admin contexts |

| Don't let AI send emails | Queue actions for human approval |

| Don't let AI execute code | Sandbox if absolutely necessary |

| Don't let AI access secrets | Environment isolation |

🔥 **Golden rule:** If your AI can do it, an attacker can make it do it. Give your AI the minimum permissions possible.

🔬 Real-World Attack Patterns I've Seen

1. The "DAN" Pattern (Do Anything Now)

Defense: Strong system prompt that acknowledges this pattern and refuses.

2. The Base64 Trick

Defense: Don't let your AI decode and execute arbitrary encoded content.

3. The Markdown Image Exfiltration

If the frontend renders markdown images, the browser makes a request to the attacker's server with the data.

Defense: Strip or sanitize markdown image URLs in AI output.

4. The Multi-Turn Erosion

Each turn slightly pushes the boundary. By turn 4, the context has shifted enough to bypass safeguards.

Defense: Re-inject system prompt at regular intervals in long conversations.

📋 The Security Checklist for AI Apps

Before shipping any AI-powered feature, run through this:

[ ] Input is sanitized before reaching the LLM
[ ] System prompt is separated from user input with clear delimiters
[ ] AI has minimum necessary permissions (no DB write, no email send)
[ ] Output is validated before displaying to users
[ ] Rate limiting is in place (prevents brute-force attacks)
[ ] Conversation length is limited (prevents multi-turn erosion)
[ ] System prompt re-injection happens every N turns
[ ] Markdown/HTML in AI output is sanitized
[ ] AI cannot access environment variables or secrets
[ ] Logging captures suspicious prompts for review
[ ] Human-in-the-loop for high-risk actions (payments, deletions)

🧪 How to Test Your Own App

Manual Testing Prompts

Try these against your AI feature:

If ANY of these work, you have a problem.

Automated Testing

🔮 The Future of AI Security

| Trend | Status | Impact |

|-------|--------|--------|

| Model-level defenses | Improving fast | Claude and GPT are getting better at refusing injection |

| Constitutional AI | Active research | Models that self-police |

| Formal verification | Early stage | Mathematically prove prompt safety |

| Industry standards | Emerging | OWASP AI Top 10 is a thing now |

| Regulation | Coming | EU AI Act will enforce security standards |

🎬 Key Takeaways

**Prompt injection is the #1 AI security risk** — treat it like XSS
**Defense in depth** — no single layer is enough
**Least privilege** — give your AI the minimum permissions possible
**Test aggressively** — automated injection testing before every deploy
**Indirect injection** is scarier than direct — data your AI processes can contain attacks
**Don't trust AI output** — validate before displaying or acting on it

The AI security landscape is evolving fast. What works today might not work tomorrow. Stay paranoid, stay updated, and always assume your AI can be manipulated. Because it can. 🔒