Understanding the Generative AI Attack Surface


The rise of generative AI and large foundation models has unlocked capabilities that were unimaginable just a few years ago—while simultaneously opening a new frontier of security risks. Generative AI, especially large language models (LLMs), represents only one branch of the broader AI ecosystem, but it’s the branch that has reshaped how modern enterprises operate.

Today, organizations rely on LLM-powered agents to automate workflows, provide 24/7 customer interaction, support internal teams, and accelerate decision-making. This shift marks one of the most significant technological leaps of our time. At this moment, generative AI isn’t just transforming industries—it’s making everyday work faster, simpler, and more accessible for millions of people.

What is Gen AI?

Generative AI (GenAI) refers to a class of AI systems—typically large foundation models—that learn patterns across massive datasets and use that learned representation to generate new content such as text, code, images, audio, or structured data.
Instead of only making predictions or classifications, GenAI models produce novel outputs that resemble the data they were trained on.

How Gen AI works?

The Gen AI works by predicting what token comes next using preceding tokens. Preceding tokens are nothing but words which helps providing enough context that can help Gen AI generate the next possible outcome or even have a conversation with you.

At its core, Generative AI relies on probabilistic next-token prediction. A model ingests billions—or even trillions—of tokens across text, code, and other modalities. By learning the statistical relationships between these tokens, it becomes capable of predicting what should come next in a sequence.

A “token” can be a word, part of a word, or even a symbol. When an LLM generates text, it does not retrieve information from a stored database; rather, it uses the patterns it has learned to produce the most likely next token given the context. With enough parameters, training data, and compute, these models can produce coherent paragraphs, execute multi-step reasoning, write code, summarize complex documents, or hold long, context-aware conversations.

This process can be summarized in three stages:

  1. Training – The model learns patterns, relationships, and structure across massive datasets.( Even the unsafe datasets)
  2. Inference – When given a prompt, the model uses that learned representation to predict the next token repeatedly, generating responses.
  3. Reinforcement & Alignment (optional) – Additional tuning ensures the model behaves safely, follows instructions, and aligns with enterprise policies.

Stages of the AI Attack Surface

1. Data Collection

The data collection phase is where raw training data is gathered from various sources—such as internal databases, web scraping, partnered datasets, public corpora, sensors, or user-generated content. This phase is foundational because the quality, origin, and integrity of the data directly shape the model’s behavior, capabilities, and risks.
Attacks:

  • Data Poisoning: Injecting malicious or misleading samples.
  • Backdoor Injection: Embedding hidden triggers in the dataset.
  • Data Contamination (Unintentional): Sensitive or copyrighted data may enter the pipeline, creating legal or privacy risks even without malicious actors.

2. Data Preprocessing

Data preprocessing is the stage where raw collected data is cleaned, normalized, labeled, filtered, and structured so it can be used for model training. This phase transforms messy, inconsistent input into a format that the model can reliably learn from.
Attacks:

  • Label-Flipping: Corrupting labels to degrade model behavior.
  • Pipeline Tampering: Manipulating preprocessing scripts or transforms.

3. Base Model Training

Large-scale training of foundational models.
Attacks:

  • Training-Time Backdoors: Triggered behaviors learned from poisoned samples.
  • Hyperparameter/Config Manipulation: Reducing robustness or skewing outcomes.
  • Infrastructure Compromise: Tampering with training servers or checkpoints.

4. Fine-Tuning

Adapting a base model to domain-specific data.
Attacks:

  • Fine-Tuning Poisoning: Malicious examples inserted into fine-tune data.
  • Alignment Drift / Intentional Misalignment — occurs even without poisoning.
  • Rapid Backdoor Injection: Small datasets make backdoors easier to implant.

5. Evaluation & Testing

Assessing safety, accuracy, and robustness.
Attacks:

  • Evaluation Set Poisoning: Hiding failures or weaknesses.
  • Metric/Script Manipulation: Tampering with evaluation pipelines.

6. Model Storage & Supply Chain

Weights, artifacts, and deployment files.
Attacks:

  • Checkpoint Tampering: Injecting trojans or modifying weights.
  • Supply Chain Attacks: Compromised libraries, dependencies, or build systems.

7. Inference & Deployment (APIs/UI/Apps)

Systems that serve model outputs to users. This is where the users will make use of the Agent.
Attacks:

  • Prompt Injection / Indirect Injection: Overriding system instructions.
  • Model Extraction: Reconstructing model behavior via queries.
  • Membership Inference: Determining if specific data was trained on.
  • Model Inversion: Recovering sensitive features from outputs.

8. RAG & Vector Stores

Embedding-based retrieval integrated with LLMs.
Attacks:

  • Embedding/Document Poisoning: Manipulating retrieval rankings.
  • RAG Data Exfiltration: Coaxing the model to leak stored content.
  • Cross-Document Injection: Malicious stored text triggering harmful actions.

9. Agentic Systems / Tool Use

LLMs connected to tools, APIs, and agent chains.
Attacks:

  • Tool-Use Escalation: Prompts that trigger unauthorized tool actions.
  • Task Injection: Malicious text creates false or harmful tasks.

10. Application Layer

UI, workflows, and app-level integrations.
Attacks:

  • Client-Side Prompt Injection: User input becomes part of system prompts.
  • Context Bleeding: Unintended reuse or mixing of user data.

11. Credentials & Orchestration

API keys, roles, environment secrets.
Attacks:

  • API Key Theft: Exposure through logs, repos, or config files.
  • Privilege Escalation: Misconfigured roles enabling excessive access.

12. Logging & Monitoring

Recorded prompts, outputs, and traces.
Attacks:

  • Prompt/Output Leakage: Logs unintentionally store sensitive data.
  • Log Poisoning: Inputs crafted to corrupt or hide activity.

13. Model Updates / Versioning

Patches, improvements, and re-training.
Attacks:

  • Malicious Model Update: Compromised weights or reverted versions.
  • Shadow Fine-Tuning: Unauthorized modifications introducing backdoors.

14. Infrastructure / Cloud Environment

Compute, storage, containers, and orchestration.
Attacks:

Side-Channel Attacks: Leakage via timing, memory, or resource patterns.

This list isn’t a complete catalog of every possible attack, but it does cover the full range of attack surfaces involved in building and operating a GenAI system. These points don’t imply that GenAI is inherently unreliable, they simply highlight the areas you should carefully evaluate.