Traditional tooling for machine-learning workflows does not address the unique quality-assurance needs of modern LLM applications. LLM systems require far more than numeric metrics — they need semantic correctness checks, hallucination detection, RAG grounding evaluation, security and injection testing, bias auditing, multi-provider comparison, and token-level cost tracking. Existing tools rarely capture these dimensions, leaving teams blind to subtle output regressions, safety vulnerabilities, and unexpected billing spikes. Without these capabilities, it becomes difficult to ensure that prompts behave reliably, consistently, and securely across different models and datasets.
Prompt Evaluator fills this gap by providing a dedicated, test-driven framework for validating and benchmarking prompts at scale. It offers assertion-based testing, multi-provider evaluations, OWASP LLM Top-10 security checks, drift detection, and detailed cost/latency tracking — all wrapped in an intuitive desktop GUI that non-technical stakeholders can use without writing code. Built on the flexible Promptfoo engine, it supports local execution, CI/CD integration, extensible evaluators, and privacy-first workflows. Compared to research-focused or cloud-locked alternatives, Prompt Evaluator delivers a complete, enterprise-ready QA layer that ensures prompts are reliable, safe, and cost-efficient before they reach production.
🧩 Step 1 – Install Node.js (Required)
Node.js is the foundation that powers Promptfoo, which Prompt Evaluator depends on.
- Go to https://nodejs.org/en
- Download the LTS version for your OS (Windows / macOS)
- Run the installer and finish the setup
- Confirm it works by opening your terminal and running:
node -v
npm -v
You should see version numbers for both commands, confirming Node.js is installed correctly.
📦 Step 2 – Install Promptfoo (CLI)
Prompt Evaluator expects Promptfoo to already be installed on your system.
- Open Terminal (macOS) or Command Prompt / PowerShell (Windows)
- Install Promptfoo globally using npm:
npm install -g promptfoo
- Verify the installation:
promptfoo --version
Once this command shows the version number, you're ready to install the desktop app.
💻 Step 3 – Install Prompt Evaluator Desktop App
Download the App
- Go to the official releases page: https://github.com/syamsasi99/prompt-evaluator/releases/tag/v0.0.1
- Download the appropriate installer for your OS:
- macOS: Download the
.dmgfile - Windows: Download the
.exefile - Ubuntu/Linux: Download the
.AppImageor.debfile
- macOS: Download the
Installation & Opening Self-Signed Apps
Since this app is self-signed, you'll need to follow these steps to open it on your operating system:
⚠️ Note: Prompt Evaluator is self-signed, which means it hasn't been notarized by Apple/Microsoft. This is normal for open-source desktop applications. The steps below will help you safely open the app.
For macOS Users:
- Open the downloaded
.dmgfile and drag Prompt Evaluator to your Applications folder - When you try to open it the first time, you'll see a warning: "Prompt Evaluator cannot be opened because it is from an unidentified developer"
- To bypass this:
- Option 1 (Recommended): Right-click (or Control-click) on the app → Select Open → Click Open in the dialog
- Option 2: Go to System Settings → Privacy & Security → Scroll down to find "Prompt Evaluator was blocked" → Click Open Anyway
- On first launch you'll see a "Setting up Prompt Evaluator" screen
- It will check for Promptfoo and other dependencies
- When you see "All dependencies verified successfully!" → click Continue
For Windows Users:
- Run the downloaded
.exeinstaller - If you see "Windows protected your PC" warning:
- Click More info
- Then click Run anyway
- Follow the installation wizard to complete the setup
- Launch the app from Start Menu or Desktop shortcut
- On first launch you'll see a "Setting up Prompt Evaluator" screen
- It will check for Promptfoo and other dependencies
- When you see "All dependencies verified successfully!" → click Continue
For Ubuntu/Linux Users:
Using AppImage (Recommended):
- Download the
.AppImagefile - Make it executable:
chmod +x prompt-evaluator-*.AppImage
- Run the AppImage:
./prompt-evaluator-*.AppImage
Using .deb package:
- Download the
.debfile - Install it:
sudo dpkg -i prompt-evaluator-*.deb
sudo apt-get install -f # Fix any dependency issues
- Launch from Applications menu or run
prompt-evaluatorin terminal - On first launch you'll see a "Setting up Prompt Evaluator" screen
- It will check for Promptfoo and other dependencies
- When you see "All dependencies verified successfully!" → click Continue
🗂 Step 4 – Create Your First Project
- On the top-left, click the project dropdown
- Use the default "My First Project" or create a new one
This project will hold:
- Prompts
- Models (Providers)
- Dataset
- Assertions
- Results & History
💬 Step 5 – Add a Prompt
- In the left sidebar, click Prompts
- Create a sample system/user prompt, for example:
You are a helpful assistant. Please answer the following question in JSON format:
{"answer": "<your answer here>"}
Use a variable like {{question}} if you plan to use a dataset column later.
🔐 Step 6 – Configure Model Providers (OpenAI, Gemini, etc.)
- Click Providers in the left sidebar
- Click + Add Provider
For OpenAI:
- Set Provider ID to something like
openai:gpt-4o - Paste your OpenAI API key in the API Key (Optional) field
- Click Test → wait for "API Key Validated Successfully"
- Choose OK if you want it saved into your .env file
For Google Gemini:
- Add another provider with ID like:
google:gemini-2.5-flash-lite(or any supported model) - Paste your Gemini API key
- Click Test → validate → save to .env if you like
💡 Pro Tip: You can keep up to 10 providers. The app will compare them on the same dataset and assertions.
📊 Step 7 – Create a Dataset
- Click Dataset in the sidebar
- You have three options:
Option A – Paste Table Data
Paste a simple CSV-like block:
question
What is the largest planet in our solar system?
Click Parse & Load.
Option B – Generate with AI
- Click Generate with AI (top-right in Dataset view)
- The app will auto-generate test rows based on your prompt variables
Option C – Upload CSV
Use Choose File to upload a CSV with headers matching your variables (e.g., question).
(Optional) Click Generate Expected Output to let AI create an expected_output field (e.g. {"answer": "Jupiter"}) for factuality checks.
✅ Step 8 – Add Assertions (How You Judge Outputs)
- Click Assertions in the sidebar
- Either:
- Click + Add Assertion to manually choose from:
- Is JSON
- Equals
- Contains
- Regex Match
- LLM Rubric
- Factuality
- Custom Code (JavaScript)
- Or click Generate with AI to auto-generate a set of assertions based on your prompt & dataset
- Click + Add Assertion to manually choose from:
Example combo:
- Is JSON
- Schema: require property "answer" with "type": "string"
▶️ Step 9 – Run an Evaluation
- Click the Run Evaluation button (top-right of the app)
- A "Evaluation Logs" dialog appears:
- Shows Promptfoo version
- Shows how many test cases & providers are running
- Streams logs while evaluations are executed
- When finished, you'll be taken to Results automatically (or you can click Results in the sidebar)
📈 Step 10 – View Test Results
In Results → Test Results:
Top summary cards:
- Pass Rate
- Passed / Failed
- Total Tests
- Total Tokens
For each Test Case:
- See question + expected_output
- For each model (provider):
- Score
- Latency
- Cost
- Prompt & output details
- PASS / FAIL status per assertion
This is where you can visually inspect how GPT-4o vs Gemini 2.5 Flash Lite performed on the same test.
🔢 Step 11 – Inspect Token Usage
- In Results, click the Token Usage tab
- You'll see:
- Total Tokens
- Breakdown: Evaluation Tokens vs Assertion Tokens
- Distribution charts for:
- Evaluation vs Assertions
- Prompt vs Completion tokens
This helps you see where most tokens (and therefore cost) are going.
🤖 Step 12 – Use AI-Powered Model Comparison
- In Results, click the AI Comparison tab
- Click Start AI Analysis
- The built-in analysis agent will:
- Summarize which model is recommended (e.g., "Gemini 2.5 Flash Lite" due to lower latency & cost)
- Show Model Performance Metrics:
- Pass Rate
- Avg Score
- Severe Failures
- Latency
- Estimated Cost
This uses the AI Analysis Prompt you saw in settings to reason about metrics & tie-breakers.
⏱ Step 13 – Compare Different Runs Over Time
- Click History in the sidebar
- You'll see past runs for the same project with:
- Pass Rate
- Avg Score
- Duration
- Passed/Failed counts
- Select two runs → click Compare Selected
- In Run Comparison you get:
- Run 1 vs Run 2 metrics
- Comparison summary:
- Consistency
- Improvements
- Regressions
- Score changes
- Visual graphs: Quality trends, Cost per run, etc.
You can even open the AI Analysis Assistant to ask questions like:
- "How many tokens were consumed?"
- "What improved between these runs?"
💾 Step 14 – Export & Automation Hooks
Save Project
- Click the Save Project icon (top bar)
- Export as e.g.
my-first-project.json - This file can be version-controlled in Git
JSON Output Path
- Go to Settings → General
- Set JSON Output Path (e.g.
/Users/.../prompt-evaluator/output.json) - Prompt Evaluator will store raw Promptfoo results here for:
- CI integration
- Dashboards
- Custom scripts
⚙️ Step 15 – Fine-Tune AI Config & Prompts (Advanced)
1. AI Configuration
- Go to Settings → AI Configuration
- Choose which model powers Generate with AI features (e.g. Gemini 2.5 Pro – Best for complex analysis)
- Set defaults like dataset row count
2. AI Prompts
- Go to Settings → AI Prompts
- AI Analysis Prompt – defines how runs are scored & explained
- Assertion Generation Prompt – controls how auto-assertions are created
- Dataset Generation Prompt – controls how test cases are generated
- Column Generation Prompt – for dataset column creation
You can customize these to match enterprise rules, cost policies, or hallucination tolerance.
🔒 Step 16 – Security Testing with OWASP LLM Top 10
Prompt Evaluator includes built-in security testing capabilities to help you identify vulnerabilities in your LLM applications. The tool integrates OWASP LLM Top 10 security checks that can be easily enabled from the settings page.
Enable OWASP LLM Security Tests
- Go to Settings → Security
- Enable the OWASP LLM Top 10 security tests
- These tests will automatically check for common LLM vulnerabilities including:
- Prompt injection attacks
- Insecure output handling
- Training data poisoning
- Model denial of service
- Supply chain vulnerabilities
- Sensitive information disclosure
- Insecure plugin design
- Excessive agency
- Overreliance on LLM outputs
- Model theft
Running these security tests helps ensure your prompts are robust against common attack vectors and potential misuse before deploying to production.
🎁 Bonus – Preview & Export Promptfoo YAML Configuration
One of the most powerful features of Prompt Evaluator is that it generates standard Promptfoo YAML configuration files behind the scenes. This means you can preview the exact configuration and export it for use in CI/CD pipelines, command-line automation, or sharing with your team.
Preview the YAML Configuration
- Click on the Preview button in the toolbar
- This opens a read-only view of your project's Promptfoo YAML configuration
- You can see exactly how your prompts, providers, test cases, and assertions are structured
- This is helpful for understanding the underlying configuration and learning Promptfoo's YAML syntax
View the Generated YAML
Once you click Preview, you'll see the complete YAML configuration that Prompt Evaluator has generated based on your project settings. This includes:
- All your prompts and their configurations
- Provider settings (models, API keys references)
- Test dataset with inputs and expected outputs
- Assertions and evaluation criteria
- Any custom configurations you've set
Export for Automation & CI/CD
The real power comes when you want to integrate your prompt testing into automated workflows:
- Click the Export button in the preview window
- Save the YAML file to your local machine
- Use this file with the Promptfoo CLI in your terminal or CI/CD pipeline
- Run automated tests with:
promptfoo eval -c your-config.yaml - Integrate into GitHub Actions, Jenkins, or any CI/CD system
This workflow bridges the gap between visual development in Prompt Evaluator and automated testing in your production environment. You get the best of both worlds: a user-friendly GUI for setup and configuration, and standard YAML files for automation and version control.
🎉 Congratulations!
You've successfully set up Prompt Evaluator and learned how to use all its powerful features. You can now systematically compare AI models, run security tests, track improvements over time, export configurations for automation, and make data-driven decisions about which models to use in your projects.