Prompt Evaluator Setup Guide - Complete Installation Tutorial

Traditional tooling for machine-learning workflows does not address the unique quality-assurance needs of modern LLM applications. LLM systems require far more than numeric metrics — they need semantic correctness checks, hallucination detection, RAG grounding evaluation, security and injection testing, bias auditing, multi-provider comparison, and token-level cost tracking. Existing tools rarely capture these dimensions, leaving teams blind to subtle output regressions, safety vulnerabilities, and unexpected billing spikes. Without these capabilities, it becomes difficult to ensure that prompts behave reliably, consistently, and securely across different models and datasets.

Prompt Evaluator fills this gap by providing a dedicated, test-driven framework for validating and benchmarking prompts at scale. It offers assertion-based testing, multi-provider evaluations, OWASP LLM Top-10 security checks, drift detection, and detailed cost/latency tracking — all wrapped in an intuitive desktop GUI that non-technical stakeholders can use without writing code. Built on the flexible Promptfoo engine, it supports local execution, CI/CD integration, extensible evaluators, and privacy-first workflows. Compared to research-focused or cloud-locked alternatives, Prompt Evaluator delivers a complete, enterprise-ready QA layer that ensures prompts are reliable, safe, and cost-efficient before they reach production.

🧩 Step 1 – Install Node.js (Required)

Node.js is the foundation that powers Promptfoo, which Prompt Evaluator depends on.

Go to https://nodejs.org/en
Download the LTS version for your OS (Windows / macOS)
Run the installer and finish the setup
Confirm it works by opening your terminal and running:

node -v
npm -v

You should see version numbers for both commands, confirming Node.js is installed correctly.

Terminal showing Node.js and npm version check

📦 Step 2 – Install Promptfoo (CLI)

Prompt Evaluator expects Promptfoo to already be installed on your system.

Open Terminal (macOS) or Command Prompt / PowerShell (Windows)
Install Promptfoo globally using npm:

npm install -g promptfoo

Verify the installation:

promptfoo --version

Once this command shows the version number, you're ready to install the desktop app.

Terminal showing Promptfoo version check

💻 Step 3 – Install Prompt Evaluator Desktop App

Download the App

Go to the official releases page: https://github.com/syamsasi99/prompt-evaluator/releases/tag/v0.0.1
Download the appropriate installer for your OS:
- macOS: Download the .dmg file
- Windows: Download the .exe file
- Ubuntu/Linux: Download the .AppImage or .deb file

GitHub releases page showing available downloads

Installation & Opening Self-Signed Apps

Since this app is self-signed, you'll need to follow these steps to open it on your operating system:

⚠️ Note: Prompt Evaluator is self-signed, which means it hasn't been notarized by Apple/Microsoft. This is normal for open-source desktop applications. The steps below will help you safely open the app.

For macOS Users:

Open the downloaded .dmg file and drag Prompt Evaluator to your Applications folder
When you try to open it the first time, you'll see a warning: "Prompt Evaluator cannot be opened because it is from an unidentified developer"
To bypass this:
- Option 1 (Recommended): Right-click (or Control-click) on the app → Select Open → Click Open in the dialog
- Option 2: Go to System Settings → Privacy & Security → Scroll down to find "Prompt Evaluator was blocked" → Click Open Anyway
On first launch you'll see a "Setting up Prompt Evaluator" screen
It will check for Promptfoo and other dependencies
When you see "All dependencies verified successfully!" → click Continue

For Windows Users:

Run the downloaded .exe installer
If you see "Windows protected your PC" warning:
- Click More info
- Then click Run anyway
Follow the installation wizard to complete the setup
Launch the app from Start Menu or Desktop shortcut
On first launch you'll see a "Setting up Prompt Evaluator" screen
It will check for Promptfoo and other dependencies
When you see "All dependencies verified successfully!" → click Continue

For Ubuntu/Linux Users:

Using AppImage (Recommended):

Download the .AppImage file
Make it executable:

chmod +x prompt-evaluator-*.AppImage

Run the AppImage:

./prompt-evaluator-*.AppImage

Using .deb package:

Download the .deb file
Install it:

sudo dpkg -i prompt-evaluator-*.deb
sudo apt-get install -f  # Fix any dependency issues

Launch from Applications menu or run prompt-evaluator in terminal
On first launch you'll see a "Setting up Prompt Evaluator" screen
It will check for Promptfoo and other dependencies
When you see "All dependencies verified successfully!" → click Continue

🗂 Step 4 – Create Your First Project

On the top-left, click the project dropdown
Use the default "My First Project" or create a new one

This project will hold:

Prompts
Models (Providers)
Dataset
Assertions
Results & History

Project dropdown showing My First Project

💬 Step 5 – Add a Prompt

In the left sidebar, click Prompts
Create a sample system/user prompt, for example:

You are a helpful assistant. Please answer the following question in JSON format:
{"answer": "<your answer here>"}

Use a variable like {{question}} if you plan to use a dataset column later.

Prompt editor showing sample prompt configuration

🔐 Step 6 – Configure Model Providers (OpenAI, Gemini, etc.)

Click Providers in the left sidebar
Click + Add Provider

For OpenAI:

Set Provider ID to something like openai:gpt-4o
Paste your OpenAI API key in the API Key (Optional) field
Click Test → wait for "API Key Validated Successfully"
Choose OK if you want it saved into your .env file

For Google Gemini:

Add another provider with ID like: google:gemini-2.5-flash-lite (or any supported model)
Paste your Gemini API key
Click Test → validate → save to .env if you like

💡 Pro Tip: You can keep up to 10 providers. The app will compare them on the same dataset and assertions.

📊 Step 7 – Create a Dataset

Click Dataset in the sidebar
You have three options:

Option A – Paste Table Data

Paste a simple CSV-like block:

question
What is the largest planet in our solar system?

Click Parse & Load.

Option B – Generate with AI

Click Generate with AI (top-right in Dataset view)
The app will auto-generate test rows based on your prompt variables

Option C – Upload CSV

Use Choose File to upload a CSV with headers matching your variables (e.g., question).

(Optional) Click Generate Expected Output to let AI create an expected_output field (e.g. {"answer": "Jupiter"}) for factuality checks.

✅ Step 8 – Add Assertions (How You Judge Outputs)

Click Assertions in the sidebar
Either:
- Click + Add Assertion to manually choose from:
  - Is JSON
  - Equals
  - Contains
  - Regex Match
  - LLM Rubric
  - Factuality
  - Custom Code (JavaScript)
- Or click Generate with AI to auto-generate a set of assertions based on your prompt & dataset

Example combo:

Is JSON
Schema: require property "answer" with "type": "string"

▶️ Step 9 – Run an Evaluation

Click the Run Evaluation button (top-right of the app)
A "Evaluation Logs" dialog appears:
- Shows Promptfoo version
- Shows how many test cases & providers are running
- Streams logs while evaluations are executed
When finished, you'll be taken to Results automatically (or you can click Results in the sidebar)

Evaluation logs dialog showing test execution

📈 Step 10 – View Test Results

In Results → Test Results:

Top summary cards:

Pass Rate
Passed / Failed
Total Tests
Total Tokens

For each Test Case:

See question + expected_output
For each model (provider):
- Score
- Latency
- Cost
- Prompt & output details
- PASS / FAIL status per assertion

This is where you can visually inspect how GPT-4o vs Gemini 2.5 Flash Lite performed on the same test.

Test results showing detailed comparison between models

🔢 Step 11 – Inspect Token Usage

In Results, click the Token Usage tab
You'll see:
- Total Tokens
- Breakdown: Evaluation Tokens vs Assertion Tokens
- Distribution charts for:
  - Evaluation vs Assertions
  - Prompt vs Completion tokens

This helps you see where most tokens (and therefore cost) are going.

🤖 Step 12 – Use AI-Powered Model Comparison

In Results, click the AI Comparison tab
Click Start AI Analysis
The built-in analysis agent will:
- Summarize which model is recommended (e.g., "Gemini 2.5 Flash Lite" due to lower latency & cost)
- Show Model Performance Metrics:
  - Pass Rate
  - Avg Score
  - Severe Failures
  - Latency
  - Estimated Cost

This uses the AI Analysis Prompt you saw in settings to reason about metrics & tie-breakers.

⏱ Step 13 – Compare Different Runs Over Time

Click History in the sidebar
You'll see past runs for the same project with:
- Pass Rate
- Avg Score
- Duration
- Passed/Failed counts
Select two runs → click Compare Selected
In Run Comparison you get:
- Run 1 vs Run 2 metrics
- Comparison summary:
  - Consistency
  - Improvements
  - Regressions
  - Score changes
- Visual graphs: Quality trends, Cost per run, etc.

You can even open the AI Analysis Assistant to ask questions like:

"How many tokens were consumed?"
"What improved between these runs?"

💾 Step 14 – Export & Automation Hooks

Save Project

Click the Save Project icon (top bar)
Export as e.g. my-first-project.json
This file can be version-controlled in Git

JSON Output Path

Go to Settings → General
Set JSON Output Path (e.g. /Users/.../prompt-evaluator/output.json)
Prompt Evaluator will store raw Promptfoo results here for:
- CI integration
- Dashboards
- Custom scripts

⚙️ Step 15 – Fine-Tune AI Config & Prompts (Advanced)

1. AI Configuration

Go to Settings → AI Configuration
Choose which model powers Generate with AI features (e.g. Gemini 2.5 Pro – Best for complex analysis)
Set defaults like dataset row count

2. AI Prompts

Go to Settings → AI Prompts
AI Analysis Prompt – defines how runs are scored & explained
Assertion Generation Prompt – controls how auto-assertions are created
Dataset Generation Prompt – controls how test cases are generated
Column Generation Prompt – for dataset column creation

You can customize these to match enterprise rules, cost policies, or hallucination tolerance.

🔒 Step 16 – Security Testing with OWASP LLM Top 10

Prompt Evaluator includes built-in security testing capabilities to help you identify vulnerabilities in your LLM applications. The tool integrates OWASP LLM Top 10 security checks that can be easily enabled from the settings page.

Enable OWASP LLM Security Tests

Go to Settings → Security
Enable the OWASP LLM Top 10 security tests
These tests will automatically check for common LLM vulnerabilities including:
- Prompt injection attacks
- Insecure output handling
- Training data poisoning
- Model denial of service
- Supply chain vulnerabilities
- Sensitive information disclosure
- Insecure plugin design
- Excessive agency
- Overreliance on LLM outputs
- Model theft

Running these security tests helps ensure your prompts are robust against common attack vectors and potential misuse before deploying to production.

OWASP LLM Top 10 security testing configuration

🎁 Bonus – Preview & Export Promptfoo YAML Configuration

One of the most powerful features of Prompt Evaluator is that it generates standard Promptfoo YAML configuration files behind the scenes. This means you can preview the exact configuration and export it for use in CI/CD pipelines, command-line automation, or sharing with your team.

Preview the YAML Configuration

Click on the Preview button in the toolbar
This opens a read-only view of your project's Promptfoo YAML configuration
You can see exactly how your prompts, providers, test cases, and assertions are structured
This is helpful for understanding the underlying configuration and learning Promptfoo's YAML syntax

Preview button showing the Promptfoo YAML configuration

View the Generated YAML

Once you click Preview, you'll see the complete YAML configuration that Prompt Evaluator has generated based on your project settings. This includes:

All your prompts and their configurations
Provider settings (models, API keys references)
Test dataset with inputs and expected outputs
Assertions and evaluation criteria
Any custom configurations you've set

Promptfoo YAML configuration preview showing complete project setup

Export for Automation & CI/CD

The real power comes when you want to integrate your prompt testing into automated workflows:

Click the Export button in the preview window
Save the YAML file to your local machine
Use this file with the Promptfoo CLI in your terminal or CI/CD pipeline
Run automated tests with: promptfoo eval -c your-config.yaml
Integrate into GitHub Actions, Jenkins, or any CI/CD system

This workflow bridges the gap between visual development in Prompt Evaluator and automated testing in your production environment. You get the best of both worlds: a user-friendly GUI for setup and configuration, and standard YAML files for automation and version control.

🎉 Congratulations!

You've successfully set up Prompt Evaluator and learned how to use all its powerful features. You can now systematically compare AI models, run security tests, track improvements over time, export configurations for automation, and make data-driven decisions about which models to use in your projects.

Share this article

Twitter LinkedIn