knowledge-base/benchmarks/agent1.md
Pratik Narola cac9674f0e feat: Add comprehensive benchmarking framework and secure database ports
Security Enhancement:
- Remove external port exposure for PostgreSQL and Neo4j databases
- Replace 'ports' with 'expose' for internal-only database access
- Maintain full internal connectivity while eliminating external attack vectors
- Follow container security best practices

Benchmarking Framework:
- Add agent1.md: Professional Manager persona testing protocol
- Add agent2.md: Creative Researcher persona testing protocol
- Add benchmark1.md: Baseline test results and analysis

Benchmark Results Summary:
- Core engine quality: 4.9/5 average across both agent personas
- Memory intelligence: Exceptional context retention and relationship inference
- Automatic relationship generation: 50+ meaningful connections from minimal inputs
- Multi-project context management: Seamless switching with persistent context
- Cross-domain synthesis: AI-native capabilities for knowledge work enhancement

Key Findings:
- Core memory technology provides strong competitive moat
- Memory-enhanced conversations unique in market
- Ready for frontend wrapper development
- Establishes quality baseline for future model comparisons

Future Use: Framework enables systematic comparison across different
LLM endpoints, models, and configurations using identical test protocols.
2025-08-10 23:45:27 +05:30

326 lines
No EOL
12 KiB
Markdown

# Agent 1: Professional Manager Persona - Daily Driver Testing
## Persona Overview
**Name**: Alex Chen
**Role**: Project Manager at TechFlow Inc
**Industry**: Fast-growing tech startup
**Experience**: 5+ years managing technical teams
## Background Profile
### Professional Context
- **Current Responsibilities**: Managing 3 concurrent projects, 8 team members, multiple stakeholders
- **Key Challenges**: Context switching between projects, tracking decisions and commitments, remembering team member preferences and working styles
- **Team Structure**: Cross-functional teams with developers, designers, QA engineers, and product managers
- **Stakeholder Management**: Regular interaction with C-level executives, VPs, and external clients
### Daily Workflow
- **Morning**: Review project statuses, check team blockers, prepare for standups
- **Mid-day**: Attend meetings, make decisions, coordinate between teams
- **Evening**: Update project documentation, plan next day priorities
- **Weekly**: Sprint planning, stakeholder updates, retrospectives
### Pain Points
1. **Context Switching**: Difficulty maintaining context when jumping between 3+ projects
2. **Decision Tracking**: Remembering past decisions and the reasoning behind them
3. **Team Dynamics**: Keeping track of individual team member preferences, skills, and current workload
4. **Stakeholder Alignment**: Ensuring all stakeholders stay informed and aligned
5. **Knowledge Silos**: Information scattered across Slack, Jira, Notion, email, and meeting notes
### Technology Comfort Level
- **API/Technical Skills**: High - comfortable with REST APIs, JSON, curl commands when necessary
- **Tool Adoption**: Quick to adopt new productivity tools if they provide clear value
- **Integration Preference**: Values tools that integrate well with existing tech stack
- **Efficiency Focus**: Prioritizes speed and reliability over feature richness
## Testing Mission
### Primary Objective
Evaluate the Mem0 interface system as a potential replacement or enhancement for current productivity stack:
- **Current Tools**: Notion (documentation), Slack (communication), Jira (task tracking), Google Calendar (scheduling)
- **Success Criteria**: Can it reduce context switching overhead and improve project coordination?
### Testing Methodology
Simulate 3 weeks of realistic PM work across multiple projects to stress-test memory persistence, context management, and practical utility.
## Available API Endpoints
**Base URL**: `http://localhost:8000`
### Core Memory Operations
```bash
# Memory-enhanced conversations with context
POST /chat
{
"message": "your message",
"user_id": "alex_chen_pm",
"context": "optional context",
"metadata": {"project": "aurora", "type": "meeting_note"}
}
# Add memories manually from conversations
POST /memories
{
"messages": [{"role": "user", "content": "text"}],
"user_id": "alex_chen_pm",
"metadata": {"project": "zenith", "stakeholder": "ceo"}
}
# Search through stored memories
POST /memories/search
{
"query": "search term",
"user_id": "alex_chen_pm",
"limit": 10,
"filters": {"project": "aurora"}
}
# Get all user memories
GET /memories/alex_chen_pm?limit=50
# Update existing memory
PUT /memories
{
"memory_id": "memory_uuid",
"content": "updated content"
}
# Delete specific memory
DELETE /memories/{memory_id}
# Delete all user memories
DELETE /memories/user/alex_chen_pm
```
### Advanced Features
```bash
# Get relationship graph between entities
GET /graph/relationships/alex_chen_pm
# Get memory change history
GET /memories/{memory_id}/history
# Get global application statistics
GET /stats
# Get user-specific analytics
GET /stats/alex_chen_pm
# Check system health
GET /health
# Get current model configuration
GET /models
```
## Testing Scenarios
### Week 1: Basic Setup & Daily Usage
#### Day 1-2: Personal and Team Setup
**Tasks:**
1. Add personal working preferences and PM style
2. Store information about all 8 team members (skills, preferences, current projects)
3. Document current 3 projects (Aurora, Zenith, Nexus) with their status and stakeholders
4. Test basic memory retrieval for team member information
**Example Interactions:**
```bash
# Store personal PM preferences
POST /memories
{
"messages": [{"role": "user", "content": "I prefer async communication over meetings when possible. I believe in servant leadership and focus on removing blockers for my team. My decision-making style is collaborative but decisive when needed."}],
"user_id": "alex_chen_pm",
"metadata": {"type": "personal_preferences"}
}
# Store team member information
POST /memories
{
"messages": [{"role": "user", "content": "Sarah is our lead frontend developer on Aurora project. She prefers morning standups, works best with detailed specs, and has expertise in React and TypeScript. She's been advocating for better testing infrastructure."}],
"user_id": "alex_chen_pm",
"metadata": {"type": "team_member", "person": "sarah", "project": "aurora"}
}
```
#### Day 3-5: Meeting Notes and Decision Tracking
**Tasks:**
1. Store meeting notes from sprint planning sessions
2. Document key decisions and their rationale
3. Track action items and ownership
4. Test retrieval of decision history
**Example Interactions:**
```bash
# Store sprint planning meeting
POST /chat
{
"message": "Just finished Aurora sprint planning. We decided to push the integration testing to next sprint due to API instability. Sarah raised concerns about technical debt in the authentication module. Mike will focus on database optimization this sprint. CEO James wants a demo ready by January 30th.",
"user_id": "alex_chen_pm",
"metadata": {"type": "meeting_note", "project": "aurora", "meeting": "sprint_planning"}
}
# Search for past decisions
POST /memories/search
{
"query": "authentication technical debt",
"user_id": "alex_chen_pm",
"limit": 5
}
```
### Week 2: Advanced Workflows
#### Day 6-8: Multi-Project Coordination
**Tasks:**
1. Manage context switching between Aurora, Zenith, and Nexus projects
2. Track interdependencies between projects
3. Coordinate shared resources (team members working on multiple projects)
4. Test memory's ability to maintain project-specific context
**Example Scenarios:**
```bash
# Switch context to Zenith project
POST /chat
{
"message": "Switching to Zenith project status review. What are the current blockers and who's working on performance optimization?",
"user_id": "alex_chen_pm",
"context": "zenith_project",
"metadata": {"project": "zenith", "type": "status_check"}
}
# Track resource conflicts
POST /memories
{
"messages": [{"role": "user", "content": "Mike is allocated to both Aurora database work and Zenith performance optimization. This is creating a bottleneck. Need to discuss prioritization with stakeholders."}],
"user_id": "alex_chen_pm",
"metadata": {"type": "resource_conflict", "person": "mike", "projects": ["aurora", "zenith"]}
}
```
#### Day 9-10: Stakeholder Management
**Tasks:**
1. Track stakeholder preferences and communication styles
2. Manage stakeholder expectations and updates
3. Coordinate between technical team and business stakeholders
4. Test relationship mapping between people and projects
**Example Interactions:**
```bash
# Store stakeholder preferences
POST /memories
{
"messages": [{"role": "user", "content": "CEO James prefers high-level updates focused on business impact. He gets impatient with technical details but wants to understand risks. VP Maria (Engineering) likes detailed technical discussions and data-driven decisions. VP Carlos (Product) focuses on user experience and timeline impact."}],
"user_id": "alex_chen_pm",
"metadata": {"type": "stakeholder_preferences"}
}
# Check relationship graph
GET /graph/relationships/alex_chen_pm
```
### Week 3: Edge Cases & Integration
#### Day 11-13: Complex Project Scenarios
**Tasks:**
1. Handle crisis situations (critical bugs, deadline changes)
2. Manage scope changes and their impact across projects
3. Coordinate emergency response and communication
4. Test system under high-frequency updates
**Example Crisis Scenarios:**
```bash
# Handle critical bug discovery
POST /chat
{
"message": "CRITICAL: Tom found a security vulnerability in Aurora's user authentication. This affects production and blocks our January 30th demo to the CEO. Need immediate response plan and stakeholder communication.",
"user_id": "alex_chen_pm",
"metadata": {"priority": "critical", "type": "incident", "project": "aurora"}
}
# Update memory with changed timeline
PUT /memories
{
"memory_id": "demo_timeline_memory_id",
"content": "CEO demo moved from January 30th to February 15th due to security vulnerability discovery. James agreed to delay after understanding the risk."
}
```
#### Day 14-15: Bulk Operations and Scalability
**Tasks:**
1. Test system with large amounts of project data
2. Simulate quarterly planning with multiple projects
3. Test search performance across accumulated memories
4. Evaluate system for long-term use scalability
## Evaluation Criteria
### Core Functionality Assessment
**Memory Quality (Weight: 30%)**
- Accuracy of information retention
- Context preservation across sessions
- Ability to update and evolve memories
**Relationship Intelligence (Weight: 25%)**
- Quality of automatic relationship detection
- Accuracy of people-project-task connections
- Usefulness of generated relationship graph
**Search & Retrieval (Weight: 20%)**
- Relevance of search results
- Speed of information retrieval
- Ability to find related information
**Context Management (Weight: 25%)**
- Effectiveness in multi-project context switching
- Maintenance of project-specific context
- Integration of information across conversations
### Daily Driver Viability
**Productivity Impact**
- Does it reduce time spent searching for information?
- Does it help with context switching between projects?
- Does it improve decision tracking and follow-up?
**Workflow Integration**
- How well does it fit into existing PM workflows?
- Can it replace or enhance current tools?
- What additional tools/integrations would be needed?
**Team Coordination Enhancement**
- Does it improve tracking of team member information?
- Does it help with resource allocation decisions?
- Does it enhance stakeholder communication?
**Missing Features for PM Adoption**
- What essential PM features are absent?
- What integrations are critical for daily use?
- What would prevent adoption as primary PM tool?
## Expected Deliverables
### Comprehensive Report Structure
1. **Executive Summary**: Overall adoption recommendation with key reasoning
2. **Functionality Scores**: Detailed ratings for memory, relationships, search, context management
3. **Workflow Analysis**: How well it supports actual PM work vs. current tools
4. **Team Coordination Assessment**: Impact on managing team and stakeholder relationships
5. **Critical Gap Analysis**: Essential missing features preventing full adoption
6. **Integration Requirements**: What additional tools/features needed for daily driver use
7. **Scaling Considerations**: Viability for managing larger teams/more complex projects
### Testing Evidence Required
- Specific examples of successful memory retrieval
- Evidence of relationship intelligence quality
- Examples of effective context switching
- Documentation of any failures or limitations encountered
- Quantitative metrics where possible (response times, accuracy rates)
## Success Metrics
**Excellent (8-10/10)**: Could replace primary PM tools with minimal additional development
**Good (6-7/10)**: Strong core capabilities but requires significant additional features
**Fair (4-5/10)**: Useful for specific PM tasks but not comprehensive enough for daily driver
**Poor (1-3/10)**: Interesting technology but not practical for PM workflows
Focus on **realistic daily PM scenarios** rather than technical edge cases. The goal is to determine if this technology can meaningfully improve project management effectiveness and team coordination.