TheAgentCompany provides a multi-agent environment designed for collaborative problem-solving and development, implementing a comprehensive benchmark for evaluating AI agents in realistic workplace scenarios. Built using OpenHands agent framework.
Environment Architecture
-
Local Workspace
- Docker-based sandboxed environment for safe execution
- Pre-installed software tools and development environment
- Isolated from evaluation machine for security
- Browser (Playwright), code editor, and Linux terminal access
-
Intranet Services
- GitLab: Code repositories and tech-oriented wiki pages
- OwnCloud: Document storage and collaborative editing
- Plane: Issue tracking, sprint cycles, product roadmaps
- RocketChat: Internal real-time messaging and collaboration
- All services are reproducible and reset-able with mock data
-
Simulated Colleagues
- Built on Sotopia platform for human-like interactions
- Detailed profiles including name, role, responsibilities, project affiliations
- Backed by Claude-3.5-Sonnet for consistent behavior
- Support for direct messages and channel communications
Task Implementation
-
Task Components
- Detailed task intent in natural language
- Multiple checkpoints representing milestones
- Programmatic evaluators for verification
- Environment initialization and cleanup code
-
Checkpoint System
- Action Completion: Tool usage, navigation, data collection
- Data Accuracy: Output correctness and completeness
- Collaboration: Quality of colleague interactions
- Point-based scoring for partial completion
-
Evaluation Methods
-
Deterministic Evaluators
- Python functions for objective checks
- Environment state verification
- File system change monitoring
- Browser history tracking
- Action sequence validation
-
LLM-based Evaluators
- Complex deliverable assessment
- Predefined evaluation rubrics
- Reference output comparison
- Subjective quality measurement
-
Scoring Implementation
-
Full Completion Score
Sfull = 1 if all checkpoints passed else 0
-
Partial Completion Score
Spartial = 0.5 * (points_achieved/total_points) + 0.5 * Sfull
-
Efficiency Metrics
- Number of LLM calls per task
- Token usage and associated costs
- Step count tracking
- Execution time monitoring
Common Failure Categories
-
Common Sense Deficits
- Missing implicit assumptions
- File type inference failures
- Context understanding issues
- Basic workflow comprehension gaps
-
Social Interaction Issues
- Incomplete communication flows
- Missed social cues
- Follow-up failures
- Context switching problems
-
Technical Challenges
- Complex UI navigation
- Popup handling difficulties
- Multi-step process management
- Tool integration issues
-
Task Execution Problems
- Invalid shortcut creation
- Critical step omission
- Incorrect assumption chains
- Resource management issues
Performance Metrics
- Success rate across different task types
- Platform-specific performance analysis
- Cost-efficiency measurements
- Step count optimization
- Token usage efficiency