TheAgentCompany

TheAgentCompany provides a multi-agent environment designed for collaborative problem-solving and development, implementing a comprehensive benchmark for evaluating AI agents in realistic workplace scenarios. Built using OpenHands agent framework.

image

Environment Architecture

  1. Local Workspace

    • Docker-based sandboxed environment for safe execution
    • Pre-installed software tools and development environment
    • Isolated from evaluation machine for security
    • Browser (Playwright), code editor, and Linux terminal access
  2. Intranet Services

    • GitLab: Code repositories and tech-oriented wiki pages
    • OwnCloud: Document storage and collaborative editing
    • Plane: Issue tracking, sprint cycles, product roadmaps
    • RocketChat: Internal real-time messaging and collaboration
    • All services are reproducible and reset-able with mock data
  3. Simulated Colleagues

    • Built on Sotopia platform for human-like interactions
    • Detailed profiles including name, role, responsibilities, project affiliations
    • Backed by Claude-3.5-Sonnet for consistent behavior
    • Support for direct messages and channel communications

Task Implementation

  1. Task Components

    • Detailed task intent in natural language
    • Multiple checkpoints representing milestones
    • Programmatic evaluators for verification
    • Environment initialization and cleanup code
  2. Checkpoint System

    • Action Completion: Tool usage, navigation, data collection
    • Data Accuracy: Output correctness and completeness
    • Collaboration: Quality of colleague interactions
    • Point-based scoring for partial completion
  3. Evaluation Methods

    • Deterministic Evaluators

      • Python functions for objective checks
      • Environment state verification
      • File system change monitoring
      • Browser history tracking
      • Action sequence validation
    • LLM-based Evaluators

      • Complex deliverable assessment
      • Predefined evaluation rubrics
      • Reference output comparison
      • Subjective quality measurement

Scoring Implementation

  1. Full Completion Score

    Sfull = 1 if all checkpoints passed else 0
    

  2. Partial Completion Score

    Spartial = 0.5 * (points_achieved/total_points) + 0.5 * Sfull
    

  3. Efficiency Metrics

    • Number of LLM calls per task
    • Token usage and associated costs
    • Step count tracking
    • Execution time monitoring

Common Failure Categories

  1. Common Sense Deficits

    • Missing implicit assumptions
    • File type inference failures
    • Context understanding issues
    • Basic workflow comprehension gaps
  2. Social Interaction Issues

    • Incomplete communication flows
    • Missed social cues
    • Follow-up failures
    • Context switching problems
  3. Technical Challenges

    • Complex UI navigation
    • Popup handling difficulties
    • Multi-step process management
    • Tool integration issues
  4. Task Execution Problems

    • Invalid shortcut creation
    • Critical step omission
    • Incorrect assumption chains
    • Resource management issues

Performance Metrics

Resources

Share link! 📋
Link copied!
See the main site!