Skip to content

Roadmap

Reactive-Agents Framework Roadmap

Current Version: 0.1.0a6 (Alpha) Last Updated: January 11, 2026 Status: Active Development - Breaking changes expected Real-World Test Success Rate: 80% (⅘ tests passing)


Version Milestones

Version Status Focus
0.1.0a6 ✅ Current Core refactoring, provider architecture, builder pattern, streaming support
0.1.0a7 🔄 Next Memory & context optimization (2-3x efficiency gain), Google SDK migration
0.1.0a8 📋 Planned Complete strategy implementations, test coverage
0.1.0a9 📋 Planned Token counting, advanced reasoning patterns
0.1.0b1 📋 Planned Beta - Production features (caching, rate limiting)
0.1.0 🎯 Target Stable release - Full feature parity

Current State (v0.1.0a6)

Real-World Performance (January 2026)

Playground Test Results: 80% success rate (⅘ tests passing)

Agent Strategy Result Iterations Duration Efficiency
Data Analysis plan_execute_reflect ✅ PASS 6 92.52s 0.17
Research Assistant plan_execute_reflect ✅ PASS 5 98.03s 0.20
Code Reviewer reflect_decide_act ✅ PASS 4 72.29s 0.25
Task Automation plan_execute_reflect ✅ PASS 6 89.62s 0.17
Customer Support reflect_decide_act ❌ FAIL* 3 54.38s -

*False negative - agent actually succeeded but validation missed explicit keyword

Key Findings: - ✅ Strategies work correctly - All tasks completed successfully - ✅ Zero tool failures - Reliable execution - ⚠️ Low efficiency (17-25%) - Taking 2x more iterations than optimal - ⚠️ Memory not consulted - No cross-session learning - ⚠️ Tool redundancy - Same tools called multiple times

Analysis: Framework is production-ready for common use cases but leaving significant performance on the table due to dormant memory system. See Phase 1.5 for critical improvements.

Recent Improvements

The framework has undergone significant refactoring with these key improvements:

  • Provider Architecture: Unified dual-parameter system with OpenAI-style interface
  • Builder Pattern: Type-safe ReactiveAgentBuilder with fluent API
  • Component Factory: Dependency injection for all components
  • Type System: Comprehensive Pydantic models (~3,900 lines)
  • Event System: Type-safe EventBus with async support
  • MCP Integration: First-class Model Context Protocol support

Test Coverage

Category Coverage Status
Overall 61% Needs improvement
Core Engine 87% Good
Tool Manager 87% Good
Event Bus 100% Excellent
Strategies 36-96% Mixed
Providers 14-64% Needs work

Provider Support

Provider Completion Tools Structured Output Streaming
OpenAI
Anthropic
Google ⚠️
Groq ⚠️ ⚠️
Ollama ⚠️ ⚠️

Phase 1: v0.1.0a7 - Critical Fixes

Timeline: 1-2 weeks Goal: Migrate Google SDK, improve test coverage

1.1 Fix Failing Tests ✅ COMPLETED

Issue: Google provider test was failing

TestGoogleModelProvider::test_get_completion
AssertionError: Expected 'get_chat_completion' to have been called once. Called 0 times.

Resolution: Fixed mock target from get_chat_completion to _get_provider_chat_completion in test file.


1.2 Google SDK Migration ✅ COMPLETED

Issue: Deprecated SDK warning

FutureWarning: All support for the `google.generativeai` package has ended.
Switch to the `google.genai` package.

Migration Steps:

  1. Update pyproject.toml:

    google-genai = "^1.5.0"  # Replace google-generativeai
    instructor = {extras = ["anthropic", "google-genai"], version = "^1.10.0"}
    

  2. Update imports in reactive_agents/providers/llm/google.py:

    # Before
    import google.generativeai as genai
    
    # After
    from google import genai
    

  3. Update API calls to match new SDK patterns (client-based architecture)

Action Items:

  • Update dependencies in pyproject.toml
  • Refactor GoogleModelProvider for new SDK
  • Update type hints and response handling
  • Test all Google functionality (21/21 tests passing)
  • Fix all diagnostic issues (0 errors, 0 warnings)

1.3 Increase Test Coverage

Priority Targets:

Component Current Target Priority
task_classifier.py 13% 70% High
prompts/base.py 34% 60% Medium
strategies/plan_execute_reflect.py 36% 70% High
strategies/reflect_decide_act.py 43% 70% High
providers/llm/groq.py 14% 60% Medium
providers/llm/anthropic.py 40% 70% High

Action Items:

  • Write unit tests for TaskClassifier.classify_task()
  • Write unit tests for fallback classification
  • Add integration tests for strategy selection
  • Add provider-specific test cases

Phase 1 Deliverables

  • All tests passing (424/429 - 5 pre-existing failures unrelated to SDK)
  • Google SDK migrated to google.genai
  • Critical component coverage > 60%
  • No deprecation warnings

Phase 1.5: v0.1.0a7 - Memory & Context Optimization (NEW - HIGH PRIORITY)

Timeline: 1-2 weeks Goal: Unlock dormant memory system and optimize context management Impact: 2-3x efficiency improvement in agent performance Discovered: January 11, 2026 from real-world playground testing

Critical Finding: Memory System is Dormant 🔴

Real-world test analysis revealed: Memory management exists and stores data perfectly, but is never consulted during agent execution. This causes:

  • 6 iterations instead of 3-4 for common tasks (efficiency: 17% vs target 40%+)
  • Tool redundancy (Code Reviewer called check_security twice)
  • No learning curve across sessions
  • Repeated mistakes

Test Results:

✅ Data Analysis Agent: 6 iterations, efficiency 0.17 (should be 3-4 iterations, 0.40+)
✅ Task Automation: 6 iterations, efficiency 0.17 (should be 3-4 iterations, 0.40+)
✅ Code Reviewer: 4 iterations, ran same tool twice

1.5.1 Memory-Guided Execution (HIGHEST IMPACT)

Problem: Memory exists but isn't used during reasoning

Current State (in memory_manager.py): - ✅ save_memory() - Works perfectly - ✅ update_session_history() - Works perfectly - ✅ update_tool_preferences() - Works perfectly - ❌ get_similar_sessions(task) - DOESN'T EXIST - ❌ get_relevant_reflections(context) - DOESN'T EXIST - ❌ recommend_tools_for_task(task) - DOESN'T EXIST

Action Items:

  • Add get_similar_sessions() to memory_manager.py
  • Use text similarity to find past tasks
  • Return strategy used, tools, iterations, success rate
  • Priority: CRITICAL

  • Add get_relevant_reflections() to memory_manager.py

  • Filter reflections by context relevance
  • Return learnings from similar situations
  • Priority: HIGH

  • Add recommend_tools_for_task() to memory_manager.py

  • Analyze tool preferences for similar tasks
  • Return high-success-rate tools
  • Priority: HIGH

  • Integrate memory loading in engine.py

  • Call memory query before task execution
  • Surface past learnings in prompts
  • Priority: CRITICAL

  • Update all strategy initialize() methods

  • Load relevant memory before starting
  • Use past insights to inform decisions
  • Priority: HIGH

Expected Impact: - Iterations: 6 → 3-4 (33-50% reduction) - Efficiency: 17% → 35-40% (2x improvement) - Tool redundancy: Eliminated - Learning curve: Agents improve over time

Files to Modify: - reactive_agents/core/memory/memory_manager.py - Add query methods - reactive_agents/core/reasoning/engine.py - Integrate memory consultation - reactive_agents/core/reasoning/strategies/*.py - Use memory in initialization


1.5.2 LLM-Powered Context Summarization (HIGH IMPACT)

Problem: Line 544 of context_manager.py has naive placeholder implementation

Current Implementation:

def _generate_summary(self, messages, start_idx, end_idx) -> str:
    # Naive: just counts messages by role
    summary = f"[Summary of {len(messages)} messages: {role_counts}]"
    # TODO: Implement more sophisticated summarization using LLM
    return summary

This is listed as Technical Debt TD-004 but is more critical than realized

Action Items:

  • Implement LLM-powered summarization in _generate_summary()
    async def _generate_summary(self, messages, start_idx, end_idx) -> str:
        """Generate semantic summary using agent's LLM."""
        message_text = "\n".join([f"{m['role']}: {m['content'][:200]}" for m in messages])
    
        prompt = f"""Summarize this conversation segment (2-3 sentences):
        {message_text}
    
        Focus on: key decisions, important results, actionable insights."""
    
        result = await self.agent_context.model_provider.complete(
            prompt=prompt, max_tokens=150
        )
    
        return f"[Context Summary {start_idx}-{end_idx}]: {result.content}"
    
  • Priority: CRITICAL

Expected Impact: - Context efficiency: +40-50% - Token costs: -20-30% - Information retention during long conversations: Much better - Better decision quality with relevant historical context

Files to Modify: - reactive_agents/core/context/context_manager.py:524


1.5.3 Tool Redundancy Detection (MEDIUM-HIGH IMPACT)

Problem: Agents call the same tool multiple times unnecessarily

Evidence: Code Reviewer called check_security twice in 4 iterations

Action Items:

  • Add RecentToolTracker to tool_manager.py
    class RecentToolTracker:
        def __init__(self, window=5):
            self.recent_calls = []  # Last N tool calls
    
        def is_recent_duplicate(self, tool_call: Dict) -> bool:
            """Check if this exact tool call happened recently."""
            signature = self._hash_call(tool_call)
            return signature in [self._hash_call(c) for c in self.recent_calls[-3:]]
    
        def _hash_call(self, call: Dict) -> str:
            """Create signature: tool_name:params"""
            return f"{call['name']}:{json.dumps(call.get('parameters', {}))}"
    
  • Priority: MEDIUM-HIGH

  • Integrate tracker into tool execution flow

  • Log warning when duplicate detected
  • Optionally skip duplicate calls
  • Priority: MEDIUM

Expected Impact: - Tool redundancy: Eliminated - Iterations: -10-15% - Better iteration efficiency

Files to Modify: - reactive_agents/core/tools/tool_manager.py


1.5.4 Completion Prediction (MEDIUM IMPACT)

Problem: Agents don't know when they're close to completion

Action Items:

  • Add completion score estimation to engine.py
    async def predict_completion(self, task: str, progress: Dict) -> float:
        """Estimate how close we are to completion (0.0-1.0)."""
        prompt = f"""Estimate task completion:
    
        Task: {task}
        Iterations: {progress['iterations']}
        Tools Used: {progress['tools']}
    
        Return score 0.0-1.0 (0=just started, 1.0=complete):"""
    
        result = await self.think(prompt)
        return float(result.content)
    
  • Priority: MEDIUM

Expected Impact: - Earlier completion detection - Fewer unnecessary validation iterations - Better resource utilization

Files to Modify: - reactive_agents/core/reasoning/engine.py


Phase 1.5 Deliverables

  • Memory consultation integrated - Agents load similar sessions before execution
  • LLM-powered context summarization - Semantic summaries replace naive placeholders
  • Tool redundancy detection - No repeated tool calls
  • Completion prediction - Agents estimate progress
  • Efficiency improvement - Average efficiency from 17% to 35-40%
  • Iteration reduction - Common tasks: 6 iterations → 3-4

Success Metrics: - Playground test efficiency: 17% → 35%+ (2x improvement) - Average iterations for known tasks: -40-50% - Tool redundancy incidents: 0 - Cross-session learning: Measurable improvement on repeated task types


Phase 2: Streaming Support ✅ COMPLETED (v0.1.0a6)

Status: ✅ Completed Goal: Add streaming across all providers

2.1 Streaming Architecture ✅

Implemented in reactive_agents/core/types/provider_types.py and reactive_agents/providers/llm/base.py:

class StreamChunk(BaseModel):
    """Single chunk in streaming response."""
    content: str = ""
    role: Optional[str] = None
    finish_reason: Optional[str] = None
    tool_calls: Optional[List[Dict[str, Any]]] = None
    is_final: bool = False
    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0
    chunk_index: int = 0
    model: Optional[str] = None

class BaseModelProvider:
    async def stream_chat_completion(
        self,
        messages: List[Dict[str, Any]],
        tools: Optional[List[Dict[str, Any]]] = None,
        options: Optional[Dict[str, Any]] = None,
        **kwargs,
    ) -> AsyncIterator[StreamChunk]:
        """Stream chat completion tokens."""
        ...

2.2 Provider Implementations ✅

All providers implemented with _stream_provider_chat_completion():

Provider Status Notes
OpenAI Native streaming with stream_options
Anthropic Event-based streaming with messages.stream()
Google generate_content() with stream=True
Groq OpenAI-compatible streaming
Ollama Native async streaming

2.3 Remaining Integration

Pending for future versions:

  • Add streaming event types
  • Integrate with ExecutionEngine
  • Add stream_run() to ReactiveAgent
  • Add streaming example (see docs/examples/streaming.md)

Phase 2 Deliverables ✅

  • Streaming works in all 5 providers
  • StreamChunk model defined
  • Token usage tracking in final chunks
  • Tool call support during streaming

Phase 3: v0.1.0a9 - Complete Strategies

Timeline: 2-3 weeks Goal: Complete all reasoning strategy implementations

3.1 PlanExecuteReflect Strategy

Current Coverage: 36%

Missing Components:

  • Plan generation with validation
  • Step-by-step execution tracking
  • Reflection after each step
  • Plan revision based on outcomes

Files:

  • reactive_agents/core/reasoning/strategies/plan_execute_reflect.py
  • reactive_agents/core/reasoning/steps/plan_execute_reflect_steps.py

Action Items:

  • Implement PlanStep.execute() with proper LLM prompting
  • Implement ExecutionStep.execute() with tool integration
  • Implement ReflectionStep.execute() with memory storage
  • Add plan validation and scoring
  • Add plan revision capability
  • Write comprehensive tests (target: 80%)

3.2 ReflectDecideAct Strategy

Current Coverage: 43%

Missing Components:

  • Proper reflection generation
  • Decision making based on reflection
  • Action selection algorithm
  • Learning from outcomes

Files:

  • reactive_agents/core/reasoning/strategies/reflect_decide_act.py
  • reactive_agents/core/reasoning/steps/reflect_decide_act_steps.py

Action Items:

  • Implement ReflectStep.execute()
  • Implement DecideStep.execute() with scoring
  • Implement ActStep.execute() with tool selection
  • Add outcome evaluation
  • Write comprehensive tests (target: 80%)

3.3 Token Counting

Add to all providers:

class BaseModelProvider:
    def count_tokens(self, text: str) -> int:
        """Count tokens using provider's tokenizer."""
        raise NotImplementedError

    def get_context_window(self) -> int:
        """Get model's context window size."""
        raise NotImplementedError

Action Items:

  • Add count_tokens() to OpenAI (tiktoken)
  • Add count_tokens() to Anthropic (anthropic-tokenizer)
  • Add count_tokens() to Google
  • Add count_tokens() to Groq
  • Add count_tokens() to Ollama
  • Add token tracking to CompletionResponse

Phase 3 Deliverables

  • PlanExecuteReflect coverage > 80%
  • ReflectDecideAct coverage > 80%
  • Token counting in all providers
  • Overall test coverage > 75%

Phase 4: v0.1.0b1 - Production Features (Beta)

Timeline: 3-4 weeks Goal: Add production-grade features

4.1 Caching System

Components:

  • LLM response cache (exact match)
  • Semantic cache (similar queries)
  • Tool result cache
  • Pluggable backends (memory, Redis, SQLite)
class CacheConfig:
    enabled: bool = True
    backend: Literal["memory", "redis", "sqlite"] = "memory"
    ttl_seconds: int = 3600
    semantic_threshold: float = 0.95

4.2 Rate Limiting

Features:

  • Per-provider rate limits
  • Token bucket algorithm
  • Automatic retry with backoff
  • Request queuing
class RateLimitConfig:
    requests_per_minute: int = 60
    tokens_per_minute: int = 100000
    concurrent_requests: int = 10

4.3 Model Fallback

Features:

  • Automatic failover on errors
  • Health-based provider ordering
  • Configurable fallback chain
agent = await (
    ReactiveAgentBuilder()
    .with_provider(Provider.OPENAI, "gpt-4")
    .with_fallback_providers([
        (Provider.ANTHROPIC, "claude-3-sonnet"),
        (Provider.GROQ, "llama-3.1-70b"),
    ])
    .build()
)

4.4 Observability

Features:

  • OpenTelemetry tracing
  • Prometheus metrics
  • Structured logging with correlation IDs
  • Grafana dashboard template

Phase 4 Deliverables

  • LLM response caching
  • Semantic caching
  • Per-provider rate limiting
  • Model fallback system
  • OpenTelemetry integration
  • Prometheus metrics
  • Grafana dashboard template

Phase 5: v0.1.0 - Stable Release

Timeline: 2-3 weeks Goal: Polish and stabilize for production use

5.1 Advanced Multi-Agent

  • Hierarchical agent orchestration
  • Agent pools with load balancing
  • Shared memory between agents
  • Enhanced A2A protocol

5.2 Tool Enhancements

  • Tool chaining/pipelines
  • Tool dependency resolution
  • Parallel tool execution improvements

5.3 Vision/Multimodal

  • Image input support (OpenAI, Anthropic, Google)
  • Multimodal tool results

5.4 Documentation & Polish

  • Complete API reference
  • Tutorial series
  • Best practices guide
  • Performance benchmarks

Technical Debt

Critical Priority (NEW - January 2026)

ID Description Location Effort Status
TD-008 Memory queries not implemented memory_manager.py Medium 🔴 CRITICAL
TD-009 Memory not consulted during execution engine.py, strategies Medium 🔴 CRITICAL
TD-010 Tool redundancy not detected tool_manager.py Small 🟡 HIGH

High Priority

ID Description Location Effort Status
TD-004 Context summarization TODO (now CRITICAL) context_manager.py:524 Medium 🔴 CRITICAL
TD-001 Google SDK deprecation providers/llm/google.py Medium Completed
TD-002 Incomplete strategies core/reasoning/strategies/ Large ⏳ Pending
TD-003 Missing streaming All providers Large Completed

Medium Priority

ID Description Location Effort
TD-005 Plugin system TODOs plugins/plugin_manager.py Medium
TD-006 Low provider coverage Multiple providers Medium
TD-007 Circular import workarounds Various Small

Success Metrics

v0.1.0a6 (Current) ✅

  • 0 failing tests (429/429 passing)
  • Streaming in 5/5 providers
  • StreamChunk model with token tracking

v0.1.0a7

  • 0 deprecation warnings (Google SDK migrated)
  • Task classifier coverage > 60%
  • Provider coverage improvement

v0.1.0a8

  • Strategy coverage > 80%
  • Overall coverage > 75%

v0.1.0b1

  • < 100ms cache hit latency
  • 0 rate limit errors in normal operation
  • Full trace visibility

v0.1.0

  • Production deployments
  • Complete documentation
  • Benchmark results published

Contributing

See CONTRIBUTING.md for guidelines.

Priority Areas (Updated January 2026)

  1. Memory system activation 🔴 CRITICAL - TD-008, TD-009
  2. Context summarization 🔴 CRITICAL - TD-004
  3. Tool redundancy detection 🟡 HIGH - TD-010
  4. Google SDK migration ✅ Completed - TD-001
  5. Strategy completeness - TD-002
  6. Test coverage
  7. Documentation
  8. Streaming implementation ✅ Completed

This roadmap is a living document updated as the project evolves.