Build Multimodal AI Assistants with Agentic Workflows on AWS

THE DAWN OF MULTIMODAL AND AGENTIC AI ASSISTANTS

Modern enterprises are swimming in a sea of diverse data, spanning everything from traditional text documents and PDFs to dynamic presentation slides, intricate images, and crucial audio recordings. The vision of an AI assistant capable of digesting a company’s quarterly earnings call – not just reading the transcript but also discerning insights from presentation charts and understanding the CEO’s spoken remarks – is swiftly becoming a reality. Industry analysts forecast a significant surge in multimodal generative AI solutions, projecting a jump from a mere 1% in 2023 to 40% by 2027. This shift underscores the paramount importance of multimodal comprehension for sophisticated business applications. Achieving this necessitates a generative AI assistant that can seamlessly integrate and interpret various data types. Furthermore, it demands an agentic architecture, empowering the AI to proactively retrieve information, strategize tasks, and make intelligent decisions regarding tool utilization, rather than merely offering passive responses.

This article delves into an innovative solution designed to meet these evolving demands. Leveraging Amazon Nova Pro, AWS’s cutting-edge multimodal large language model (LLM) as the central orchestrator, alongside powerful new features like Amazon Bedrock Data Automation for comprehensive multimodal data processing, we explore how enterprises can build highly intelligent AI assistants. We will demonstrate how advanced agentic workflow patterns such as Retrieval Augmented Generation (RAG), sophisticated multi-tool orchestration, and conditional routing, often facilitated by frameworks like LangGraph, enable robust end-to-end solutions. This approach provides a blueprint for artificial intelligence and machine learning (AI/ML) developers and enterprise architects to adopt and customize. We’ll illustrate this with an example of a financial management AI assistant, capable of delivering quantitative research and grounded financial advice by analyzing earnings call audio and presentation slides, complemented by real-time financial data feeds. The patterns discussed are broadly applicable across diverse sectors including finance, healthcare, and manufacturing.

UNPACKING THE AGENTIC WORKFLOW PATTERN

The essence of an agentic pattern lies in its iterative decision-making cycle, which mirrors human problem-solving. This core cycle consists of four distinct stages:

Reason – The AI agent, typically an LLM, meticulously analyzes the user’s request, considering the prevailing context and current state. It then determines the optimal next action, whether that’s directly formulating an answer or invoking a specific tool or sub-task to acquire additional information.
Act – Following its reasoning, the agent executes the chosen step. This might involve calling a specialized tool or function, such as performing a web search, executing a database query, or conducting an in-depth document analysis utilizing services like Amazon Bedrock Data Automation.
Observe – The agent then processes the outcome of its action. For instance, it reads the retrieved text or analyzes the data returned by the tool.
Loop – Armed with new information, the agent re-enters the reasoning phase, evaluating whether the task is complete or if further steps are required. This continuous loop persists until the agent is confident it can generate a definitive answer for the user.

This dynamic, iterative process empowers agents to tackle complex requests that are beyond the scope of a single prompt. However, implementing agentic systems presents its own set of challenges, primarily increased complexity in control flow. Naive agents can be inefficient, making unnecessary tool calls or entering redundant loops, and can become difficult to manage at scale. This is where structured frameworks like LangGraph become invaluable. LangGraph enables the definition of a directed graph or state machine of potential actions, complete with well-defined nodes (e.g., “Report Writer,” “Query Knowledge Base”) and edges (representing permissible transitions). While the agent’s internal reasoning still dictates the path, LangGraph ensures the overall process remains structured, manageable, and transparent. This controlled flexibility grants the AI assistant the necessary autonomy to handle diverse tasks while maintaining workflow stability and predictability.

SOLUTION ARCHITECTURE: A MULTIMODAL AI ASSISTANT

Our proposed financial management AI assistant is engineered to empower analysts in querying portfolios, conducting company analyses, and generating insightful reports. Central to this solution is Amazon Nova, an intelligent LLM that serves as the inference engine. Amazon Nova adeptly processes text, images, and documents (like earnings call slides), making dynamic decisions on which tools to employ to fulfill requests. Optimized for demanding enterprise tasks, Amazon Nova supports function calling, enabling the model to plan actions and invoke tools in a structured manner. Its expansive context window (up to 300,000 tokens in Amazon Nova Lite and Pro) allows it to effectively manage lengthy documents and extensive conversation histories during complex reasoning processes.

The workflow integrates several key components:

KNOWLEDGE BASE RETRIEVAL

Earnings call audio files and PowerPoint presentations are meticulously processed by Amazon Bedrock Data Automation. This managed service efficiently extracts text, transcribes audio and video content, and prepares data for in-depth analysis. For PowerPoint files, the system converts each slide into an image (PNG) to facilitate efficient searching and analysis, a technique inspired by advanced generative AI applications. Essentially, Amazon Bedrock Data Automation acts as an out-of-the-box multimodal AI pipeline, bridging the gap between raw data and the agentic workflow. The extracted data chunks are then transformed into vector embeddings by Amazon Bedrock Knowledge Bases, using Amazon Titan Text Embeddings V2, and stored in an Amazon OpenSearch Serverless database.

ROUTER AGENT

When a user initiates a query, such as “Summarize the key risks in this Q3 earnings report,” Amazon Nova, acting as the router agent, first determines if the task necessitates data retrieval, file processing, or direct response generation. It intelligently maintains a dialogue memory, interprets the user’s intent, and devises a plan of action. The “Memory & Planning” module depicted in the solution architecture indicates the router agent’s capacity to leverage conversation history and chain-of-thought (CoT) prompting for optimal next step determination. Crucially, this agent decides whether the query can be resolved using internal company data or if it requires external information and tools.

MULTIMODAL RAG AGENT

For queries involving audio or visual information, Amazon Bedrock Data Automation utilizes a unified API to extract insights from multimedia data, which are then stored in Amazon Bedrock Knowledge Bases. Amazon Nova employs these knowledge bases to retrieve factual answers through semantic search. This ensures that responses are grounded in real data, significantly reducing the likelihood of hallucination. Should Amazon Nova generate an answer, a secondary hallucination check rigorously cross-references the response against trusted sources to identify any unsupported claims.

HALLUCINATION CHECK (QUALITY GATE)

To further bolster reliability, the workflow incorporates a post-processing step that can utilize a different foundation model (FM) outside the Amazon Nova family, such as Anthropic’s Claude, Mistral, or Meta’s Llama, to grade the answer’s faithfulness. For example, after Amazon Nova generates a response, a dedicated hallucination detector model or function can compare the answer against retrieved sources or known facts. If a potential hallucination is detected (meaning the answer lacks support from reference data), the agent can opt for additional retrieval, refine the answer, or escalate the query to a human operator.

MULTI-TOOL COLLABORATION

This pivotal feature enables the AI to not only find information but also execute actions before formulating a final answer. This introduces a spectrum of multi-tool options. A supervisor agent might initiate or coordinate multiple tool-specific agents – for instance, a web search agent for general information, a stock search agent for market data, or specialized agents for company financial metrics or industry news. Each agent performs a focused task (e.g., calling an API or querying the internet) and relays its findings back to the supervisor agent. Amazon Nova Pro’s robust reasoning capabilities allow the supervisor agent to seamlessly merge these disparate findings. This multi-agent approach adheres to the principle of dividing complex tasks among specialist agents, thereby enhancing efficiency and reliability for intricate queries.

REPORT CREATION AGENT

A standout feature of this architecture is the utilization of Amazon Nova Canvas for output generation. While Amazon Nova Canvas is primarily an image-generation model within the Amazon Nova family, in this context, the “canvas” concept is figuratively applied to represent a structured template or format for generated content. For example, a predefined template for an “investor report” could be populated by the assistant, comprising sections like “Key Highlights” (bullet points), “Financial Summary” (table of figures), and “Notable Quotes.” The agent can guide Amazon Nova to fill such a template by providing a system prompt with the desired format (akin to few-shot prompting where the layout is provided). The outcome is an assistant that not only answers ad-hoc questions but can also produce comprehensive, professional-looking reports, combining text, images, and references to visuals, as if prepared by a human analyst.

These components are intricately orchestrated within an agentic workflow. Instead of following a rigid script, the solution employs a dynamic decision graph (implemented using the open-source LangGraph library in the provided notebook solution) to intelligently route between steps. The result is an AI assistant that behaves less like a conventional chatbot and more like a collaborative analyst—one capable of parsing earnings call audio, critiquing slide decks, or drafting investor memos with minimal human intervention.

EXAMPLE: MULTI-TOOL COLLABORATION IN ACTION

To illuminate the multi-tool collaboration agent workflow, let’s trace a typical question-answer interaction within our deployed system:

User Prompt – An end-user in the chat interface asks: “What is XXX’s stock performance this year, and how does it compare to its rideshare-industry peers?”
Agent Initial Response – The agent (Amazon Nova FM orchestrator) receives the question and responds with: “Received your question. Routing to the reasoning engine…”
Planning and Tool Selection – The agent determines the precise information required:
- The ticker symbol for the company (XXX).
- Real-time stock price and Year-to-Date (YTD) changes.
- Key financial metrics (revenue, net income, price-earnings ratio).
- Industry benchmarks (peer YTD performance, average revenue growth).
Planning Execution using Tool Calls – The agent then systematically calls various tools to perform the necessary actions:
- Look up ticker symbol: Agent → WebSearchTool.lookupTicker(“XXX Inc”)
  WebSearchTool → Agent: returns “XXX”
- Fetch real-time stock performance: Agent → StockAnalysisTool.getPerformance(symbol=”XXX”, period=”YTD”)
  StockAnalysisTool → Agent: {currentPrice: [value], ytdChange: [value], 52wkRange: [value], volume: [value]}
- Retrieve company financial metrics: Agent → CompanyFinancialAnalysisTool.getMetrics(“UBER”)
  CompanyFinancialAnalysisTool → Agent: {revenueQ4_2024: xxx B, netIncomeQ4_2024: xxx M, peRatio: xxx}
- Gather industry benchmark data: Agent → IndustryAnalysisTool.comparePeers(symbol=”XXX”, sector=”Rideshare”)
  IndustryAnalysisTool → Agent: {avgPeerYTD: [value], avgRevenueGrowth: [value]}
Validation Loop – The agent executes a crucial validation step: Agent: validate()
- Are all four data points present?
- Ticker: ✔
- Stock performance: ✔
- Financial metrics: ✔
- Industry benchmark: ✔
- All set—no retry needed.
If any data is missing or a tool encounters an error, the FM orchestrator triggers an error handler (with up to three retries), then resumes the plan from the failed step.
Synthesis and Final Answer – The agent utilizes Amazon Nova Pro to synthesize all collected data points and generate a comprehensive, grounded final answer.

THE ADVANTAGES OF AMAZON BEDROCK FOR SCALABLE GENERATIVE AI WORKFLOWS

This sophisticated solution thrives on Amazon Bedrock’s integrated ecosystem, offering unparalleled advantages for building and scaling such advanced solutions:

Managed Foundation Models: Amazon Bedrock provides access to top-tier FMs like Amazon Nova, complete with managed infrastructure. This eliminates the need for provisioning GPU servers or grappling with scaling complexities.
Out-of-the-Box Multimodal Data Processing: Amazon Bedrock Data Automation stands as a ready-to-use solution for processing diverse data types—documents, images, audio, and video—into actionable insights. It can convert presentation slides to images, transcribe audio, perform OCR, and generate textual summaries or captions, all seamlessly indexed in an Amazon Bedrock knowledge base.
Robust Knowledge Bases: Amazon Bedrock Knowledge Bases efficiently store embeddings from unstructured data and support advanced retrieval operations via similarity search, ensuring responses are factual and well-grounded.
Flexible Agent Orchestration: Beyond LangGraph (demonstrated in this solution), developers can also leverage Amazon Bedrock Agents to construct agentic workflows. Amazon Bedrock Agents simplifies the configuration of tool flows and action groups, enabling declarative management of agentic processes.
Seamless AWS Integration: Applications developed with open-source frameworks like LangGraph (an extension of LangChain) can effortlessly run and scale on AWS infrastructure, including Amazon Elastic Compute Cloud (Amazon EC2) or Amazon SageMaker instances. This facilitates defining directed graphs for agent orchestration, making multi-step reasoning and tool chaining straightforward.

AWS provides a truly integrated network for generative AI workflows, negating the need to piece together a dozen disparate systems.

CONSIDERATIONS FOR CUSTOMIZATION AND PRODUCTION READINESS

The architecture described exemplifies exceptional flexibility through its modular design. At its heart, the system uses Amazon Nova FMs, which can be precisely selected based on task complexity. Amazon Nova Micro efficiently handles straightforward tasks like classification with minimal latency. Amazon Nova Lite manages moderately complex operations with balanced performance, while Amazon Nova Pro excels at sophisticated tasks demanding advanced reasoning or the generation of comprehensive responses.

The inherent modularity of the solution—encompassing Amazon Nova, various tools, knowledge bases, and Amazon Bedrock Data Automation—means that each component can be independently swapped or adjusted without necessitating a complete system overhaul. Solution architects can leverage this reference architecture as a robust foundation, implementing custom integrations as required. New capabilities can be seamlessly integrated through AWS Lambda functions for specialized operations, and the LangGraph orchestration facilitates dynamic model selection and complex routing logic. This architectural approach guarantees that the system can evolve organically while maintaining peak operational efficiency and cost-effectiveness.

Transitioning to production demands meticulous design, but AWS offers inherent scalability, security, and reliability. Critical considerations include securing knowledge base content with encryption and robust access control, integrating the agent with AWS Identity and Access Management (IAM) to ensure it performs only authorized actions (e.g., verifying user permissions for access to sensitive financial data), and diligently monitoring costs (tracking Amazon Bedrock pricing and tool usage, potentially using Provisioned Throughput for consistent high-volume usage). Moreover, AWS enables a seamless transition from a notebook experiment to a full production deployment, utilizing the same foundational building blocks, integrated with appropriate AWS infrastructure like Amazon API Gateway or Lambda when deployed as a service.

TRANSFORMATIVE IMPACT: VERTICAL INDUSTRY APPLICATIONS

The versatile architecture discussed has the potential to drive significant value across various industries:

FINANCIAL SERVICES

In finance, the solution integrates multimedia RAG to unify earnings call transcripts, presentation slides (converted into searchable images), and real-time market feeds into a cohesive analytical framework. Multi-agent collaboration empowers Amazon Nova to orchestrate tools like Amazon Bedrock Data Automation for slide text extraction, semantic search for regulatory filings, and live data APIs for trend detection. This enables the system to generate actionable insights—such as identifying portfolio risks or recommending sector rebalancing—while automating content creation for investor reports or trade approvals (with human oversight). By emulating an analyst’s ability to cross-reference diverse data types, the AI assistant transforms fragmented inputs into coherent strategies.

HEALTHCARE

Healthcare workflows leverage multimedia RAG to process clinical notes, lab PDFs, and X-rays, grounding responses in peer-reviewed literature and patient audio interviews. Multi-agent collaboration excels in scenarios like triage: Amazon Nova interprets symptom descriptions, Amazon Bedrock Data Automation extracts text from scanned documents, and integrated APIs check for drug interactions, all while meticulously validating outputs against trusted sources. Content creation ranges from succinct patient summaries (“Severe pneumonia, treated with levofloxacin”) to evidence-based answers for complex queries, such as summarizing diabetes guidelines. The architecture’s stringent hallucination checks and source citations are crucial for maintaining trust in medical decision-making.

MANUFACTURING

Industrial teams can utilize multimedia RAG to index equipment manuals, sensor logs, worker audio conversations, and intricate schematic diagrams, facilitating rapid troubleshooting. Multi-agent collaboration allows Amazon Nova to correlate sensor anomalies with manual excerpts, and Amazon Bedrock Data Automation highlights faulty parts in technical drawings. The system can generate precise repair guides (e.g., “Replace valve Part 4 in schematic”) or contextualize historical maintenance data, bridging the knowledge gap between seasoned veterans and new technicians. By unifying text, images, and time-series data into actionable content, the assistant significantly reduces downtime and preserves invaluable institutional knowledge—proving that even in hardware-centric fields, AI-driven insights can dramatically enhance efficiency.

These examples underscore a common, powerful pattern: the synergy of advanced data automation, powerful multimodal models, and intelligent agentic orchestration culminates in solutions that closely mimic the assistance of a human expert. The financial AI assistant cross-checks figures and explanations with the diligence of an analyst; the clinical AI assistant correlates images and notes with the thoroughness of a diligent doctor; and the industrial AI assistant recalls diagrams and logs with the precision of a veteran engineer. All of this is underpinned by the innovative architecture we’ve explored.

CONCLUSION

The era of siloed AI models, limited to handling a single input type, is rapidly concluding. As demonstrated, the convergence of multimodal AI with an agentic workflow unleashes an unprecedented level of capability for enterprise applications. This article has illustrated how to construct such a sophisticated workflow using AWS services:

Amazon Nova serves as the central AI orchestrator, offering its powerful multimodal and agent-friendly capabilities.
Amazon Bedrock Data Automation automates the ingestion and indexing of complex data (documents, slides, audio) into Amazon Bedrock Knowledge Bases.
The concept of an agentic workflow graph for reasoning and conditional logic (using frameworks like LangChain or LangGraph) orchestrates multi-step reasoning and precise tool usage.

The ultimate outcome is an AI assistant that operates with the diligence of a seasoned analyst: researching, meticulously cross-checking multiple sources, and delivering profound insights—but at machine speed and scale. This solution unequivocally demonstrates that building a sophisticated agentic AI system is no longer an academic aspiration; it is practical and entirely achievable with today’s AWS technologies. By leveraging Amazon Nova as a robust multimodal LLM and Amazon Bedrock Data Automation for comprehensive multimodal data processing, coupled with frameworks for tool orchestration like LangGraph (or Amazon Bedrock Agents), developers gain a significant head start. Many common challenges, such as OCR, intricate document parsing, or complex conversational orchestration, are effectively handled by these managed services or libraries, allowing developers to concentrate their efforts on core business logic and specific domain needs.

The sample notebook, BDA_nova_agentic, serves as an excellent starting point for experimenting with these transformative ideas. We encourage you to explore it, extend its functionalities, and tailor it to your organization’s unique requirements. The techniques discussed here represent just a fraction of the immense possibilities unlocked when combining diverse modalities with intelligent agents.