Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

FieldAI raises $405M to build universal robot brains

IBM and NASA Release Groundbreaking Open-Source AI Model on Hugging Face to Predict Solar Weather and Help Protect Critical Technology

Latest AI Funding And Acquisition Deals Spotlight Innovation, Growth, And Faster Time-To-Market

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Amazon AWS AI

Benchmarking document information localization with Amazon Nova

By Advanced AI EditorAugust 19, 2025No Comments12 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Every day, enterprises process thousands of documents containing critical business information. From invoices and purchase orders to forms and contracts, accurately locating and extracting specific fields has traditionally been one of the most complex challenges in document processing pipelines. Although optical character recognition (OCR) can tell us what text exists in a document, determining where specific information is located has required sophisticated computer vision solutions.

The evolution of this field illustrates the complexity of the challenge. Early object detection approaches like YOLO (You Only Look Once) revolutionized the field by reformulating object detection as a regression problem, enabling real-time detection. RetinaNet advanced this further by addressing class imbalance issues through Focal Loss, and DETR introduced transformer-based architectures to minimize hand-designed components. However, these approaches shared common limitations: they required extensive training data, complex model architectures, and significant expertise to implement and maintain.

The emergence of multimodal large language models (LLMs) represents a paradigm shift in document processing. These models combine advanced vision understanding with natural language processing capabilities, offering several groundbreaking advantages:

Minimized use of specialized computer vision architectures
Zero-shot capabilities without the need for supervised learning
Natural language interfaces for specifying location tasks
Flexible adaptation to different document types

This post demonstrates how to use foundation models (FMs) in Amazon Bedrock, specifically Amazon Nova Pro, to achieve high-accuracy document field localization while dramatically simplifying implementation. We show how these models can precisely locate and interpret document fields with minimal frontend effort, reducing processing errors and manual intervention. Through comprehensive benchmarking on the FATURA dataset, we provide benchmarking of performance and practical implementation guidance.

Understanding document information localization

Document information localization goes beyond traditional text extraction by identifying the precise spatial position of information within documents. Although OCR tells us what text exists, localization tells us where specific information resides—a crucial distinction for modern document processing workflows. This capability enables critical business operations ranging from automated quality checks and sensitive data redaction to intelligent document comparison and validation.

Traditional approaches to this challenge relied on a combination of rule-based systems and specialized computer vision models. These solutions often required extensive training data, careful template matching, and continuous maintenance to handle document variations. Financial institutions, for instance, would need separate models and rules for each type of invoice or form they processed, making scalability a significant challenge. Multimodal models with localization capabilities available on Amazon Bedrock fundamentally change this paradigm. Rather than requiring complex computer vision architectures or extensive training data, these multimodal LLMs can understand both the visual layout and semantic meaning of documents through natural language interactions. By using models with the capability to localize, organizations can implement robust document localization with significantly reduced technical overhead and greater adaptability to new document types.

Multimodal models with localization capabilities, such as those available on Amazon Bedrock, fundamentally change this paradigm. Rather than requiring complex computer vision architectures or extensive training data, these multimodal LLMs can understand both the visual layout and semantic meaning of documents through natural language interactions. By using models with the capability to localize, organizations can implement robust document localization with significantly reduced technical overhead and greater adaptability to new document types.

Solution overview

We designed a simple localization solution that takes a document image and text prompt as input, processes it through selected FMs on Amazon Bedrock, and returns the field locations using either absolute or normalized coordinates. The solution implements two distinct prompting strategies for document field localization:

Image dimension strategy – Works with absolute pixel coordinates, providing explicit image dimensions and requesting bounding box locations based on the document’s actual size
Scaled coordinate strategy – Uses a normalized 0–1000 coordinate system, making it more flexible across different document sizes and formats

The solution has a modular design to allow for straightforward extension to support custom field schemas through configuration updates rather than code changes. This flexibility, combined with the scalability of Amazon Bedrock, makes the solution suitable for both small-scale document processing and enterprise-wide deployment. In the following sections, we demonstrate the setup and implementation strategies used in our solution for document field localization using Amazon Bedrock FMs. You can see more details in our GitHub repository.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account with Amazon Bedrock access
Permissions to use Amazon Nova Pro
Python 3.8+ with the boto3 library installed

Initial set ups

Complete the following setup steps:

Configure the Amazon Bedrock runtime client with appropriate retry logic and timeout settings:

import boto3
from botocore.config import Config

# Configure Bedrock client with retry logic
BEDROCK_CONFIG = Config(
region_name=”us-west-2″,
signature_version=’v4′,
read_timeout=500,
retries={
‘max_attempts’: 10,
‘mode’: ‘adaptive’
}
)

# Initialize Bedrock runtime client
bedrock_runtime = boto3.client(“bedrock-runtime”, config=BEDROCK_CONFIG)

Define your field configuration to specify which elements to locate in your documents:

# sample config
field_config = {
“invoice_number”: {“type”: “string”, “required”: True},
“total_amount”: {“type”: “currency”, “required”: True},
“date”: {“type”: “date”, “required”: True}
}

Initialize the BoundingBoxExtractor with your chosen model and strategy:

extractor = BoundingBoxExtractor(
model_id=NOVA_PRO_MODEL_ID, # or other FMs on Amazon Bedrock
prompt_template_path=”path/to/prompt/template”,
field_config=field_config,
norm=None # Set to 1000 for scaled coordinate strategy
)

# Process a document
bboxes, metadata = extractor.get_bboxes(
document_image=document_image,
document_key=”invoice_001″ # Optional tracking key
)

Implement prompting strategies

We test two prompt strategies in this workflow: image dimension and scaled coordinate.

The following is a sample prompt template for the image dimension strategy:

“””
Your task is to detect and localize objects in images with high precision.
Analyze each provided image (width = {w} pixels, height = {h} pixels) and return only a JSON object with bounding box data for detected objects.

Output Requirements:
1. Use absolute pixel coordinates based on provided width and height.
2. Ensure high accuracy and tight-fitting bounding boxes.

Detected Object Structure:
– “element”: Use one of these labels exactly: {elements}
– “bbox”: Array with coordinates [x1, y1, x2, y2] in absolute pixel values.

JSON Structure:
“`json
{schema}
“`

Provide only the specified JSON format without extra information.
“””

The following is a sample prompt template for the scaled coordinate strategy:

“””
Your task is to detect and localize objects in images with high precision.
Analyze each provided image and return only a JSON object with bounding box data for detected objects.

Output Requirements:
Use (x1, y1, x2, y2) format for bounding box coordinates, scaled between 0 and 1000.

Detected Object Structure:
– “element”: Use one of these labels exactly: {elements}
– “bbox”: Array [x1, y1, x2, y2] scaled between 0 and 1000.

JSON Structure:
“`json
{schema}
“`

Provide only the specified JSON format without extra information.
“””

Evaluate performance

We implement evaluation metrics to monitor accuracy:

evaluator = BBoxEvaluator(field_config=field_config)
evaluator.set_iou_threshold(0.5) # Adjust based on requirements
evaluator.set_margin_percent(5) # Tolerance for position matching

# Evaluate predictions
results = evaluator.evaluate(predictions, ground_truth)
print(f”Mean Average Precision: {results[‘mean_ap’]:.4f}”)

This implementation provides a robust foundation for document field localization while maintaining flexibility for different use cases and document types. The choice between image dimension and scaled coordinate strategies depends on your specific accuracy requirements and document variation.

Benchmarking results

We conducted our benchmarking study using FATURA, a public invoice dataset specifically designed for document understanding tasks. The dataset comprises 10,000 single-page invoices saved as JPEG images, representing 50 distinct layout templates with 200 invoices per template. Each document is annotated with 24 key fields, including invoice numbers, dates, line items, and total amounts. The annotations provide both the text values and precise bounding box coordinates in JSON format, making it ideal for evaluating field localization tasks. The dataset has the following key characteristics:

Documents: 10,000 invoices (JPEG format)
Templates: 50 distinct layouts (200 documents each)
Fields per document: 24 annotated fields
Annotation format: JSON with bounding boxes and text values
Field types: Invoice numbers, dates, addresses, line items, amounts, taxes, totals
Image resolution: Standard A4 size at 300 DPI
Language: English

The following figure shows sample invoice templates showcasing layout variation.

The following figure is an example of annotation visualization.

Sample Results 1
Sample Results 2

Before conducting the full-scale benchmark, we performed an initial experiment to determine the optimal prompting strategy. We selected a representative subset of 50 images, comprising 5 samples from 10 different templates, and evaluated three distinct approaches:

Image dimension:

Method: Provides explicit pixel dimensions and requests absolute coordinate bounding boxes
Input: Image bytes, image dimensions, output schema

Scaled coordinate:

Method: Uses normalized 0-1000 coordinate system
Input: Image bytes, output schema

Added gridlines:

Method: Enhances image with visual gridlines at fixed intervals
Input: Modified image with gridlines bytes, image dimensions, output schema

The following figure compares performance for different approaches for Mean Average Precision (mAP).

Mean AP Distribution

Building on insights from our initial strategy evaluation, we conducted benchmarking using the complete FATURA dataset of 10,000 documents. We employed the scaled coordinate approach for Amazon Nova models, based on their respective optimal performance characteristics from our initial testing. Our evaluation framework assessed Amazon Nova Pro through standard metrics, including Intersection over Union (IoU) and Average Precision (AP). The evaluation spanned all 50 distinct invoice templates, using an IoU threshold of 0.5 and a 5% margin tolerance for field positioning.

The following are our sample results in JSON:

{
“template”: “template1”,
“instance”: “Instance0”,
“metrics”: {
“mean_ap”: 0.8421052631578947,
“field_scores”: {
“TABLE”: [0.9771107575829314, 1.0, 1.0, 1.0, 1.0],
“BUYER”: [0.3842328422050217, 0.0, 0.0, 0, 0.0],
“DATE”: [0.9415158516000428, 1.0, 1.0, 1.0, 1.0],
“DISCOUNT”: [0.8773709977744115, 1.0, 1.0, 1.0, 1.0],
“DUE_DATE”: [0.9338410331219548, 1.0, 1.0, 1.0, 1.0],
“GSTIN_BUYER”: [0.8868145680064249, 1.0, 1.0, 1.0, 1.0],
“NOTE”: [0.7926162009357707, 1.0, 1.0, 1.0, 1.0],
“PAYMENT_DETAILS”: [0.9517931284002012, 1.0, 1.0, 1.0, 1.0],
“PO_NUMBER”: [0.8454266053075804, 1.0, 1.0, 1.0, 1.0],
“SELLER_ADDRESS”: [0.9687004508445741, 1.0, 1.0, 1.0, 1.0],
“SELLER_EMAIL”: [0.8771026147909002, 1.0, 1.0, 1.0, 1.0],
“SELLER_SITE”: [0.8715647216012751, 1.0, 1.0, 1.0, 1.0],
“SUB_TOTAL”: [0.8049954543667662, 1.0, 1.0, 1.0, 1.0],
“TAX”: [0.8751563641702513, 1.0, 1.0, 1.0, 1.0],
“TITLE”: [0.850667327423512, 1.0, 1.0, 1.0, 1.0],
“TOTAL”: [0.7226784112051814, 1.0, 1.0, 1.0, 1.0],
“TOTAL_WORDS”: [0.9099353099528785, 1.0, 1.0, 1.0, 1.0],
“GSTIN_SELLER”: [0.87170328009624, 1.0, 1.0, 1.0, 1.0],
“LOGO”: [0.679425211111111, 1.0, 1.0, 1.0, 1.0]
}
},
“metadata”: {
“usage”: {
“inputTokens”: 2250,
“outputTokens”: 639,
“totalTokens”: 2889
},
“metrics”: {
“latencyMs”: 17535
}
}
}

The following figure is an example of successful localization for Amazon Nova Pro.

The results demonstrate Amazon Nova Pro’s strong performance in document field localization. Amazon Nova Pro achieved a mAP of 0.8305. It demonstrated consistent performance across various document layouts, achieving a mAP above 0.80 across 45 of 50 templates, with the lowest template-specific mAP being 0.665. Although Amazon Nova Pro showed relatively high processing failures (170 out of 10,000 images), it still maintained high overall performance. Most low AP results were attributed to either complete processing failures (particularly over-refusal by its guardrail filters and malformed JSON output) or field misclassifications (particularly confusion between similar fields, such as buyer vs. seller addresses).

The following table summarizes the overall performance metrics.

Mean IoU
Mean AP

Amazon Nova Pro
0.7423
0.8331

The following graph shows the performance distribution for each individual extraction of approximately 20 labels for 10,000 documents.

Field-specific analysis reveals that Amazon Nova Pro excels at locating structured fields like invoice numbers and dates, consistently achieving precision and recall scores above 0.85. It demonstrates particularly strong performance with text fields, maintaining robust accuracy even when dealing with varying currency formats and decimal representations. This resilience to format variations makes it especially valuable for processing documents from multiple sources or regions.

The following graph summarizes field-specific performance. The graph shows AP success percentage for each label, across all documents for each model. It is sorted by highest success.

Conclusion

This benchmarking study demonstrates the significant advances in document field localization by multimodal FMs. Through comprehensive testing on the FATURA dataset, we’ve shown that these models can effectively locate and extract document fields with minimal setup effort, dramatically simplifying traditional computer vision workflows. Amazon Nova Pro emerges as an excellent choice for enterprise document processing, delivering a mAP of 0.8305 with consistent performance across diverse document types. Looking ahead, we see several promising directions for further optimization. Future work could explore extending the solution in agentic workflows to support more complex document types and field relationships.

To get started with your own implementation, you can find the complete solution code in our GitHub repository. We also recommend reviewing the Amazon Bedrock documentation for the latest model capabilities and best practices.

About the authors

Ryan Razkenari is a Deep Learning Architect at the AWS Generative AI Innovation Center, where he uses his expertise to create cutting-edge AI solutions. With a strong background in AI and analytics, he is passionate about building innovative technologies that address real-world challenges for AWS customers.

Harpreet Cheema is a Deep Learning Architect at the AWS Generative AI Innovation Center. He is very passionate in the field of machine learning and in tackling different problems in the ML domain. In his role, he focuses on developing and delivering Generative AI focused solutions for real-world applications.

Spencer Romo is a Senior Data Scientist with extensive experience in deep learning applications. He specializes in intelligent document processing while maintaining broad expertise in computer vision, natural language processing, and signal processing. Spencer’s innovative work in remote sensing has resulted in multiple patents. Based in Austin, Texas, Spencer loves working directly with customers to understand their unique problems and identify impactful AI solutions. Outside of work, Spencer competes in 24 Hours of Lemons racing series, embracing the challenge of high-performance driving on a budget.

Mun Kim is a Machine Learning Engineer at the AWS Generative AI Innovation Center. Mun brings expertise in building machine learning science and platform that help customers harness the power of generative AI technologies. He works closely with AWS customers to accelerate their AI adoption journey and unlock new business value.

Wan Chen is an Applied Science Manager at the Generative AI Innovation Center. As a ML/AI veteran in tech industry, she has wide range of expertise on traditional machine learning, recommender system, deep learning and Generative AI. She is a stronger believer of Superintelligence, and is very passionate to push the boundary of AI research and application to enhance human life and drive business growth. She holds Ph.D in Applied Mathematics from University of British Columbia, and had worked as postdoctoral fellow in Oxford University.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleDeepSeek Releases V3.1 Model with 685 Billion Parameters on Hugging Face
Next Article Speed Always Wins: A Survey on Efficient Architectures for Large Language Models – Takara TLDR
Advanced AI Editor
  • Website

Related Posts

Enhance AI agents using predictive ML models with Amazon SageMaker AI and Model Context Protocol (MCP)

August 21, 2025

Tyson Foods elevates customer search experience with an AI-powered conversational assistant

August 21, 2025

Create personalized products and marketing campaigns using Amazon Nova in Amazon Bedrock

August 20, 2025

Comments are closed.

Latest Posts

Tanya Bonakdar Gallery to Close Los Angeles Space

Dallas Museum of Art Names Brian Ferriso as Its Next Director

Rapa Nui’s Moai Statues Threatened by Rising Sea Levels, Flooding

Mickalene Thomas Accused of Harassment by Racquel Chevremont

Latest Posts

FieldAI raises $405M to build universal robot brains

August 21, 2025

IBM and NASA Release Groundbreaking Open-Source AI Model on Hugging Face to Predict Solar Weather and Help Protect Critical Technology

August 21, 2025

Latest AI Funding And Acquisition Deals Spotlight Innovation, Growth, And Faster Time-To-Market

August 21, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • FieldAI raises $405M to build universal robot brains
  • IBM and NASA Release Groundbreaking Open-Source AI Model on Hugging Face to Predict Solar Weather and Help Protect Critical Technology
  • Latest AI Funding And Acquisition Deals Spotlight Innovation, Growth, And Faster Time-To-Market
  • Thousands of Grok chats are now searchable on Google
  • PixVerse AI Effect Brings Oil Paintings to Life: Trending AI Video Generation Tool Analysis | AI News Detail

Recent Comments

  1. Charlescak on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. ArturoJep on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. ArturoJep on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. Charlescak on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. ArturoJep on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.