An IMF empirical study How Effectively Can Current LLMs Analyze Macrofinancial Issues? (WP/26/35) has evaluated the capability of advanced Large Language Models (LLMs)—including GPT-o1, GPT-4.1, and GPT-5—to analyze macrofinancial coverage in Article IV staff reports. These reports are the cornerstone of IMF surveillance, resulting from annual "Article IV consultations" where IMF economists visit member countries to assess economic and financial developments and provide policy advice.
Testing a dataset of 543 reports (2016-2024), the research found that the latest models achieve 71-75% accuracy on qualitative ratings and 76-81% on binary questions compared to human benchmarks. While the models exhibit high reproducibility (~88%) and effectively handle structured factual extraction, they demonstrate a consistent optimistic bias and lower variance in ratings.
Key Findings on LLM Performance in Economic Analysis
Model Hierarchy: Advanced models like GPT-5 (unified router) and GPT-o1 (reasoning-optimized) substantially outperform earlier iterations in interpreting complex macrofinancial surveillance.
Accuracy Benchmarks: In 2024 data, models reached 74–75% accuracy for ratings when refined prompting was utilized, though exact matches on granular scales remain low.
Optimistic Skew: LLMs tend to rate report quality more favorably than humans, potentially overlooking weaker analyses or missing subtle context that an expert would flag.
High Consistency: Models show an 88% consistency rate on repeated runs, suggesting they can provide a reliable baseline for human reviewers.
Justification Features: Integrating "justification" requirements in prompts helps human supervisors detect where an LLM’s interpretation of specialized content diverges from expert meaning.
What is the "Unified Router" (GPT-5) in Economic Tasks? In the context of this study, the "Unified Router" refers to a deeper mode of GPT-5 that employs specialized sub-routines for different types of queries. It handles macrofinancial data by switching between reasoning-optimized and fact-extraction modes depending on the complexity of the staff report. This architecture is shown to improve the match rate with human economists on nuanced binary questions where standard models often fail.
Policy Relevance: Digital Transformation and Indian Surveillance
Operationalizing "Second Opinion" Protocols: For the Ministry of Finance and RBI, the 88% reproducibility rate provides a mechanical tool to institutionalize AI as a mandatory consistency check for domestic economic reports before they are finalized for international surveillance.
Bypassing the Documentation Load: Given India's rank as the 3rd most diverse trade partner in the Global South, LLMs offer a primary mechanic to rapidly filter cross-border financial reports to identify "high-risk" surveillance gaps that require human deep-dives.
Mechanical Link to GDP Re-benchmarking: As India moves to the 2022-23 GDP base year, LLMs can be utilized to audit the "Technical Fidelity" of how new data sources (like ASUSE and PLFS) are integrated into macrofinancial models.
Mitigating "Optimism Bias" in Growth Projections: Indian policymakers must account for the LLM's inherent optimistic skew. In practice, this means AI-generated economic outlooks should be mechanically adjusted by human experts to ensure the 7.6% growth projected for FY 2025-26 remains grounded in "off-text" reality.
Standardizing Digital Identity in Finance: The use of LLMs to construct sentiment and risk measures from firm-level financial statements aligns with SEBI's social media disclosure mandates, creating a verified digital audit trail for market participants.
Follow the full research here: IMF Working Paper: LLMs for Macrofinancial Issues


