Blog Post

Microsoft 365 Copilot Blog
9 MIN READ

LLMs can read, but can they understand Wall Street? Benchmarking their financial IQ

juhisingh's avatar
juhisingh
Icon for Microsoft rankMicrosoft
May 15, 2025

Authors:
Dominick Kubicaa+, Dylan T. Gordona+, Nanami Emuraa+, Derleen Sainia+, and Charlie Goldenberga+*

a Department of Business Analytics, Santa Clara University – Leavey School of Business, Santa Clara, California 95053, United States

+ These authors contributed equally.

This research was conducted as part of a Microsoft-sponsored Capstone Project, led by Juhi Singh and Bonnie Ao from the Microsoft MCAPS AI Transformation Office.

Generative AI’s role in high-stakes domains like finance will become more prevalent as AI becomes increasingly embedded in professional workflows. Financial language is uniquely complex because it is charged with forward-looking statements, hedged language, and subtle cues that challenge current models. Can today’s Large Language Models (LLMs) understand this kind of nuance?  

This question drove our collaborative research project between Santa Clara University and the Microsoft AI Transformation team. We set out to evaluate whether LLMs could outperform traditional natural language processing (NLP) tools in financial sentiment analysis. Additionally, whether they could generate useful insights when applied to real-world financial reporting such as quarterly earnings calls. 

Our approach had three parts: 

  1. Benchmarking LLMs and traditional NLP tools on a standardized financial dataset.
  2. Applying these models to Microsoft’s quarterly earnings transcripts, breaking down sentiment by business line, and better understanding insights that can be extracted from earnings call transcripts.
  3. Analyzing results to identify optimization opportunities and assess how well sentiment analysis correlates with actual stock performance. 

What we discovered was both encouraging and eye-opening: while LLMs significantly outperformed traditional tools in grasping nuanced sentiment, they still face performance challenges. In this blog, we’ll discuss our benchmarking process, real-world findings, and recommendations to enhance tools like Microsoft Copilot. 

Evaluating the Accuracy of Models Through Benchmarking 

An objective benchmarking process is essential to evaluate performance differences between LLMs and traditional NLP tools. First, we conducted a standardized evaluation to measure how accurately various models interpret sentiment in financial texts. Given the complexities of financial language, this comparison highlights how effectively each model captures tone and nuance, offering insights for both tool selection and future model development.  

Accuracy testing was conducted using the Financial Phrase Bank dataset, developed by researchers at Aalto University. It consists of financial and earnings-related news headlines labeled as positive, neutral, or negative based on market sentiment. Nine models were compared:  

  • LLM-based/cloud platforms: Microsoft Copilot Desktop App, Copilot via Microsoft 365, Copilot App Online, ChatGPT - 4o, and Google Gemini 2.0 Flash
  • Cloud-based NLP service: Azure Language AI  
  • Python libraries: FinBERT (Transformer model via Python library), NLTK, and Textblob (Microsoft Copilot 365) 

Each model classified the same sentences from the dataset. Financial sentences were preprocessed for the traditional NLP libraries to ensure formatting consistency. For LLM-based tools, identical prompts were used to reflect a real-world application. After each model returned the sentiment of each sentence, accuracy was measured as the percentage of correct classifications against the pre-labeled dataset. Both the Copilot desktop app and the Chat interface, whether run locally or accessed via the web, were used with the “think deeper” capability. Microsoft 365 does not have a “Think Deeper” ability available. 

 

Benchmarking revealed significant differences in sentiment analysis accuracy across models. The Copilot App (both Online and Local) led the field with the highest accuracy at 82.0%, outperforming all other models. These were followed by ChatGPT 4o (77.6%), Prompt-Engineered ChatGPT (75.6%), and Gemini (68.0%). Notably, large language models using uncleaned sentences demonstrated stronger performance, particularly in detecting nuance and hedged expressions.  

Copilot through Microsoft 365 exhibited lower accuracy relative to other LLM sentiment models in our benchmarking.  We discovered that Copilot 365 defaults to the TextBlob Python library as its primary sentiment‑analysis tool, whereas both the desktop App and web‑hosted Chat versions rely on their full LLM‑based computational capabilities. However, this outcome is understandable given that Copilot 365 is designed primarily to optimize and extend the core capabilities of the Microsoft 365 suite. 

Our earlier research focused exclusively on Microsoft 365 Copilot, where sentiment analysis results were limited in scope and accuracy. However, follow-up testing with the Copilot App in both its local and online versions revealed significantly better performance. These versions achieved 82 percent accuracy with the deeper thinking mode, placing them at the top of our benchmark results across all models evaluated. This finding underscores the importance of distinguishing between different Copilot interfaces when assessing their capabilities. 

While the app-based versions demonstrated strong performance, there are still concerns around usability and accessibility. Currently, there is limited communication to users regarding the differences in performance and functionality between the Microsoft 365 version and the standalone Copilot App. Without this context, users may not understand the appropriate time to use each model. 

During testing, all versions of Copilot required extra attention when processing structured data such as CSV files. In many cases, data had to be manually converted to plain text to ensure accurate interpretation. Additionally, we observed a pattern of hallucinations during post-processing, particularly in tasks involving label formatting or cleaning. For example, when asked to convert sentiment labels to lowercase, some models returned unexpected or inconsistent results, leading to mismatches in evaluation. 

Despite these challenges, Microsoft 365 Copilot showed strong performance in adjacent natural language tasks such as summarization and transcript segmentation. These strengths demonstrate its value for broader enterprise use cases even if sentiment classification remains a secondary function. 

We see a valuable opportunity for Microsoft to enhance user experience by improving transparency and documentation around Copilot’s underlying tools. Providing clear information about which models or engines are used in different contexts would help users quickly learn how to best apply the system. Additionally, if Microsoft were to relax the current restrictions preventing deeper model integration within 365 Copilot, this could significantly improve its performance in specialized domains such as financial sentiment analysis. 

Across the board, large language models performed better than traditional sentiment engines in identifying implied or nuanced sentiment. ChatGPT and Gemini delivered strong results, while FinBERT was particularly effective for finance-specific cases. However, the standout performance of the Copilot App confirms that Microsoft has already developed highly capable models. Expanding access to these capabilities within the Microsoft 365 environment could further support professionals in making sense of complex and nuanced financial data. 

Real-World Application: Business Line Sentiment vs Stock Prediction 

While benchmarking LLMs on a standardized dataset offered valuable insight into model accuracy, we wanted to assess whether these tools could convert benchmarking results into actionable insights in a real-world financial context. Specifically, can business line sentiment from earnings call transcripts provide additional business insights or competitive data, and correlate with stock price movements? 

First, we used Copilot to separate Microsoft’s quarterly earnings call transcripts, segmenting them by business line: Devices, Dynamics, Gaming, Office Commercial, Search & News Advertising, Server Products and Cloud Services, Q&A, and Overall Sentiment. Each segment was then processed using ChatGPT 4o to evaluate sentiment at the business-line level. ChatGPT-4o was selected as the sentiment analysis tool due to its consistent output and its ability to process multiple quarters of transcripts efficiently through a single standardized Python workflow. While Copilot demonstrated the highest sentiment accuracy in our evaluation, its lack of API access and reliance on manual input made it impractical for large-scale analysis. Given ChatGPT-4o’s close performance and superior accessibility, it was the most viable choice for our analysis. 

  

To break down our box and whisker chart, the blue shaded box for each quarter shows the interquartile range of the data and the black horizontal lines represent the minimum and maximum sentiment values without outliers. The arrows below represent the stock price increase or decrease the day after the earnings call for each quarter. 

Analyzing overall transcript sentiment provided limited insight into stock movement, but breaking down sentiment by business segment revealed meaningful patterns that would otherwise be overlooked. For instance, high positive sentiment in the “Search and News Advertising” segment during Q1 2025 was associated with a notable drop in stock price following the call. Conversely, in Q3 2023, Devices sentiment spiked positively and was followed by a significant rise in Microsoft’s share price. These findings suggest that sentiment within specific business segments may have a greater impact on market response than the overall tone of each earnings call. 

 

To better understand this relationship and visualize potential causal patterns, we used a SHAP (SHapley Additive exPlanations) bee-swarm plot to map sentiment direction against stock movement. In this graph, each dot represents a business segment within a specific quarter. Red indicates high positive sentiment, while blue indicates low sentiment. The horizontal axis represents the SHAP value, indicating the contribution of each business line’s sentiment to the model’s stock price change prediction, with values further from zero signifying greater influence. 

Importantly, dots for Search and News Advertising cluster toward the left despite the red color. This reinforces our hypothesis that a positive tone in this segment may raise investor skepticism or signal over-optimism, resulting in sell-offs. 

This inverse correlation raises a critical insight: not all positive sentiment within transcripts translates to positive external / investor sentiment. We observed similar sentiment inversions in segments like gaming and Q&A as well, where optimistic statements were followed by negative stock movement. These cases illustrate a core challenge in sentiment analysis: tone alone is not a reliable predictor of investor sentiment. Investor response likely depends on additional context, expectations, and broader market narratives. 

Ultimately, breaking down transcripts by business line was key to revealing meaningful insights and it transformed high-level analysis into more granular interpretation. Integrating LLMs into workflows offers a valuable opportunity to improve forecasting accuracy and contextual understanding. Our findings suggest that large language models, paired with domain expertise, are effective for uncovering sentiment patterns that highlight promising areas for further exploration, even if definitive conclusions about stock movements can't yet be made. 

Findings and Optimization Recommendations 

Reviewing our benchmark results alongside the Microsoft earnings case study, several key insights emerged: 

  • LLMs clearly outperformed traditional tools in financial sentiment analysis, especially in detecting filler words and subtle cues.
  • Traditional models require aggressive text cleaning, which often strips away nuance. LLMs were able to use more of the filler context wording.
  • Despite their edge, LLMs still failed to exceed 85% accuracy. Financial experts should be able to surpass this with their tailored domain knowledge.
  • LLMs remain expensive and computationally intensive, limiting scalability for smaller teams.
  • Human creativity is still essential. While LLMs supported tasks like code generation and visuals, they lacked the intuition to guide the project’s direction. The key questions, analysis choices, and meaningful visualizations came from the data scientists. AI can assist, but it doesn’t have the perspective or creative insight to see a project through from start to finish. 

These findings reinforce that LLMs are powerful tools, but not replacements for domain experts. 

Optimization Recommendations for Microsoft Copilot 

Our hands-on testing across browser-based language models and multiple Copilot platforms surfaced key opportunities to improve Microsoft Copilot's performance and usability for financial applications. 

Optimization Opportunity 1 – Performance Transparency: Provide users with clear, real-time transparency when task handling is offloaded from large language models to simpler tools like TextBlob. Include interface cues or summary messages that explain limitations or rerouting, helping users adjust expectations and workflows accordingly. Additionally, performance differences between models should be clearly communicated so that professionals can confidently choose the right tool for the specific NLP task they are working on. 

Optimization Opportunity 2 – Improve CSV Usability: Improve native handling of CSV and tabular inputs by allowing users to paste or upload structured content without compromising analysis accuracy. Manual conversion of structured input into plain text was often necessary for accurate results. Reducing these friction points would significantly enhance Copilot’s utility in data-heavy workflows. 

Optimization Opportunity 3 - Reducing Hallucinations in Basic NLP Tasks:  Focus on increasing reliability in simple NLP functions to ensure that core analysis outputs, such as sentiment classification, remain accurate and uncorrupted during formatting or conversion steps. This safeguards the value of Copilot’s more advanced capabilities by improving trust in foundational outputs. 

Conclusion 

Financial language is layered with strategy, hedging, and nuance, making it a stress test for any sentiment analysis tool. Our research found that while LLMs outperform traditional NLP libraries in detecting financial sentiment, they still face architectural, economic, and reliability barriers to adoption at scale. 

When used carefully, however, LLMs can help highlight patterns or shifts that may complement traditional analysis, especially when applied at the business-line level. Our case study on Microsoft earnings calls showed how a segmented, model-enhanced approach can connect executive tone with investor behavior in powerful ways. 

Looking forward, tools like Microsoft Copilot have immense potential, but unlocking that potential will require deeper integration with LLM capabilities, better transparency, and a stronger focus on enterprise needs. 

LLMs, paired with human intelligence, will reshape the future of financial analysis, not by replacing experts, but by providing them with a powerful, new toolset. 

Note: ChatGPT is developed by OpenAI and operates on Microsoft’s Azure supercomputing infrastructure. While Microsoft and OpenAI collaborate in the development and delivery of AI services, OpenAI remains an independent entity. Azure OpenAI Service provides enterprise-grade access to OpenAI models. 

Updated May 14, 2025
Version 1.0
OSZAR »