Semantic Embeddings: Do News Sentiment or Informational Content Drive Real Estate Prices?
- thomasssschannnn
- Oct 25, 2025
- 7 min read
Updated: Nov 19, 2025
It's an brief intro of one of my ongoing research:
Resonale: While a growing literature has shown that financial news sentiment can predict real estate investment returns, , little evidence indicates the underlying mechanism of whether market reactions are driven by the emotional tone of news or its underlying informational content. To address this, I want to develop a transfer-learning-based semantic embedding framework leveraging state-of-the-art large language models (LLMs), applied to 900,000 Wall Street Journal articles and state-level news (1990–2025), moving beyond pre-LLM bag-of-words sentiment approach to capture richer, analyst-like semantic representations. By building two parallel models (one rational, one sentiment-driven) based on semantic embeddings and evaluating their relative explanatory power, the baseline analysis provides the first direct evidence on whether U.S. real estate market reactions are primarily the result of rational, information-based decisions or driven by non-rational, sentiment-driven behavior. In doing so, this framework also offers the first large-scale empirical test of whether the U.S. real estate market conforms to the (weak or) semi-strong form of market efficiency as defined by Fama (1970, 1991), thereby contributing to the long-standing debate between the efficient market hypothesis (EMH) and behavioral interpretations of real estate dynamics. Further, I will examine three pivotal periods: 2008 crisis, 2020 surge, and the post-AI era, to compare shifts in real estate pricing mechanisms and identify key drivers of market mispricing. Lastly, as an additional test of practical relevance, we implement a cross-sectional portfolio strategy based on semantic embeddings to forecast real estate returns, offering preliminary evidence of their applicability to asset pricing.
Key Research Questions:
Which plays a more dominant role in shaping real estate pricing — the emotional tone of news or the informational content embedded within it? In other words, is the U.S. housing market primarily efficient and information-based, or does it behave more like a sentiment-driven, behavioral market?
Research Gaps
Mechanism Gap: Existing studies about real estate and investor sentiments predominantly focus on predictive power, yet the underlying transmission mechanisms through which investors sentiment influences market reactions remain largely underexplored. We build on the debate regarding whether real estate market movements are driven by contextual fundamentals or news sentiment, as highlighted in Soo (2018).
Large-scale direct empirical evidence testing the Efficient Market Hypothesis (EMH) and behavioral explanations in the U.S. real estate markets remains scarce. Real estate has long served as a principal arena for testing and critiquing EMH, especially in the aftermath of the 2008 financial crisis, with Shiller (2015)’s Nobel-winning work offering influential critiques. Yet, due to methodological limitations, existing literatures rely primarily on indirect market proxies or qualitative survey-based approaches, leaving a gap for large-scale empirical validation.
Technically, existing studies about news sentiments mainly rely on three types of pre-LLM approaches: (i) “bag-of-words” approach such as Loughran–McDonald (2011), Harvard IV-4 (Stone et al. 1966), or topic-based SESTM in Ke, Kelly, Xiu (2020); (ii) “Static word embeddings” approach such as Word2Vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014); and (iii) “Early contextual embeddings” approach such as BERT (Devlin et al., 2018) or FinBERT (Yang et al., 2020) that require extensive fine-tuning. These approaches often oversimplify the nuanced information in financial news or lack robustness across contexts, thereby limiting their effectiveness in capturing complex market dynamics.
Real estate gap in modern EMH theory: In modern Machine Learning–based asset pricing frameworks, investor learning dynamics have increasingly become a core component of how informational efficiency is conceptualized (Nagal, 2025; Molavi et al., 2024; Da, Nagal & Xiu, 2024). Particularly, Martin & Nagel (2022) demonstrate that, after controlling for dynamic belief updating, cross-sectional return differentials largely vanish after 2019, indicating increased informational efficiency and reduced arbitrage opportunities in equity markets recent years. Building on this insight, we examine whether similar dynamics are emerging in the real estate market—specifically, whether housing markets are currently undergoing a fundamental shift in pricing mechanisms compared to the pre-2008 era. In AI era, recognizing that investors’ capacity to process rational information has grown exponentially powered by AI and high-dimensional data, we ask whether sentiment continues to dominate real estate pricing, or whether rational information processing now plays a more central role.
Overview & Contributions
· First application of semantic embeddings to real estate
Introduces a novel framework using state-of-the-art LLM-based sematic embeddings to extract rich, analyst-like semantic signals from 900,000 Wall Street Journal and local newspaper articles (1990–2025), moving beyond traditional bag-of-words news sentiment proxies approach which oversimplify textual information and fail to capture the complex semantic relationships essential. Unlike prompt-based approaches that use LLMs to generate scores, our framework operates directly at the vector level and leveraging the latent semantic structure encoded in embeddings. It ensures reproducibility, stability, and full preservation of semantic meaning.
· First large-scale Empirical test of the EMH vs. behavioral finance in real estate
Real estate has long served as a central battleground in the debate over market efficiency hypothesis (EMH), with Shiller’s Nobel-winning work offering influential but largely qualitative critiques. Our study provides the first large-scale, direct empirical evidence on whether U.S. housing market reactions are driven by rational information processing or sentiment-driven behavior, contributing directly to real estate market efficiency literature.
· Uncovering underlying mechanism of news sentiment
We provide the first systematic examination of how news sentiment transmits into housing price movements, addressing the mechanism gap highlighted by Soo (2018). This gap persists because real estate lacks direct and robust proxies for sentiment and rationality, in contrast to equity market. We address this by leveraging semantic embeddings to build two parallel models (one rational, one sentiment-driven), capturing nearly all publicly available financial news from 1990 to 2025 in the U.S. Our Machine Learning (ML) approach establishes a replicable foundation for future research aiming to detect sentiment-driven versus information-driven pricing dynamics.
· Novel Machine Learning framework
We established a replicable foundation for future research aiming to detect “sentiment-driven” versus “information-driven” pricing dynamics in real estate. The Machine Learning mechanism proceeds in two steps:
(i) we employ state-of-the-art LLMs including Qwen3 (2025), OpenAI text-embedding-3-large (2024), and Gemini API (2025), to transform financial news into high-dimensional semantic vectors (ranging from 768 to 4,096 dimensions) that capture both emotional tone and informational content.
(ii) we construct two parallel econometric models that link these vectors to subsequent housing price movements: one driven primarily by rational, information-based content; and the other focusing exclusively on sentiment. They allow both explanatory and predictive analysis.
· Event study
Reexamine (i) the 2008 subprime crisis and (ii) the 2020 COVID-19 housing surge to assess whether real estate mispricing stems from rational information processing or sentiment-driven distortions, and whether the big-data era (2020) marks a structural shift in market dynamics compared to pre-2008. By uncovering the transmission patterns of sentiment during crises, we provide a foundational framework for anticipating future housing market disruptions.
· Heterogeneity analysis (Temporal & Spatial)
Tests on whether semantic signals vary across market phases (upturns vs. downturns), regions, demographics, credit conditions, and historical periods, offering a nuanced view of pricing dynamics and informational efficiency.
· Model comparison for embedding utilization
Evaluates neural networks, tree-based methods, and simple OLS regression to identify the most effective approach for extracting predictive signals from high-dimensional vectors generated from semantic embeddings, as motivated by Giglio et al., (2020) and Chen et al., (2023).
· Practical relevance (long-short strategy)
Beyond pure empirics, we test the practical relevance of semantic embeddings by implementing a long–short strategy across rental returns in 50 U.S. states, evaluating whether these signals generate economically meaningful returns and function as asset-pricing factors. Our cross-sectional long-short tests are not intended to reject EMH, but to examine whether rational information processing leads to systematic pricing differentials across U.S. states.
· Robustness Test by Placebo tests
Implements (i) temporal shuffling, (ii) random embedding replacement, (iii) irrelevant news injection, and (iv) synthetic event windows to ensure that semantic signals reflect economically meaningful information rather than spurious correlations. These placebo designs serve as our core robustness tests to validate model integrity and causal interpretation.
· “Look-ahead” Bias in LLM
Conducts five rigorous out-of-sample tests to address look-ahead bias in LLM, and reinforce the credibility of belief updating under high-dimensional information that aligning with modern learning dynamic (Da et al., 2024). They includes: (i) chronologically consistent design following He et al., (2023, 2025), (ii) cross-model temporal comparison, (iii) entity neutering, (iv) post-cutoff validation, and (v) event-study validation.
· Monitoring “Black-Box”
While LLMs and Machine Learning inherently encode both informational and emotional content, our empirical design explicitly separates the rational (information-based) and sentiment-driven components. Therefore, rather than relying on post-hoc interpretability tools such as Shapley or Integrated Gradients, we introduce a framework that enables LLMs to make decisions based solely on sentiment-driven (or economic-rational) signals. This approach provides a more direct and economically interpretable distinction between rational expectations and sentiment.
· External benchmarking with industry sentiment indices
Through the author’s established connection with (i) Cindy K. Soo, former associate professor at the University of Michigan and founder of the Housing Media Sentiment Index v2.0, and (ii) Prof. Oguzhan Cepni (CBS, Coopenhagen) who developed the U.S. Housing-Media-Attention Indices; both indices will serve as external benchmarks for validating our findings and ensuring that our conclusions are recognized and credible within the real estate research academia.
· Modern EMH theory & Post-2019 AI shift
Building on Martin & Nagel (2022)’s “high-dimensional EMH,” which suggests that post-2019 equity markets have become increasingly efficient due to learning effects have diminished return predictability, we extend this framework to real estate. Our study tests whether U.S. housing markets show similar patterns of informational efficiency in the era of AI. This allows us to evaluate whether behavioral interpretations of pre-2008 real estate markets, as emphasized by Shiller (2015), have been supplanted by advanced rational processing capabilities in post-2019 AI-driven market regimes.
· Future mispricing prediction
At its core, the study of real estate pricing mechanisms is an investigation into the origins of mispricing. By disentangling emotional and informational drivers in the 2008 subprime crisis and the 2020 housing surge, our framework identifies early-stage sentiment signals predictive of market distortions. We introduce a Dynamic Rationality Index (DRI) that serves as a real-time indicator of potential mispricing, it enables the early detection of potential future housing market crises. This is especially critical in real estate, where the absence of short-selling and limited arbitrage makes mispricing harder to correct, historically led to severe consequences as seen in 1930s and 2008.

Figure 1. Flowchart of Baseline Model

Fugure 2. a 3-dimensional illustrative example of the embedding space
The baseline results coming up soon...



Comments