Who Invests, Who Gets Funded: Gender and Racial Bias in LLM-Generated Investment Advice
Who Invests, Who Gets Funded: Gender and Racial Bias in LLM-Generated Investment Advice
Abstract
Do large language models generate unbiased financial advice across investor and fund manager demographics? We develop a two-sided audit framework to evaluate demographic bias in LLM-generated investment advice and apply it to multiple large language models, with GPT-4 Turbo as the primary baseline. On the investor side, fund selections are similar across demographic groups and rely on financial criteria, but recommended investment amounts vary when investor names signal race or gender, despite identical age and income. On the fund manager side, capital allocations favor non-Black and male managers: racial disparities persist even under explicit disclosure, while gender-related differences are more pronounced under name-based cues. Bias patterns are qualitatively similar across models, with differences in magnitude between implicit and explicit demographic signaling. These results suggest that, even when LLMs incorporate core financial reasoning, demographic signals can affect allocation decisions, with effects that tend to be stronger under implicit signaling, potentially replicating existing market inequalities and raising concerns about impartiality in financial advising. The proposed audit framework provides a generalizable approach for identifying and evaluating demographic bias in AI-driven financial advisory systems.
Introduction
Introduction
The increasing integration of artificial intelligence into financial services is changing investment management, robo-advising, and broader financial decision-making. Recent advances in generative artificial intelligence, particularly large language models, have enabled financial institutions to automate decision-making processes. Institutions such as Morgan Stanley have integrated generative artificial intelligence tools like Research Assistant "AskResearchGPT" to help financial advisors generate investment recommendations. Beyond finance, other sectors such as healthcare, legal services, and consulting are actively exploring and implementing generative artificial intelligence in the decision-making process.
Although large language models offer the potential to improve efficiency and expand financial access, rapidly transitioning from experimental applications to core components of capital allocation and portfolio management, an emerging body of research highlights a critical concern. These models may not only inherit but also amplify existing biases embedded in historical data, reinforcing disparities in capital allocation and financial opportunities. From a business ethics perspective, such disparities raise questions about who receives access to capital and under what conditions, as well as the fiduciary responsibilities of financial advisors. If investment advice systematically varies with demographic characteristics that are unrelated to financial fundamentals, it may compromise both the duty of care and the duty of loyalty owed to clients. Previous studies document significant demographic biases across various financial and economic contexts, including lending decisions, labor markets, and housing markets. Given that large language model-driven financial advisory systems rely heavily on historical data and algorithmic principles similar to those examined in these studies, there is reason to expect that such biases may emerge or even intensify within large language model-generated recommendations.
The potential for large language models to replicate financial bias is not merely an academic concern. Real-world large language model failures have already demonstrated how algorithmic decision-making can introduce unintended distortions, sometimes with severe consequences. For instance, the Apple Card credit limit controversy highlighted significant gender bias, with multiple users reporting that women received substantially lower credit limits than men with comparable financial profiles. This led to an official investigation by the New York Department of Financial Services, underscoring the risks posed by opaque algorithmic credit models that can produce discriminatory outcomes with tangible financial consequences. Similarly, Upstart, a fintech lender utilizing AI-driven credit scoring, has faced scrutiny over potential racial disparities in loan approvals and interest rates. While regulatory reviews are ongoing, emerging analyses suggest these AI models may unintentionally perpetuate systemic inequities affecting marginalized racial groups. Although these cases arise from earlier algorithmic systems, they reveal the broader vulnerability of financial technologies to encode and reproduce demographic disparities. As large language models become increasingly integrated into finance, from investment advising to client interaction, the possibility of comparable biases emerging within these systems demands especially scrutiny. Such concerns illustrate a broader challenge in the development of large language models: models are not inherently neutral but reflect the biases present in their training data. Given that financial data historically exhibits demographic disparities in access to capital and investment flows, it is critical to examine whether large language model-driven financial tools replicate, mitigate, or exacerbate these inequalities. Beyond efficiency, such controversies also implicate the legitimacy of financial institutions: opaque and biased algorithms threaten public trust in markets and challenge the moral justification for delegating financial decision-making to AI systems.
Existing research has primarily examined investor bias in financial decision-making, particularly in credit and lending contexts. More recent studies extend this focus to large language models. Fedyk et al. investigate whether investor characteristics lead to perception bias in financial advice, while An et al. examine large language model responses to race and gender signaling names in general domains such as question answering and job recommendation. However, fund-level recommendation differences and capital allocation biases remain largely unexplored. In this study, we conduct a systematic audit of large language models, with GPT-4 Turbo serving as a baseline to examine whether implicit or explicit demographic cues related to investor and fund manager race and gender influence their fund selection and capital allocation decisions. More specifically, we introduce a two-sided bias framework that examines both investor-side and fund manager-side bias. Rather than merely documenting disparities, this audit design systematically tests whether equivalent financial profiles receive different recommendations when demographic cues are introduced. When such differences are unrelated to financial fundamentals, they raise ethical concerns about impartiality, fiduciary responsibility, and the legitimacy of AI-driven advisory systems. On the investor side, we examine whether demographic signals affect both fund selection, where different investors receive different recommended funds, and investment allocation, where different investment amounts are suggested. On the fund manager side, we analyze whether capital allocation decisions change when fund manager race and gender are either explicitly stated or implicitly signaled through names. Our methodology follows established audit study frameworks, systematically manipulating demographic information in controlled input text to analyze GPT-4 Turbo decision making. By structuring investor and fund manager profiles with and without explicit demographic signals, we test whether investment recommendations systematically vary based on race and gender.
Our findings show that GPT-4 Turbo, when applied to financial decision-making tasks, does not consistently exhibit demographic bias, but rather shows variation depending on the structure of the task and the way demographic information is presented. This perspective moves beyond asking whether large language models are biased and instead examines the conditions under which bias is likely to manifest. When selecting funds, GPT-4 Turbo appears to follow rational investment principles, prioritizing risk-adjusted returns and objective financial metrics, with no significant evidence of bias in the selection process. However, investment allocation decisions reveal demographic disparities, as investor names that signal race or gender influence the recommended investment amount even when income and age remain constant. Such disparities are ethically significant because they indicate that demographic cues, rather than financial fundamentals, can shape advisory outcomes, raising concerns about impartiality and fiduciary responsibility in AI-driven finance. This pattern is consistent with behavioral finance research showing that structured decision-making tends to constrain bias, whereas open-ended judgments are more vulnerable to implicit influences.
The fund manager bias experiment provides further insight into the mechanisms driving demographic bias in GPT-four Turbo-generated financial advice. Our results show that racial bias persists even when the race of a fund manager is directly disclosed. Black fund managers consistently receive lower investment recommendations than their White counterparts, regardless of whether their race is explicitly stated or inferred through names. The consistent disadvantage faced by Black fund managers indicates that GPT-four Turbo encodes systemic racial disparities in financial markets, reinforcing the structural barriers faced by minority fund managers. The persistence of race-based disparities, despite explicit disclosure, indicates that biases ingrained in historical financial data are difficult to mitigate through transparency alone. Our results show that demographic associations, rather than financial metrics, influence capital allocation in these cases. This persistence raises concerns about whether reliance on such models aligns with fiduciary duties of care and loyalty in financial advising.
In contrast, gender bias follows a different pattern. Although female fund managers receive lower investment recommendations when gender is inferred implicitly from names, explicit disclosure of gender does not produce statistically significant differences in recommended allocations. This pattern implies that GPT-four Turbo is more sensitive to implicit gender signals than to explicit ones, reflecting the underlying dynamics in the way financial markets historically treat race and gender. One possible explanation is that gender disparities in fund management have been partially mitigated by evolving industry norms, causing GPT-four Turbo to be less responsive to direct gender disclosures. However, the presence of implicit gender bias, in which GPT-four Turbo allocates less capital to female fund managers when gender is inferred but not explicitly stated, indicates that biases may emerge when decision-making lacks structured guidance. The contrast between implicit and explicit gender effects suggests that model training methods, such as data processing, RLHF, and test-time controlled generation, may have been more effective in reducing explicit gender discrimination while not fully addressing racial biases.
To extend our baseline results, we implement a list of robustness checks to evaluate the consistency and reliability of our findings. First, we assess whether disparities persist when prompts are enriched with more realistic decision-making context. On the investor side, adding horizon, risk tolerance, and return objectives reduces ambiguity yet does not remove disparities in recommended allocations. On the fund manager side, incorporating long-term evaluation criteria and professional experience still yields statistically significant associations with race and gender. These results indicate that allocation outcomes remain sensitive to demographic cues, raising concerns about fairness and impartiality in systems designed to emulate professional investment advice. The findings suggest that procedural fixes alone, such as richer prompts or disclosure, may not satisfy ethical requirements of impartiality, which points to the need for deeper alignment interventions and connects this issue to broader ethical discussions of fairness and accountability in financial services.
We also examine whether the patterns documented above are consistent across different large language models. To do so, we replicate the investor-side and fund manager-side analyses using GPT-four point one, GPT-four zero, Claude three point five Sonnet, and Llama three point one eight B. We find that different models display demographic associations that differ in both sign and significance, reflecting variations in training data, alignment methods, and feedback processes. This heterogeneity illustrates the risks of relying on proprietary AI systems whose internal design is opaque, since institutions may inadvertently introduce model-specific disparities into financial allocation. If left unexamined, such inconsistencies could affect fiduciary responsibility, regulatory compliance, and public trust in the fairness of financial markets, ultimately threatening the legitimacy of AI adoption in finance. More broadly, the findings connect to ethical questions of transparency and accountability in the deployment of AI-driven financial tools, where observed disparities are not merely technical outcomes but raise broader ethical concerns about justice, accountability, and the legitimacy of AI adoption in financial services.
Lastly, we assess whether patterns vary across alternative age and income classifications. For older and higher-income investors, fund selection outcomes appear relatively stable, but younger and lower-income profiles show significant differences in choice when demographic cues are present. Allocation recommendations, by contrast, systematically rise with age and income. While this aligns with economic expectations about financial capacity, it also raises questions about whether algorithmic systems implicitly adopt assumptions that disadvantage less affluent or less experienced investors. These observations resonate with ethical debates on distributive justice, as algorithmic advice may contribute to capital flows that reinforce existing socio-economic inequalities rather than alleviate them.
Our paper contributes to the existing literature by introducing a two-sided audit framework to evaluate bias from both the investor's and the fund manager's perspectives. This design is grounded in the recognition that although both forms of bias likely stem from shared internal associations between demographics and perceived competence, they manifest differently across roles and tasks. Prior literature often isolates one side of the interaction. On the investor side, research has examined how large language models tailor advice or responses based on user demographic cues, echoing earlier work in behavioral finance showing that race and gender influence how consumers are treated in credit markets. These biases often reflect judgments about risk tolerance, financial literacy, or deservingness. On the fund manager side, recent empirical studies highlight barriers to capital access for women and minority fund managers. Here, bias operates through perceptions of professional competence and investor trust, with implications for firm growth and industry representation. Evaluations of fund managers engage different stereotypes, often rooted in leadership, expertise, or financial acumen, compared with those applied to retail investors.
By separating the two roles, the framework allows analysis of how the same large language model exhibits role-specific expressions of bias. The empirical results indicate that racial disparities persist across both perspectives but vary in magnitude and in their sensitivity to explicit versus implicit demographic signals. Gender bias appears more sensitive to implicit cues and is more pronounced in evaluations of fund managers. These differences suggest that task framing and role salience shape how large language models express bias, even when the underlying associations may be shared. This insight draws on social psychology literature showing that stereotype activation is context dependent and varies with the evaluative lens applied
The role-specific differences we uncover bring into focus continuing questions of fairness and accountability in financial markets, showing that evaluations of investors raise issues of equal treatment in advisory services, while evaluations of fund managers raise issues of distributive justice in capital allocation.
Our paper also has important political and economic implications that extend beyond the fairness of large language models to broader issues of financial market efficiency and equity. If investment recommendations generated by large language models systematically disadvantage certain demographic groups, whether investors or fund managers, structural inequalities in capital allocation may be reinforced, reducing opportunities for underrepresented groups in financial markets. As large language model-driven decision making becomes more prevalent in finance, biased outputs have the potential not only to shape long-term patterns of wealth accumulation and economic mobility but also to raise concerns about fiduciary responsibility and public trust. These risks directly imply the legitimacy of AI adoption in financial services, since clients and regulators expect that recommendations are grounded in financial criteria rather than demographic signals. Variation in bias across different models raises concerns about the consistency of large language model-generated financial advice and signals the ethical need for careful evaluation and auditing of such systems in investment advisory contexts. Differences in model behavior also point to the continued importance of research on the design and regulation of large language model-based financial decision tools, reinforcing that technical performance alone is insufficient unless such systems also meet normative standards of fairness, accountability, and equal access to capital.
In the remainder of the paper, the "Background and Hypotheses" section describes the background and hypotheses. The "Related Work" section presents our main analysis and research design. The "Hypotheses development" section presents empirical results. The "Methods and Experimental Design" section presents the robustness checks. The "Investor-Side Bias Experiment" section concludes.