What a study of AI copilots for lawyers says about the future of AI for everyone

Jeremy Kahn

4 June 2024 at 4:31 pm·9-min read

Hello and welcome to Eye on AI.

Name a profession and there’s almost certainly someone building a generative AI copilot for it. Accountants, lawyers, doctors, architects, financial advisors, marketing copywriters, software programmers, cybersecurity experts, salespeople—there are copilots already in the market for all of these roles.

AI copilots differ from using a general-purpose LLM-based chatbot, like OpenAI’s GPT models, although some use one of those general-purpose models as their central component. Copilots have user interfaces and usually backend processes specifically tailored for the particular tasks that someone in that profession would want assistance with—whether that is crafting an Excel spreadsheet formula for an accountant, or, for a salesperson, figuring out the best wording to convince a customer to close a complex deal. Many copilots rely on a process called RAG—retrieval augmented generation—to boost the accuracy of the information they output and reduce the tendency of LLMs to hallucinate or produce superficially plausible but inaccurate information.

Perhaps no profession save software developers has embraced experimentation with copilots as enthusiastically as the law. There have already been several instances where lawyers—including former Trump lawyer-turned-star-witness-for-the-prosecution Michael Cohen—have been reprimanded and fined by judges for naively (or very lazily) using ChatGPT for legal research and writing without checking the case citations it produced, which in some cases turned out to be completely invented. The legal copilots, however, are supposed to be much better than ChatGPT at completing legal tasks and answering legal questions.

But are they? The answer matters because lawyers’ experience using these copilots may foretell what will happen in other professions too in the coming few years. In that context, a study published last month (and updated Friday) from researchers affiliated with Stanford University’s Human-Centered AI Institute (HAI) sounded an important caution—not just for the legal profession but for copilots as a whole.

The HAI researchers, who included a Stanford Law professor, created a dataset of 200 questions designed to mimic the kinds of questions a lawyer might ask a legal research copilot. The Stanford team claims their questions are a better test of how legal copilots may perform in a real-world setting than bar exam questions—especially because a lot of datasets of bar exam questions have already been memorized by LLMs trained for vast amounts of data scraped from the internet. The dataset includes some particularly tricky questions in which the question includes a false premise. Such questions often lead LLMs astray. Trained to be helpful and agreeable, they frequently accept the false premise and then invent information to justify it, rather than telling the user the premise of the question is wrong.

The researchers then tested several prominent legal research copilots, including one from LexisNexis (Lexis+ AI) and two from Thomson Reuters (Ask Practical Law AI and Westlaw’s AI-Assisted Research) on this dataset. They used OpenAI’s GPT-4 as a kind of control, to see how well an LLM would do without RAG and without any of the other backend processing that had been geared just for legal research. The answers were evaluated by human experts.

For lawyers—and everyone hopeful RAG would eliminate hallucinations—there was a little bit of good news and quite a lot of not-so-good news in the results. The good news is that RAG did indeed reduce hallucination rates significantly. GPT-4 had a hallucination rate of 43% while the worst of the three legal copilots had a hallucination rate of 33%. The bad news is that the hallucination rates were still much higher than you’d want. The best two copilots still made up information in about one out of six instances. Worse still, the RAG-based legal copilots often omitted key information from answers, with between nearly a fifth to well over half of the responses judged by human evaluators as incomplete. By contrast, fewer than one in 10 of GPT-4’s responses failed on this metric. The study also pointed out that LexisNexis’s copilot provided legal citations for all the information it provided, but that sometimes the cases cited did not say what the copilot said they did. The researchers pointed out that this kind of error can be particularly dangerous because the presence of the citation to a real case can make lawyers complacent, making it easier for errors to slip past.

LexisNexis and Thomson Reuters have both said that the accuracy figures in the HAI study were significantly lower than what they’ve found in their own internal performance testing and in feedback from customers. “Our thorough internal testing of AI-Assisted Research shows an accuracy rate of approximately 90% based on how our customers use it, and we’ve been very clear with customers that the product can produce inaccuracies,” Mike Dahn, head of Westlaw Product Management at Thomson Reuters, wrote in a blog response to the HAI study.

“LexisNexis has extensive programs and system measures in place to improve the accuracy of responses over time, including the validation of citing authority references to mitigate hallucination risk in our product,” Jeff Pfeifer, LexisNexis chief product officer for the U.S., Canada, Ireland, and the U.K., wrote in a statement provided to newsletter LegalDive.

The blog post HAI wrote to accompany the research pointed to a recent story by Bloomberg Law that also could give people pause. It looked at the experience of Paul Weiss Rifkind Wharton & Garrison—among the 50 largest U.S. law firms with close to 1,000 attorneys—with a legal copilot from the startup Harvey. Paul Weiss told the news organization that it wasn’t using quantitative metrics to assess the copilot because, according to Bloomberg, “the importance of reviewing and verifying the accuracy of the output, including checking the AI’s answers against other sources, makes any efficiency gains difficult to measure.” The copilot’s answers could also be inconsistent—with the same query yielding different results at different times—or extremely sensitive to seemingly inconsequential changes in the wording of a prompt. As a result, Paul Weiss said it wasn’t in a position yet to determine the return on investment from using Harvey.

Instead, Paul Weiss was evaluating the copilots based on qualitative metrics, such as how much attorneys enjoyed using them. And here, there were some interesting anecdotes. It turned out that while junior lawyers might not see much time-savings in using the AI copilot for research because of the need to verify its answers, more senior lawyers found the copilot to be a very useful tool for helping them brainstorm possible legal arguments. The firm also noted that the copilot could do certain things—such as evaluate every single contract in a huge database in minutes—that humans simply could not do. In the past, firms had to rely on some sort of statistical sampling of the contracts, and even then the process might take days or weeks.

Pablo Arredondo, cofounder of CoCounsel, a legal copilot now owned by Thomson Reuters, but which was not part of the HAI study, told me that the HAI study and the Bloomberg story reinforce that all generative AI legal copilots need oversight (as do junior associates at law firms). Some of the areas where the copilots stumbled in the HAI study, such as determining when a case had been overturned subsequently by a higher court, are also areas where different legal research companies often provide conflicting information, he noted.

Taken together, I think the Stanford study and the Bloomberg Law story say a lot about where AI copilots are today and how we should be thinking about where they are heading. Some AI researchers and skeptics of the current hype around generative AI have jumped on the HAI paper as evidence that LLMs were entering the “trough of disillusionment” and that perhaps the entire field is about to enter another “AI winter.” I think that’s not quite right. Yes, the Stanford paper points to serious weaknesses in AI copilots. And yes, RAG will not cure hallucinations. But I think we will find ways to continue to minimize hallucinations (longer context windows is one of them) and that people will continue to use copilots.

The HAI paper makes a great case for rigorous testing—and for that performance data to be shared with users. Professionals must have a clear sense of copilots’ capabilities and weaknesses and need to understand how they are likely to fail. Having this mental model of how a particular copilot works is essential for any professional working alongside one. Also, as the Bloomberg Law story suggests, many professionals will come to find copilots useful and helpful, even in cases when they aren’t entirely accurate—and that the efficiency gains from such a system may be hard to evaluate. It’s not about whether the copilot can do well enough on its own to replace human workers. It’s about whether the human working with the copilot can perform better than they could on their own—just as in the case of the senior Paul Weiss lawyers who said it helped them think through legal arguments.

Arredondo said that Thomson Reuters is in early discussions with Stanford to form a consortium of legal tech firms and law firms to partner along with other academic institutions, to develop and maintain benchmarking for legal copilots. He said that ideally, these standards would compare how human lawyers perform on these same tests and then see how they perform when assisted by AI tools, as opposed to evaluating the systems only against one another and without the human oversight they still need.

We don’t have very good benchmarks for human-AI teaming. It’s time to create some.

There’s more AI news below...But first, if you want to find out more about working alongside AI copilots, I’ve got some news of my own: My book Mastering AI: A Survival Guide to Our Superpowered Future is now available for pre-order in the U.S. and the U.K.! The book has a chapter on how AI will transform the way we work. But Mastering AI goes well beyond that to reveal how AI will change and challenge our democracy, our society, and even ourselves. AI presents tremendous opportunities in science, education, and business, but we must urgently address the substantial risks this technology poses. In Mastering AI I explain how. If you enjoy this newsletter, I know you’ll find the book valuable. Please consider pre-ordering your copy today.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Correction, June 4: An earlier version of this story misspelled the full name of the law firm Paul Weiss Rifkind Wharton & Garrison.

This story was originally featured on Fortune.com

Reuters
Modi's govt calls on Indian state to look into Reuters report on Foxconn hiring
Indian Prime Minister Narendra Modi's government on Wednesday said it has asked the Tamil Nadu state to submit a "detailed report" after a Reuters story revealed that Apple supplier Foxconn rejected married women from iPhone assembly jobs in the country. In a statement calling for the probe, the federal government's Ministry of Labour and Employment cited the Equal Remuneration Act of 1976, saying the law "clearly stipulates that no discrimination (is) to be made while recruiting men and women workers." The ministry said it has requested a detailed report from the Labour Department of Tamil Nadu, site of a major iPhone factory where Reuters uncovered Foxconn’s practice of shunning married women from jobs.
MediaOutReach
Tencent Consolidates Singapore Office By Moving To TWP’s CapitaSky Coworking Space
SINGAPORE - Media OutReach Newswire - 27 June 2024 - Chinese tech giant Tencent has consolidated its Singapore office, relocating from 30 Raffles Place to a new flexible working space managed by The Work Project (TWP) at CapitaSky — a 29-storey tower in the Central Business District (CBD). Tencent Consolidates Singapore Office By Moving To TWP’s CapitaSky Coworking Space New Singapore Office Consolidates Tencent Workforce in Prime CBD Location The move, completed in the first week of June 2024 f
Reuters
Potential China action against EU pork may be double-edged sword for Brazil
China's potential imposition of anti-dumping measures on European Union pork imports, a disastrous scenario for the bloc, could benefit Brazil's exporters but also affect the country's ability to compete elsewhere, analysts and industry sources said. Chinese companies have asked for an anti-dumping probe into pork imports from the European Union, state-backed Chinese media have reported. The potential for additional demand from China as a consequence of any anti-dumping measures is welcome for Brazilian exporters.
Reuters
Elon Musk won $56 billion payday because of vote, Tesla argues in court
Tesla is claiming Elon Musk won his legal battle over his $56 billion pay package because shareholders voted for the compensation, despite a judge rescinding it earlier this year, according to a court filing made public on Thursday. The company's filing comes two weeks after Tesla shareholders voted to ratify the 2018 package of stock options. Tesla held the vote following a January ruling by a Delaware judge to void the compensation because Musk improperly controlled the negotiation process and the company misled shareholders about key details.
AFP News
Two employees leave Adidas amid China graft probe
Adidas said Wednesday two employees had left the company as the German sportswear giant investigates bribery allegations in China.As a result, both employees have left the company."
Investing.com
Goldman Sachs: Brent oil prices "at our summer peak"
Investing.com -- Brent crude prices have touched their "summer peak" of $86 per barrel, according to projections from analysts at Goldman Sachs.
Reuters
Biden awards up to $75 million in CHIPS Act grant to Entegris
"We’re not just bringing leading-edge chip technology and (factories) to the United States, we’re also bolstering the suppliers that make leading-edge manufacturing possible,” U.S. Secretary of Commerce Gina Raimondo said in a statement. Congress in 2022 approved the Chips and Science Act to boost domestic semiconductor output with $52.7 billion in research and manufacturing subsidies. According to the Commerce Department, the project's first phase would support the production of liquid filter membranes and Front Opening Unified Pods (FOUPS), specialized containers invented by Entegris to secure semiconductor wafers when transported during the manufacturing process.
The Telegraph
Volkswagen to invest $5bn in Tesla electric car rival
Volkswagen has announced plans to invest up to $5bn (£3.9bn) in US electric carmaker Rivian, as manufacturers rethink their strategies amid uncertain demand.
The Edge Singapore
Prudential Singapore to get new CEO in September
Chan San San is appointed CEO of Prudential SIngapore, reporting to Dennis Tan, MD of Strategic Business Group, a regional role
Reuters
Factbox-Companies forging a rare earths industry in the EU
In the rare earths sector, the EU will struggle to meet most of its ambitious goals in new legislation designed to boost domestic output of critical minerals and reduce dependence on China. Below are some key companies working on rare earths, with production figures in metric tons per year. Neodymium and praseodymium (referred to in the industry by their joint elemental abbreviation, NdPr) are key rare earths needed to make permanent magnets.
Sky News
Post Office Horizon expert admits seeing legal advice on trial obligations and defends Fujitsu's remote access
A key figure behind the flawed IT system that led to hundreds of sub-postmasters being wrongly jailed has told the Post Office inquiry he had "not remembered" legal advice he was given about his obligations as an expert witness. Former senior Fujitsu engineer Gareth Jenkins was crucial in the prosecutions process because he was one of the architects of the Horizon accounting systems and one of a few people to have extensive knowledge of their workings. The letter, which related to the Post Office prosecution of postmaster Lee Castleton, was written four years before Mr Jenkins was presented as an expert witness in the case of Seema Misra, who was wrongfully sentenced to 15 months in prison while pregnant in 2010.
Reuters
Zeekr's sales jump as Chinese dominate Russian EV market
Sales of Zeekr electric vehicles have shot up in Russia in the last year, putting Chinese carmakers ahead of local competition in Russia's small but rapidly expanding electric vehicle (EV) sector. Chinese carmakers have already seized more than half of Russia's car market since Western competitors pulled out, taking their technology and know-how with them, after Moscow sent troops into Ukraine in February 2022. Governments in Europe and the United States are growing increasingly wary of Chinese dominance in the EV sector but Russia, rather than imposing tariffs, is embracing Chinese brands across all automobile sectors to prop up its car industry.
Reuters
Oil rises on positive economic outlooks; poised for third weekly gain
(Reuters) -Oil prices rose in Asian trade on Friday and were poised for a third straight weekly jump, buoyed by growing expectations that the U.S. central bank will soon start to cut interest rates. "Crude oil edged higher despite weak near-term fundamentals," said ANZ analysts, referring to unexpected gains in U.S. crude inventories despite expectations of a drawdown during the summer peak demand. Growing expectations of an imminent Fed easing cycle have sparked a risk rally across stock markets.
Investing.com
Oil prices settle higher, but gains capped by surprise inventory build, dollar
Investing.com-- Oil prices settled slightly higher Wednesday, though gains were capped by unexpected build in U.S. inventories and a stronger dollar.
Benzinga
The Average Millennial Expects To Retire Before 60, Why Many Think This Is Unattainable
The average millennial expects to retire before age 60, but many critics think these expectations are unrealistic for today's standards. Surveys and reports from experts in the finance industry illustrate why early retirement could be a challenge for millennials. YouGov surveyed millennials, with the largest share, about 30%, saying they expect to retire between ages 51 and 60. Another survey by Principal Financial shows that the average millennial expects to retire by age 59. Don't Miss: Are yo
Yahoo Finance
Europe is at the gates of Apple’s walled garden
Apple's "walled garden" ecosystem has allowed it to exert control and make it, arguably, a safer place. But with that comes less flexibility, and, as Europe alleges, antitrust issues.
SmartAsset
Ask an Advisor: We Have $1.25M in Retirement Savings and Need to Withdraw $50k Per Year. How Should We Invest it?
We have $250,000 in the bank and one million to invest for retirement with no debt. We need to earn $50,000 a year from the million. Where should I invest it? -Rob First, congratulations on saving $1 million for your retirement – I'm sure a lot of hard work has gone into this! You have […] The post Ask an Advisor: We Have $1.25M in Retirement Savings and Need to Withdraw $50k Per Year. How Should We Invest it? appeared first on SmartReads by SmartAsset.
GOBankingRates
I’m a Financial Expert: Here are 7 Things Boomer Entrepreneurs Need to Know About Starting a Business
There are various reasons why someone may want to start a business in their golden years. Perhaps they're retired and miss the workforce. Maybe they've always nurtured a passion for entrepreneurship...
Reuters
Meta says it may block news from Facebook in Australia
Facebook owner Meta is considering blocking news content from the platform in Australia if the government makes it pay licensing fees, a company representative told a parliamentary hearing on Friday. Meta's regional policy director Mia Garlick told lawmakers "all options are on the table" when asked if the company would block Australians from sharing news content to avoid paying fees. "There's a large number of channels that people can get news content from," Garlick told the inquiry.
South China Morning Post
Huawei smartphone sales see biggest growth in 618 shopping festival, but Apple stays ahead
Huawei Technologies has emerged as the big winner in China's smartphone market for the recently-concluded midyear 618 shopping festival with the biggest jump in sales and the smallest discounts, according to the latest data from Counterpoint Research. Huawei sales jumped 42.4 per cent during the one-month 618 sales period that started May 20, according to Counterpoint, which attributed the rise to continued strong demand for the company's latest handsets, including those from the Mate 60 and Pur

Straits Times Index

Nikkei

Hang Seng

FTSE 100

Bitcoin USD

CMC Crypto 200

S&P 500

Dow

Nasdaq

Gold

Crude Oil

10-Yr Bond

FTSE Bursa Malaysia

Jakarta Composite Index

PSE Index

What a study of AI copilots for lawyers says about the future of AI for everyone

Latest stories

Modi's govt calls on Indian state to look into Reuters report on Foxconn hiring

Tencent Consolidates Singapore Office By Moving To TWP’s CapitaSky Coworking Space

Potential China action against EU pork may be double-edged sword for Brazil

Elon Musk won $56 billion payday because of vote, Tesla argues in court

Two employees leave Adidas amid China graft probe

Goldman Sachs: Brent oil prices "at our summer peak"

Biden awards up to $75 million in CHIPS Act grant to Entegris

Volkswagen to invest $5bn in Tesla electric car rival

Prudential Singapore to get new CEO in September

Factbox-Companies forging a rare earths industry in the EU

Post Office Horizon expert admits seeing legal advice on trial obligations and defends Fujitsu's remote access

Zeekr's sales jump as Chinese dominate Russian EV market

Oil rises on positive economic outlooks; poised for third weekly gain

Oil prices settle higher, but gains capped by surprise inventory build, dollar

The Average Millennial Expects To Retire Before 60, Why Many Think This Is Unattainable

Europe is at the gates of Apple’s walled garden

Ask an Advisor: We Have $1.25M in Retirement Savings and Need to Withdraw $50k Per Year. How Should We Invest it?

I’m a Financial Expert: Here are 7 Things Boomer Entrepreneurs Need to Know About Starting a Business

Meta says it may block news from Facebook in Australia

Huawei smartphone sales see biggest growth in 618 shopping festival, but Apple stays ahead