Five Questions We Didn't Get to Answer During our EDRM Webinar ChatGPT 101
By Dr. William Webber and John Tredennick.
Our ChatGTP webinar had almost 200 participants, many coming with questions to ask. We didn’t have a chance to answer them all. You can watch our webinar here to learn more about ChatGPT.
Here are selected questions with our answers. We thought these might be helpful to others. You can watch the recording here: https://www.merlin.tech/chatgpt-101-how-do-i-use-it-and-why/
1. Are queries and the responses thereto confidential? How does ChatGPT fit into the concept of user privacy?
Answer: We would liken confidentiality with ChatGPT to confidentiality in using Lexis, Westlaw or even CaseText. While you send information to OpenAI for the purpose of prompting an answer, the system doesn’t use that information for training or otherwise keep it except as we describe next. Confidentiality in this case should be maintained through NDA’s and similar contract provisions much like we currently use with ediscovery hosting providers or even Microsoft and Google for their office products.
That said, OpenAI keeps a log of interactions with GPT for 30 days for the purpose of being able to review them for “abuse” of the system (trying to generate grossly inappropriate text or to find out dangerous information).
Interactions with GPT are not otherwise used by OpenAI, or shared with others. In particular, they’re not used to train the GPT system. Of course, OpenAI’s terms of use could change in the future but we doubt those changes will result in less security for information submitted as prompts.
We should note that chat history is stored in your account until you choose to delete it.
2. Can you comment on early usage of GPT in ediscovery platforms? Does the Hallucination factor limit its use in the Enterprise where defensibility is critical?
Answer: One vendor announced an early beta use just before LegalWeek and provided a video demo. The vendor did not directly reveal what approach they were using to support question-answering on an e-discovery corpus (such as the Enron corpus they use in the demo). Thus we could not tell whether they were training the model on the e-discovery corpus directly to create a “private” model or whether they were taking Bing’s “search-then-synthesize” approach of using the question to search for documents, then have GPT read the documents and answer the question based on this reading.
The fact that the responses list “support documents by relevance” (and our doubts about the effectiveness of the former approach) suggest to us that the vendor is using the latter method.
We think the search-then-synthesize method is a valid approach to search and question answering on e-discovery collections and indeed have developed a prototype ourselves. In this case, “hallucinations” are less of an issue, because GPT is directly being presented with the documents (sources) on which to base its answers, and the user is able to check those answers against the sources.
Hallucinations could occur if follow-up questions were answered without fetching new search results. For instance, if “Melanie Smith” is mentioned in a document and you ask “Who is Melanie Smith?”, there might not be enough information in the documents to answer that question accurately. In that case GPT might hallucinate an answer, making assumptions that are not justified by the sources.
We note that in the demo on YouTube, each follow-up question is accompanied by a “supporting documents” list, suggesting that search results are updated as the conversation continues.
Question answering, however, does not solve the core e-discovery task of document production, even if it might be useful in early case analysis or in the interrogation of a production (on the receiving party’s side). Rather, question answering provides a convenient summary of top-ranking search results, an alternative to snippets or simply reading the documents.
For production, we have to find substantially all documents that are responsive to an issue in a production request. We (Merlin) are proposing and have prototyped an approach to the review task in which GPT (or another LLM) directly reviews documents for relevance to a description of the issue for production (and in which the subject-matter expert is supported by the system in interactively developing and refining the issue description based on a sample of the review results). Initial experiments suggest that this is a promising approach.
By the way, the demo at issue was based on the Enron collection and the nature of this collection merits care in interpreting the apparent successes of their integration of GPT for question answering.
First of all, the Enron case is well known, and there is a lot of information about it in news articles and on the web, which will have been scraped for GPT’s pre-training data. GPT, therefore, would be able to answer many questions about the Enron case without having to look at any documents from the Enron corpus.
Second, Enron’s actions and fortunes were highly topical even when Enron was still operating, and the Enron email corpus contains many news articles about Enron and the current affairs it was involved in, forwarded from one Enron employee to another. Such articles provide a concentrated and digested source of information about Enron’s public activities, which makes question-answering for GPT much easier (in many cases, GPT will simply need to summarize the contents of a news article returned by the search).
In contrast, inferring this sort of information from a heterogeneous set of emails that, though perhaps relevant to an issue, do not directly and explicitly describe that issue, would be a more challenging task.
We don’t say the above to criticize the vendor. It is almost impossible to find a realistic e-discovery corpus that can be demoed publicly but doesn’t contain publicly known information. But legal professionals should seek to test these methods on private data before reaching conclusions on their effectiveness.
3. How soon will this be available to use in the legal field for accurate case references?
Answer: Other firms are working on this. We are not certain it will prove feasible to enhance an LLM’s pre-training on a collection of legal case data and have the LLM answer the questions directly (as with ChatGPT). A widely-used alternative, and one we are pursuing, is to connect GPT to a search system that can surface and analyze legal cases and then have the LLM analyze the search results and synthesize an answer (as with Bing Chat).
4. Where is GPT heading next? Will it be developed to replace the current document review platforms like Relativity?
Answer: We expect that GPT will remain a general-purpose natural language AI tool that others will build systems around to accomplish specific tasks. Existing document review platforms work by using keywords and predictive coding to identify a set of possibly relevant documents, which can then be reviewed by humans or, rarely, produced unreviewed.
We believe that the natural-language understanding capabilities of GPT (and competitor LLMs) allow for a different processing model, one in which a small team of experts can work with GPT possibly with limited rounds of iterative testing and then ask GPT to review and tag documents based on information contained in prompts.
Some may see this as a return to linear review but with an AI system rather than a human doing the reviewing. We would instead suggest that a CAL-based review system would be even more efficient, making the GPT-aided review evenb more efficient and less costly
Such a system would offer significant savings in review costs, a great reduction in turnaround time, greater accuracy and reproducibility of review results, and much greater ability to change the direction of review mid-course or even after completion. We have done some initial but promising experimental work on this approach (see attached), and are working on building such a system.
5. What’s causing these limitations, i.e. 3000 words at a time for 3.5? Can those limitations be resolved by tech advancement/development in the near future soon?
Answer: GPT is trained as follows. Take a large amount of text (billions of documents). Tokenize the text (i.e. split into words and punctuation). Take a sequence of N-1 tokens, and try to predict (using a predictive neural network model) the next token that follows. Show the actual token and update the predictive model based upon what the token was and whether the prediction was correct. Repeat this many billions of times using tens of thousands of computers over many weeks.
The core GPT system then works essentially by continuing to make these predictions, as a sort of hyper-intelligent auto-complete. Take the N-X tokens of the user’s prompt. Predict, based on the trained model, what the next X tokens should be, and generate these as the response.
The “N” here is the context window size of the model. It is deeply embedded in how the model is trained, and defines the maximum number of “slots” the model has for reading in context at prediction time (and remembering context during a conversation). There is simply no room in the model for more data than this.
It is, of course, possible to train the model with a larger N, i.e. a larger context window, and this has certainly been one of the key improvements in these models over time.
Larger context windows take more computing resources to train and more resources to run. Calls to the GPT-4.0 32k API cost 30 times as much as GPT-3.5, but offer 240 times as much per full context window. An API call with a full GPT-3.5 context window costs the caller 0.8 US cents (that is, less than a cent), whereas a full request to GPT-4.0 32k API costs $2.00. (Two dollars is not much in itself, but if you want to summarize every document in your collection, and you have a million documents, then the cost starts to add up.)
In addition, a larger context window requires more training data, or else you’ll see reduced returns in language understanding improvement. Thus, we will need more training examples for the system. And with LLM training sets already containing billions of documents crawled from the web, some commentators argue that we’re soon going to have a shortage of good-quality training data.
For these reasons, it wouldn’t surprise me if LLMs didn’t progress too far beyond GPT-4.0’s larger 32k window for some time to come—perhaps not before a different approach to language AI is developed. In addition, 32k tokens is already quite a large window for most uses (it is roughly the length, for instance, of Luke’s Gospel, which is the longest of the gospels).
Ultimately, even if the context window size were to continue rapidly growing, it is never going to be feasible to send up an entire document collection in a single request and ask GPT to process them all at once, so some methods of document selection will be required (or an alternative approach taken).
John Tredennick is the CEO and founder of Merlin Search Technologies.
Dr. William Webber is the Chief Data Scientist of Merlin Search Technologies.