Unlike structured data or FAQs, unstructured documents are made up of written text, such as blog posts, help articles, and product manuals. In order to find information within them, you either need to read them or search them. Widely used strategies, such as TF-IDF, fall short by only identifying relevant documents. Answers will use Extractive QA to scan your unstructured documents and find the answers to the questions your customers ask, not just the documents that may hold those answers.
To search unstructured documents, most search engines use TF-IDF, which stands for term frequency–inverse document frequency. At its core, TF-IDF finds overlapping keywords between the query and the documents, and ranks the documents based on the number of matches. This algorithm has three major issues:
The biggest issue with this approach is that it focuses on keywords, not intent. If the words used in the search don't overlap with those in the document, the algorithm won’t identify a match. Humans phrase the same query in a multitude of ways, and a good search algorithm needs to take this into account.
2. Inflated Recall
Any search engine using either of these algorithms will often return tens, hundreds, even thousands of documents for any search term because any document that has any keyword overlap will be returned as a possible match. This creates noise and pushes more relevant results down the page.
3. Bad User Experience
This approach returns a list of documents, at best. Google and other leading search engines don't just return links to documents. They extract answers from those documents and do the work for the user. Your internal search engine needs to do the same.
To solve these issues, Answers will take a new approach to document search: Extractive QA. Extractive QA has two specific advantages over TF-IDF.
1. Intent not Keywords
Just like Semantic Text Search, Extractive QA embeds both the search term and the documents in vector space to match based on intent, not keywords. The embedding process is slightly different than that for FAQs, but the overall result is the same: find the right documents, no matter what words are used.
2. Snippets not Documents
Answers will use Dense Passage Retrieval, a technique that extracts the answer from a block of unstructured text. This approach finds the most helpful snippets from the documents and surfaces those as direct answers. This greatly improves the user experience by eliminating the need to dig into documents. Best of all, this extraction happens automatically.