Large-Scale Knowledge Extraction

Hands-On Approach to Large-Scale Knowledge Extraction

Companies often struggle to derive meaningful insights from vast amounts of unstructured data spread across thousands or even millions of documents.

Common approaches to Retrieval-Augmented Generation (RAG), which typically find a handful of relevant documents to inform an AI's answer, fail when the required information is distributed across an entire corpus.

We focus on hands-on solutions to solve the problem in practice. We combine two key strategies: hierarchical prompting and dynamic attribute mapping. These techniques enable comprehensive analysis of large document sets and the extraction of relevant information, allowing companies to answer complex queries that encompass their entire knowledge base.

Hierarchical Prompting: Scaling Information Extraction

Hierarchical prompting is our solution for processing large datasets that exceed the context limits of individual language models. It's about answering questions in document sets/batches from the source documents in a Map-Reduce-like process and aggregating them into an answer.

Document Set Division: The large corpus is divided into manageable document sets that can be processed within the context window of the language model.
Parallel Processing: Each document set is processed independently with the same query or prompt.
Iterative Aggregation: The results of individual document sets are combined at increasingly higher levels until a final answer is created that covers the entire dataset.

This method allows us to extract relevant information from every document in the corpus, rather than limiting the analysis to a small subset of documents.

Dynamic Attribute Mapping: Structured Knowledge Extraction

Because frequent, repeated hierarchical prompting can become expensive, we additionally create a corpus of knowledge that we expand with each new query: Dynamic attribute mapping focuses on the reusable organization and storage of extracted information.

Attribute Identification: Before processing documents, the system uses the language model to determine which attributes should be extracted based on the user's question and existing attributes.
Information Extraction: As document sets are processed, the system extracts the identified attributes and their relationships.
Structured Storage: Extracted information is stored in a format similar to a JSON object or dictionary.
Continuous Updates: The structure evolves as new queries are processed and more documents are analyzed, with the language model continuously refining which attributes are relevant.

This dynamic approach allows the system to adapt to new types of information and queries over time, creating an increasingly comprehensive and nuanced representation of the data.

Example: Analysis of Drug Interactions

Let's consider a pharmaceutical company using this approach to analyze drug interactions across millions of research papers and clinical studies.

Initial Query: "What are the possible interactions between Drug A and commonly prescribed antidepressants?"
Hierarchical Prompting Process:
- Divides the corpus into batches (e.g., 1000 papers per batch)
- Processes each batch to extract mentions of Drug A, antidepressants, and observed interactions
- Aggregates results across all batches

Dynamic Attribute Mapping:

Creates a structured representation:

    {
      "Drug A": {
        "Interactions": {
          "SSRI": ["increased risk of serotonin syndrome", "20% of papers"],
          "MAOI": ["contraindicated", "15% of papers"],
          "SNRI": ["potential blood pressure effects", "10% of papers"]
        }
      }
    }

Updates this structure with each new analysis

Result: The system provides a comprehensive answer about Drug A's interactions, including the frequency of reported interactions across the entire dataset. This structured data can be quickly retrieved for future related queries without having to reprocess all documents.

This approach allows the pharmaceutical company to gain insights that might be overlooked by conventional RAG methods, as these could miss important interactions mentioned in less prominent papers.

Conclusion

The hands-on use of hierarchical prompting and dynamic attribute mapping represents a practical solution for deriving insights from large, complex datasets. By enabling truly comprehensive analysis and efficient knowledge structuring, this approach allows companies to harness the full potential of their information assets, make more informed decisions, and uncover valuable insights across their entire data corpus.

The combination of hierarchical prompting and dynamic attribute mapping brings several crucial advantages:

- A thorough analysis of complete datasets
- Scalability that goes beyond the capacities of individual models
- An evolving knowledge structure that improves both efficiency and robustness and deterministic answers
- Efficient query design for future inquiries

Next Steps

Schedule a conversation to discuss your specific challenges in knowledge extraction. We want to show you how we can give your engineers, lawyers, managers, and employees superpowers within a week.

Conversational Search with GenAI

Click here to learn more about how we can help you identify and solve your GenAI requirements.