Project Title: Intelligent PDF Data Extraction and database creation

Table of contents

No heading

No headings in the article.

Project Objective: To create a system that extracts structured and unstructured data from vendor-uploaded PDFs and stores this data in a database for indexing and querying. The system should also support a chatbot capable of answering questions related to the PDF content.

Project Details:

  1. Input Requirements:

    • PDFs with diverse structures, including plain text, headings, paragraphs, tables, and bullet points.

    • Examples include Requests for Quotations (RFQs), contracts, manuals, and reports.

  2. Key Features:

    • Extract all relevant data from PDFs, ignoring irrelevant sections like headers and footers.

    • Recognize and structure tables accurately, associating them with their respective titles or captions (found in bold text, usually followed by a colon). Handle nested data within tables if applicable.

    • Identify and extract bullet points within paragraphs and organize them as nested lists.

    • Dynamically structure text content using headings as keys and their corresponding text as values.

    • Clean extracted data by removing unnecessary symbols and normalizing spaces.

  3. Data Storage and Querying:

    • Store extracted data in Elasticsearch for efficient indexing and search capabilities.

    • Ensure the database schema supports structured data (e.g., tables) and unstructured text.

  4. Chatbot Integration:

    • Implement a chatbot that can:

      • Answer specific queries related to extracted data.

      • Support questions about tables, specific headings, and bullet points.

      • Provide references to the page number or section of the document for context.

  5. Technical Challenges:

    • Data Accuracy: Ensuring tables, bullet points, and text are extracted correctly and associated with the right headings.

    • Header/Footer Removal: Dynamically ignoring irrelevant header/footer content without affecting the core data.

    • Title Detection for Tables: Associating tables with the correct titles using proximity and formatting cues.

    • Nested Content: Structuring paragraphs containing bullet points into hierarchical formats for better clarity.

  6. Desired Outcome:

    • A script or pipeline that can process a PDF to output structured JSON data. Example format:

        {
            "Heading 1": "Text under heading 1",
            "Heading 2": [
                "Bullet point 1",
                "Bullet point 2",
                "Bullet point 3"
            ],
            "Table Title": [
                {"Column 1": "Value 1", "Column 2": "Value 2"},
                {"Column 1": "Value 3", "Column 2": "Value 4"}
            ]
        }
      
    • Integration with Elasticsearch to index this structured data.

    • A chatbot API capable of answering natural language questions about the extracted data.

  7. Current Progress:

    • Developed base Python scripts using pdfplumber and Apache Tika for text and table extraction.

    • Implemented logic to remove headers and footers and validate extracted tables.

    • Structured data into key-value pairs using headings as keys and nested bullet points as values.

Help Needed:

  • Enhancing the table extraction logic to:

    • Ensure accurate table title detection from bold text.

    • Handle complex tables with merged cells or irregular structures.

  • Optimizing the removal of headers/footers to ensure no relevant data is lost.

  • Recommendations for integrating the chatbot with Elasticsearch for effective querying.

  • Best practices for handling large PDFs with complex structures.

Expected Community Support: Looking for code samples, architecture recommendations, and best practices to:

  • Refine PDF data extraction (focus on accuracy and efficiency).

  • Improve the organization of nested and tabular data.

  • Scale the solution for high volumes of data.

  • Enhance the chatbot's ability to interpret and answer queries effectively.