Skip to content

LoopBraker/workin-with-pdfs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

workin-with-pdfs: Extracting Insights from Scientific Papers

This repository aims to streamline the extraction and structuring of information from scientific academic papers (PDFs) for use in a Retrieval-Augmented Generation (RAG) pipeline. We leverage the power of Python libraries like unstructured, grobid, and potentially Large Language Models (LLMs) to achieve this.

Key Features:

  1. Automated PDF Processing: Effortlessly extract text and metadata from PDFs of research papers.
  2. Structure Inference: Identify and segment documents into meaningful sections (e.g., Abstract, Introduction, Methods, Results, Conclusion).
  3. Markdown Conversion: Transform extracted content into a clean and readable Markdown format for easy editing and integration with other tools.
  4. RAG Pipeline Foundation: Prepare the processed documents for efficient indexing and retrieval, crucial for building a robust RAG system.

Technology Stack

  1. Python: The core programming language for this project.
  2. unstructured: A powerful library for extracting text and structure from various document formats, including PDFs.
  3. grobid: (Optional) A specialized tool designed for extracting and structuring bibliographic information and citations from scientific articles.
  4. Large Language Model (LLM): (Optional) Explore the potential of LLMs for tasks like section classification, content summarization, and keyword extraction.
  5. Markdown: The chosen format for representing structured information, ensuring human readability and compatibility.

Setup and Usage

  1. Clone the Repository:

    git clone https://github.com/LoopBraker/working-with-pdfs.git
    cd working-with-pdfs
  2. Create a Virtual Environment (Recommended):

    python -m venv .venv
    source .venv/bin/activate
  3. Install Dependencies:

    pip install -r requirements.txt
  4. (Optional) Download grobid:

  5. Prepare your PDFs:

    • Place the PDFs you want to process in a dedicated directory (e.g., ./data/pdfs).
  6. Run the Extraction Script:

    python extract_and_structure.py --pdf_dir ./data/pdfs --output_dir ./data/processed
    • Replace ./data/pdfs and ./data/processed with your desired input and output directories.
    • Use the --grobid_path argument to specify the path to your grobid.sh script if you choose to use grobid.

Contributing

We welcome contributions! If you have ideas for improvements, new features, or bug fixes, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published