This project was developed as part of my internship. The objective was to create a web application that applies Optical Character Recognition (OCR) on scanned PDF files to extract laws and their respective articles. The extracted information includes law numbers, titles, and detailed articles, which are then cleaned and stored in a structured format within a MongoDB database. This application is designed to facilitate the management and search of legal documents for both users and administrators.
- User: Can search for laws and articles, and extract content from specific PDF files.
- Admin: Can add new laws via file uploads, manage existing laws and articles, and oversee other administrators.
- OCR processing of scanned PDFs to extract text.
- Search functionality for laws and articles.
- User-friendly interface for both users and admins.
- Data storage in MongoDB for efficient retrieval and management.
- User Dashboard
- Extract Law
- Cleaning The extracted Data(Explication)
Before you begin, ensure you have met the following requirements:
- You have installed Python (version 3.7 or higher).
- You have installed Node.js (version 12 or higher) and npm.
- You have installed MongoDB.
- You have installed Tesseract OCR (required for text extraction from images).
git clone https://github.com/abdelmaoulagr/pdf_ocr_app.git
cd pdf_ocr_app
- Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install the required Python packages:
pip install -r requirements.txt
- Start the Flask server:
export FLASK_APP=app.py # On Windows use `set FLASK_APP=app.py`
flask run
- Navigate to the frontend directory:
cd frontend
- Install the required npm packages:
npm install
- Start the React development server:
npm start
Ensure MongoDB is running on your local machine or a remote server. Configure the connection settings in your backend configuration to point to your MongoDB instance.
Follow the installation instructions for Tesseract OCR based on your operating system from the official documentation: Tesseract Installation
- Languages: HTML, CSS, JavaScript, TypeScript, Python
- Frontend Framework: React with Chakra UI and Bootstrap
- Backend Framework: Flask with libraries such as JSON, Base64, PyMongoDB, Pytesseract, pdf2image
- Database: MongoDB