Skip to content

Web application that applies OCR on scanned PDF files

Notifications You must be signed in to change notification settings

abdelmaoulagr/pdf_ocr_app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR Law Extraction Web Application

Table of Contents

Description

This project was developed as part of my internship. The objective was to create a web application that applies Optical Character Recognition (OCR) on scanned PDF files to extract laws and their respective articles. The extracted information includes law numbers, titles, and detailed articles, which are then cleaned and stored in a structured format within a MongoDB database. This application is designed to facilitate the management and search of legal documents for both users and administrators.

Project Overview

User Roles

  • User: Can search for laws and articles, and extract content from specific PDF files.
  • Admin: Can add new laws via file uploads, manage existing laws and articles, and oversee other administrators.

Features

  • OCR processing of scanned PDFs to extract text.
  • Search functionality for laws and articles.
  • User-friendly interface for both users and admins.
  • Data storage in MongoDB for efficient retrieval and management.

ScreenShots

  1. User Dashboard

  1. Extract Law

  1. Cleaning The extracted Data(Explication)

Installation

Prerequisites

Before you begin, ensure you have met the following requirements:

  • You have installed Python (version 3.7 or higher).
  • You have installed Node.js (version 12 or higher) and npm.
  • You have installed MongoDB.
  • You have installed Tesseract OCR (required for text extraction from images).

Clone the Repository

git clone https://github.com/abdelmaoulagr/pdf_ocr_app.git
cd pdf_ocr_app

Backend Setup

  1. Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  1. Install the required Python packages:
pip install -r requirements.txt
  1. Start the Flask server:
export FLASK_APP=app.py  # On Windows use `set FLASK_APP=app.py`
flask run

Frontend Setup

  1. Navigate to the frontend directory:
cd frontend
  1. Install the required npm packages:
npm install
  1. Start the React development server:
npm start

MongoDB Setup

Ensure MongoDB is running on your local machine or a remote server. Configure the connection settings in your backend configuration to point to your MongoDB instance.

Tesseract OCR Installation

Follow the installation instructions for Tesseract OCR based on your operating system from the official documentation: Tesseract Installation

Technologies Used

  • Languages: HTML, CSS, JavaScript, TypeScript, Python
  • Frontend Framework: React with Chakra UI and Bootstrap
  • Backend Framework: Flask with libraries such as JSON, Base64, PyMongoDB, Pytesseract, pdf2image
  • Database: MongoDB

About

Web application that applies OCR on scanned PDF files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published