OCR Law Extraction Web Application

Description

This project was developed as part of my internship. The objective was to create a web application that applies Optical Character Recognition (OCR) on scanned PDF files to extract laws and their respective articles. The extracted information includes law numbers, titles, and detailed articles, which are then cleaned and stored in a structured format within a MongoDB database. This application is designed to facilitate the management and search of legal documents for both users and administrators.

Project Overview

User Roles

User: Can search for laws and articles, and extract content from specific PDF files.
Admin: Can add new laws via file uploads, manage existing laws and articles, and oversee other administrators.

Features

OCR processing of scanned PDFs to extract text.
Search functionality for laws and articles.
User-friendly interface for both users and admins.
Data storage in MongoDB for efficient retrieval and management.

ScreenShots

User Dashboard

Extract Law

Cleaning The extracted Data(Explication)

Installation

Prerequisites

Before you begin, ensure you have met the following requirements:

You have installed Python (version 3.7 or higher).
You have installed Node.js (version 12 or higher) and npm.
You have installed MongoDB.
You have installed Tesseract OCR (required for text extraction from images).

Clone the Repository

git clone https://github.com/abdelmaoulagr/pdf_ocr_app.git
cd pdf_ocr_app

Backend Setup

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required Python packages:

pip install -r requirements.txt

Start the Flask server:

export FLASK_APP=app.py  # On Windows use `set FLASK_APP=app.py`
flask run

Frontend Setup

Navigate to the frontend directory:

cd frontend

Install the required npm packages:

npm install

Start the React development server:

npm start

MongoDB Setup

Ensure MongoDB is running on your local machine or a remote server. Configure the connection settings in your backend configuration to point to your MongoDB instance.

Tesseract OCR Installation

Follow the installation instructions for Tesseract OCR based on your operating system from the official documentation: Tesseract Installation

Technologies Used

Languages: HTML, CSS, JavaScript, TypeScript, Python
Frontend Framework: React with Chakra UI and Bootstrap
Backend Framework: Flask with libraries such as JSON, Base64, PyMongoDB, Pytesseract, pdf2image
Database: MongoDB

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
backend		backend
frontend		frontend
screenshots		screenshots
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Law Extraction Web Application

Table of Contents

Description

Project Overview

User Roles

Features

ScreenShots

Installation

Prerequisites

Clone the Repository

Backend Setup

Frontend Setup

MongoDB Setup

Tesseract OCR Installation

Technologies Used

About

Releases

Packages

Contributors 2

Languages

abdelmaoulagr/pdf_ocr_app

Folders and files

Latest commit

History

Repository files navigation

OCR Law Extraction Web Application

Table of Contents

Description

Project Overview

User Roles

Features

ScreenShots

Installation

Prerequisites

Clone the Repository

Backend Setup

Frontend Setup

MongoDB Setup

Tesseract OCR Installation

Technologies Used

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages