OCR and Document Search Web Application

This project is a web-based application that allows users to upload images or documents, extract text from them using Optical Character Recognition (OCR), and perform searches across the extracted text. Built with Gradio for a user-friendly interface and leveraging Pytesseract for OCR functionality, this app is ideal for managing and searching text from scanned documents, images, and PDFs.

Features

OCR Processing: Extract text from images or documents in multiple languages using Tesseract.
Document Search: Perform a keyword search across the extracted text to find relevant information quickly.
Gradio Interface: User-friendly web interface for uploading documents and displaying results.
Multiple Language Support: The application can handle multiple languages, including English and Hindi.

Tech Stack

Gradio: For building the web-based interface.
Pytesseract: To perform OCR on uploaded images and documents.
Python: Backend processing.
Tesseract-OCR: The underlying OCR engine used for text extraction.

Requirements

Before running the application, ensure you have the following installed:

Python 3.x
Tesseract-OCR
- For Linux:
```
sudo apt-get install tesseract-ocr
```
- For Windows, download from here.
Required Python packages:
```
pip install -r requirements.txt
```
Your requirements.txt should include:
- gradio
- pytesseract
- pillow (for image handling)
- numpy

Installation

Clone the Repository:

git clone https://github.com/ajaynair710/OCR-and-Document-Search-Web-Application.git
cd OCR-and-Document-Search-Web-Application

Install Dependencies:
```
pip install -r requirements.txt
```
Install Tesseract:
- For Linux:
```
sudo apt-get install tesseract-ocr
```
- For Windows: Download and install Tesseract.
Make sure Tesseract is added to your system's PATH.
Run the Application:
```
python app.py
```
The app will run on a local server, and you can access it by navigating to the local URL provided in the console, typically http://0.0.0.0:7860.

Usage

Upload a Document or Image:
- Open the app in your browser, and upload an image or document that contains text you want to extract.
Text Extraction:
- Once the document is uploaded, the app will process it using the Tesseract OCR engine and display the extracted text on the screen.
Search:
- Use the search bar to find specific keywords within the extracted text.

Troubleshooting

TesseractNotFoundError: If you encounter the TesseractNotFoundError, ensure Tesseract is installed correctly and added to your system's PATH.

For Linux:
```
sudo apt-get install tesseract-ocr
```
For Windows: Add Tesseract's installation directory (e.g., C:\Program Files\Tesseract-OCR\) to your system's PATH.
OCR Accuracy: OCR accuracy depends on the quality of the uploaded images. Higher resolution images tend to yield better results.

Future Enhancements

PDF Support: Add the ability to process and extract text from multi-page PDF documents.
Text Export: Allow users to download the extracted text as a .txt or .docx file.
Advanced Search: Implement advanced search filters such as case sensitivity, regex matching, and proximity searches.

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a feature branch (git checkout -b feature-branch).
Commit your changes (git commit -m 'Add new feature').
Push to the branch (git push origin feature-branch).
Open a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
ocr_app.py		ocr_app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR and Document Search Web Application

Features

Tech Stack

Requirements

Installation

Usage

Troubleshooting

Future Enhancements

Contributing

About

Releases

Packages

Languages

License

ajaynair710/OCR-and-Document-Search-Web-Application

Folders and files

Latest commit

History

Repository files navigation

OCR and Document Search Web Application

Features

Tech Stack

Requirements

Installation

Usage

Troubleshooting

Future Enhancements

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages