OCR based Medical Data Extraction Project

I made this project by following Data Analytics Bootcamp 3.0, the link for the course is below. If you are interested checkout the link.

Data Analytics Bootcamp 3.0

Problem Statement for Python Expert

Health insurance companies need to process patient details and prescription images sent by hospitals or individual doctors to extract useful data for claim issuance. This process must comply with government regulations and be completed within 24 hours.

Currently, many insurance companies outsource this task to firms like "AtliQ Analytics", which rely on a manual process to extract information from images. Employees view scanned images, manually enter the information, and categorize the data. This manual method is prone to errors and becomes inefficient with large volumes of images, such as during a pandemic.

Task: Develop an automated solution to extract relevant data from images of patient details and prescriptions. This solution should:

Process Images: Use OCR (Optical Character Recognition) to extract text from scanned images.
Data Extraction: Identify and extract specific information, such as patient name, date, address, prescription details, etc.
Error Reduction: Minimize errors compared to the manual process.
Efficiency: Handle large volumes of images and ensure data extraction within the 24-hour timeframe.
Compliance: Ensure the solution complies with government regulations for data processing and privacy.

This upgrade aims to replace the current manual system with a more efficient, accurate, and scalable automated software solution.

Solution approach

To solve all these problems, we are building a program which can do the extraction of data from images automatically. As always, machines can not replace humans. A person will recheck the extracted data and submit. So, that it will save a tremendous amount which was taken to type the data manually.

Here, we are using the Python programming language and pytesseract google library for extracting the data and Regex module to process the data and get distilled desired output.

Technologies used

Python
oops
Pdf2image module
Opencv
pytesseract
Regular expression
pytest
Postman
FastApi

Workflow

PDF to Image

For converting PDF to image, we have used pdf2image library.

Without preprocessing extracting data

Tried extracting data from source files without any processing, as they are not in proper format to be extracted, the extracted data was not as expected.

Extracted data from the above image

Dr John Smith, M.D
2 Non-Important Street,
New York, Phone (000)-111-2222

Name: Maria Sharapova Date: 5/11/2022

Address: 9 tennis court, new Russia, DC

—momennannenncmneneunnmnnnnninsissiyoinnitnahaadaanih issn earnttneenrenen:

Prednisone 20 mg
Lialda 2.4 gram

3 days,

or 1 month

Image processing

we decided to preprocess the image using opencv module, before extracting data from them. For that we have first used normal thresholding and checked, which resulted in below image

So, if there is any shadow or some noise, the normal thresholding fade out the area. which will result in loss of data.

In the search of better approach of this problem, we have decided to use adaptive thresholding technique. In this technique, the image will be divided into sub image and the thresholding value will be different for all sub regions. And the end result of adaptive thresholding is much better compared to normal thresholding.

After preprocessing the image data extraction

Dr John Smith, M.D
2 Non-Important Street,
New York, Phone (000)-111-2222

Name: Marta Sharapova Date: 5/11/2022

Address: 9 tennis court, new Russia, DC

K

Prednisone 20 mg
Lialda 2.4 gram

Directions:

Prednisone, Taper 5 mg every 3 days,
Finish in 2.5 weeks a
Lialda - take 2 pill everyday for 1 month

Notebook

For all these above trials, used jupyter books and developed the small bits of the functionalities., which can be used later while designing the class.

Notebook

OOPS design

The code was written in using OOPs concepts for extracting the medical data from prescription and patient details documents.

Code

Regular expression

using regular expression module we can match the patterns and extract the data we want from the files. For this project, analyst the medical files and as fact all the medical documents will follow same pattern, we wrote patterns that match only the required data. Before writing the python code, It is advisable to practise and match the patterns in regex 101 website.

regex101

Test driven Development

In this project test driven development methodology was used to develop the code. For testing pytest module was used. For all the methods and final result the test cases was designed and checked simultaneously while developing the code.

Test cases

FastApi

Used FastAPI for hosting the server of the project. FastApi, as name suggest is help us to develop fast and some other advantages are,

In build Data validation
In build Documentation
Fast running and performance

Postman

As it is a backend project, not developed frontend part. For checking how the server responds for http requests, used postman to trigger http requests and tested the outcome.

Result

This backend functionality can be integrated into the Mr.X Analytics existing software and data can be extracted automatically. The extracted data may have some errors, the person who is performing the work has to correct it and submit the response.

Benefits of the Automated Data Extraction Solution

Time Efficiency:
- The automated solution can save at least 30 seconds per document. While this might seem insignificant for a single document, the cumulative time saved across thousands of documents can be substantial. This efficiency boost allows the company to process more documents within the given timeframe, enhancing productivity and profitability.
Cost Savings:
- By automating the data extraction process, the company can handle peak periods without the need to hire additional temporary staff. This not only reduces labor costs but also alleviates the challenges associated with training and managing a seasonal workforce.
Error Reduction:
- Combining automation with manual verification significantly lowers the error rate. Automated systems can consistently extract data with high accuracy, and manual oversight ensures any anomalies are quickly corrected. This dual approach ensures data integrity and reliability, leading to fewer mistakes and improved compliance with regulatory requirements.

Overall, these benefits highlight the transformative impact of automation in processing patient details and prescription images, ensuring faster, more accurate, and cost-effective operations.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
Backend		Backend
frontend		frontend
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR based Medical Data Extraction Project

Problem Statement for Python Expert

Solution approach

Technologies used

Workflow

PDF to Image

Without preprocessing extracting data

Extracted data from the above image

Image processing

After preprocessing the image data extraction

Notebook

OOPS design

Regular expression

Test driven Development

FastApi

Postman

Result

Benefits of the Automated Data Extraction Solution

About

Releases

Packages

Languages

prashantsingh8962/Healthcare_Data_extraction

Folders and files

Latest commit

History

Repository files navigation

OCR based Medical Data Extraction Project

Problem Statement for Python Expert

Solution approach

Technologies used

Workflow

PDF to Image

Without preprocessing extracting data

Extracted data from the above image

Image processing

After preprocessing the image data extraction

Notebook

OOPS design

Regular expression

Test driven Development

FastApi

Postman

Result

Benefits of the Automated Data Extraction Solution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages