A lightweight Windows desktop application that extracts text from images, PDFs, and Office documents with ease.
- Image Selection: Select images or documents (JPG, PNG, BMP, GIF, TIFF, PDF, EXCEL, POWERPOINT (PPTX)
- OCR Processing: Extract text from images using Tesseract OCR engine
- Text Preview: View extracted text in the application
- Word Export: Export extracted text to Microsoft Word (.docx) format
- User-Friendly Interface: Clean and intuitive Windows Forms UI built using the ReaLTaiizor UI framework.
- .NET 4.8 SDK or later
- Tesseract OCR installed on your system
- Download Tesseract OCR installer from: https://github.com/UB-Mannheim/tesseract/wiki
- Run the installer and install to default location (usually
C:\Program Files\Tesseract-OCR) - The installer includes English language data files by default
choco install tesseract- Download Tesseract binaries
- Extract to a folder (e.g.,
C:\Tesseract-OCR) - Download language data files from: https://github.com/tesseract-ocr/tessdata
- Place
eng.traineddatain thetessdatafolder
- Open a terminal in the project directory
- Restore NuGet packages:
dotnet restore
- Build the project:
dotnet build
- Run the application:
dotnet run
- Launch the application
- Click "Select Image/Document" to choose an image file
- Click "Extract Text (OCR)" to process the image and extract text
- Review the extracted text in the text box
- Click "Export to Word Document" to save the text as a .docx file
OCRTextReaderApp/
├── MainForm.cs # Main UI form
├── OCRService.cs # OCR text extraction service
├── WordExportService.cs # Word document export service
├── Program.cs # Application entry point
├── OCRTextReader.csproj # Project file
- Tesseract: OCR engine for text extraction
- DocumentFormat.OpenXml: For creating Word documents
- Ensure Tesseract OCR is installed
- Verify that
eng.traineddataexists in the tessdata folder - Check that the tessdata path is accessible
- The image quality might be too low
- Try using higher resolution images
- Ensure the image contains clear, readable text
- Check if the text is in a supported language (English by default)
- The application currently supports English text extraction by default
- To add support for other languages, download the corresponding language data files from the Tesseract tessdata repository
- PDF files may require additional processing depending on their format
This project uses the following open-source libraries:
-
Tesseract OCR – Licensed under the Apache License 2.0
https://github.com/tesseract-ocr/tesseract -
DocumentFormat.OpenXml – Licensed under the MIT License
https://github.com/OfficeDev/Open-XML-SDK -
ReaLTaiizor – Licensed under the MIT License
https://github.com/Taiizor/ReaLTaiizor