JAVA and HADOOP: Extract Text From (Image, PDF, Image embedded in PDF)

Extract Text From (Image, PDF, Image embedded in PDF)
----------------------------------------------------------------------------------------------------
Extracting text from the PDF is easy but extract text from the PDF that you received through scan is bit difficult. Because each scanned page is embedded in PDF as image.

Logic
-------
So in these kind of PDFs, first we have to extract images from PDF than extract text from images.

step1:

If you have maven project add below dependency to your pom. This decency required to extract images from the PDF.

<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>2.0.0</version>
</dependency>

Step2:

To extract text from the image we need to install tesseract-ocr. Download .exe from the site and install the EXE https://code.google.com/p/tesseract-ocr/

Once after installing the EXE add home directory to the PATH. I added installed folder path to PATH system variable (Properties->Advance settings->Environment Variable)

C:\Program Files (x86)\Tesseract-OCR

Step3:

Open the windows commend prompt and run the text "tesseract" . You should not get command not exist error here . if you get error check for how to set the path.

Step4:
Once all set restart the eclipse and execute the below Program with your PDF. It generate the text file of pdfs in given folder.

import java.io.File;

import net.sourceforge.tess4j.util.PdfUtilities;

import org.apache.commons.io.FileUtils;



public class TesseractExample {
 
 static String imageFolderPath="C:/santosh/PNG";
 static public void main(String[] args) {
  try {
   File[] imageFile = PdfUtilities.convertPdf2Png(new File(
     "C:/santosh/1999_001.pdf"));
   File dir=new File("C:/santosh/IMAGE_TEXT");
   if(!dir.exists())
   {
    if (dir.mkdir()) {
     System.out.println("Directory is created!");
    } 
   }
   else {
    FileUtils.cleanDirectory(dir); 
   }
   int i=1;
   for (File file : imageFile) {
    Runtime.getRuntime().exec("tesseract "+file.getAbsolutePath()+ " "+dir+File.separator+"imageText"+i);
    i++;
   }
  } catch (Exception e) {
   System.err.println(e.getMessage());
  }

 }

JAVA and HADOOP

7/02/2015

Extract Text From (Image, PDF, Image embedded in PDF)

2 comments: