Tesseract-ocr convert scanned images into editable documents on linux linuxaria funny jokes for adults dirty


Tesseract-ocr: how to convert scanned documents into editable text on Ubuntu or Debian, Original article by Gabriele published on Gmstyle (italian blog)

I learned from the requests come via email, that some of my readers use Ubuntu (or Linux in general) to work and deal with graphics and publishing, who for his profession and who as a hobby solving problems by finding equivalent ratios. I draw inspiration from the request of a dear member of this little web space, which has given to me the input for this article, to make a bit of clarity about a subject that,for what I’ve saw around during my research on the Internet, seems to have created some difficulties for many people.

The argument i’m talking about is the OCR technology (Optical Character Recognition), that is a “technology ” that can recognize text characters from an image of paper documents previously digitized through the scanner and then transform this into an editable text.

In other words, using the program Tesseract-ocr (which uses this technology), if take a piece of newspaper and we scan it in our scanner, we get an image file (jpeg, tiff, etc …) from which we can extrapolate a the text document and save it as a normal txt that you can change, according to our convenience or purpose.

Hoping to make an useful thing, I tried to come to a procedure as simple and less invasive as possible, drawing on some material on the web, to enable all interested in the subject to do with Ubuntu or Linux in general what still keeps them tied to Windows.

In this guide, I’ve used Ubuntu 10.10, and in addition to Tesseract-ocr and gImageReader i’ve installed also the program Xsane, that i will use to scan documents.

2 – Now is the time to install the graphical user interface (GUI) to use Tesseract in a simple and intuitive way: gImageReader gender roles in society. Download it from this link binary domain. This is a .deb package, install it by clicking on it canadian dollar in us dollar. After installation, you’ll find the icon in Applications>Graphics.

We start Xsane from Applications> Graphics, wait for it to recognize your scanner and do the setup before you scan exchange rate usd rmb. You need to set all parameters in order to make a scan of the document as accurate as possible jpy news. The parameters to be set on Xsane are these that you see in the picture below

c- Binary is the parameter that tell that the imagine of the document will be made in black & white qar to usd exchange rate. This step is crucial for Tesseract to RECOGNIZE all the scanned text.

d- 1200 dpi resolution us futures markets. The value below which i suggest to don’t go, according to my tests, because cause the failure to recognize total or partial text is 600 dpi.

4 – Now that everything is configured, click on “Scan” and wait for the end of the process that will end with saving the image out.tiff to the destination folder that we have indicated previously (Home in this case)

Now that we got our digital document, we must start Tesseract through gImageReader usd vs inr forecast 2016. Let’s go to Applications> Graphics and launch the program.

The interface is, as I said, really very intuitive and easy to use. In fact, just click on “Open Image” to open the file out.tif, created earlier, and then click on “Recognize all” to begin the OCR process, and wait for it to end investing aud usd. At the end, as you see in the screenshot below, you will see on the right, in the form of text, the contents of the file out.tif.

The tests made ​​by me have returned positive results, but the useful information of these tests regards the resolution of the image file obtained: the higher the quality that may be obtained from your scanner while scanning documents,the better will be the file produced by the OCR process and so you’ll have less errors in the text file.