Our
solution allows you to either label or OCR a document. We
suggest labeling most documents, and using OCR with pages
that are relevant to a future search. This method will
keep your index database to a manageable size, as each
word in the OCR document is entered into your index
database.
The
freedom to use an unlimited number of characters and
numbers when labeling documents allows operators to use
existing terminology with which employees are already
familiar. This facilitates a quick and easy transition
into using digital documents in place of paper documents
throughout their organizations. |
|
When scanning documents, the decision to use Optical
Character Recognition (OCR) versus labeling should revolve
around the issues of data mining and document retrieval.
For documents and/or information contained within those
documents to be searchable, electronic documents must be
indexed. It is a disservice to the imaging industry's
customers when it advocates OCR as the preferred method of
document imaging for search purposes. OCR has been
promoted to allegedly automate the process. Although our
document imaging systems allow you to either label or OCR
a document for indexing, the preferred method to use needs
to be made on a document-by-document basis. Understanding
the drawbacks associated with each method will help
clarify when OCR or labeling is preferred.
Search Results
While search engines (indices) are easy to use, the search
results are often imprecise and display irrelevant
information. For example, searching for “java” might
return documents that describe java the programming
language, java the coffee, and Java the Indonesian island.
The greater the number of documents on your server, the
greater the number of irrelevant search results.
Labeling documents will significantly improve the
accuracy of your search and return a higher number of
relevant documents. This is especially true for industries
retrieving standardized documents such as job files and
invoices.
For other organizations, it may be more practical to OCR
the document in order to search for a key word or phrase
contained within the text. It is not recommended to OCR
every page contained within a document due to time
requirements. Rather, individual pages form the document
should be selected to OCR that would be relevant to a
future search.
OCR is Not 100% Accurate
OCR is the process of converting text on a scanned image
into text that is be searchable. One then executes a
“full-text search” on the OCR document with words and
phrases known to be included in a document. The OCR
process is extremely sensitive to the quality of the
image, as well as the font differences within the
document. As a result, the output from an OCR process is
seldom flawless.
If the OCR process claims to be 95% accurate, then one
character in twenty is not recognized. Errors are
introduced when characters “bleed” and touch one another,
or when the scanner picks up “ghost” images from the
reverse side of the document. The inaccuracy of the OCR
process requires an operator to manually correct the
suspect characters. Using OCR in place of labeling often
negates any time gained by the automated process because
of character corrections.
Fuzzy Search
Several companies propose that it is unnecessary to
correct the suspect characters and that document searches
are accurate. They suggest using “fuzzy search” technology
that expands queries to include terms that sound similar
or are typographically similar to the term requested.
However, fuzzy searches often produce an even larger
number of documents that are irrelevant and unnecessary.
Appropriate Conditions for OCR
Our document imaging system allows intra-document
searches. This is especially useful if one is data mining
or looking for information contained within a document.
One can retrieve a document and then jump directly to the
specific information needed within that document.
Navigation from occurrence to occurrence of the keyword(s)
will start with the first “hit.” Each “hit” is visibly
highlighted within the document, making the search and
retrieval process much more efficient.
Conclusion
Our solution allows you to either label or OCR a document.
We suggest labeling most documents, and using OCR with
pages that are relevant to a future search. This method
will keep your index database to a manageable size, as
each word in the OCR document is entered into your index
database.
Many of our customers find it faster and more accurate to
manually enter predetermined keywords, phrases and numbers
to a document than to OCR the document and correct the
suspect characters. The freedom to use an unlimited number
of characters and numbers when labeling documents allows
operators to use existing terminology with which employees
are already familiar. This facilitates a quick and easy
transition into using digital documents in place of paper
documents throughout their organizations. |
|