Recognition rules setup

elDoc IDP performs document processing using recognition rules (also referred as "Recognition form" or "RecoForm") defined in the system.

This page describes how to setup RecoForms:

Recognition form creation

You can create unlimited number of the recognition forms in the system. In order to create a new recognition form - press +Add button on the Recognition forms management page.

Recognition form settings

Form name (required field) - name which is distinctive and obvious for corresponding document.
NOTE: Form name should be a unique name.

Hint

It is recommended to have naming conventions for the RecoForm names in order to be able to manage RecoForms when number of them growth.

Naming convention may look as follows: {docType}_vX_{remarks} (e.g.: Invoice_v1, Invoice_v2, etc.). This will allow to have a better manageability of RecoForms.

Target document - select Document type from drop-down list if it is required to perform Conversion after the document recognition is completed. If Conversion is not required, leave the field blank. 

Recognition method - defines type of recognition approach to be applied for the specific RecoForm. By default it is set to "Anchor based".

  • Anchor based - is recommended to apply for documents of standard layout / document with fixed table format (e.g. invoices, purchase orders, service reports, transcripts, etc).
  • Regex-based - is recommended to apply for documents of not standard layout / document without fixed table format (e.g. boarding pass, payment instructions, etc).
  • Custom plugin based - custom RecoForms for handling non-trivial document types. 

Enabled languages – select language(s) from drop-down list to be applied during the recognition process. By default it is set to English.

Hint

In order to achieve greater accuracy and if one language is sufficient for recognition process, it is recommended to not add extra languages as this may complicate the recognition process.

Keywords (required field) - lists key words which are associated with a particular type of the document and/or minus key words that are not associated with a particular type of the document. Adding key words and/or minus key words to the specific RecoForm helps to optimize Recognition Queue processing. The following rules are applied during the recognition process:

  • Keywords (inclusive) - listed as normal words. The rule that all keywords have to be present on the document in order to include this RecoForm into the IDP phase.
  • Minus keywords (exclusive) - listed as words starting with "-" minus sign (e.g.: -invoice). In case any of the listed minus keyword is present on the document page - the given RecoForm will be excluded from the IDP phase.

Document sample upload

Document sample is required by RecoForm in order to define recognition rules by marking document's layout.

Hint

For getting better results during documents processing it is recommended to use document sample of the best available quality for RecoForm.

General recommendation for the document sample: it should be scanned with min. scanning resolution of 300DPI, properly aligned, do not contain artifacts, and contain only target page(s).

1) Once new RecoForm is created you need to provide document sample which the given RecoForm will be processing by extracting its data.

  • Press +Choose button to upload document sample;
  • Press Remove sample button to remove attached sample.

2) Uploaded Document sample is displayed at the right side of the Recognition form page.

Recognition form layout settings

Recognition form layout settings area used for defining document layout and tells elDoc IDP system which data has to be extracted from the document. To begin with layout settings press the Add field button to add field what will add new field to the layout settings area.

Pressing the +Add Field button also adds 2 rectangular markers on the Document sample preview area that should be mapped (marked over) to the field value region (pink filled) and field anchor (green filled).

Field name – defines user-friendly field name which describes the value which should be extracted from the document (e.g. InvoiceNumber, IssueDate, TotalAmount etc.)

Tags - defines field tag(s) for programmatic access to the retrieved values via API. In case Conversion phase is enabled for this RecoForm (Target document is set) tag(s) should match with the respective tag(s) on the Document form defined via "Document form -> Form builder" page.

Confidence threshold -  defines desired min. confidence level for the field (in percentage from 0 to 100). For the critical fields it is recommended to set value above 85-90. For the optional fields this value can be set to 0.

Note

Confidence threshold is one of the most important measurement used by the elDoc IDP system to decide on the document quality and further processing route for the document. This parameter works in the following way: during the IDP phase confidence level of the retrieved from the document data for this field is compared with the defined Confidence threshold. In case confidence level is below defined confidence threshold - document is routed to the validation stage by the elDoc IDP system.

Field type – defines type of the field. By default it is set to "TEXT".

  • TEXT - regular text field.
  • OMR - stands for (Optical Mark Recognition) and defines fields with optical marks in form of check-boxes and circles.
  • TABLE - defines table field and applicable for locating and capturing the data from tables. 

Anchor text – defines field's anchor and should contain value (text) exactly as it mentioned on the Document sample.

Note

Anchor is any static label or image that can be found in every document copy of this type and used for locating value region of the defined field.

Anchor text field supports as plain text values, as well as regex-based values. Regex-based value have to start with "regex:" prefix (e.g.: value "regex:INV(O|0)ICE" will serve as anchor text for both variants of the labels: INVOICE and INV0ICE).

Apply regex – defines whether to apply regular expression for the field value partial data extraction (e.g. only numbers without symbols, etc.) by entering regular expression into input field next to the checkbox. Named capturing-group should be used with the "value" name, e.g.: "(?<value>X)", where X is the target value which is to be stored as field's value (see more in the "Regular-expression tester" part). The following regex types are available:

  • Standard - uses standard regex syntax rules
  • Bitap - uses embedded Bitap syntax rules (available only for Regex based RecoForm types)

Anchor region - displays anchor's coordinates (not editable).

Anchor region extension (in percent) - defines percentage for locating the anchor based on the position defined on the document sample. Has the following order from left to right: Top, Right, Bottom, Left. By default has the following values: Top (180%), Right (25%), Bottom (180%), Left (25%).

Regular-expression tester

Button opens regular expression (regex) tester area as shown below.

You may use Regex tester area for testing regular expression syntax before adding it to the field.

For the field data cleaning & retrieval using regular expression named capturing-group should be used with the "value" name, e.g.: "(?<value>X)", where X is the target value which is to be stored as field's value.

Last modified: April 28, 2023