get_prep_info_ocr_class_svmT_get_prep_info_ocr_class_svmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmget_prep_info_ocr_class_svm (Operator)
Name
get_prep_info_ocr_class_svmT_get_prep_info_ocr_class_svmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmget_prep_info_ocr_class_svm
— Compute the information content of the preprocessed feature vectors
of an SVM-based OCR classifier.
Signature
void GetPrepInfoOcrClassSvm(const HTuple& OCRHandle, const HTuple& TrainingFile, const HTuple& Preprocessing, HTuple* InformationCont, HTuple* CumInformationCont)
HTuple HOCRSvm::GetPrepInfoOcrClassSvm(const HTuple& TrainingFile, const HString& Preprocessing, HTuple* CumInformationCont) const
HTuple HOCRSvm::GetPrepInfoOcrClassSvm(const HString& TrainingFile, const HString& Preprocessing, HTuple* CumInformationCont) const
HTuple HOCRSvm::GetPrepInfoOcrClassSvm(const char* TrainingFile, const char* Preprocessing, HTuple* CumInformationCont) const
HTuple HOCRSvm::GetPrepInfoOcrClassSvm(const wchar_t* TrainingFile, const wchar_t* Preprocessing, HTuple* CumInformationCont) const
(Windows only)
Description
get_prep_info_ocr_class_svmget_prep_info_ocr_class_svmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmget_prep_info_ocr_class_svm
computes the information content
of the training vectors that have been transformed with the
preprocessing given by PreprocessingPreprocessingPreprocessingPreprocessingpreprocessingpreprocessing
.
PreprocessingPreprocessingPreprocessingPreprocessingpreprocessingpreprocessing
can be set to 'principal_components'"principal_components""principal_components""principal_components""principal_components""principal_components"
or 'canonical_variates'"canonical_variates""canonical_variates""canonical_variates""canonical_variates""canonical_variates". The OCR classifier
OCRHandleOCRHandleOCRHandleOCRHandleOCRHandleocrhandle
must have been created with
create_ocr_class_svmcreate_ocr_class_svmCreateOcrClassSvmCreateOcrClassSvmCreateOcrClassSvmcreate_ocr_class_svm
. The preprocessing methods are
described with create_class_svmcreate_class_svmCreateClassSvmCreateClassSvmCreateClassSvmcreate_class_svm
. The information content is
derived from the variations of the transformed components of the
feature vector, i.e., it is computed solely based on the training
data, independent of any error rate on the training data. The
information content is computed for all relevant components of the
transformed feature vectors (NumFeaturesNumFeaturesNumFeaturesNumFeaturesnumFeaturesnum_features
for
'principal_components'"principal_components""principal_components""principal_components""principal_components""principal_components" and min(NumClassesNumClassesNumClassesNumClassesnumClassesnum_classes
- 1,
NumFeaturesNumFeaturesNumFeaturesNumFeaturesnumFeaturesnum_features
) for 'canonical_variates'"canonical_variates""canonical_variates""canonical_variates""canonical_variates""canonical_variates", see
create_class_svmcreate_class_svmCreateClassSvmCreateClassSvmCreateClassSvmcreate_class_svm
), and is returned in
InformationContInformationContInformationContInformationContinformationContinformation_cont
as a number between 0 and 1. To convert
the information content into a percentage, it simply needs to be
multiplied by 100. The cumulative information content of the first
n components is returned in the n-th component of
CumInformationContCumInformationContCumInformationContCumInformationContcumInformationContcum_information_cont
, i.e., CumInformationContCumInformationContCumInformationContCumInformationContcumInformationContcum_information_cont
contains the sums of the first n elements of
InformationContInformationContInformationContInformationContinformationContinformation_cont
. To use
get_prep_info_ocr_class_svmget_prep_info_ocr_class_svmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmget_prep_info_ocr_class_svm
, a sufficient number of samples
must be stored in the training files given by TrainingFileTrainingFileTrainingFileTrainingFiletrainingFiletraining_file
(see write_ocr_trainfwrite_ocr_trainfWriteOcrTrainfWriteOcrTrainfWriteOcrTrainfwrite_ocr_trainf
).
InformationContInformationContInformationContInformationContinformationContinformation_cont
and CumInformationContCumInformationContCumInformationContCumInformationContcumInformationContcum_information_cont
can be used
to decide how many components of the transformed feature vectors
contain relevant information. An often used criterion is to require
that the transformed data must represent x% (e.g., 90%) of the
total data. This can be decided easily from the first value of
CumInformationContCumInformationContCumInformationContCumInformationContcumInformationContcum_information_cont
that lies above x%. The number thus
obtained can be used as the value for NumComponentsNumComponentsNumComponentsNumComponentsnumComponentsnum_components
in a
new call to create_ocr_class_svmcreate_ocr_class_svmCreateOcrClassSvmCreateOcrClassSvmCreateOcrClassSvmcreate_ocr_class_svm
. The call to
get_prep_info_ocr_class_svmget_prep_info_ocr_class_svmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmget_prep_info_ocr_class_svm
already requires the creation of
a classifier, and hence the setting of NumComponentsNumComponentsNumComponentsNumComponentsnumComponentsnum_components
in
create_ocr_class_svmcreate_ocr_class_svmCreateOcrClassSvmCreateOcrClassSvmCreateOcrClassSvmcreate_ocr_class_svm
to an initial value. However, if
get_prep_info_ocr_class_svmget_prep_info_ocr_class_svmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmget_prep_info_ocr_class_svm
is called it is typically not
known how many components are relevant, and hence how to set
NumComponentsNumComponentsNumComponentsNumComponentsnumComponentsnum_components
in this call. Therefore, the following
two-step approach should typically be used to select
NumComponentsNumComponentsNumComponentsNumComponentsnumComponentsnum_components
: In a first step, a classifier with the
maximum number for NumComponentsNumComponentsNumComponentsNumComponentsnumComponentsnum_components
is created
(NumFeaturesNumFeaturesNumFeaturesNumFeaturesnumFeaturesnum_features
for 'principal_components'"principal_components""principal_components""principal_components""principal_components""principal_components" and
min(NumClassesNumClassesNumClassesNumClassesnumClassesnum_classes
- 1,
NumFeaturesNumFeaturesNumFeaturesNumFeaturesnumFeaturesnum_features
) for 'canonical_variates'"canonical_variates""canonical_variates""canonical_variates""canonical_variates""canonical_variates"). Then,
the training samples are saved in a training file using
write_ocr_trainfwrite_ocr_trainfWriteOcrTrainfWriteOcrTrainfWriteOcrTrainfwrite_ocr_trainf
. Subsequently,
get_prep_info_ocr_class_svmget_prep_info_ocr_class_svmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmget_prep_info_ocr_class_svm
is used to determine the
information content of the components, and with this
NumComponentsNumComponentsNumComponentsNumComponentsnumComponentsnum_components
. After this, a new classifier with the
desired number of components is created, and the classifier is
trained with trainf_ocr_class_svmtrainf_ocr_class_svmTrainfOcrClassSvmTrainfOcrClassSvmTrainfOcrClassSvmtrainf_ocr_class_svm
.
Execution Information
- Multithreading type: reentrant (runs in parallel with non-exclusive operators).
- Multithreading scope: global (may be called from any thread).
- Processed without parallelization.
Parameters
OCRHandleOCRHandleOCRHandleOCRHandleOCRHandleocrhandle
(input_control) ocr_svm →
HOCRSvm, HTupleHHandleHTupleHtuple (handle) (IntPtr) (HHandle) (handle)
Handle of the OCR classifier.
TrainingFileTrainingFileTrainingFileTrainingFiletrainingFiletraining_file
(input_control) filename.read(-array) →
HTupleMaybeSequence[str]HTupleHtuple (string) (string) (HString) (char*)
Names of the training files.
Default value:
'ocr.trf'
"ocr.trf"
"ocr.trf"
"ocr.trf"
"ocr.trf"
"ocr.trf"
File extension: .trf
, .otr
PreprocessingPreprocessingPreprocessingPreprocessingpreprocessingpreprocessing
(input_control) string →
HTuplestrHTupleHtuple (string) (string) (HString) (char*)
Type of preprocessing used to transform the
feature vectors.
Default value:
'principal_components'
"principal_components"
"principal_components"
"principal_components"
"principal_components"
"principal_components"
List of values: 'canonical_variates'"canonical_variates""canonical_variates""canonical_variates""canonical_variates""canonical_variates", 'principal_components'"principal_components""principal_components""principal_components""principal_components""principal_components"
InformationContInformationContInformationContInformationContinformationContinformation_cont
(output_control) real-array →
HTupleSequence[float]HTupleHtuple (real) (double) (double) (double)
Relative information content of the transformed
feature vectors.
CumInformationContCumInformationContCumInformationContCumInformationContcumInformationContcum_information_cont
(output_control) real-array →
HTupleSequence[float]HTupleHtuple (real) (double) (double) (double)
Cumulative information content of the transformed
feature vectors.
Example (HDevelop)
* Create the initial OCR classifier.
read_ocr_trainf_names ('ocr.trf', CharacterNames, CharacterCount)
create_ocr_class_svm (8, 10, 'constant', 'default', CharacterNames, \
'rbf', 0.01, 0.01, 'one-versus-one', \
'principal_components', 81, OCRHandle)
* Get the information content of the transformed feature vectors.
get_prep_info_ocr_class_svm (OCRHandle, 'ocr.trf', 'principal_components', \
InformationCont, CumInformationCont)
* Determine the number of transformed components.
* NumComp = [...]
* Create the final OCR classifier.
create_ocr_class_svm (8, 10, 'constant', 'default', CharacterNames, \
'rbf', 0.01, 0.01,'one-versus-one', \
'principal_components', NumComp, OCRHandle)
* Train the final classifier.
trainf_ocr_class_svm (OCRHandle, 'ocr.trf', 0.001, 'default')
write_ocr_class_svm (OCRHandle, 'ocr.osc')
Result
If the parameters are valid the operator
get_prep_info_ocr_class_svmget_prep_info_ocr_class_svmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmget_prep_info_ocr_class_svm
returns the value 2 (H_MSG_TRUE). If
necessary, an exception is raised.
get_prep_info_ocr_class_svmget_prep_info_ocr_class_svmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmGetPrepInfoOcrClassSvmget_prep_info_ocr_class_svm
may return the error 9211
(Matrix is not positive definite) if PreprocessingPreprocessingPreprocessingPreprocessingpreprocessingpreprocessing
=
'canonical_variates'"canonical_variates""canonical_variates""canonical_variates""canonical_variates""canonical_variates" is used. This typically indicates
that not enough training samples have been stored for each class.
Possible Predecessors
create_ocr_class_svmcreate_ocr_class_svmCreateOcrClassSvmCreateOcrClassSvmCreateOcrClassSvmcreate_ocr_class_svm
,
write_ocr_trainfwrite_ocr_trainfWriteOcrTrainfWriteOcrTrainfWriteOcrTrainfwrite_ocr_trainf
,
append_ocr_trainfappend_ocr_trainfAppendOcrTrainfAppendOcrTrainfAppendOcrTrainfappend_ocr_trainf
,
write_ocr_trainf_imagewrite_ocr_trainf_imageWriteOcrTrainfImageWriteOcrTrainfImageWriteOcrTrainfImagewrite_ocr_trainf_image
Possible Successors
clear_ocr_class_svmclear_ocr_class_svmClearOcrClassSvmClearOcrClassSvmClearOcrClassSvmclear_ocr_class_svm
,
create_ocr_class_svmcreate_ocr_class_svmCreateOcrClassSvmCreateOcrClassSvmCreateOcrClassSvmcreate_ocr_class_svm
Module
OCR/OCV