DynaPDF Manual - Page 612

Previous Page 611   Index   Next Page 613

Function Reference
Page 612 of 839
TSetTextScale
TSetWordSpacing
TShowTextArrayA or
// Preferred for text search algorithms
TShowTextArrayW
// Preferred for text extraction algorithms
Which text callback function should be used depends on the requirements of the algorithm. If the
coordinates of arbirary sub strings must be computed the usage of the TShowTextArrayA callback
function in combination with TranslateRawCode() is recommended. If sub string processing not
required then TShowTextArrayW should be used.
Patterns, shadings, images, and vector graphics can be ignored when extracting text. So, it is not
required to set related callback functions.
The graphics state should contain these variables:
Parameter
Type
Initial Value
CharSpacing
float
0.0f
FillColor
UI32 (see comment below)
Black (optional)
Font
IFont*
NULL
FontSize
double
0.0
FontType
TFontType
-
Leading
float
0.0f (optional)
Matrix
TCTM
{1, 0, 0, 1, 0, 0}
SpaceWidth
float
0.0f
StrokeColor
UI32 (see comment below)
Black (optional)
TextDrawMode
TDrawMode
dmNormal
TextScale
float
100.0f
WordSpacing
float
0.0f
The text color is normally not required when extracting text. If it should be considered the current
fill and stroke color must be available in the graphics state and the corresponding callback functions
must be set. Whether the current fill or stroke color must be used as text color depends on the text
draw mode (see also SetTextDrawMode()).
Note that colors in PDF are represented by an array of double where each component ranges from
0.0 through 1.0 of the corresponding color space. However, colors are normally processed in a
unique device color space such as DevicedGray, DeviceRGB, or DeviceCMYK because applications
which use the content parser have normally no build-in support for PDF color spaces. Colors which
are set by the TSetFillColor and TSetStrokeColor callback functions should be converted to a device
color space with ConvColor(). The resulting device color must then be stored in the graphics state.
The TEndTemplate callback function is optional because no specific operation is required to be
executed when a template is leaved.
Unicode conversion
The extraction of human readable text requires a conversion to a well known encoding like Unicode
because PDF strings are not necessarily human readable.
 

Previous topic: Using the Content Parser, Text Extraction or Text Search Algorithms

Next topic: External CMaps, Inside the Callback Functions