DynaPDF Manual - Page 505

Previous Page 504   Index   Next Page 506

Function Reference
Page 505 of 860
Text objects use a separate coordinate system which is represented by the text transformation matrix
tm. We call this coordinate system text space. All text properties such as font size, text width and so
on are calculated in text space. The PDF format supports also several text positioning operators to
decrease the size of a text object. To make the usage of the function easier DynaPDF includes all text
positioning operators already in the text transformation tm.
The text coordinate system must be transformed to user space by multiplying the text matrix with
the current transformation matrix cm to enable the calculation of the text position. The combined
matrix must be recalculated each time GetPageText() returns a new text object.
As mentioned earlier a content stream is not organized into text lines and the order in which text
objects occur is essentially arbitrary. A text record can occur in two different formats: as an array or
as one coherent text string. The array form enables the definition of kerning between characters in a
compact format since PDF viewers ignore any available kerning information in a font resource. The
strings in a kerning array lie always on the same text line.
The kerning array is also often used to emulate space characters because word spacing does not
work with CID fonts. Most PDF drivers use the same algorithm to format text of single and multi-
byte fonts; that is the reason why space characters are very often emulated with kerning space.
However, it is quite easy to determine whether a space character is emulated at given position: if the
displacement is larger than the half space width we can assume that a space character was emulated
at this position. The half space width should be used because the fonts of documents which emulate
space characters with kerning space contain often no space character. DynaPDF sets a default space
width in this case which can be too large if a condensed font is used.
However, the array form is just one possible format to enable kerning between characters. Due to
several reasons the array form is sometimes not used. Many PDF drivers update the text position
with text positioning operators instead. This technique produces not only much greater content
streams it splits text records also into separate ones. This complicates the identification of word
boundaries a lot because each record is returned in a separate GetPageText() call. We need now the
coordinates to determine whether the text must be assigned to the same line. If the text is not rotated
this is not a big deal but if the coordinate system is rotated or if it contains other transformations
some further math is required to determine whether a text record must be assigned to the current
line.
We want now take a look into a PDF content stream to determine how an arbitrary text can be
stored in a PDF file. The following text can be stored in many different ways and it is important to
understand that many variants are possible and exist in real PDF files.
The rendered result of the string "The fox eats the lazy mouse." looks quite normal:
The fox eats the lazy mouse.
However, a PDF driver does not necessarily store this text in one record, there are many possible
variants:
%This is the easiest variant, one record contains the entire text line.
 

Previous topic: Organization of content streams and pages, Organization of text objects

Next topic: Possible encoding issues