Monkeybread Software - DynaPDF Manual

DynaPDF Manual - Page 483

Function Reference

Page 483 of 839

replaced in the middle or end of a kerning array. To make text replacement

easier it is possible to preserve an arbitrary number of kerning records from

deletion. The value of DeleteKerningAt represents the first array index

which should be deleted. All kerning records above this index will be

deleted too. Take a look into the demo examples/edit_text which is delivered

with DynaPDF to determine how this member can be used.

FontFlags

The font flags describe important characteristics of the current font:

• 0x00001 // Fixed pitch font

• 0x00002 // Serif style

• 0x00004 // Symbol font

• 0x00008 // Script style

• 0x00020 // Non-symbolic font

• 0x00040 // Italic style

• 0x40000 // Force Bold (Type1 fonts only)

External CMaps

A widely used technique to reduce the amount of data that must be stored in a PDF file is the usage

of non-embedded CID fonts. CID fonts, whether embedded or not, can depend on external CMaps

which must be available at runtime.

To process strings of such fonts correctly DynaPDF must be able to load required CMap files if

necessary. Therefore, DynaPDF is delivered with the most important CMap files which are provided

by Adobe Systems. These CMaps can be found in the DynaPDF installation directory at

/Resource/CMap/. Applications which extract text from PDF files should include these CMaps so

that they can be loaded at runtime.

The search path to external CMaps must be set with SetCMapDir() before executing GetPageText()

the first time. The function creates a CMap cache that is hold in memory until the PDF instance will

be deleted. The search path(s) to external CMap files should be set only one time per PDF instance

and one PDF instance should be used to process so many PDF files as possible. This can significantly

improve processing speed.

Order of Text records

GetPageText() returns always when a text showing operator was found. That means the returned

text represents not a text line. It can be a single character up to a complete text line depending on

how the text is stored in the PDF file.

The order in which text is returned is essentially arbitrary. It depends on the file creator whether

text is stored in the logical reading order. For example, most PDF drivers convert headers and

footers first. Such strings appear then at the beginning of the content stream. All other strings are in

turn not necessarily ordered and one text line can be stored in several different text objects.

A text search or text replacement algorithm must correctly handle cases in which a word or sentence

is separated into different text objects. In the worst case GetPageText() returns always only a single