kitepdf/io.github.yuroyami.kitepdf.font/CMap

CMap

class CMap

CMap parser for PDF /ToUnicode streams (ISO 32000-1 §9.10.3, Adobe Tech Note 5014).

A ToUnicode CMap is a small PostScript-ish program that tells the reader "given the source character codes I show in this content stream, here are the unicode codepoints to use when extracting/copying text." It uses two section types we care about:

begincodespacerange … endcodespacerange — declares 1-byte vs 2-byte codes beginbfchar … endbfchar — <src> <utf16BE-bytes> pairs beginbfrange … endbfrange — <srcLo> <srcHi> <utf16BE-start> — <srcLo> <srcHi> [ <utf> <utf> … ]

Source codes are 1-byte (most simple fonts), 2-byte (CIDFonts, Identity-H), or rarely 3/4-byte. Destination codes are always UTF-16BE byte strings, possibly multi-character for ligatures.

We mirror MuPDF's pdf-cmap-parse.c approach: tokenize with our regular PDF Lexer, then walk section keywords (bfchar, bfrange, codespacerange). Anything outside known sections (CIDSystemInfo, /Registry strings, etc.) is silently skipped — robustness over strictness, per the spec recommendation.

Types

Companion

object Companion

Properties

codeWidth

val codeWidth: Int

Maximum byte length of one source code (1–4). Inferred from codespacerange.

Functions

decode

fun decode(bytes: ByteArray, offset: Int): <Error class: unknown class><String, Int>?

Decode a single character code starting at offset in bytes; returns the (text, advance-in-bytes) pair, or null if no mapping exists.

decodeAll

fun decodeAll(bytes: ByteArray): String

Decode an entire byte string to a Kotlin String via repeated decode.