CMap

class CMap

CMap parser for PDF /ToUnicode streams (ISO 32000-1 §9.10.3, Adobe Tech Note 5014).

A ToUnicode CMap is a small PostScript-ish program that tells the reader "given the source character codes I show in this content stream, here are the unicode codepoints to use when extracting/copying text." It uses two section types we care about:

begincodespacerange … endcodespacerange — declares 1-byte vs 2-byte codes beginbfchar … endbfchar — <src> <utf16BE-bytes> pairs beginbfrange … endbfrange — <srcLo> <srcHi> <utf16BE-start><srcLo> <srcHi> [ <utf> <utf> … ]

Source codes are 1-byte (most simple fonts), 2-byte (CIDFonts, Identity-H), or rarely 3/4-byte. Destination codes are always UTF-16BE byte strings, possibly multi-character for ligatures.

We mirror MuPDF's pdf-cmap-parse.c approach: tokenize with our regular PDF Lexer, then walk section keywords (bfchar, bfrange, codespacerange). Anything outside known sections (CIDSystemInfo, /Registry strings, etc.) is silently skipped — robustness over strictness, per the spec recommendation.

Types

Link copied to clipboard
object Companion

Properties

Link copied to clipboard

Maximum byte length of one source code (1–4). Inferred from codespacerange.

Functions

Link copied to clipboard
fun decode(bytes: ByteArray, offset: Int): <Error class: unknown class><String, Int>?

Decode a single character code starting at offset in bytes; returns the (text, advance-in-bytes) pair, or null if no mapping exists.

Link copied to clipboard

Decode an entire byte string to a Kotlin String via repeated decode.