CMap
CMap parser for PDF /ToUnicode streams (ISO 32000-1 §9.10.3, Adobe Tech Note 5014).
A ToUnicode CMap is a small PostScript-ish program that tells the reader "given the source character codes I show in this content stream, here are the unicode codepoints to use when extracting/copying text." It uses two section types we care about:
begincodespacerange … endcodespacerange — declares 1-byte vs 2-byte codes beginbfchar … endbfchar — <src> <utf16BE-bytes> pairs beginbfrange … endbfrange — <srcLo> <srcHi> <utf16BE-start> — <srcLo> <srcHi> [ <utf> <utf> … ]
Source codes are 1-byte (most simple fonts), 2-byte (CIDFonts, Identity-H), or rarely 3/4-byte. Destination codes are always UTF-16BE byte strings, possibly multi-character for ligatures.
We mirror MuPDF's pdf-cmap-parse.c approach: tokenize with our regular PDF Lexer, then walk section keywords (bfchar, bfrange, codespacerange). Anything outside known sections (CIDSystemInfo, /Registry strings, etc.) is silently skipped — robustness over strictness, per the spec recommendation.