Skip to content

Commit

Permalink
Add documentation for CID text extraction flag
Browse files Browse the repository at this point in the history
  • Loading branch information
JorjMcKie authored and jamie-lemon committed Mar 28, 2024
1 parent bfbeef3 commit d543903
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 19 deletions.
23 changes: 12 additions & 11 deletions docs/app1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -283,24 +283,25 @@ Text Extraction Flags Defaults
`flags = TEXTFLAGS_SEARCH & ~TEXT_DEHYPHENATE`


=================== ==== ==== ===== === ==== ======= ===== ====== ======
Indicator text html xhtml xml dict rawdict words blocks search
=================== ==== ==== ===== === ==== ======= ===== ====== ======
preserve ligatures 1 1 1 1 1 1 1 1 1
preserve whitespace 1 1 1 1 1 1 1 1 1
preserve images n/a 1 1 n/a 1 1 n/a 0 0
inhibit spaces 0 0 0 0 0 0 0 0 0
dehyphenate 0 0 0 0 0 0 0 0 1
clip to mediabox 1 1 1 1 1 1 1 1 1
=================== ==== ==== ===== === ==== ======= ===== ====== ======
========================= ==== ==== ===== === ==== ======= ===== ====== ======
Indicator text html xhtml xml dict rawdict words blocks search
========================= ==== ==== ===== === ==== ======= ===== ====== ======
preserve ligatures 1 1 1 1 1 1 1 1 1
preserve whitespace 1 1 1 1 1 1 1 1 1
preserve images n/a 1 1 n/a 1 1 n/a 0 0
inhibit spaces 0 0 0 0 0 0 0 0 0
dehyphenate 0 0 0 0 0 0 0 0 1
clip to mediabox 1 1 1 1 1 1 1 1 1
use CID instead of U+FFFD 1 1 1 1 1 1 1 1 0
========================= ==== ==== ===== === ==== ======= ===== ====== ======

* **search** refers to the text search function.
* **"json"** is handled exactly like **"dict"** and is hence left out.
* **"rawjson"** is handled exactly like **"rawdict"** and is hence left out.
* An "n/a" specification means a value of 0 and setting this bit never has any effect on the output (but an adverse effect on performance).
* If you are not interested in images when using an output variant which includes them by default, then by all means set the respective bit off: You will experience a better performance and much lower space requirements.

To show the effect of *TEXT_INHIBIT_SPACES* have a look at this example::
To show the effect of `TEXT_INHIBIT_SPACES` have a look at this example::

>>> print(page.get_text("text"))
H a l l o !
Expand Down
21 changes: 13 additions & 8 deletions docs/vars.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,39 +199,44 @@ For the PyMuPDF programmer, some combination (using Python's `|` operator, or si
64 -- If set, characters entirely outside a page's **mediabox** will be ignored. This is default in PyMuPDF.

.. py:data:: TEXT_CID_FOR_UNKNOWN_UNICODE
128 -- If set, use raw character codes instead of U+FFFD. This is the default for **text extraction** in PyMuPDF. If you **want to detect** when encoding information is missing or uncertain, toggle this flag and scan for the presence of U+FFFD (= `chr(0xfffd)`) code points in the resulting text.


The following constants represent the default combinations of the above for text extraction and searching:

.. py:data:: TEXTFLAGS_TEXT
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP`
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_CID_FOR_UNKNOWN_UNICODE`

.. py:data:: TEXTFLAGS_WORDS
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP`
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_CID_FOR_UNKNOWN_UNICODE`

.. py:data:: TEXTFLAGS_BLOCKS
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP`
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_CID_FOR_UNKNOWN_UNICODE`

.. py:data:: TEXTFLAGS_DICT
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_PRESERVE_IMAGES`
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_PRESERVE_IMAGES | TEXT_CID_FOR_UNKNOWN_UNICODE`

.. py:data:: TEXTFLAGS_RAWDICT
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_PRESERVE_IMAGES`
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_PRESERVE_IMAGES | TEXT_CID_FOR_UNKNOWN_UNICODE`

.. py:data:: TEXTFLAGS_HTML
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_PRESERVE_IMAGES`
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_PRESERVE_IMAGES | TEXT_CID_FOR_UNKNOWN_UNICODE`

.. py:data:: TEXTFLAGS_XHTML
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_PRESERVE_IMAGES`
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_PRESERVE_IMAGES | TEXT_CID_FOR_UNKNOWN_UNICODE`

.. py:data:: TEXTFLAGS_XML
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP`
`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_CID_FOR_UNKNOWN_UNICODE`

.. py:data:: TEXTFLAGS_SEARCH
Expand Down

0 comments on commit d543903

Please sign in to comment.