Skip to content

A printer that can print multiple web pages as one pretty PDF

License

Notifications You must be signed in to change notification settings

ourongxing/web-printer

Repository files navigation


Web Printer

A printer that can print multiple web pages as one pretty PDF
with outlines, without distractions
and learn in depth

language version license

Warning

Respect the copyright please! Do not share non-public content on the Internet, especially paid content!

Features

Playwright is used to print PDFs, similar to printing in Chrome, but with the added ability to print multiple web pages into one seamless PDF automatically.

  • Fully customizable as it is a Node.js library.
  • Universal compatibility with any website through plugins.
  • Unique feature to replace internal website links with internal PDF links, supporting hash positioning.
  • Automatically generates PDF outlines, with support for different levels and collapsed statuses.
  • Easy to remove distracting elements, leaving only pure knowledge.

Installation

Warning

Web Printer is a Node.js library, not an application. If you're new to Node.js/TypeScript/JavaScript, Web Printer might be challenging to use. An app is currently being developed for general use. Please follow @pbkapp for updates.

If you're not a beginner, feel free to proceed as you would with any npm package installation.

pnpm i playwright @web-printer/core
# Web Printer use Chrome by default. Other supported browsers can be viewed in PrinterOption.channel.
# If you have installed Chrome, you can skip it.
pnpm exec playwright install chrome
# install plugin you need
pnpm i @web-printer/vitepress

Then create a .ts file, input

import { Printer } from "@web-printer/core"
// import plugin you have installed
import vitepress from "@web-printer/vitepress"


// Will open a browser to login if you need.
// new Printer().login(url)

new Printer()
  .use(
    vitepress({
      url: {
        Guide: "https://vuejs.org/guide/introduction.html",
        API: "https://vuejs.org/api/application.html"
      }
    })
  )
  .print("Vue 3.2 Documentation")

And run it by tsx, in other ways may throw errors. I have no time to fix it now.


But if you are a novice, follow me, maybe easier.

First you shoud install pnpm(with node), vscode(support typescript).

pnpm create printer@latest

# or complete in one step. https://github.com/ourongxing/web-printer/tree/main/packages/create-printer
pnpm create printer@latest web-printer -p vitepress -c chrome

And follow the tips. After customizing, use pnpm print to print. A pretty PDF will appear in ./output.

Options

The @web-printer/core provide a Printer object, some types and some utilities.

import { Printer, type Plugin } from "@web-printer/core"
import type { Plugin, PrinterOption, PrinterPrintOption } from "@web-printer/core"

// Will open a browser to login if you need.
// new Printer().login(url)

new Printer({} as PrinterOption)
  .use({} as Plugin)
  .print("PDF name", {} as PrinterPrintOption )

PrinterOption extends Playwright browserType.launchPersistentContext options.

{
  /**
   * Chromium distribution channel. Choose you have installed.
   * @default "chrome"
   * */
  channel?: "chromium" | "chrome" | "chrome-beta" | "chrome-dev" | "chrome-canary" | "msedge" | "msedge-beta" | "msedge-dev" | "msedge-canary"
   /**
   * Dir of userdata of Chrome. It is not recommended to use your system userData of Chrome.
   * @default "./userData"
   */
  userDataDir?: string
  /**
   * Dir of output pdfs
   * @default "./output"
   */
  outputDir?: string
  /**
   * Number of threads to print, will speed up printing.
   * @default 1
   */
  threads?: number
}

PrinterPrintOption extends Playwright page.pdf() options.

{
  /**
   * Used for outline. If given, Printer could fetch titles and set it as part of outline.
   * @default 0 means not set sub titles as outline.
   */
  subTitleOutline?: number
  /**
   * Make a test print, only print two pages and name will be appended "test: "
   * @default false
   */
  test?: boolean
  /**
   * Filter the pages you want
   */
  filter?: PageFilter
  /**
   * Reverse the printing order.
   * If the outline has different levels, outline may be confused.
   */
  reverse?: boolean
  /**
   * A local cover pdf path.
   * Maybe you can use it to marge exist pdf, but can't merge outlines.
   */
  coverPath?: string
  /**
   * inject additonal css
   */
  style?: string | (false | undefined | string)[]
  /**
   * Set the top and bottom margins of all pages except the first page of each artical to zero.
   * @default false
   */
  continuous?: boolean
  /**
   * Replace website link to PDF link
   * @default false
   */
  replaceLink?: boolean
  /**
   * Add page numbers to the bottom center of the page.
   * @default false
   * @requires PrinterPrintOption.continuous = false
   */
  addPageNumber?: boolean
  /**
   * Margins of each page
   * @default
   * {
   *    top: 60,
   *    right: 55,
   *    bottom: 60,
   *    left: 55,
   * }
   */
  margin?: {
    /**
     * @default 60
     */
    top?: string | number
    /**
     * @default 55
     */
    right?: string | number

    /**
     * @default 60
     */
    bottom?: string | number
    /**
     * @default 55
     */
    left?: string | number
  }
  /**
   * Paper format. If set, takes priority over `width` or `height` options.
   * @defaults "A4"
   */
  format?: "A0" | "A1" | "A2" | "A3" | "A4" | "A5" | "Legal" | "Letter" | "Tabloid"
}

Plugins

Plugins in Web Printer is only used to adapt to different websites.

A plugin have five methods:

  • fetchPagesInfo: Used to fetch a list of page url and title, need return the list.
  • injectStyle: Used to remove distracting elements and make web pages more PDF-friendly.
  • onPageLoaded: Run after page loaded.
  • onPageWillPrint: Run before page will be printed.
  • otherParams: Used to place other useful params.

Offical plugins

How to write a plugin

In fact, it is just use Playwright to inject JS and CSS into the page. You can read the code of offical plugins to learn how to write a plugin. It's pretty simple most of the time.

Let's make some rules

  • Use a function to return a plugin.
  • The function parameter is an options object.
  • If the number of pages info to be fetched is large and fetched slow, you need to provide the maxPages option, especially endless loading.

fetchPagesInfo

Used to fetch a list of page url and title, need return the list. Usually need to parse sidebar outline. Web Printer could restore the hierarchy and collapsed state of the original outline perfectly.

type fetchPagesInfo = (params: {context: BrowserContext}) => MaybePromise<PageInfoWithoutIndex[]>
interface PageInfoWithoutIndex {
  url: string
  title: string
  /**
  * Outer ... Inner
  */
  groups?: (
    | {
        name: string
        collapsed?: boolean
      }
    | string
  )[]
  /**
   * When this item is a group but have a link and content.
   */
  selfGroup?: boolean
  collapsed?: boolean
}

The pageInfo need returned just like

// https://javascript.info/
[
  {
    title: "Manuals and specifications",
    url: "https://javascript.info/manuals-specifications",
    groups: [
      {
        name: "The JavaScript language"
      },
      {
        name: "An introduction"
      }
    ]
  },
  ...
]

Examples

injectStyle

Used to remove distracting elements and make web pages more PDF-friendly.

type injectStyle = (params: { url: string; printOption: PrinterPrintOption }): MaybePromise<{
  style?: string
  contentSelector?: string
  titleSelector?: string
  avoidBreakSelector?: string
}>

Let's make some rules:

  • Hide all elements but content.
  • Set the margin of the content element and it's ancestor elements to zero.

Therefore, everyone can set the same margin for any website.

Don't worry, It's so easy. You only need to provide a contentSelector , support selector list. Web Printer can hide all elements but it and make the margin of it and it's ancestor elements zero automatically.

But not all websites can do this, sometimes you still need to write CSS yourself, just return the style property.

When you set PrinterPrintOption.continuous to true. Web Printer will set the top and bottom margins of all pages to zero.

The titleSelector is used to mark the title element, and set top margin for it only. The default value is same as contentSelector if contentSelector is not empty. And If contentSelector has ,, Printer will use the first selector. If titleSelector and contentSelector are both empty, the default value will be body, but sometimes setting margin top for the body may result in extra white space.

The avoidBreakSelector is used to avoid page breaks in some elements. The default value is pre,blockquote,tbody tr

onPageLoaded

Run after page loaded. Usually used to wait img loaded, especially lazy loaded images.

type onPageLoaded = (params: { page: Page; pageInfo: PageInfo; printOption: PrinterPrintOption }): MaybePromise<void>

Web Printer provide two methods to handle image loading:

  • type evaluateWaitForImgLoad = (page: Page, imgSelector = "img"): Promise<void>
  • type evaluateWaitForImgLoadLazy = ( page: Page, imgSelector = "img", waitingTime = 200 ): Promise<void>

onPageWillPrint

Run before page will be printed.

type onPageWillPrint = (params: { page: Page; pageInfo: PageInfo; printOption: PrinterPrintOption }): MaybePromise<void>

otherParams

Used to place other useful params.

 type otherParams = (params: { page: Page; pageInfo: PageInfo; printOption: PrinterPrintOption }): MaybePromise<{
  hashIDSelector: string
 }>

In some sites, such as Wikipedia, like to use a hash id to jump to the specified element. If you give the hashIDSelector and PrinterPrintOption.replaceLink is true, Printer could replace the hash of url to PDF position. The default value is h2[id],h3[id],h4[id],h5[id].

Shrink PDF

PDF generated by Web Printer maybe need to be shrinked in size by yourself.

Acknowledgements

License

MIT ©