Skip to content
Danny Lin edited this page Oct 10, 2024 · 22 revisions

Basic usage

Three principal approaches

WebScrapBook provides various options and there are lots of ways using it. Below are three principal approaches to use it:

1. Independent files approach

This mimics the native browser saving functionality and saves each captured page as an independent file, which can then be found and viewed from the file manager. The save path for each captured page can be specified if the browser is configured to ask the saving path for each downloaded file.

  • Set Save captured data to: to File.

  • Set Save captured data as: to the desired saving format. Single HTML is generally most convenient. Additional configuration is required to open archive files directly from the file manager if set to HTZ package or MAFF package.

  • Optionally set Filename to save: to a desired default filename for the captured page, such as %title%.

2. Directory approach

This approach is to store every web page under the specified directory, and can then be found and viewed from the file manager. This approach supports saving the captured page as a folder, and can be configured to generate sub-directories. However, due to the security restriction of the browser, only a sub-directory of <default download folder> can be specified as the target.

  • Set Save captured data to: to Scrapbook folder.

  • Set Save captured data as: to the desired saving format. Folder supports the most features and Single HTML is more convenient. Additional configuration is required to open archive files directly from the file manager if set to HTZ package or MAFF package.

    For Google Chrome or some Chromium-based browsers with this options set to Folder, it's recommended to uncheck the Ask where to save each file before downloading browser option (located at chrome://settings/downloads), or every file to be saved will trigger a prompt. (see Known issues for details)

  • Optionally set Scrapbook folder: to put the captured pages under another directory under <default download folder>.

  • Optionally set Filename to save: to a desired default filename for the captured page, such as %title%, or %create-Y%/%create-m%/%title% if organizing files by time is desired.

3. Browser sidebar approach

This approach requires setting up a backend server. Captured files will be directly saved to the backend server and can be accessed through the sidebar (toolbar button > Open scrapbook) for a browser with WebScrapBook extension installed.

  • Install PyWebScrapBook package.

  • Follow the instruction to set up the backend server.

    For example, to host C:\Users\MyUserName\WebScrapBook as a scrapbook, change working directory to it in the command prompt and run wsb config -ba to generate config files. And C:\Users\MyUserName\WebScrapBook\.wsb\serve.py can be run to start the server (do not close the prompted window unless it's intended to shutdown the server).

    For more advanced configuration of the backend server, see here.

  • Enter WebScrapBook options and set Address, User, and Password of the Backend server. (Defaults to http://localhost:8080/ and blank user and password if the backend server is not otherwise configured.)

  • Set Save captured data to: to Backend server.

  • Set Save captured data as: to the desired saving format. Folder is generally most recommended, and HTZ package or MAFF package can be used to save space and reduce file number. This only affects newly captured web pages, exist files are not affected, and data with different saving format may coexist without problem.

    Although Single HTML is supported, it takes more space, cannot preserve certain complicated information, and certain advanced features are not supported, and thus is generally not recommended.

    PyWebScrapBook provides a wsb convert utility that can convert the file format on demand.

  • Optionally set Filename to save: to a desired value, though default %ID% is most recommended to prevent a potential compatibility issue.

  • Start the backend server before usage, and then capture a web page or access captured data via the sidebar.

    To start the backend server automatically when the devise is booted or when the user logs in, add the starting file .wsb/serve.py to system service or something like the startup folder on Windows.

    To hide the command prompt of the backend server on Windows, rename .wsb/serve.py to .wsb/serve.pyw. To shut down a backend server with hidden command prompt, use the task manager.

  • Optionally generate a site index on the backend server (toolbar button > Options > Run indexer) for clients without WebScrapBook to browse the captured pages.

Capture web page(s)

  • Open a web page and wait until it's completely loaded and ready for a capture.

    NOTE: A web page may load resources dynamically using scripts, and thus it may be required to wait for a while, to scroll down the screen, or to perform some interactive operations to ensure that wanted contents and resources are loaded.

  • Click the toolbar button (also called browser action button), and select capture tab from the dropdown list. A capture dialog will prompt and the capture will start. Data will be saved after the capture succeeded using the previously configured way (see above section for details).

  • If there's a selection in the web page, capture tab will capture only the selected range. Some browsers (such as Firefox) supports selecting multiple ranges via pressing Ctrl, and all of them will be captured.

  • Capture tab captures the currently shown content on the screen (and the tab cannot be closed before the capture completes). Capture tab (source) captures the original web page HTML content before processed by scripts. Capture tab (bookmark) captures the page as a bookmark file (or a bookmark item in a scrapbook), which can then be opened to visit the source web page. Capture as... prompts a dialog for customization before a capture.

  • Use Edit tab to start annotating or editing the web page. The edited web page can be captured afterwards.

  • Various web page elements can be captured individually through the context menu. For example, right-click on a link to capture the linked web page, right-click on an image or media to capture it, right-click on a frame to capture the frame page, right-click in a page or a frame page with something selected to capture only the selected range, etc.

  • The keyboard shortcut for various WebScrapBook functions can be customized using the built-in browser shortcuts manager.

  • With backend server configured, a capture with positioning can be invoked through dragging the capture tab (or alike) command button onto a desired position in the sidebar.

    Capture tabs through dragging and dropping

  • Similarly, a capture with positioning can be invoked through dragging a link, image, etc., in a page.

  • When clicking or dragging a toolbar button command like capture tab, hold Shift to toggle between capturing tab or source, hold Alt to capture as bookmark, or hold Ctrl to open a dialog for customization.

Batch capture

A batch capture can be invoked to capture multiple web pages one by one through the following ways:

  • Hold Ctrl or Shift to select multiple tabs and invoke a batch capture through the capture tab (or alike) from the toolbar button.

  • Invoke a batch capture through the batch capture all tabs from the toolbar button. All tabs will be pre-filled in the prompted dialog for later manipulation.

  • Invoke a batch capture through the batch capture selected links from the toolbar button or context menu. All links or all selected links in the page will be pre-filled in the prompted dialog for later manipulation.

Capture linked files

To capture images, audio, or other resource files attached using links in a web page, go to the capture - capture links options, set download linked files to match URL file extension or match HTTP header and file extension, and set appropriate conditions in the included file types for downloading linked files. WebScrapBook will save those linked files together when capturing the web page.

Capture linked web pages (in-depth capture)

Go to capture - capture links, set depth to capture linked pages to a positive integer and configure the filter rules in the included URLs for capturing linked pages, to get WebScrapBook capture linked web pages that match the rules together when capturing a web page, and rebuild the interlinkings and generate a resource map. An in-depth captured item will be marked as a "site" type.

For example, for a web page at http://example.com/foo having a hyperlink targeting http://example.com/bar, use the filter http://example.com/bar to additionally capture that page, or use the filter /^http://example\.com// to additionally capture the linked web pages under the same domain.

Hint: depth to capture linked pages can also be set to 0 to generate a resource map while capturing only the current page. A "merge-capture" mentioned later can be performed to add other web pages to the item.

NOTE: It's recommended to set Save captured data as to folder when performing an in-depth capture. Although HTZ package and MAFF package are also permitted, it may cause a capture failure due to memory exhaust or a poor performance when annotating a page in a large archive file if there are too many pages captured, and merge-capture is not available.

Re-capture

If the backend server is configured, a re-capture can be invoked from the context menu of a web page, site, bookmark, etc., item. Alternatively, in the capture tab as... dialog, set capture type to re-capture and select a suitable target item.

A re-capture replaces the content of the original item, updates its item type, index, modified time, source URL, favicon, title, and comment, and attempts to copy annotations from the original web page (this may fail if both versions are significantly different, though). The original web page will be moved to the backup directory (which is by default at .wsb/backup/) after the re-capture succeeded (the automatic backup can be disabled through the backup when capturing again option).

Merge-capture

If the backend server is configured, this can be performed to capture a web page and merge it into a previously captured item. To do this, in the capture tab as... dialog, set capture type to merge-capture and select a suitable target item.

The target item for a merge-capture requires a resource map (index.json), which is generated only for a site, i.e. when depth to capture linked pages has been set to 0 or more. Additionally, its captured data should have been saved as folder.

Except for the main page, a merge-capture determines whether a resource (page or file) already exists via the resource map. An existing resource will not be downloaded again, even if it has changed on the original site. It's generally not recommended to perform a merge-capture too long after the original capture to avoid an inconsistency (e.g. a new version page referencing an old version resource).

To update resources, edit the resource map file and delete their entries from the files property, and perform a merge-capture for the referencing page.

A merge-capture may miss redirects. For example, a merge-capture on the page http://example.com/redirected cannot rewrite a hyperlink http://example.com/link that redirects to it in the already captured page http://example.com/main. To fix this issue, edit the resource map and add ["http://example.com/link", "http://example.com/redirected"] to the redirects property, and perform a merge-capture on http://example.com/main (or another page) to trigger link rebuilding.

Capture helpers

Capture helpers allow customization for specific sites. Check enable capture helpers in the options and set an adequate JSON config to get it work. Below are some usage examples:

Capture deferred loading images (which record true URLs with data-*)

[
  {
    "name": "DeferredImageFixer",
    "description": "Save deferred images defined by data-*",
    "commands": [
      ["attr", {"css": "img[data-src]"}, "src", ["get_attr", null, "data-src"]],
      ["attr", {"css": "img[data-srcset]"}, "srcset", ["get_attr", null, "data-srcset"]]
    ]
  }
]

Default not to capture images for a specific site

[
  {
    "description": "Don't capture images on this site",
    "pattern": "/^https?://example\\.com//i",
    "options": {
      "capture.image": "blank",
      "capture.imageBackground": "blank"
    }
  }
]

Auto-capture

Check enable auto-capture in the options and set an adequate JSON config to get it work. Below are some usage examples:

Auto-capture any web page

[{}]

Auto-capture any web page under example.com

[{"pattern": "/^https?://example\\.com//"}]

Auto-capture any web page 10 seconds after loading complete

[{"delay": 10000}]

Auto-capture a new version of the web page every 60 seconds

[{"repeat": 60000}]

Auto-capture as a bookmark

[{"taskInfo": {"mode": "bookmark"}}]

Put the auto-captured web page under the autocaptures subdiredtory

  • save captured data to set to scrapbook folder. scrapbook folder set to WebScrapBook/data.
[{"taskInfo": {"options": {"capture.saveFolder": "WebScrapBook/data/autocaptures"}}}]

Put the auto-captured web page items under a specific item

  • save captured data to set to backend server.
[{"taskInfo": {"parentId": "20200101020304567"}}]

Put the auto-captured web page item as the first item under its parent

  • save captured data to set to backend server.
[{"taskInfo": {"index": 0}}]

Add an #autocapture hashtag in the comment for the auto-captured web page item

  • save captured data to set to backend server.
[{"eachTaskInfo": {"comment": "#autocapture"}}]
Clone this wiki locally