Skip to content

Latest commit

 

History

History
94 lines (77 loc) · 5.58 KB

index.md

File metadata and controls

94 lines (77 loc) · 5.58 KB

Dataset for pdf-struct (Logical structure analysis tool for visually structured documents)

This is the dataset to reproduce the experiments for pdf-struct, a tool for extracting fine-grained logical structures (such as boundaries and their hierarchies) from visually structured documents (VSDs) such as PDFs.

Please visit the website at https://github.com/stanfordnlp/pdf-struct for data specifications and other details.

Dataset specification

Our dataset consists of following files:

pdf-struct-dataset/
  - contract_pdf_en/
    - raw/
    - anno/
    - source.tsv
  - contract_pdf_ja/
  - contract_text_en/
  - legistration_pdf_en/
  - LICENSE
  - README.md

Each root-level directory represents a single type of dataset:

  • contract_pdf_en/: English NDAs in PDF format
  • legistration_pdf_en/: English executive orders from local authorities
  • contract_text_en/: English NDAs in visually structured plain text format
  • contract_pdf_ja/: Japanese NDAs in PDF format

Raw documents are located in raw/ directory. Annotated data (generated by pdf-struct init-dataset and then annotated by hand) is located in anno/ directory. source.tsv shows the URL that each document was obtained from.

License

Our dataset is released under CC BY 4.0. Please refer attached "LICENSE" or https://creativecommons.org/licenses/by/4.0/ for the exact terms.

When you use our dataset in your work, please cite our paper:

@inproceedings{koreeda-manning-2021-capturing,
    title = "Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser",
    author = "Koreeda, Yuta  and
      Manning, Christopher D.",
    booktitle = "Natural Legal Language Processing Workshop",
    year = "2021",
    publisher = "Association for Computational Linguistics"
}

Download

Terms and Conditions of Use

By clicking “DOWNLOAD” below and accessing, reviewing, downloading or otherwise making use of the Hitachi America, Ltd. dataset, entitled “Dataset for pdf-struct (Logical structure analysis tool for visually structured documents)” (the “Dataset”), or any portion thereof, you are agreeing to these Terms and Conditions of Use.

If you do not agree to these Terms and Conditions of Use in their entirety, please do not access, review, download or otherwise make use of the Dataset.

The Dataset is being made available to you pursuant to, and you understand and agree that you will have the rights to access, review, download or otherwise make use of the Dataset, in accordance with the terms and conditions of the Creative Commons Attribution 4.0 International Public License.

THIRD PARTY WEBSITES, PLATFORMS AND NETWORKS: To the extent the Dataset contains any links to third party websites, social media platforms or other Internet-based networks, you understand and agree that Hitachi America, Ltd. does not own, operate, control or endorse such third party websites, platforms or networks. Accordingly, these Terms and Conditions do not apply to any such third party websites, platforms or networks or govern such third parties’ collection, use, storage, disclosure, or other processing of your personal information. Such third party websites, platforms or networks will be subject to their own terms and conditions of use and privacy policies.

NOTICES OF COPYRIGHT INFRINGEMENT: In the event you believe that the Dataset incorporates any of your work or materials in a way that may constitute infringement of your copyright or any other intellectual property right, Hitachi America, Ltd. will take appropriate actions in response to your notification thereof.

Pursuant to Title 17, United States Code, Section 512(c)(3), a notification of claimed infringement must be a written communication addressed to the designated agent as set forth below (the “Notice“), and must include substantially all of the following:

(a) a physical or electronic signature of the person authorized to act on behalf of the owner of the copyright interest that is alleged to have been infringed;

(b) a description of the copyrighted work or works that you claim have been infringed (“infringed work”) and identification of what material in such work(s) is claimed to be infringing (“infringing work”) and which you request to be removed or access to which is to be disabled;

(c) a description of the exact name of the infringing work that is being used in the Dataset (and the location of the infringing work, if it appears in the Dataset);

(d) information sufficient to permit Hitachi America, Ltd. to contact you, such as your physical address, telephone number, and email address;

(e) a statement by you that you have a good faith belief that the use of the material identified in your Notice in the manner complained of is not authorized by the copyright owner, its agent, or the law;

(f) a statement by you that the information in your Notice is accurate and, under penalty of perjury that you are the copyright owner or authorized to act on the copyright owner’s behalf.

To reach Hitachi America, Ltd.’s Copyright Agent for Notice of claims of copyright infringement:

Hitachi America, Ltd. 2535 Augustine Drive, Third Floor Santa Clara, California 95054 Attn.: Legal Dept.

The Copyright Agent should only be contacted if you believe that your work has been used or copied in a way that constitutes copyright infringement and such infringement is occurring on or through the Dataset. The Copyright Agent will not respond to any other inquiries.

DOWNLOAD

Changelog and release note

  • 11/7/2021: Initial release