This a template for creating datapackage repositories for Government of Maharashtra for orders/GRs found at https://gr.maharashtra.gov.in/1145/Government-Resolutions. If you are looking for a generic datapackage template then visit http://github.com/orgpedia/template.datapackage.
You need to have copier package installed to use this template.
-
Create an empty repository in Github make sure you choose a short name <org_code> that beings with
mah
, for examplemahedu
ormahhome
. -
Clone the newly created repository using the command given below. If you are on windows you should enable sym-links. Check this link to know more. Note given there are lot of pdfs, the pdf files are not checked in .
git clone http://github.com/orgpedia/<org_code>
-
populate <org_code> directory using this template by running the command.
copier https://github.com/orgpedia/template.mahdept.git path/to/<org_code>
-
commit the changes in the git repository.
The directory layout of this repository follows the ideas mentioned in this video: Principled Data Processing by Patrick Ball. The idea is to follow a fixed and well understood layout to minimize the need for documentation.
There are 3 main directories import
, flow
and export
. A simple makefile
orchestrates the document flow across these folders, run make help
to find out more about the commands.
This contains the documents you want to process and other artifcats that are needed for executing the document flow. Most of this content is either on GitHub LFS or imported at the start and never checked in.
$ tree import
import
├── data_packages # This data (orders, officer_infos) from other data-packages
│ └── README.md
├── docker # Generated Docker images and mounted dirs
│ └── README.md
├── documents # Documents that are processed, with normalized names
│ ├── README.md
│ ├── dept-1.pdf -> ../.../upload_3456_12022023.pdf # short name
│ ├── dept-2.pdf -> ../.../upload_3456_12022024.pdf
│ └── documents.yml
├── models # Models needed for this data package
│ ├── huggingface
│ │ └── README.md
│ └── models.yml
├── src # code for downloading docs
│ └── README.md
└── websites # Actual document pdfs (stored on GitHub LFS)
├── README.md
└── dept.gov.in
└── uploads
├── upload_3456_12022023.pdf
└── upload_3456_12022024.pdf
This contains series of task directories that process the documents and generate the data, this is the main part of the respository. Note, that the data (both intermediate and final) is checked in the repository.
$ tree flow
flow
├── subFlows # Generate supporting data/conf files
│ └── README.md
└── task_ # Main processing directory for document-flow
├── conf # Contains configuration for the files in task
├── input # conains input to the task
│ ├── dept-1.pdf -> ../../../import/documents/dept-1.pdf
│ └── dept-2.pdf -> ../../../import/documents/dept-2.pdf
├── logs # logs generated in the task
├── output # output generated by this task
│ ├── dept-1.order.json
│ ├── dept-2.order.json
│ ├── orders.json
│ └── tenures.json
└── src # code for this task
└── gen_order.py
The data generated is stored in the export/data
folder as sym-links, this also contains other intermediate data folders that are needed. Most of these directories are not checked in the repository.
$ tree export
export
├── data # This contains the final data generated
│ ├── README.md
│ ├── orders
│ │ ├── README.md
│ │ ├── dept-1.order.json -> .../flow/task_/output/dept-1.order.json
│ │ └── dept-2.order.json -> .../flow/task_/output/dept-2.order.json
│ ├── orders.json -> ../../flow/task_/output/orders.json
│ ├── schema
│ │ └── README.md
│ └── tenures.json -> ../../flow/task_/output/tenures.json
├── html # Temporary folder html for debugging
│ └── README.md
├── orgpedia_{{org_code}} # Folder for creating python package
│ ├── __init__.py
│ └── datapackage.json.jinja
└── page_images # images of pages for vision algorithms
└── README.md
Check makefile to know more about the commands
$ make help
Please use 'make <target>', where <target> is one of
install install packages and prepare software environment
import import data required for processing document flow
flow execute the tasks in the document flow
export export the data generated by the document flow
check check if files are correctly laid in the flow directory
readme generate the readme for the flow/task directories
lint run the code linters
format reformat code
pre-commit run pre-commit checks, runs yaml lint, you need pre-commit
Check the makefile to know exactly what each target is doing.