Skip to content

A template repository to create the data package repository for department in Maharashtra Government

License

Notifications You must be signed in to change notification settings

orgpedia/template.mahdept

Repository files navigation

Template: template.mahdept

This a template for creating datapackage repositories for Government of Maharashtra for orders/GRs found at https://gr.maharashtra.gov.in/1145/Government-Resolutions. If you are looking for a generic datapackage template then visit http://github.com/orgpedia/template.datapackage.

You need to have copier package installed to use this template.

Creating a Maharasthra department data package repository

  1. Create an empty repository in Github make sure you choose a short name <org_code> that beings with mah, for example mahedu or mahhome.

  2. Clone the newly created repository using the command given below. If you are on windows you should enable sym-links. Check this link to know more. Note given there are lot of pdfs, the pdf files are not checked in .

    git clone http://github.com/orgpedia/<org_code>

  3. populate <org_code> directory using this template by running the command.

    copier https://github.com/orgpedia/template.mahdept.git path/to/<org_code>

  4. commit the changes in the git repository.

Directory Layout

The directory layout of this repository follows the ideas mentioned in this video: Principled Data Processing by Patrick Ball. The idea is to follow a fixed and well understood layout to minimize the need for documentation.

There are 3 main directories import, flow and export. A simple makefile orchestrates the document flow across these folders, run make help to find out more about the commands.

import

This contains the documents you want to process and other artifcats that are needed for executing the document flow. Most of this content is either on GitHub LFS or imported at the start and never checked in.

$ tree import
import
├── data_packages      # This data (orders, officer_infos) from other data-packages
│   └── README.md
├── docker             # Generated Docker images and mounted dirs
│   └── README.md
├── documents          # Documents that are processed, with normalized names
│   ├── README.md
│   ├── dept-1.pdf -> ../.../upload_3456_12022023.pdf # short name
│   ├── dept-2.pdf -> ../.../upload_3456_12022024.pdf
│   └── documents.yml
├── models             # Models needed for this data package
│   ├── huggingface
│   │   └── README.md
│   └── models.yml
├── src                # code for downloading docs
│   └── README.md
└── websites           # Actual document pdfs (stored on GitHub LFS)
    ├── README.md
    └── dept.gov.in
        └── uploads
            ├── upload_3456_12022023.pdf
            └── upload_3456_12022024.pdf

flow

This contains series of task directories that process the documents and generate the data, this is the main part of the respository. Note, that the data (both intermediate and final) is checked in the repository.

$ tree flow
flow
├── subFlows           # Generate supporting data/conf files     
│   └── README.md       
└── task_              # Main processing directory for document-flow
    ├── conf           # Contains configuration for the files in task
    ├── input          # conains input to the task
    │   ├── dept-1.pdf -> ../../../import/documents/dept-1.pdf
    │   └── dept-2.pdf -> ../../../import/documents/dept-2.pdf
    ├── logs           # logs generated in the task
    ├── output         #  output generated by this task
    │   ├── dept-1.order.json
    │   ├── dept-2.order.json
    │   ├── orders.json
    │   └── tenures.json
    └── src            # code for this task
        └── gen_order.py

export

The data generated is stored in the export/data folder as sym-links, this also contains other intermediate data folders that are needed. Most of these directories are not checked in the repository.

$ tree export
export
├── data               # This contains the final data generated
│   ├── README.md
│   ├── orders
│   │   ├── README.md
│   │   ├── dept-1.order.json -> .../flow/task_/output/dept-1.order.json
│   │   └── dept-2.order.json -> .../flow/task_/output/dept-2.order.json
│   ├── orders.json -> ../../flow/task_/output/orders.json
│   ├── schema
│   │   └── README.md
│   └── tenures.json -> ../../flow/task_/output/tenures.json
├── html               # Temporary folder html for debugging
│   └── README.md
├── orgpedia_{{org_code}} # Folder for creating python package
│   ├── __init__.py
│   └── datapackage.json.jinja
└── page_images        # images of pages for vision algorithms
    └── README.md

make help

Check makefile to know more about the commands

$ make help
Please use 'make <target>', where <target> is one of

install     install packages and prepare software environment

import      import data required for processing document flow
flow        execute the tasks in the document flow
export      export the data generated by the document flow

check       check if files are correctly laid in the flow directory
readme      generate the readme for the flow/task directories

lint        run the code linters
format      reformat code
pre-commit  run pre-commit checks, runs yaml lint, you need pre-commit

Check the makefile to know exactly what each target is doing.

About

A template repository to create the data package repository for department in Maharashtra Government

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published