Singapore Hansard Data

Parliament records scraped from 2005-01-12 to 2021-03-05 and stored in hansard_full.zip
Sessions are captured in a JSON file
- JSON format changes before 2012-09-10, where the entire HTML file is stored in the JSON instead :/
- From 2012-09-10, JSON file contains the following:
  - Metadata including date of session, start time, session name
  - Attendance List
  - Permission for MPs to be absent
  - Assent to Bills Passed
  - Actual discussion by MPs with appropriate title and rich formatting
Some example Bills, Motions and Oral Answers included

Cleaning of data

A format_hansard.py file is included to extract speeches and standardise the formatting of the JSON files. The formatted JSON files are written into an output folder. By default, all peripheral text is removed (without the need for any flags). Specific flags are introduced for partial Hansard sessions (e.g. Bills, Motions and Oral Answers) as well as more granular settings (with the -g flag).

Annotation of data

A formatted_to_txt.py is included to turn the formatted JSON file into text for annotation.

The Annotated folder consists of a config.json file to be used with ner-annotator. The annotated files are present and labelled with _annotated.json.

The workflow for annotation was done with the following steps:

Download JSON file
Format JSON file using format_hansard.py: python format_hansard.py step_1_file.json -f, which will give you a step_2_formatted.json file
Convert to text: python formatted_to_txt.py step_2_formatted.json, which will give you a step_3_text.txt file
Use ner_annotator: ner_annotator step_3_text.txt -c config.json -m hansard

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Annotated		Annotated
Bill		Bill
Motion		Motion
OralAnswer		OralAnswer
.gitattributes		.gitattributes
.gitignore		.gitignore
Readme.md		Readme.md
format_hansard.py		format_hansard.py
formatted_to_txt.py		formatted_to_txt.py
hansard_full.zip		hansard_full.zip
manual_sentiment.py		manual_sentiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Singapore Hansard Data

Cleaning of data

Annotation of data

About

Releases

Packages

Languages

nus-cs3244-ml-singapore-7/hansard_data

Folders and files

Latest commit

History

Repository files navigation

Singapore Hansard Data

Cleaning of data

Annotation of data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages