Congressional Portal Project

Overview

The Congressional Portal Project provides a repository for workflows, methodologies, instructional materials, controlled vocabularies, and more. This repository was created to house large-scale project efficiency methodologies and automated workflows and to document strategies throughout the project timeline. The project focuses on materials relating to the American Congress from the Carl Albert Research and Studies Center Archives.

We have developed scripts and batch processes within this repository designed to automate large-scale archival workflows. The primary goal is to streamline extracting, analyzing, and enriching metadata from archival text files using Natural Language Processing (NLP) techniques and OpenAI's GPT models. This automation is crucial for efficiently managing extensive archival collections, ensuring consistent metadata quality, and enabling advanced data analysis.

Background

Partnerships: Collaborated with West Virginia University Libraries, the Robert J. Dole Institute of Politics, and the Robert C. Byrd Center for Congressional History and Education, The Dirksen Congressional Center, University of Hawai'i at Manoa, and Richard B. Russell Library for Political Research and Studies, to create the American Congress Digital Archives Portal
Objective: Address challenges in using congressional archives, which are large, complex, and dispersed across various institutions.
Focus: Highlight materials related to American Indian sovereignty, providing insights into the struggles and achievements of American Indian communities.
Results: The project began with several independent scripts handling different tasks across stages of transcription, metadata extraction, and face recognition. Over time, through multiple iterations and refinements, the workflow has evolved into a more streamlined and efficient pipeline. This pipeline now integrates advanced natural language and facial recognition tools, controlled lists, feedback loops, and parallel processing via MPI to enhance both accuracy and speed.

Key Tasks

Develop Adaptive Learning Models: Create and train models within a controlled environment to ensure accuracy by minimizing unreliable external information. Automate Text Recognition and Metadata Extraction: Use NLP techniques to recognize text in archival documents and extract relevant metadata.
Implement Feedback Loop Mechanisms: Establish cyclical processes that allow models to learn from each processed document, improving controlled vocabulary and identifying terminology or entities.
Enhance Standardization and Consistency: Ensure all archival entries adhere to consistent standards, improving data quality and reliability across multiple projects.
Integrate Human-Centered Design and Linear Reciprocity Model: Incorporate user-centric approaches and streamline data flow in the archival process.

Automating Archival Processes

Machine Learning Algorithms

Named Entity Recognition (NER): Identify and categorize proper nouns like people, organizations, and locations across large datasets.
Topic Modeling: Uncover themes and patterns in document collections by grouping related terms and phrases.
Text Classification: Connect documents to predefined terminologies and mappings based on their content.
Sentiment Analysis: Assess the tone and sentiment of textual data to understand the emotional context.
Entity Linking: Connect entities mentioned in the text with existing knowledge bases, enhancing accuracy.

Feedback-Loop Controls

Continuous Improvement: Use feedback loops to refine the model's understanding of text complexities, including context, sentiment, and nuanced language.
Contextual Analysis: Differentiate between literal and figurative language by analyzing terms and phrases within context.
Verification Steps: Implement multi-step verification to accurately identify entities, dates, and relationships.

Controlled Vocabularies

Standardization: Utilize standardized terms to ensure consistent data annotation and retrieval.
Enhanced Discoverability: Cross-reference and validate data to produce accurate records, highlighting important historical contributions.

Folders

documentation-applications-list: Contains project worksheets, collection indexes, training models, and controlled vocabularies.
workflows: Contains packaged workflows with either executable portable applications or consolidated/compiled scripts for OCR, assigning controlled metadata, and extracting specific text from OCR text.
deprecated-packages: Contains outdated scripts and notes that have been replaced by newer versions.

Content Overview

The Center concentrates on content related to four curated collections, encompassing over 75,677 individual items from the CAC Archives. Additional digital files are available on our Digital Archives Platform.

Collection	Type	Topics	Subtopics	Significance	Extent	Formats
Indian Self-Determination	Topical	Congress as policy-maker, Leaders and parties	Types of decisions, Committee leadership, Policymaking in committee, Constituent communications	Congressional offices hold correspondence showcasing intricate strategies used by tribal entities and congressional members.	23 collections	PDF/A, PDF/E, or PDF with original file, TIFF
Robert L. Owen Collection	Collection-Whole	Congress as policy-maker, Leaders and parties, Congress and the courts	Cultural norms	Robert L. Owen was a member of the Cherokee Nation and represented the Five Civilized Tribes as a federal Indian agent before entering politics as a Progressive Democrat.	199 items	PDF/A, PDF/E, or PDF with original file, TIFF
U.S. House of Representatives Offices Campaign Ads	Collection-Whole	Leaders and parties, Elections, Congress and interest groups, Congress history - general	Leadership activities, Determinants of voting, Tactics, Electoral outcomes	Collection of television and radio political advertisement	24,678 items	Motion JPEG 2000, MOV, AVI
Carl Albert Photograph Collection	Collection-Whole	Leaders and parties	Party leadership files	The personal collection of Albert’s photographs, spanning the entirety of his career.	11,000 items	TIFF

Acknowledgements

Carl Albert Congressional Research and Studies Center Archives

See acknowledgments for student staff and collaborators.

See collaborative partners for project partners.

Authors

JA Pryse - Senior Archivist III

License

See LICENSE for more information.

Feel free to contribute to this project by submitting issues or pull requests. Your feedback and enhancements are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 257 Commits
documentation-applications-lists		documentation-applications-lists
workflows		workflows
CAC_metadata.png		CAC_metadata.png
CITATION.cff		CITATION.cff
Nguyen_Long_Code_Creator		Nguyen_Long_Code_Creator
README.md		README.md
Text_OCR_Readme.md		Text_OCR_Readme.md
acknowledgements.md		acknowledgements.md
collaborative-partners.md		collaborative-partners.md
congressthumb_home.jpg		congressthumb_home.jpg
portal_metadata.png		portal_metadata.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Congressional Portal Project

Overview

Background

Key Tasks

Automating Archival Processes

Machine Learning Algorithms

Feedback-Loop Controls

Controlled Vocabularies

Folders

Content Overview

Acknowledgements

Authors

License

About

Releases

Packages

prys0000/congressional-portal-project

Folders and files

Latest commit

History

Repository files navigation

Congressional Portal Project

Overview

Background

Key Tasks

Automating Archival Processes

Machine Learning Algorithms

Feedback-Loop Controls

Controlled Vocabularies

Folders

Content Overview

Acknowledgements

Authors

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages