Unified Corpus Explorer

Making UIMA-annotated corpora tangible, searchable and vivid.

We introduce the Unified Corpus Explorer (UCE), a standardized, dockerized, and dynamic Natural Language Processing (NLP) application designed for flexible and scalable corpus navigation. Herein, UCE utilizes the UIMA format for NLP annotations as a standardized input, constructing interfaces and features around those annotations while dynamically adapting to the corpora and their extracted annotations.

UCE-Explained.mp4

Quick Start

Clone this repository:

git clone https://github.com/texttechnologylab/UCE.git

Start the docker containers:

docker-compose up

The web instance, by deafult, is reachable under: http://localhost:8008. If you're looking for a small demo without creating it yourself, please check our open demo.

We are currently creating a dedicated Documentation Page which will be up soon to explain the configuration in more detail and how you can customize UCE.

About

UCE is customizable in terms of annotations imported, corporate identity used, and background information added. It allows the creation of a specific UCE instance for your project, regardless of the domain. It does so by utilizing UIMA-annotated corpora, with the primary tool for creating those being the Docker Unified UIMA Interface (DUUI). Hence, you would gather your corpus, use DUUI to annotate whatever you want to annotate, and finally import those annotations into UCE to host them.

Microservices

UCE consists of several microservices, each dockerized and utilizing distinct technologies, which is being outlined in the following:

Microservice	Description
A: Corpus-Importer	UCE is based on Corpus-Importer, a Java application that reads UIMA-annotated documents from a specified path, along with a corresponding corpus-configuration JSON file. The importer extracts the raw data and the configured annotations, applying its own post-processing to set up the environment, which includes text segmentation, database indexing, keyword extraction, and the creation of various embedding spaces, before finally storing each processed document in a PostgreSQL database (B).
B: Relational Database	As our primary database, we opted for a relational PostgreSQL database, as UCE requires a structured and standardized database schema that can be extended if necessary. Additionally, its compatibility with the pgvector extension enables efficient vector operations directly within the database engine. This allows us to store high-dimensional vector embeddings within relational data tables while also enabling fast vector operations and searches.
C: Graph Database	In addition to a relational database (B), UCE utilizes an Apache Jena SPARQL database to incorporate basic semantic searches in the Resource Description Framework (RDF) and Web Ontology Language (OWL) data formats. This integration enables the incorporation of domain-specific ontologies (e.g., biological taxonomy) into the UCE environment, further enriching its search capabilities.
D: Python Webserver	Within UCE, we also utilize a Python web service to provide an interface to machine learning and AI models, as these are primarily accessible through Python. In this context, the web server facilitates access to the generation of embedding vectors, their dimensionality reduction methods, such as t-SNE and PCA, and the inference of (Large) Language Models. The web server is accessible via a REST API and is utilized by services (A) and (E).
E: UCE Web Portal	The user interacts with UCE and all of its features through a web portal implemented in Java. This service communicates with all other services except for (B), providing a variety of search methods, visualization features, and different ways to interact with the underlying information units, as outlined in detail in Section 3.2.

In Medias Res

Some, but not all of the search and visualization features within UCE:

Annotations

Currently supported annotations within UCE are outlined in the following table:

Annotation	Description
Sentence	Divides the documents into their respective sentences.
Named-Entity	Extracts named entities from a document, categorizing them into four types: organization (ORG), person (PER), location (LOC), and miscellaneous (MISC).
Lemma and POS	Lemmatization reduces inflected words to their root form. Within UCE, searches are enhanced by considering these root forms.
Semantic Role Labels (SRL)	SRL identifies semantic relations between the lexical constituents of a sentence, assigning labels to words or phrases that indicate their semantic roles, such as agent, goal, or result.
Time	Extracts temporal expressions, including time and date formats, from a document, analogous to Named-Entity Recognition tasks.
Taxon	The recognition of unambiguous names of biological entities is referred to as a taxon.
WikiLinks	Maps potential words and phrases to their corresponding Wikidata URLs, facilitating the retrieval and access of additional information.
OCR	Since much of the literature has yet to be digitized, UCE provides support for corpora containing documents that have undergone Optical Character Recognition (OCR) extraction. These annotations assist in reconstructing the physical layout of the pages within UCE.

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.github/workflows		.github/workflows
database		database
docs		docs
rag		rag
site		site
sparql		sparql
uce.portal		uce.portal
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified Corpus Explorer

Making UIMA-annotated corpora tangible, searchable and vivid.

Quick Start

About

Microservices

In Medias Res

Annotations

About

Releases 1

Packages

Contributors 3

Languages

License

texttechnologylab/UCE

Folders and files

Latest commit

History

Repository files navigation

Unified Corpus Explorer

Making UIMA-annotated corpora tangible, searchable and vivid.

Quick Start

About

Microservices

In Medias Res

Annotations

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages