Impresso Multilingual Text Embedder

This repository offers tools for embedding texts in multiple languages with an efficient workflow. It uses the transformers library by Hugging Face and the make tool to manage large datasets. Make ensures robust and incremental processing, allowing you to handle new data, resume tasks, and run processes across different machines, all while avoiding redundant work.

Features

Efficient Storage Management: Minimal local storage is required as necessary data for each year of a newspaper are downloaded on-the-fly and truncated after uploading.
Parallel Processing: Processes run in parallel to optimize throughput.
Selective Processing: Only the necessary processing steps are executed, ensuring efficiency by not reprocessing existing outputs on S3.
S3 Integration: Integration with S3 for storing and resuming processing. The system ensures no overwriting of files or partial uploads due to interruptions. It is also posssible to run everything locally without S3.
Custom Embedding Options: Flexible configurations via normal environment variables or make variables, including the ability to specify model versions and filter text data.

Missing Features

Batch processing of texts is not yet implemented. This will be added in a future release.
Installing specialized xformer implementation for sparse attention inference is not yet imlemented. This shoulld be added in a future release for faster inference.

Concepts

Storage Locations

Local Storage: Temporary disk space used for processing tasks. This disk space can be fast storage.
S3 Storage: Permanent storage where final results are stored. Processing can be resumed from this storage.

File Stamps

To manage dependencies, local file stamps are used:

Input Stamps (.stamp): Indicate the status of input files on S3.
Output Stamps (no extension or .done): Indicate the completion status of output files on S3. Make uses these to determine if a file needs to be processed or not.

Local stamps help make determine which files need to be processed or skipped. The helper script lib/sync_s3_filestamps.py manages these stamps by syncing them with S3.

File Organization

The processing follows a structured organization:

S3 Buckets: Data is organized by processing steps and, in some cases, versions.
Build Directory: A local mirror of the S3 storage, structured similarly for consistency.

# Example directory structure
BUILD_DIR/BUCKET/NEWSPAPER/<NEWSPAPER-YEAR>.jsonl.bz2
BUILD_DIR/BUCKET/PROCESSING_TYPE/VERSION/NEWSPAPER/<NEWSPAPER-YEAR>.jsonl.bz2

Setup

Clone the Repository:

git clone git@github.com:impresso/impresso-text-embedder.git
cd impresso-text-embedder

Configure S3 Credentials: Copy the dotenv.sample file to .env. Modify the .env file to include your AWS credentials:
```
SE_ACCESS_KEY=<your-access-key>
SE_SECRET_KEY=<your-secret-key>
SE_HOST_URL=<your host name>
```
Install Dependencies: Ensure make and python3 and pip3 are installed. For GPU support of pytorch, go to https://pytorch.org/get-started/locally/ and get your installation command. Run:
```
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip3 install -r requirements.txt

# or pipenv install
```

Setup Environment and make variables:

cp dotenv.sample .env  # edit .env with your S3 credentials
cp local.config.sample.mk local.config.mk  # edit local.config.mk with your local settings

Setup Directories and Model: Create necessary directories, the list of newspapers to process and download the Hugging Face model:
```
make setup
```

Usage

Makefile Targets

make help

Running the Embedder

Process a single newspaper: You can specify a list of newspapers to process using the NEWSPAPER_LIST_FILE. The default list is generated automatically from the S3 bucket:
```
make newspaper
```
Parallel Processing of each newspaper: To process newspapers in parallel, use:
```
make each
```

Data flow overview:

flowchart LR
    %% Nodes
    %% External Entities



    subgraph cluster_local ["Local Machine"]
        style cluster_local fill:#FFF3E0,stroke:#FF6F00,stroke-width:1px

        %% Processes
        F{{"Text Embedding Processor"}}

        %% Data Stores
        B[("Local Rebuilt Data")]
        E[("Text Embedding Model")]
        C[("Processed Output Data")]


    end
    subgraph cluster_s3 ["S3 Storage"]
        style cluster_s3 fill:#E0F7FA,stroke:#0097A7,stroke-width:1px
        A[/"Rebuilt Data"/]
        D[/"Processed Data"/]


                %% Data Flows
        A -->|Sync Input| B
        B -->|Data| F
        E -->|Model| F
        F -->|Output| C
        C -->|Upload Output| D
        D -->|Sync Output| C
    end

About

Impresso

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585) and the Luxembourg National Research Fund under grant No. 17498891.

Copyrights

Copyright (C) 2018-2024 The Impresso team.
Contributors to this program include: Simon Clematide

License

This program is provided as open source under the GNU Affero General Public License v3 or later.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
lib		lib
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
Pipfile		Pipfile
README.md		README.md
config.local.sample.mk		config.local.sample.mk
dotenv.sample		dotenv.sample
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Impresso Multilingual Text Embedder

Features

Missing Features

Concepts

Storage Locations

File Stamps

File Organization

Setup

Usage

Makefile Targets

Running the Embedder

Data flow overview:

About

Impresso

Copyrights

License

About

Releases

Packages

Contributors 2

Languages

License

impresso/impresso-text-embedder

Folders and files

Latest commit

History

Repository files navigation

Impresso Multilingual Text Embedder

Features

Missing Features

Concepts

Storage Locations

File Stamps

File Organization

Setup

Usage

Makefile Targets

Running the Embedder

Data flow overview:

About

Impresso

Copyrights

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages