Chairum Corpus

A corpus of publicly available speeches from Mexican presidents:

Andres Manuel Lopez Obrador
Claudia Sheinbaum Pardo

Currently data is sourced exclusively from YouTube. For some videos it was not possible to get the automatically generated subtitles to source the transcriptions, for those cases a transcription is done using Open AI Whisper.

Currently there is no interface or API where the data can be queried (coming in future iterations), but it's really simple to do using a text editor, for example using Visual Studio:

Data

Individual files in JSON format are also provided under the data folder. Additionally, a script is provided to generate a file in CSV format with all records. Sample record:

{
    "video_id": "_uNpYoBHukM",
    "video_thumbnail_url": "https://i.ytimg.com/vi/_uNpYoBHukM/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLBiA5GPXPQfIJ7UxkMLQKQY9gKhhQ",
    "video_url": "https://www.youtube.com/watch?v=_uNpYoBHukM",
    "video_title": "M\u00e9xico garantiza derecho de asilo a solicitantes de Nicaragua. Conferencia presidente AMLO",
    "video_length_seconds": 10097,
    "transcription_with_timestamps": [
        {
            "text": "el INE no se toca",
            "start": 1803.179,
            "duration": 5.761
        },
        {
            "text": "pero tambi\u00e9n",
            "start": 1806.6,
            "duration": 5.959
        },
        {
            "text": "Garc\u00eda Luna no se toca",
            "start": 1808.94,
            "duration": 3.619
        },
        {
            "text": "y en el fondo es",
            "start": 1812.779,
            "duration": 3.081
        },
        {
            "text": "el r\u00e9gimen",
            "start": 1816.159,
            "duration": 6.781
        },
        {
            "text": "corrupto y conservador no se toca",
            "start": 1818.26,
            "duration": 4.68
        },
        {
            "text": "para eso es pero es bueno",
            "start": 1826.039,
            "duration": 4.941
        }
    ],
    "transcription_text": " el INE no se toca pero tambi\u00e9n Garc\u00eda Luna no se toca y en el fondo es el r\u00e9gimen corrupto y conservador no se toca para eso es pero es bueno",
    "transcription_source": "YouTube auto-generated captions",
    "playlist_id": "PLRnlRGar-_296KTsVL0R6MEbpwJzD8ppA",
    "playlist_title": "Conferencias de prensa matutinas",
    "published_time_text": "Streamed 6 months ago",
    "retrieved_time": "2023-09-07 20:16:50.123990"
}

How to run?

Install requirements:

pip3 install -r requirements.txt

Get a YouTube API token and set an environment variable with this value:

export YOUTUBE_V3_API_KEY={YOUR_TOKEN}

Run:

python process.py && python transcribe.py

To generate a single CSV file for the dataset run:

python generate_csv.py

Future work

Add persistence (db backend)
Add API
- Handle gracefully phonetic coincidences (Krauze, Krause, Kraus, Krauz) using something like Metaphone or Baider-Morse
Add simple app to search and query the data
Add new field with transcribed text without stop words

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
data		data
failed		failed
manual_transcriptions		manual_transcriptions
.flake8		.flake8
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
generate_csv.py		generate_csv.py
process.py		process.py
requirements.txt		requirements.txt
simple_search.gif		simple_search.gif
thij_ij_fine.png		thij_ij_fine.png
transcribe.py		transcribe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chairum Corpus

Data

How to run?

Future work

About

Languages

License

ivansabik/chairum-corpus

Folders and files

Latest commit

History

Repository files navigation

Chairum Corpus

Data

How to run?

Future work

About

Topics

Resources

License

Stars

Watchers

Forks

Languages