Author: Oweys Momenzada
FOR DEEPER INSIGHT INTO THE WORK AND APPROACH, ALL NOTEBOOKS ARE WELL DOCUMENTED AND PROVIDED ON THIS GITHUB REPOSITORY.
In my time at SCHICKLER I was allowed to work on the award-winning DRIVE-Project. The Drive project has data from various regional publishers throughout Germany. One of my tasks was to develop an API in which customers search their archives for Evergreen articles. In addition, authors want to check how much evergreen character their written text contains.
Evergreen content is content that remains relevant regardless of the season or the time-frame (click here for more). Thus, publishers can always use these articles without creating new ones.
The challenge here is that such a project does not yet exist for German text corpus (it is also poorly documented for English text corpus). Therefore, this repository goes into detail about the technical approach.
The Dataset has been labeled manually by the publishers. Therefore, I can not provide a dataset to work on. However, there is a dataset for English Evergreens by StumbleUpon. You should be able to apply my approach to the StumbleUpon Dataset.
As mentioned, the data is manually labeled. Only the text and the article-ID were used as dataset. For EDA purposes, further data, such as genre, publisher, accesses, etc., were taken from Google BigQuery. A labeled article could look as follows:
ID | Text | Publisher | pageview_start | pageview_end | genre | topic | label |
---|---|---|---|---|---|---|---|
55312 | Experte gibt Tipps für... | Publisher 1 | 00:00:00 UTC | 00:00:20 UTC | Kultur | Tipps | Evergreen |
55442 | Zwei Schwerverletzte bei Unfall... | Publisher 3 | 03:00:10 UTC | 03:00:50 UTC | Gesellschaft | Nachrichten | Ephemeral |
Initially, a distinction was made between Evergreen-Seasonal, Evergreen-Forever, Evergeen-Event and Ephemeral. However, after EDA (see "/EDA/Evergreen EDA.ipynb"), a too large disbalance of the data was noticed, which would have had an high impact on the accuracy of the model. Therefore, we only distinguish between Evergreens and Ephemeral or Non-Evergreens.
After the EDA (see "EDA/Evergreen EDA.ipynb"), we could see that Evergreen articles behave differently in time than other articles. Evergreen articles have been more consistent in their views over time than other articles. Other articles have a high number of views in the first days and then drop significantly in the following days. Thus, you can classify Evergreen articles according to their behavior based on time. The problem is that, according to the results, the classification only can be reliable after 80 days of observation (see "/EDA/Timebased Clf.ipynb").
Therefore, we classify articles based on their content or text corpus. For the classification we will use the State-of-Art Model: BERT. The advantage here is that a classification can be performed immediately. We could reach an accuracy of over 83% (see "model/Model training.ipynb").
A Real World Application on some articles can be seen here "Results and Examples.ipynb"
This will be provided for SCHICKLERS customers based on an API. We first store the trained model into a Bucket in Google Cloud Storage and than load it into GCP AI Platform. We then implement Textcleaning and other Feature Engineering steps and also the communcation with the trained model on AI platform on a different .py-file (see "Application - API/main.py"). In addition, we use FLASK for our RESTful API. For our API we implement POST requests to get the text of our customers. We then finally deploy our API on APP Engine to provide for our customers online.
Cite the authors of the BERT Model.
@misc{devlin2019bert,
title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
author={Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova},
year={2019},
eprint={1810.04805},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Please cite this GitHub if you use this work.
@misc{momenzada_schickler_2021_evergreen,
title={Evergreen recognition for German text},
author={Momenzada, Oweys and SCHICKLER},
url={https://github.com/OweysMomenzada/Evergreen-Content-Classifier-for-german-Text},
journal={Github},
year={2021},
month={Sep}
}