I am pleased to present to you our graduation project that includes 9 member. "AI-generated media (Audio, Image, and Text) Detection" or as we called it, "Catch the AI".
Catch The AI: It is a complete system that enables you to have your account to save what you have recorded from previous media. With this system, you can display it on different media (text, audio, and image), each of which is self-contained.
Register now to be able to catch the AI:
- Demo : catchtheai
- Main Repo on github : private repo
- Can use all models RoBERTa,DeBERTa,DistilBERT,BERT,FeedForwardWithRoBERTaDeBERTa of DAIGT on SpaceHuggingface :DAIGT Space
This is part of our graduation project, I was working on this project on the Detect AI-Generated Text DAIGT model.
about this repo Here I present to you everything related to my work on the text model, from the stages of data collection to testing some models to the final model...
One of the goals of Large Language Models “LLMs” is to create texts similar to what humans write and after the many LLMs currently available and ready for use, such as GPT 4, Gemini, etc. The open-source models that you can train on your data and your task, such as Mistral. And as we use them a lot in many tasks, it will become difficult to differentiate between these texts and their authors.
This will affect many things, such as Students using LLMs for homework; this will affect their academic level, and it will be difficult for the teacher to determine the level of his students, and his estimation of these students will be wrong. Another example that we encountered during the data collection stage is the lack of trust in articles. We always wondered who wrote this article. If we consider that the one who wrote it was one of the LLMs, and our suspicion was wrong, this reduces the quality of the data that the model will be trained on.
These are some of the simple examples and cases in which you want the DAIGT model to intervene to help you, and in “Catch The AI”.
After many stumbles and experiments to obtain data and a suitable architecture capable of achieving our goals of building a robust and generalized model. We trained many models, including from scratch Bi-LSTM, Conv1D, etc., and we used different tokenizers in them, such as ELMo model, BERT-tokenizer and pre-trained models such as Mistral-7B, BERT, DistilBERT, RoBERTa, and DeBERTa.
- Here is an explanation of the latest architecture:
In the DAIGT model, we relied on two models that proved their efficiency in some of the data that they were not trained on RoBERTa and DeBERTa. Therefore, we decided to use them together and create an ensemble technique through the Feedforward Layer 'ReLU activation function', consisting of 32 neurons that were trained on the outputs coming out of RoBERTa and DeBERTa.
⏲️ Time-Line : Timeline All results and notebooks can be accessed here.
🗞️ All details about final version is here:
Documant of text model
🔗Links of Notebooks and dataset 'final version':
Data was collected from different areas on Kaggle and HuggingFace, and you can access it through this link:
- DAIGT | Catch The AI
- DAIGT | EDA
- DAIGT | BERT
- DAIGT | DistilBERT
- DAIGT | RoBERTa
- DAIGT | DeBERTa
- DAIGT | Model Analysis
- You can use all of them or retrain them. You will find them on my account on HuggingFace: zeyadusf
We did not just work as a team, but we were a family. These people are truly skilled and creative. Follow them and wait for their wonderful projects, from which they learn a lot and benefit a lot of people.❤️
Romani Nasrat Shawqi | Abdalla Mohammed |
Mohannad Ayman | Mohammed Abdeldayem |
Ahmed Abo-Elkassem | Sara Reda |
Reham Mostafa | Rawan Aziz |
ا