Dataset for S2AMP including training/classification and inference data. This is the dataset repo for "S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications".
@inproceedings{10.1145/3529372.3533283,
author = {Rohatgi, Shaurya and Downey, Doug and King, Daniel and Feldman, Sergey},
title = {S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications},
year = {2022},
isbn = {9781450393454},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3529372.3533283},
doi = {10.1145/3529372.3533283},
booktitle = {Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries},
articleno = {44},
numpages = {5},
keywords = {mentorship, academic knowledge graph, relationship mining},
location = {Cologne, Germany},
series = {JCDL '22}
}
To obtain the S2AMP dataset, run the following command: [Expected download size is: ~160 GiB]
aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2amp/ data/
The data has been explored in the notebook notebooks/S2AMP_demo.ipynb
included in this repo
conda create -n s2amp python=3.7.10
conda activate s2amp
pip install -r requirements.txt
The notebook will download required data from the s3 bucket.
s2amp/
├── [2.1G] gold
│ ├── [812M] first_stage_features
│ │ ├── [163M] test.csv
│ │ ├── [485M] train.csv
│ │ └── [164M] val.csv
│ ├── [6.6M] lgb_first.stage.model.pkl
│ ├── [6.3M] lgb_second.stage.model.pkl
│ ├── [ 15M] S2AMP_matched_pairs.csv
│ └── [1.3G] second_stage_features
│ ├── [269M] test.csv
│ ├── [817M] train.csv
│ └── [272M] val.csv
└── [155.4G] inferred
├── [1.3G] mentors_s2_fos_scores.csv
└── [8.1G] s2amp_predictions_with_names.csv
└── [ 52G] first_stage_features
└── [203M] features.0.csv
└── [203M] features...csv
└── [144M] features.199.csv
└── [ 94G] second_stage_features
└── [129M] features.0.csv
└── [129M] features...csv
└── [71M] features.799.csv
160G used
- Number of mentor-mentee pairs : 137 million
- Number of scholars : 24 million
- Feature count : 65
-
Mentor Mentee true pairs with S2 ids.
S2AMP_matched_pairs.csv
- mentee_ai2_id
- mentor_ai2_id
- mentor_fname : mentor's first name
- mentor_lname : mentor's last name
- mentee_fname : mentee's first name
- mentee_lname : mentee's last name
- num_papers_cowritten : number of co-authored papers
-
Train data
-
is_mentor : flag for true pair(1) and false pair(0)
-
First Stage
first_stage_features/train.csv
first_stage_features/val.csv
first_stage_features/test.csv
-
Second Stage
second_stage_features/train.csv
second_stage_features/val.csv
second_stage_features/test.csv
-
More details about the features are in README_features.md
- First stage model : LightGBM model trained on
first_stage_features
lgb_first.stage.model.pkl
- Second stage model : LightGBM model trained on
second_stage_features
lgb_second.stage.model.pkl
README_features.md includes details about all the features extracted for each mentor-mentee pair.
- Mentor-mentee pairs with scores
s2amp_predictions_with_names.csv
- mentee_ai2id
- mentor_ai2id
- pred_prob : mentorship_score (<0.1 scores can be ignored)
- mentee_name
- mentor_name
- Mentors with author details and mentorship scores
mentors_s2_fos_scores.csv
- authors_ai2_id : ai2id of the author
- h_index
- paper_count
- citation_count
- affiliations
- mentorship_score : sum of mentorship scores from mentorship graph
- mentorship_score_mean : mean of mentorship scores
- menteeship_score : sum of menteeship scores from mentorship graph
- menteeship_score_mean : mean of menteeship scores
- mentee_count : count of mentees mentored
- mentor_count : count of mentors of the author
- fos : field of study of the author
- log_mentee_count
first_stage_features/features.0.csv
and second_stage_features/features.0.csv
are the features for all mentor mentee
pairs in Semantic Scholar.