S2AMP data

Dataset for S2AMP including training/classification and inference data. This is the dataset repo for "S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications".

Citation

@inproceedings{10.1145/3529372.3533283,
author = {Rohatgi, Shaurya and Downey, Doug and King, Daniel and Feldman, Sergey},
title = {S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications},
year = {2022},
isbn = {9781450393454},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3529372.3533283},
doi = {10.1145/3529372.3533283},
booktitle = {Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries},
articleno = {44},
numpages = {5},
keywords = {mentorship, academic knowledge graph, relationship mining},
location = {Cologne, Germany},
series = {JCDL '22}
}

Download Instructions

To obtain the S2AMP dataset, run the following command: [Expected download size is: ~160 GiB]

aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2amp/ data/

The data has been explored in the notebook notebooks/S2AMP_demo.ipynb included in this repo

Running the notebook

conda create -n s2amp python=3.7.10
conda activate s2amp
pip install -r requirements.txt

The notebook will download required data from the s3 bucket.

Details about the dataset

s2amp/
├── [2.1G]  gold
│   ├── [812M]  first_stage_features
│   │   ├── [163M]  test.csv
│   │   ├── [485M]  train.csv
│   │   └── [164M]  val.csv
│   ├── [6.6M]  lgb_first.stage.model.pkl
│   ├── [6.3M]  lgb_second.stage.model.pkl
│   ├── [ 15M]  S2AMP_matched_pairs.csv
│   └── [1.3G]  second_stage_features
│       ├── [269M]  test.csv
│       ├── [817M]  train.csv
│       └── [272M]  val.csv
└── [155.4G]  inferred
    ├── [1.3G]  mentors_s2_fos_scores.csv
    └── [8.1G]  s2amp_predictions_with_names.csv
    └── [ 52G]  first_stage_features
        └── [203M]  features.0.csv
        └── [203M]  features...csv
        └── [144M]  features.199.csv
    └── [ 94G]  second_stage_features
        └── [129M]  features.0.csv
        └── [129M]  features...csv
        └── [71M]  features.799.csv
              
  160G used

Quicks stats about the inferred S2 features data -

Number of mentor-mentee pairs : 137 million
Number of scholars : 24 million
Feature count : 65

S2AMP Gold

Mentor Mentee true pairs with S2 ids.
- S2AMP_matched_pairs.csv
  - mentee_ai2_id
  - mentor_ai2_id
  - mentor_fname : mentor's first name
  - mentor_lname : mentor's last name
  - mentee_fname : mentee's first name
  - mentee_lname : mentee's last name
  - num_papers_cowritten : number of co-authored papers
Train data
- is_mentor : flag for true pair(1) and false pair(0)
- First Stage
  - first_stage_features/train.csv
  - first_stage_features/val.csv
  - first_stage_features/test.csv
- Second Stage
  - second_stage_features/train.csv
  - second_stage_features/val.csv
  - second_stage_features/test.csv

More details about the features are in README_features.md

First stage model : LightGBM model trained on first_stage_features
- lgb_first.stage.model.pkl
Second stage model : LightGBM model trained on second_stage_features
- lgb_second.stage.model.pkl

README_features.md includes details about all the features extracted for each mentor-mentee pair.

S2AMP Inferred

Mentor-mentee pairs with scores
- s2amp_predictions_with_names.csv
  - mentee_ai2id
  - mentor_ai2id
  - pred_prob : mentorship_score (<0.1 scores can be ignored)
  - mentee_name
  - mentor_name
Mentors with author details and mentorship scores
- mentors_s2_fos_scores.csv
  - authors_ai2_id : ai2id of the author
  - h_index
  - paper_count
  - citation_count
  - affiliations
  - mentorship_score : sum of mentorship scores from mentorship graph
  - mentorship_score_mean : mean of mentorship scores
  - menteeship_score : sum of menteeship scores from mentorship graph
  - menteeship_score_mean : mean of menteeship scores
  - mentee_count : count of mentees mentored
  - mentor_count : count of mentors of the author
  - fos : field of study of the author
  - log_mentee_count

first_stage_features/features.0.csv and second_stage_features/features.0.csv are the features for all mentor mentee pairs in Semantic Scholar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

S2AMP data

Citation

Download Instructions

Running the notebook

Details about the dataset

Quicks stats about the inferred S2 features data -

S2AMP Gold

S2AMP Inferred

Files

README.md

Latest commit

History

README.md

File metadata and controls

S2AMP data

Citation

Download Instructions

Running the notebook

Details about the dataset

Quicks stats about the inferred S2 features data -

S2AMP Gold

S2AMP Inferred