Skip to content

Latest commit

 

History

History
147 lines (121 loc) · 4.78 KB

README.md

File metadata and controls

147 lines (121 loc) · 4.78 KB

S2AMP data

Dataset for S2AMP including training/classification and inference data. This is the dataset repo for "S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications".

alt text

Citation

@inproceedings{10.1145/3529372.3533283,
author = {Rohatgi, Shaurya and Downey, Doug and King, Daniel and Feldman, Sergey},
title = {S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications},
year = {2022},
isbn = {9781450393454},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3529372.3533283},
doi = {10.1145/3529372.3533283},
booktitle = {Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries},
articleno = {44},
numpages = {5},
keywords = {mentorship, academic knowledge graph, relationship mining},
location = {Cologne, Germany},
series = {JCDL '22}
}

alt text

Download Instructions

To obtain the S2AMP dataset, run the following command: [Expected download size is: ~160 GiB]

aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2amp/ data/

The data has been explored in the notebook notebooks/S2AMP_demo.ipynb included in this repo

Running the notebook

conda create -n s2amp python=3.7.10
conda activate s2amp
pip install -r requirements.txt

The notebook will download required data from the s3 bucket.

Details about the dataset

s2amp/
├── [2.1G]  gold
│   ├── [812M]  first_stage_features
│   │   ├── [163M]  test.csv
│   │   ├── [485M]  train.csv
│   │   └── [164M]  val.csv
│   ├── [6.6M]  lgb_first.stage.model.pkl
│   ├── [6.3M]  lgb_second.stage.model.pkl
│   ├── [ 15M]  S2AMP_matched_pairs.csv
│   └── [1.3G]  second_stage_features
│       ├── [269M]  test.csv
│       ├── [817M]  train.csv
│       └── [272M]  val.csv
└── [155.4G]  inferred
    ├── [1.3G]  mentors_s2_fos_scores.csv
    └── [8.1G]  s2amp_predictions_with_names.csv
    └── [ 52G]  first_stage_features
        └── [203M]  features.0.csv
        └── [203M]  features...csv
        └── [144M]  features.199.csv
    └── [ 94G]  second_stage_features
        └── [129M]  features.0.csv
        └── [129M]  features...csv
        └── [71M]  features.799.csv
              
  160G used

Quicks stats about the inferred S2 features data -

  • Number of mentor-mentee pairs : 137 million
  • Number of scholars : 24 million
  • Feature count : 65

S2AMP Gold

  • Mentor Mentee true pairs with S2 ids.

    • S2AMP_matched_pairs.csv
      • mentee_ai2_id
      • mentor_ai2_id
      • mentor_fname : mentor's first name
      • mentor_lname : mentor's last name
      • mentee_fname : mentee's first name
      • mentee_lname : mentee's last name
      • num_papers_cowritten : number of co-authored papers
  • Train data

    • is_mentor : flag for true pair(1) and false pair(0)

    • First Stage

      • first_stage_features/train.csv
      • first_stage_features/val.csv
      • first_stage_features/test.csv
    • Second Stage

      • second_stage_features/train.csv
      • second_stage_features/val.csv
      • second_stage_features/test.csv

More details about the features are in README_features.md

  • First stage model : LightGBM model trained on first_stage_features
    • lgb_first.stage.model.pkl
  • Second stage model : LightGBM model trained on second_stage_features
    • lgb_second.stage.model.pkl

README_features.md includes details about all the features extracted for each mentor-mentee pair.

alt text

S2AMP Inferred

  • Mentor-mentee pairs with scores
    • s2amp_predictions_with_names.csv
      • mentee_ai2id
      • mentor_ai2id
      • pred_prob : mentorship_score (<0.1 scores can be ignored)
      • mentee_name
      • mentor_name
  • Mentors with author details and mentorship scores
    • mentors_s2_fos_scores.csv
      • authors_ai2_id : ai2id of the author
      • h_index
      • paper_count
      • citation_count
      • affiliations
      • mentorship_score : sum of mentorship scores from mentorship graph
      • mentorship_score_mean : mean of mentorship scores
      • menteeship_score : sum of menteeship scores from mentorship graph
      • menteeship_score_mean : mean of menteeship scores
      • mentee_count : count of mentees mentored
      • mentor_count : count of mentors of the author
      • fos : field of study of the author
      • log_mentee_count

first_stage_features/features.0.csv and second_stage_features/features.0.csv are the features for all mentor mentee pairs in Semantic Scholar.