Skip to content

Commit

Permalink
Merge pull request #3 from adamoyoung/master
Browse files Browse the repository at this point in the history
update to version 0.4.0
  • Loading branch information
adamoyoung authored Jan 24, 2024
2 parents 2b13d60 + 7c8f6a8 commit 2d3b40c
Show file tree
Hide file tree
Showing 29 changed files with 3,951 additions and 1,177 deletions.
26 changes: 15 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,9 @@ mol: 964, 138, 274, 1376
> test, mol_loss_obj_mean = 0.4836
```

`mol_loss_obj_mean` is the loss averaged over molecules (instead of individual spectra) on a heldout portion of the MoNA dataset. See the [config](config/demo/demo_eval.yml), the [loss definitions](src/massformer/losses.py), and the [runner script](src/massformer/runner.py) for more detailed information about metrics.
`mol_loss_obj_mean` is the loss averaged over molecules (instead of individual spectra) on a heldout portion of the MoNA dataset. See the [config](config/demo/demo_eval.yml), the [loss definitions](src/massformer/losses.py), and the [runner file](src/massformer/runner.py) for more detailed information about metrics.

*Note: none of the steps below are required to run the demo, but are helpful for reproducing results from the paper*
*Note: none of the steps below are required to run the demo, but are helpful for reproducing results from the paper.*

## Downloading the Raw Spectrum Data

Expand All @@ -139,13 +139,13 @@ bash download_scripts/download_mona_raw.sh
bash download_script/download_casmi_raw.sh
```

The [NIST 2020 LC-MS/MS library](https://www.nist.gov/programs-projects/nist20-updates-nist-tandem-and-electron-ionization-spectral-libraries) is not available for download directly, but can be purchased from an authorized distributor and exported using the instructions below.
The [NIST 2020 MS/MS Library](https://www.nist.gov/programs-projects/nist20-updates-nist-tandem-and-electron-ionization-spectral-libraries) is not available for download directly, but can be purchased from an authorized distributor and exported using the instructions below.

## Exporting the NIST Data

*Note: this step requires a Windows System or Virtual Machine*
*Note: this step requires a Windows System or Virtual Machine.*

*Note: these instructions are for NIST 2020, the NIST 2023 LC-MS/MS library does not support plain text export with lib2nist*
*Note: these instructions are for NIST 2020, the NIST 2023 MS/MS Library does not support export with lib2nist.*

The spectra and associated compounds can be exported to MSP/MOL format using the free [lib2nist software](https://chemdata.nist.gov/mass-spc/ms-search/Library_conversion_tool.html). The resulting export will contain a single MSP file with all of the mass spectra, and multiple MOL files which include the molecular structure information (linked to the spectra by ID). The screenshot below indicates appropriate lib2nist export settings.

Expand Down Expand Up @@ -250,10 +250,10 @@ The model configs for the experiments are stored in the [config](config/) direct

Configurations exist of MassFormer (MF) and the baseline methods (FP, WLN, and CFM).

To train and evluate a model, simply choose a configuration and pass it to the runner.py script. For example, to run the [mona_scaffold_all_MF](config/all_prec_type/mona_scaffold_MF.yml) experiment (train MassFormer on NIST data, and evaluate on MoNA using a scaffold split), you can use the following command:
To train and evaluate a model, simply choose a configuration and pass it to the [run_train_eval.py](scripts/run_train_eval.py) script. For example, to run the [mona_scaffold_all_MF](config/all_prec_type/mona_scaffold_MF.yml) experiment (train MassFormer on NIST data, and evaluate on MoNA using a scaffold split), you can use the following command:

```
python src/massformer/runner.py -c config/all_prec_type/mona_scaffold_all_MF.yml -w online
python scripts/run_train_eval.py -c config/all_prec_type/mona_scaffold_all_MF.yml -w online
```

The `-w` argument controls the wandb logging (online, offline, or off). Note that training a model without a GPU will be very time-consuming.
Expand Down Expand Up @@ -324,7 +324,7 @@ run:
do_casmi22: False
save_state: True
save_media: True
log_auxiliary: False
log_auxiliary: True
train_seed: 5585
split_seed: 420
split_key: "scaffold"
Expand All @@ -349,7 +349,7 @@ The [inference script](scripts/run_inference.py) allows a pretrained model to ma
python scripts/run_inference.py -c config/demo/demo_eval.yml -s predictions/example_smiles.csv -o predictions/example_predictions.csv -d 0
```

*Note: if you are using the MF-CPU environment, replace the `-d 0` argument with `-d -1`*
*Note: if you are using the MF-CPU environment, replace the `-d 0` argument with `-d -1`.*

The smiles file (`-s` argument, see [this file](predictions/example_smiles.csv) for an example) is a csv file with two columns: the first column is a molecule id, and the second column is the SMILES string.

Expand All @@ -359,10 +359,14 @@ The precursor adducts and normalized collision energies can be controlled via co

## Checkpoints for Models from the Manuscript

All models in the manuscript (except CFM) are trained on NIST 2020 LC-MS/MS library. NIST does not support redistribution of parameters for models trained on this library. As such, we cannot provide model checkpoints.
All models in the manuscript (except CFM) are trained on NIST 2020 MS/MS Library. NIST does not support redistribution of parameters for models trained on this library. As such, we cannot provide model checkpoints.

However, it should be possible to reproduce our models by following the instructions to [export](#exporting-the-nist-data) and [preprocess](#preprocessing-the-spectrum-datasets) the data, and training a model using the [training script](scripts/run_train_eval.py) with the appropriate config.

If you are having trouble, feel free to create a GitHub issue or send an email to ayoung [AT] cs [DOT] toronto [DOT] edu.

*Note: there are not any CFM checkpoints, since we use pre-computed CFM predictions (see [previous section](#downloading-the-cfm-predictions)) in our experiments. If you want to use a pretrained CFM model on your own data, visit [their website](https://cfmid.wishartlab.com/)*
*Note: there are not any CFM checkpoints, since we use pre-computed CFM predictions (see [previous section](#downloading-the-cfm-predictions)) in our experiments. If you want to use a pretrained CFM model on your own data, visit [their website](https://cfmid.wishartlab.com/)*.

## Training a Model on All of the Data

It may be useful to get a version of MassFormer that is trained on all available data (i.e. both NIST 2020 and MoNA spectra). We have provided config files for such models in the [train_both](config/train_both) subdirectory. There are four different configs: two for training with all supported precursor adducts ([1.0 Da bin size](config/train_both/train_both_all_MF.yml), [0.1 Da bin size](config/train_both/train_both_all_MF_hr.yml)) and two for training for training with \[M+H\]+ precusor adducts ([1.0 Da bin size](config/train_both/train_both_mh_MF.yml), [0.1 Da bin size](config/train_both/train_both_mh_MF_hr.yml)).
1 change: 0 additions & 1 deletion config/ablations/ablations_large.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

run_name:
data:
num_entries: -1
Expand Down
1 change: 0 additions & 1 deletion config/ablations/ablations_large_pt.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

run_name:
data:
num_entries: -1
Expand Down
1 change: 0 additions & 1 deletion config/ablations/ablations_large_pt_ln.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

run_name:
data:
num_entries: -1
Expand Down
1 change: 0 additions & 1 deletion config/ablations/ablations_large_pt_tl.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

run_name:
data:
num_entries: -1
Expand Down
1 change: 0 additions & 1 deletion config/ablations/ablations_large_pt_tl_ln.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

run_name:
data:
num_entries: -1
Expand Down
1 change: 0 additions & 1 deletion config/ablations/ablations_small.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

run_name:
data:
num_entries: -1
Expand Down
59 changes: 0 additions & 59 deletions config/all_train.yml

This file was deleted.

20 changes: 10 additions & 10 deletions predictions/example_predictions.csv
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
spec_id,mol_id,group_id,prec_type,prec_mz,nce,inst_type,frag_mode,spec_type,ion_mode,peaks
0,0,0,[M+H]+,495.1138657639999,20.0,FT,HCD,MS2,P,"[(203.0, 0.002313667), (273.0, 0.002760048), (287.0, 0.008741894), (315.0, 0.1204835), (327.0, 0.06440582), (421.0, 0.0017913928), (477.0, 0.0019702096), (495.0, 0.7975335)]"
1,0,0,[M+H]+,495.1138657639999,40.0,FT,HCD,MS2,P,"[(203.0, 0.0020261826), (241.0, 0.025527652), (253.0, 0.0002888203), (273.0, 0.0013525537), (287.0, 0.103123665), (310.0, 0.00015975084), (315.0, 0.37091643), (327.0, 0.4273775), (421.0, 0.00042286966), (427.0, 6.215199e-05), (477.0, 0.0014547931), (495.0, 0.067287534)]"
2,0,0,[M+H]+,495.1138657639999,60.0,FT,HCD,MS2,P,"[(69.0, 2.4936304e-05), (119.0, 0.0015158347), (137.0, 0.00019018288), (149.0, 0.00072512997), (165.0, 0.00033318455), (203.0, 2.2514763e-05), (241.0, 0.8219682), (253.0, 0.0027942518), (287.0, 0.0670574), (310.0, 0.0026385773), (315.0, 0.020984074), (327.0, 0.081094824), (423.0, 0.00060890603), (427.0, 4.1945103e-05)]"
3,0,0,[M+H]+,495.1138657639999,80.0,FT,HCD,MS2,P,"[(94.0, 5.8393285e-05), (103.0, 0.0001759864), (119.0, 0.0019250469), (145.0, 0.00021896785), (149.0, 0.0021489789), (165.0, 0.0070218206), (241.0, 0.95762587), (253.0, 0.004113863), (281.0, 0.0004195166), (287.0, 0.011860776), (310.0, 0.0020834995), (315.0, 0.0012405076), (327.0, 0.011106766)]"
4,0,0,[M+H]+,495.1138657639999,100.0,FT,HCD,MS2,P,"[(65.0, 0.0015538434), (77.0, 0.00013693911), (89.0, 0.00017702878), (91.0, 0.003677364), (94.0, 0.012639048), (103.0, 0.00088350615), (105.0, 4.9430153e-05), (119.0, 0.0017848618), (131.0, 0.00046574083), (145.0, 0.004003338), (149.0, 0.00533638), (165.0, 0.05408701), (202.0, 0.0005469444), (241.0, 0.9063027), (253.0, 0.004608855), (281.0, 9.752709e-05), (287.0, 0.0015124343), (310.0, 0.0008274026), (327.0, 0.0013095455)]"
6,2,1,[M+H]+,281.14206597200007,20.0,FT,HCD,MS2,P,"[(82.0, 9.8669036e-05), (154.0, 0.00090298394), (194.0, 1.7881455e-05), (195.0, 0.00030222317), (281.0, 0.99867827)]"
7,2,1,[M+H]+,281.14206597200007,40.0,FT,HCD,MS2,P,"[(58.0, 0.0034104374), (69.0, 0.0019534626), (71.0, 0.0014802404), (82.0, 0.045510236), (95.0, 0.0003118592), (107.0, 0.0072222934), (110.0, 4.184326e-05), (125.0, 0.00028859248), (126.0, 0.0058972333), (154.0, 0.5900758), (168.0, 0.00057692145), (194.0, 0.003579222), (195.0, 0.0017105744), (196.0, 0.0061774957), (212.0, 0.0003587251), (279.0, 0.0006355032), (281.0, 0.3307695)]"
8,2,1,[M+H]+,281.14206597200007,60.0,FT,HCD,MS2,P,"[(56.0, 0.000571173), (58.0, 0.01858286), (69.0, 0.012174619), (71.0, 0.002456917), (82.0, 0.05944455), (91.0, 0.0023840687), (95.0, 0.004808746), (98.0, 3.9496335e-05), (107.0, 0.013895194), (110.0, 0.0021513093), (112.0, 0.0022901376), (114.0, 0.00014672062), (125.0, 0.007438548), (126.0, 0.01628112), (138.0, 0.0008138842), (154.0, 0.83046794), (168.0, 0.0065943873), (172.0, 0.0009739474), (179.0, 0.000264092), (194.0, 0.0043559736), (195.0, 5.2802425e-05), (196.0, 0.013811533)]"
9,2,1,[M+H]+,281.14206597200007,80.0,FT,HCD,MS2,P,"[(56.0, 0.0020196445), (58.0, 0.048166003), (69.0, 0.035398502), (70.0, 0.0064075175), (71.0, 0.0028679024), (82.0, 0.08426448), (90.0, 0.00013975982), (91.0, 0.00628303), (95.0, 0.016496617), (98.0, 0.007761548), (107.0, 0.011278744), (110.0, 0.0032603815), (112.0, 0.0064628953), (114.0, 0.0013717974), (122.0, 0.0014642046), (123.0, 0.00043327318), (125.0, 0.025516672), (126.0, 0.024463804), (138.0, 0.0014940419), (140.0, 0.00015327956), (154.0, 0.6835278), (166.0, 0.0005148152), (168.0, 0.015206274), (172.0, 0.0014064693), (179.0, 0.0018037586), (194.0, 0.0020256918), (196.0, 0.009811029)]"
10,2,1,[M+H]+,281.14206597200007,100.0,FT,HCD,MS2,P,"[(55.0, 8.587345e-05), (56.0, 0.0034195774), (58.0, 0.05135388), (69.0, 0.14643352), (70.0, 0.06591061), (71.0, 0.0042483946), (81.0, 0.00041718603), (82.0, 0.1493269), (83.0, 0.0005412484), (90.0, 0.0017571332), (91.0, 0.00912976), (93.0, 1.996728e-06), (95.0, 0.049992178), (98.0, 0.07067497), (107.0, 0.006684196), (110.0, 0.0020596164), (112.0, 0.010208446), (114.0, 0.0017411129), (121.0, 0.00028147304), (122.0, 0.001073316), (123.0, 0.0005710973), (125.0, 0.04284465), (126.0, 0.014451538), (138.0, 0.00076676573), (140.0, 0.0006966512), (154.0, 0.33809566), (165.0, 3.405598e-05), (166.0, 0.00043869112), (168.0, 0.020234268), (172.0, 0.00049967965), (179.0, 0.0033012703), (194.0, 0.0003462039), (196.0, 0.0023779853)]"
0,0,0,[M+H]+,495.1138657639999,20.0,FT,HCD,MS2,P,"[(203.0, 0.0023136702), (273.0, 0.0027600487), (287.0, 0.008741906), (315.0, 0.12048378), (327.0, 0.0644058), (421.0, 0.0017914006), (477.0, 0.001970213), (495.0, 0.79753315)]"
1,0,0,[M+H]+,495.1138657639999,40.0,FT,HCD,MS2,P,"[(203.0, 0.0020261896), (241.0, 0.025527699), (253.0, 0.00028882286), (273.0, 0.001352553), (287.0, 0.103123665), (310.0, 0.00015975104), (315.0, 0.3709158), (327.0, 0.42737842), (421.0, 0.00042286539), (427.0, 6.215274e-05), (477.0, 0.001454794), (495.0, 0.06728712)]"
2,0,0,[M+H]+,495.1138657639999,60.0,FT,HCD,MS2,P,"[(69.0, 2.4937186e-05), (119.0, 0.0015158347), (137.0, 0.00019018025), (149.0, 0.00072513754), (165.0, 0.0003331884), (203.0, 2.2516233e-05), (241.0, 0.8219682), (253.0, 0.0027942578), (287.0, 0.06705724), (310.0, 0.0026385689), (315.0, 0.020984096), (327.0, 0.08109496), (423.0, 0.00060890184), (427.0, 4.1945303e-05)]"
3,0,0,[M+H]+,495.1138657639999,80.0,FT,HCD,MS2,P,"[(94.0, 5.8393056e-05), (103.0, 0.00017598867), (119.0, 0.0019250412), (145.0, 0.00021897002), (149.0, 0.0021489812), (165.0, 0.0070218244), (241.0, 0.95762587), (253.0, 0.0041138683), (281.0, 0.00041951673), (287.0, 0.011860752), (310.0, 0.0020834962), (315.0, 0.0012405002), (327.0, 0.011106785)]"
4,0,0,[M+H]+,495.1138657639999,100.0,FT,HCD,MS2,P,"[(65.0, 0.0015538357), (77.0, 0.00013693754), (89.0, 0.00017702857), (91.0, 0.0036773828), (94.0, 0.012639074), (103.0, 0.000883507), (105.0, 4.94285e-05), (119.0, 0.001784856), (131.0, 0.0004657451), (145.0, 0.00400336), (149.0, 0.0053363917), (165.0, 0.054087486), (202.0, 0.00054694316), (241.0, 0.90630215), (253.0, 0.004608861), (281.0, 9.7528005e-05), (287.0, 0.001512429), (310.0, 0.00082739757), (327.0, 0.0013095518)]"
6,2,1,[M+H]+,281.14206597200007,20.0,FT,HCD,MS2,P,"[(82.0, 9.8669865e-05), (154.0, 0.00090298615), (194.0, 1.7882887e-05), (195.0, 0.0003022246), (281.0, 0.99867827)]"
7,2,1,[M+H]+,281.14206597200007,40.0,FT,HCD,MS2,P,"[(58.0, 0.003410421), (69.0, 0.0019534694), (71.0, 0.0014802454), (82.0, 0.045510184), (95.0, 0.00031186), (107.0, 0.0072222594), (110.0, 4.1845993e-05), (125.0, 0.00028859096), (126.0, 0.0058972086), (154.0, 0.5900737), (168.0, 0.00057692884), (194.0, 0.0035792445), (195.0, 0.0017105845), (196.0, 0.0061774994), (212.0, 0.00035872572), (279.0, 0.00063549954), (281.0, 0.33077177)]"
8,2,1,[M+H]+,281.14206597200007,60.0,FT,HCD,MS2,P,"[(56.0, 0.00057117664), (58.0, 0.018582838), (69.0, 0.012174658), (71.0, 0.0024569195), (82.0, 0.05944453), (91.0, 0.00238408), (95.0, 0.0048087426), (98.0, 3.9486917e-05), (107.0, 0.013895153), (110.0, 0.0021513195), (112.0, 0.0022901394), (114.0, 0.00014672345), (125.0, 0.00743856), (126.0, 0.016281147), (138.0, 0.00081389025), (154.0, 0.8304677), (168.0, 0.006594447), (172.0, 0.00097395515), (179.0, 0.0002640964), (194.0, 0.0043559973), (195.0, 5.2808846e-05), (196.0, 0.013811602)]"
9,2,1,[M+H]+,281.14206597200007,80.0,FT,HCD,MS2,P,"[(56.0, 0.002019643), (58.0, 0.048165746), (69.0, 0.035398677), (70.0, 0.0064075775), (71.0, 0.0028679068), (82.0, 0.08426436), (90.0, 0.00013975425), (91.0, 0.0062830322), (95.0, 0.016496567), (98.0, 0.007761437), (107.0, 0.011278713), (110.0, 0.003260405), (112.0, 0.006462905), (114.0, 0.0013717976), (122.0, 0.0014642079), (123.0, 0.0004332722), (125.0, 0.025516616), (126.0, 0.02446379), (138.0, 0.0014940557), (140.0, 0.00015328344), (154.0, 0.68352807), (166.0, 0.0005148166), (168.0, 0.015206392), (172.0, 0.001406477), (179.0, 0.0018037639), (194.0, 0.0020257111), (196.0, 0.009811046)]"
10,2,1,[M+H]+,281.14206597200007,100.0,FT,HCD,MS2,P,"[(55.0, 8.5875545e-05), (56.0, 0.0034195974), (58.0, 0.051353574), (69.0, 0.14643396), (70.0, 0.06591069), (71.0, 0.0042484067), (81.0, 0.00041719206), (82.0, 0.14932694), (83.0, 0.00054125534), (90.0, 0.0017571158), (91.0, 0.009129739), (93.0, 1.9976633e-06), (95.0, 0.049991935), (98.0, 0.070673585), (107.0, 0.006684202), (110.0, 0.0020596257), (112.0, 0.010208496), (114.0, 0.0017411148), (121.0, 0.0002814762), (122.0, 0.0010733175), (123.0, 0.0005711019), (125.0, 0.04284435), (126.0, 0.014451534), (138.0, 0.00076677446), (140.0, 0.0006966561), (154.0, 0.33809704), (165.0, 3.4057048e-05), (166.0, 0.00043869554), (168.0, 0.020234488), (172.0, 0.00049968675), (179.0, 0.0033012885), (194.0, 0.0003462086), (196.0, 0.0023779923)]"
2 changes: 1 addition & 1 deletion preproc_scripts/prepare_casmi22_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import argparse

from massformer.casmi_utils import common_filter, load_mw_cand, prepare_casmi_mol_df, prepare_casmi_cand_df, prepare_casmi_spec_df, proc_cand_smiles
from massformer.data_utils import par_apply_series, mol_from_smiles, mol_to_smiles, mol_to_mol_weight, check_mol_props, get_res, H_MASS, O_MASS, NA_MASS, N_MASS, C_MASS
from massformer.data_utils import par_apply_series, mol_from_smiles, mol_to_smiles, mol_to_mol_weight, get_res, H_MASS, NA_MASS


def calculate_total_spec_ints(peaks):
Expand Down
Loading

0 comments on commit 2d3b40c

Please sign in to comment.