-
Notifications
You must be signed in to change notification settings - Fork 10
CSV annotations export
-
group
,datasetName
,datasetId
- metadata about the dataset this annotation belongs to. -
formula
- The formula for the base molecule of the annotated ion, e.g. H2O for water. -
adduct
- The adduct applied to the formula for this annotation, e.g. M+H for protonation. -
chemMod
- The formula for the chemical modification (if any). Normally this will be empty unless chemical modifications were added in the annotation settings. -
ion
- A combination of the formula, adduct, chemMod, and charge of this annotation that can be used to uniquely identify this annotation within the dataset. -
mz
- The theoretical m/z of the first peak of the ion (including the adduct mass and with an electron mass added/removed). This mass may differ very slightly from the exact theoretical mass, as a centroiding simulation is done on the theoretical peaks, which introduces a small amount of numerical error (generally <0.1ppm). -
fdr
- Global False Discovery Rate for this annotation. "Global" here means e.g. out of the set of all annotations withfdr <= 0.1
, 10% of them are expected to be false discoveries. -
msm
,rhoSpatial
,rhoSpectral
,rhoChaos
- Scores used to evaluate the annotation's FDR. See this figure for more details. -
moleculeNames
,moleculeIds
- names and database IDs for the putative molecules that match this annotation'sformula
. See the note below for how to programmatically split these lists if needed. -
minIntensity
,maxIntensity
- the lowest and highest intensity in the first ion image. -
totalIntensity
- sum of intensities in the first ion image. -
colocalizationCoeff
- if a "colocalized with" filter is applied, this column contains the colocalization coefficient to the molecule used for comparison. -
offSample
- Result from running the OffsampleAI image classification model to check whether the ion image looks like it is off-sample.true
means that it looks off-sample,false
means that it looks on-sample. -
rawOffSampleProb
- The predicted probability that the image is off-sample. Values higher than 0.5 are considered off-sample. -
isomerIons
- A comma-separated list of theion
values of other annotations that are isomeric (i.e. identical isotopic m/zs). -
isobarIons
- A comma-separated list of theion
values of other annotations that are isobaric (i.e. different but overlapping isotopic m/zs).
The CSV files exported by METASPACE contain a timestamp and a link to the source data in the first two lines of the file. This can cause issues with some CSV loaders.
To load a METASPACE CSV export with Pandas use the skiprows=2
argument:
import pandas as pd
annotations = pd.read_csv('metaspace_annotations.csv', skiprows=2)
In plain Python:
from csv import DictReader
annotations_lines = open('metaspace_annotations.csv').readlines()[2:]
annotations = list(DictReader(annotations_lines))
Some spreadsheet programs will occasionally incorrectly detect the character encoding of the CSV files, causing names such as 8-hydroxy-2-phenyl-1λ⁴-chromen-1-ylium to appear mangled, e.g. as "8-hydroxy-2-phenyl-1λâ�´-chromen-1-ylium". The solution to this is to close the file, reopen it and select Unicode (UTF-8) or UTF-8 as the character encoding when prompted.
In the Annotations CSV file, each row may specify multiple values in the moleculeNames
and moleculeIds
columns.
The items in these lists are delimited by ,
(comma then space). If a molecule name naturally contains a comma
followed by one or more spaces, such as HMDB0032389, then the spaces are
removed to ensure they're unambiguously parseable, e.g.
"Methyl acrylate-divinylbenzene, completely hydrolyzed, copolymer" would become
"Methyl acrylate-divinylbenzene,completely hydrolyzed,copolymer".
The rawOffSampleProb
column contains the unmodified raw output of the off-sample prediction model as a number from
0.0
to 1.0
. For most datasets this number is close to the probability that the ion image is off-sample,
however it is not guaranteed to be accurate. Datasets that are too dissimilar to the datasets that the model
was trained on will have less accuracy. Furthermore, toward the extremes of the scale, the model tends to be
overconfident. e.g. a prediction with a rawOffSampleProb
value of 0.0000001
may still have a 1-10% chance of
being off-sample.