-
Notifications
You must be signed in to change notification settings - Fork 15
Glossary
In this glossary, we explain specific functions or other items of RMassBank
A compound list in CSV format is required to identify all compounds unambiguously.
The CSV file is required to have at least the following columns, which are used for
further processing and must be named correctly (but present in any order): ID, Name, SMILES, RT, CAS
. The columns ID
and SMILES
must be filled, the other columns
must be present in the file but do not need to be filled.
ID
specifies an (arbitrary) numeric ID code which must be < 4 digits long; SMILES
specifies
a SMILES code with the chemical structure of the compound (and is used to extract the
molecular formula, for calculation of molecular masses, for database searching in CTS etc.)
Although the columns Name, RT, CAS
have to be present, the
information in the columns is only used if the cells are filled.
RT, if present, specifies the retention time (in minutes; CAS
and Name
are used as additional information while retrieving annotations from CTS. The
compound list doesn't have to be ordered in any particular way. It can contain large numbers of compounds,
even compounds which will not be actively used by the script (Note: Unused compounds
don't require a SMILES code, since they will not be accessed.)
An example list is provided with the RMassBankData package, and can be copied into a local folder, viewed and edited:
file.copy(system.file("list/NarcoticsDataset.csv",
package="RMassBankData"), "./Compoundlist.csv")
A number of different settings influence RMassBank. They are partly parameters for data processing and partly constants used for annotation.
A settings template file, to be edited by hand, can be generated using
RmbSettingsTemplate("mysettings.ini")
where mysettings.ini
is the file that will be generated. This file
should then be edited. Important settings are:
-
deprofile
: Whether to use a deprofiling algorithm to work with profile-mode data. Default isNA
for use with centroid-mode data. Allowed settings for profile-mode data includedeprofile.fwhm
(full-width half-maximum algorithm),deprofile.spline
(cubic spline algorithm),deprofile.localmax
(local maximum). See the respective help pages for detailed information. -
rtMargin
: The deviation allowed for retention times (in minutes) when extracting spectra from raw data files. -
rtShift
: The systematic retention time shift (in minutes) in the LC-MS data compared to the values in the compound list. -
babeldir
: The directory pointing to the OpenBabel binaries. -
use_version
: which MassBank data format to use. The default is the newer version 2; alternatively, the (deprecated) version 1 can be specified for MassBank servers running old versions of the server software. -
use_rean_peaks
: Whether or not peaks from reanalysis should be used (see below for details.) -
add_annotation
: Whether or not fragments should be annotated with the (tentative) molecular formula in MassBank records. -
annotations
: A list of annotation data used in the MassBank records.-
authors
,copyright
,publication
,license
,instrument
,instrument_type
,compound_class
: values for the corresponding MassBank fields -
confidence_comment
: A commentary field about "compound confidence" which is added like "COMMENT: CONFIDENCE standard compound" in the MassBank record. -
internal_id_fieldname
: The name for an internal ID field in the MassBank record where to store the compound ID (in the compound list). Forinternal_id_fieldname
= "MY_ID", the ID will be stored like "COMMENT: MY_ID 1234". -
entry_prefix
: The prefix for MassBank accession IDs. -
ms_type
,ionization
,lc_*
: Annotations for the LC and MS information fields in the MassBank records. -
ms_dataprocessing
: Tags added to describe the data processing. In addition to the tags specified here, MS$DATA_PROCESSING: WHOLE RMassBank will be added (corresponding to a list("WHOLE" = "RMassBank") entry for this option.)
-
-
annotator
: For advanced users: option to select your own custom annotator. Check ?annotator.default and the source code for details. -
spectraList
: The list of data-dependent scans triggered by a MS1 scan in their order; used for annotation of MassBank records. See the template file for description. -
accessionBuilderType
: A string (either "standard", "simple" or "selfDefined") to determine how to generate MassBank record accession numbers (optional, default: "standard"). RMassBank generates an accession number for each record. The structure and generation of this number varies based onaccessionBuilderType
.- "standard": accession numbers consisting of an arbitrary number of letters followed by a 6-digit code are generated. The letter code is defined by
annotations$entry_prefix
, the first four digits are given by the compound ID. The last two digits are generated from the position of the spectrum inspectraList
and the shift defined inaccessionNumberShifts
for the selected ion type (Example: the compound with ID 2112, processed in "pNa" mode ([M+Na]+), will have accession numbers XX211233, XX211234 ... etc in for the first, second... spectrum in the data-dependent scan, if the "pNa" shift is set to 32.) - "simple": accession numbers consisting of an arbitrary number of letters followed by a 6-digit code are generated. The letter code is defined by
annotations$entry_prefix
, the 6 digit code is generated from the position of the spectrum inspectraList
and the shift given inaccessionNumberStart
. Leading zeros are added if necessary. (Example: accession numbers XX000043, XX000045 ... will be generated for the first, second ... spectrum in the data-dependent scan ifaccessionNumberStart
is set to 32.) - "selfDefined": Accession numbers are generated by a user-defined function given in
accessionBuilderFile
. In particular, there is no constraint on the prefix andannotations$entry_prefix
will be ignored, if this option is chosen. The function definition must be in the formaccessionBuilder <- function(cpd, spectrm, subscan)
. Note: This functionality is quite advanced. If you really want to specify your ownaccessionBuilder
instead of using the "simple" or "standard" option, we highly encourage you to familiarize yourself with the source code of the function.buildRecord.RmbSpectraSet
inbuildRecord.R
first.
- "standard": accession numbers consisting of an arbitrary number of letters followed by a 6-digit code are generated. The letter code is defined by
-
accessionNumberShifts
: A list defining the starting points for generating MassBank record accession numbers. This will be used ifaccessionBuilderType
is unspecified or "standard" (seeaccessionBuilderType
above). -
accessionBuilderFile
: A file with a user-defined function to generate MassBank record accession numbers. This will be used ifaccessionBuilderType
is "selfDefined" (seeaccessionBuilderType
above.) -
accessionNumberStart
: An integer < 1000000 defining the starting point of MassBank record accession numbers. This will be used ifaccessionBuilderType
is "simple". (seeaccessionBuilderType
above). -
project
: A string giving the project tag, optional. If present, this will be inclueded in thePROJECT
field of the record. -
recalibrateBy
: Which parameter to use for recalibration:dppm
(recalibrate the deviation in ppm) ordmz
(recalibrate the m/z deviation). -
recalibrateMS1
: Whether to recalibrate MS1 data points separately from MS2 data points ("separate"
), with the same recalibration curve as the MS2 data points ("common"
) or not at all ("none"
). Note that the MS1 datapoints points will be used to generate the MS2 recalibration curve in all cases (since this makes the recalibration curve in high-m/z regions better-defined) but may be recalibrated independently themselves, if desired. -
recalibrator
: Sets the functions to use for recalibration. Defaults tolist(MS1="recalibrate.loess", MS2="recalibrate.loess")
which uses a Loess non-parametric fit to generate a recalibration curve. Any custom function may be specified. The function is expected to accept a dataset with variablesrecalfield
andmzFound
and to return an object which can be used withpredict()
. The inputrecalfield
is the value to be estimated by recalibration - it will either contain delta ppm values or absolute mass deviations, depending on the setting forrecalibrateBy
. In addition torecalibrate.loess
,recalibrate.MS1
is predefined, which uses a GAM model for recalibration and appears to work well for pure MS1 datapoints. However, common recalibration for MS1 and MS2 appears to be the best option in general. -
multiplicityFilter
: Define the multiplicity filtering level. Default is 2, a value of 1 is off (no filtering) and >2 is harsher filtering. -
titleFormat
: The title of MassBank records is a mini-summary of the record, for example "Dinotefuran; LC-ESI-QFT; MS2; CE: 35%; R=35000; [M+H]+". By default, the first compound nameCH$NAME
, instrument typeAC$INSTRUMENT_TYPE
, MS/MS typeAC$MASS_SPECTROMETRY: MS_TYPE
, collision energyRECORD_TITLE_CE
, resolutionAC$MASS_SPECTROMETRY: RESOLUTION
and precursorMS$FOCUSED_ION: PRECURSOR_TYPE
are used. If alternative information is relevant to differentiate acquired spectra, the title should be adjusted. For example, many TOFs do not have a resolution setting. See MassBank documentation for more. -
filterSettings
: A list of settings that affect the MS/MS processing.-
ppmHighMass
,ppmLowMass
: values for pre-processing, prior to recalibration. The default settings (for e.g. Orbitrap) is 10 ppm for high mass range, 15 ppm for low mass range (defined bymassRangeDivision
) -
massRangeDivision
: The m/z value defining the split betweenppmHighMass
andppmLowMass
above. The default m/z 120 is recommended for Orbitraps. -
ppmFine
: This defines the ppm cut-off post recalibration. The default value of 5 ppm is recommended for Orbitraps. -
prelimCut
,prelimCutRatio
: Intensity cut-off and cut-off ratio (in % of the most intense peak) for pre-processing. Affects peak selection for the recalibration only. Careful: the default 1e4 for Orbitrap LTQ positive could remove all peaks for TOF data and will remove too many peaks for Orbitrap LTQ negative mode spectra! -
specOkLimit
: MS/MS must have at least one peak above this limit present to be processed. -
dbeMinLimit
: The minimum allowable ring and double bond equivalent (DBE) allowed for assigned formulas. Assumes maximum valences for elements with multiple possible valences. Default is -0.5 (accounting for fragment peaks being ions). -
satelliteMzLimit
,satelliteIntLimit
: Cut-off m/z and intensity values for satellite peak removal. All peaks within the m/z (default 0.5) and intensity ratio (default 0.05 or 5 %) of the respective peak will be removed. Applicable to Fourier Transform instruments (e.g. Orbitrap).
-
-
findMsMsRawSettings
: Parameters for adjusting the raw data retrieval.-
ppmFine
: The ppm error to look for the precursor in the MS1 (parent) spectrum. Default is 10 ppm for Orbitrap. -
mzCoarse
: The error to search for the precursor specification in the MS2 spectrum. This is often only saved to 2 decimal places and thus inaccurate and may also depend on the isolation window. The default settings (for e.g. Orbitrap) is m/z=0.5 formzCoarse
. -
fillPrecursorScan
: The default value (FALSE) assumes all necessary precursor information was available in the mzML file. A setting of TRUE tries to fill in the precursor data scan number if it is missing.
Only tested on one case-study so far.
-
-
logging_file
: Set a file logs should be written to. By default,logging_file
is not specified and all logging information is written to STDOUT. Note: This setting will cause a static package variable to contain the logging file. This variable is checked by the logging functions, rather than the setting. Hence, changing the setting manually afterwards will not change the logging file.
See also the manpage ?RmbSettings
for a description of all RMassBank
settings.
Copyright (C) MassBank Consortium 2023
Authors: Michael Stravs, Tobias Schulze