Mathematical Analysis of Mass Spectrometry Data
Comprehensive Analysis of Lipid Changes under Stimulation. The goal of the computational analysis is the construction of an array containing the m/z ratios for peaks observed within a mass spectrometry experiment that displays the comprehensive changes in these species between two experimental conditions (e.g., addition of a ligand at a given concentration) over the defined time course. In order to accomplish this goal, computer algorithms were developed by our group to achieve the following: (i) smooth raw mass spectrometry data to remove extraneous portions; (ii) identify peaks within each data set and normalize their signal intensities; (iii) create baseline profiles of the transformed signal at each peak identified; (iv) statistically compare these baseline conditions with the observed stimulated results; and (v) handle exceptional cases in which assumptions are not validated by the data. All data analysis programs were written in the S-Plus V3.3 for Windows environment. These steps are outlined in Fig. 2.
![]() |
| Fig. 2. Flowchart for the analysis of mass spectrometry data in the creation of lipid arrays. |
The data analysis begins with the conversion of Xcalibur raw files into text for loading into S-Plus. After the files are loaded, each data set is smoothed using a kernel regression estimator (24). The effect of this smoothing is the removal of shoulders from peaks within the data set, which reduces the overall number of peaks to be analyzed. This smoothed data set is then transformed in the second portion of the analysis.
Since the absolute signal intensity at a particular m/z value exhibits a high variability, even between apparent exact replicates, the development of a unitless number was desirable for the comparison of this data. The primary characteristic for this transformation was that it should be a more robust measure of the signal strength at a particular m/z value with respect to the overall pattern observed. Our software is set up for two distinct methods for transforming the data. In the first method, the mean and standard deviation of a function of the observed intensities for the complete spectra are calculated. These statistics are then used to transform the signal intensity at each of the m/z values as: I* = (I - mean)/SD (i.e., the transformed intensity is represented as the number of standard deviations the signal occurs above or below the mean signal strength). Thus, a signal with intensity equal to the mean intensity of the data set would receive a score of zero, and any signal with intensity below the mean would receive a negative score. The second method involves using the rank of the signal in comparison with the other points in the data set as the transformed intensity. This method has proven to be highly robust against the wide changes in signal magnitude observed. These transformed intensity signals are carried into the next level of analysis.
After the data have been smoothed and transformed, the algorithm identifies potential peaks by parsing for low-high-low patterns in the data set. These peaks are collected and concatenated with other data sets from the same condition. During this concatenation, the algorithm averages the locations of peaks identified within different data sets to compensate for m/z measurement error. For example, peaks identified at m/z ratios of 768.4 and 768.6 are placed at 768.5 for the analysis.
In the next stage of the analysis, the data from the basal condition are used to create a Shewhart control chart for the means of the transformed data. The object of this construction is the identification of m/z ratios for which the signal remains stable in the basal condition during the course of the experiment so that the information can be pooled for subsequent comparison with the stimulated condition. In the analysis of the WEHI-231 cells, we have n = 4 samples at each time point. The mean of these values is plotted along the time axis and a set of control limits are calculated for this statistic. The area between these limits represents the expected variability in the mean of four observations of the process, not the individual measurements. The limits are computed from process output, assuming the underlying distribution remains stable. Thus, a kind of running hypothesis test is constructed. An example of this stage of the analysis is shown in the left panel of Fig. 3.
![]() |
| Fig. 3. Shewhart control chart. The left panel of the figure represents a control chart for the sample mean of the transformed data at a specific m/z value constructed from the basal case at five time points with four measurements each. The y-axis is unitless. The means of the sets are connected with a solid line. The chart also shows the grand mean as well as the lower and upper control limits, LCL and UCL, respectively. The area between the grand mean and the control limits is divided into zones labeled A, B, and C, as they proceed toward the center of the control chart. These zones represent the 3s, 2s, and 1s distance from the grand mean and are used to test for time-dependent nonrandom patterns. The right panel of the figure shows the data for the stimulated condition, compared against the basal control limits, indicating an out-of-control condition at the third and fourth time points. |
A process is said to be "in control" if it exhibits only random variation, that is, all points (means in this case) are within the control limits and no nonrandom patterns are present. Our algorithm uses the zones labeled A, B, and C in the control chart to examine the time series for nonrandom patterns. At all m/z values at which the basal data remain in control over the course of the experiment, it is assumed that the variability in the signal output is appropriately represented by the control limits. These m/z values represent molecules in which metabolic cellular events are negligible, as measured by mass spectrometry, in the nonstimulated condition. An example of this can be seen in the left panel of Fig. 3, in which the signal denoted by the means of the four measurements is seen to be in control. In this case, these limits are used for comparison with the observed means at this m/z ratio in the stimulated condition, and these results are collected for processing into the lipid arrays. This analysis includes parsing for time points beyond the control limits as well as searching for patterns that can be deduced from the control chart zones. In Fig. 3, the third and fourth time points in the stimulated condition fail a nonrandom pattern test (two of three consecutive points in zone A or beyond; the fourth time point is beyond the lower control limit as well) and are flagged by the algorithm as having decreased in the stimulated condition.
The final portion of the computational analysis is the handling of exceptions to the above discussion. Two possibilities require elucidation here. In the first case, the basal data behave in an "out-of-control" manner, that is, they contain some nonrandom time-related variation. When the basal condition exhibits out-of-control variation, extending the control limits for comparison with the stimulated condition would be inappropriate. In this instance, a Welch-modified two-sample t-test is performed at each of the time points to determine if differences exist in the means between the two conditions at the given time. Thus, the algorithm performs an alternative statistical test at each time point for every m/z ratio in which the basal signal is found to be out of control. The second possibility involves peaks that appear in different frequencies within the basal and stimulated conditions. In this case, a binomial test is performed, with the null hypothesis that a peak has an equal chance of appearing in either of the two conditions, to determine if the observed difference in the number of occurrences in the two conditions is significant.
Lipid Arrays. At the conclusion of the analysis, the results are grouped into a comprehensive array containing the m/z values observed as peaks on the vertical and the time points on the horizontal axes. Lipid species that have been identified by CID MS/MS techniques as being present in the sample are assigned their corresponding m/z values. Each m/z and time point combination found to be increasing is scored with a one (1), while those decreasing are assigned a negative one (-1). Statistically stable combinations are scored with a zero (0). These arrays are color coded to enhance readability and in many cases provide a striking display of cellular lipid changes in time after challenge with a biological agonist. An excerpt of a lipid array is shown in Fig. 4.
![]() |
| Fig. 4. An excerpt from a lipid array in the m/z range of 885.6 to 892.5. Data were collected from WEHI-231 cells challenged with 0.13 mM anti-IgM. Analysis using CID MS/MS determined that this area contained phosphatidylinositol lipids with 38 carbons in several double-bond configurations. The array shows these species decreasing over the time course after the stimulation, as indicated by the negative score. |
The number of peak/time point combinations examined in the system can create a significant opportunity for false-positives. This is illustrated by considering that if 1000 different peaks are analyzed over 5 time points, 5000 chances for a false-positive are created. If the alpha value is set at 0.05, one would anticipate 250 false indicators on a lipid array of this size occurring by chance alone. This effect can be countered by repeating the entire experiment multiple times and summing the cells from the resulting arrays. Thus, if the experiment is repeated five times, each cell in the summary array will have a score between -5 and 5. Scores occurring toward the extremes (-5 and 5) indicate species that are fluctuating under stimulation with high statistical significance. Since random errors are unlikely to occur in the same position, after several repetitions the result is seen to converge to a stable map of lipid changes.