Introduction
Pioneer and its companion tool Altimeter are an open-source and performant solution for analysis of protein MS data acquired by data-independent acquisition (DIA). Poineer includes routines for searching DIA experments from Thermo and Sciex instruments and for building spectral libraries using the Koina interface. Given a spectral library of precursor fragment ion intensities and retention time estimates, Pioneer identifies and quantifies peptides from the library in the data.
Design Goals
- Open-Source: Pioneer is completely open source.
- Cross-Platform: Pioneer and the vendor-specific file conversion tool run on Linux, MacOS, and Windows
- High-Performance: Pioneer achieves high sensitivity, FDR control, quantitative precision and accuracy on benhcmark datat-sets
- Scalability: Memory consumption and speed should remain constant as the number of raw files in an anslysis grows. Pioneer should scale to very large experiments with hundreds to thousands of raw files (experimental)
- Fast: Pioneer searches data several times faster than it can be aquired and faster than state-of-the-art search tools.
Features
Pioneer and Altimeter build on previous search engines and introduce several new concepts:
Spectral Library Prediction with Koina: Using Koina Pioneer can construct fully predicted spectral libraries given an internet connection and a FASTA file with protein sequences. Pioneer uses Chronologer to predict peptide retention times and is optimized to use Altimeter for fragment ion intensity predictions.
Collision Energy Independent Spectral Libraries: Rather than predicting a single intensity value for each fragment ion, Altimeter predicts 4 B-spline coefficients. Evaluating the fragment splines at a given collision energy gives a fragment ion intensity. Pioneer calibrates the library to find the optimal collision energy value to use for each MS data file in an experiment. In this way, it is possible to use a single spectral library for different instruments and scan settings.
Fragment Isotope Correction: Fragment isotope distributions depend on precursor isotope distributions as distorted by quadrupole mass filtering. Altimeter addresses this by predicting total fragment ion intensities rather than monoisotopic ones. Pioneer then accurately re-isotopes these library spectra using methods from Goldfarb et al.. This is particularly important for narrow-window DIA methods where precursor isotopic envelopes frequently straddle multiple windows.
Qaudrupole Transmission Modeling: For narrow-window data, Pioneer can optionally estimate a quadrupole-transmission efficiency function for more accurate re-isotoping.
Intensity-Aware Fragment Index Search: Pioneer implements a fast fragment index search inspired by MSFragger and Sage. Pioneer's implementation uniquely leverages accurate fragment intensity predictions from in silico libraries to improve both speed and specificity of the search.
Linear Regression onto Library Templates: Pioneer explains each observed mass spectrum as a linear combination of template spectra from the library. To reduce quantitative bias from interfering signals, Pioneer minimizes the pseudo-Huber loss rather than squared error. This provides robust quantification even in complex spectra. For other examples of linear regression applied to DIA analyses, see Specter and Chimerys.
Scalability: Pioneer was designed to scale to large experiments with many MS data files. Memory consumption remains constant as the number of data files in an experiment grows large.
Quick Links
Authors and Development
Pioneer is developed and maintained by:
- Nathan Wamsley (Major Lab/Goldfarb Lab, Washington University)
- Dennis Goldfarb (Goldfarb Lab, Washington University)
Contact
For questions about Pioneer or to collaborate, please contact:
- Nathan Wamsley (wamsleynathan@gmail.com)
- Dennis Goldfarb (dennis.goldfarb@wustl.edu)
For toubleshooting use the Issues page on github. To critique methods or propose features use the Discussions page.
Exported Methods
Pioneer.BuildSpecLib
Pioneer.GetBuildLibParams
Pioneer.GetSearchParams
Pioneer.SearchDIA
Pioneer.convertMzML
Pioneer.SearchDIA
— FunctionSearchDIA(params_path::String)
Main entry point for the DIA (Data-Independent Acquisition) search workflow. Executes a series of SearchMethod
s and generates performance metrics.
Parameters:
- params_path: Path to JSON configuration file containing search parameters
Output:
- Generates a log file in the results directory
- Long and wide-formatted tables (.arrow and .csv) for protein-group and precursor level id's and quantitation.
- Reports timing and memory usage statistics
Example:
julia> SearchDIA("/path/to/config.json")
==========================================================================================
Sarting SearchDIA
==========================================================================================
Starting search at: 2024-12-30T14:01:01.510
Output directory: ./../data/ecoli_test/ecoli_test_results
[ Info: Loading Parameters...
[ Info: Loading Spectral Library...
.
.
.
If it does not already exist, SearchDIA creates the user-specified results_dir and generates quality control plots, data tables, and logs.
results_dir/
├── pioneer_search_log.txt
├── qc_plots/
│ ├── collision_energy_alignment/
│ │ └── nce_alignment_plots.pdf
│ ├── quad_transmission_model/
│ │ ├── quad_data
│ │ │ └── quad_data_plots.pdf
│ │ └── quad_models
│ │ └── quad_model_plots.pdf
│ ├── rt_alignment_plots/
│ │ └── rt_alignment_plots.pdf
│ ├── mass_error_plots/
│ │ └── mass_error_plots.pdf
│ └── QC_PLOTS.pdf
├── precursors_long.arrow
├── precursors_long.tsv
├── precursors_wide.arrow
├── precurosrs_wide.tsv
├── protein_groups_long.arrow
├── protein_groups_long.tsv
├── protein_groups_wide.arrow
└── protein_groups_wide.tsv
Pioneer.GetSearchParams
— FunctionGetSearchParams(lib_path::String, ms_data_path::String, results_path::String;
params_path::Union{String, Missing} = missing,
simplified::Bool = true)
Creates a search parameter configuration file with user-specified paths.
The function loads default parameters from either the simplified or full JSON template (from assets/example_config/) and creates a customized parameter file with the user's file paths. All other parameters retain their default values and can be modified later.
Arguments:
- lib_path: Path to the spectral library file (.poin)
- msdatapath: Path to the MS data directory
- results_path: Path where search results will be stored
- paramspath: Output path for the parameter file. Can be a directory (creates searchparameters.json) or full file path. Defaults to "search_parameters.json" in current directory.
- simplified: If true (default), uses simplified template with essential parameters only. If false, uses full template with all advanced options.
Returns:
- String: Path to the newly created search parameters file
Templates used:
- Simplified:
defaultSearchParamsSimplified.json
(basic parameters) - Full:
defaultSearchParams.json
(all advanced parameters)
Example:
# Create simplified parameter file
output_path = GetSearchParams(
"/path/to/speclib.poin",
"/path/to/ms/data/dir",
"/path/to/results/dir"
)
# Create full parameter file with custom output location
output_path = GetSearchParams(
"/path/to/speclib.poin",
"/path/to/ms/data/dir",
"/path/to/results/dir";
params_path = "/custom/path/my_params.json",
simplified = false
)
Pioneer.BuildSpecLib
— FunctionBuildSpecLib(params_path::String)
Main function to build a spectral library from parameters. Executes a series of steps:
- Parameter validation and directory setup
- Fragment bound detection
- Retention time prediction (optional)
- Fragment prediction (optional)
- Library index building
Parameters:
- params_path: Path to JSON configuration file containing library building parameters
Output:
- Generates a spectral library in the specified output directory
- Creates a detailed log file with timing and performance metrics
- Returns nothing
Pioneer.GetBuildLibParams
— FunctionGetBuildLibParams(out_dir::String, lib_name::String, fasta_inputs;
params_path::Union{String, Missing} = missing,
regex_codes::Union{Missing, Dict, Vector} = missing,
simplified::Bool = true)
Creates a library building parameter configuration file with user-specified paths and FASTA files.
The function loads default parameters from either the simplified or full JSON template (from assets/example_config/) and creates a customized parameter file with the user's paths and automatically discovered FASTA files. All other parameters retain their default values and can be modified later.
Arguments:
- out_dir: Output directory path where the library will be built
- lib_name: Name for the spectral library (used for directory and file naming)
- fasta_inputs: FASTA file specification. Can be:
- A single directory path (String) - searches for .fasta/.fasta.gz files
- A single FASTA file path (String)
- An array of directories and/or FASTA file paths
- paramspath: Output path for the parameter file. Can be a directory (creates buildspeclibparams.json) or full file path. Defaults to "buildspeclib_params.json" in current directory.
- regex_codes: Optional FASTA header regex patterns for protein annotation extraction. Can be:
- A single Dict with keys: "accessions", "genes", "proteins", "organisms" (applied to all FASTA files)
- A Vector of Dicts for positional mapping to fasta_inputs
- If missing, uses default patterns from the template
- simplified: If true (default), uses simplified template with essential parameters only. If false, uses full template with all advanced library building options.
Returns:
- String: Path to the newly created library building parameters file
Templates used:
- Simplified:
defaultBuildLibParamsSimplified.json
(basic parameters) - Full:
defaultBuildLibParams.json
(all advanced parameters)
The function automatically:
- Discovers FASTA files in specified directories
- Generates appropriate library names from FASTA filenames
- Expands regex patterns to match the number of FASTA files found
- Validates that all specified paths exist and are accessible
Example:
# Create simplified parameter file with directory of FASTA files
output_path = GetBuildLibParams(
"/path/to/output",
"my_library",
"/path/to/fasta/directory"
)
# Create full parameter file with specific FASTA files and custom regex
output_path = GetBuildLibParams(
"/path/to/output",
"my_library",
["/path/to/human.fasta", "/path/to/yeast.fasta"];
params_path = "/custom/path/build_params.json",
regex_codes = Dict("accessions" => "^sp\|(\w+)\|", "genes" => " GN=(\S+)"),
simplified = false
)
Pioneer.convertMzML
— FunctionconvertMzML(mzml_dir::String; skip_scan_header::Bool=true)
Convert mzML mass spectrometry data files to Arrow IPC format.
Takes either a directory containing mzML files or a path to a single mzML file and converts them to Arrow format, preserving scan data including m/z arrays, intensity arrays, and scan metadata.
Arguments
mzml_dir::String
: Path to either a directory containing mzML files or a path to a single mzML fileskip_scan_header::Bool=true
: When true, omits scan header information from the output to reduce file size
Returns
nothing
Output
Creates Arrow (.arrow) files in the same directory as the input mzML files and with the same base filename.
Examples
# Convert all mzML files in a directory
convertMzML("path/to/mzml/files")
# Convert a single mzML file
convertMzML("path/to/single/file.mzML")
# Include scan headers in output
convertMzML("path/to/mzml/files", skip_scan_header=false)
Notes
Each mzML file is converted to a corresponding Arrow IPC (.arrow) file in the same directory. This is particularly useful for Sciex data where direct .wiff/.wiff2 conversion is not supported