Introduction

Pioneer and its companion tool Altimeter are an open-source and performant solution for analysis of protein MS data acquired by data-independent acquisition (DIA). Poineer includes routines for searching DIA experments from Thermo and Sciex instruments and for building spectral libraries using the Koina interface. Given a spectral library of precursor fragment ion intensities and retention time estimates, Pioneer identifies and quantifies peptides from the library in the data.

Design Goals

  • Open-Source: Pioneer is completely open source.
  • Cross-Platform: Pioneer and the vendor-specific file conversion tool run on Linux, MacOS, and Windows
  • High-Performance: Pioneer achieves high sensitivity, FDR control, quantitative precision and accuracy on benhcmark datat-sets
  • Scalability: Memory consumption and speed should remain constant as the number of raw files in an anslysis grows. Pioneer should scale to very large experiments with hundreds to thousands of raw files (experimental)
  • Fast: Pioneer searches data several times faster than it can be aquired and faster than state-of-the-art search tools.

Features

Pioneer and Altimeter build on previous search engines and introduce several new concepts:

  • Spectral Library Prediction with Koina Using Koina Pioneer can construct fully predicted spectral libraries given an internet connection and a FASTA file with protein sequences. Pioneer uses Chronologer to predict peptide retention times and is optimized to use Altimeter for fragment ion intensity predictions.

  • Collision Energy Independent Spectral Libraries Rather than predicting a single intensity value for each fragment ion, Altimeter predicts 4 B-spline coefficients. Evaluating the fragment splines at a given collision energy gives a fragment ion intensity. Pioneer calibrates the library to find the optimal collision energy value to use for each MS data file in an experiment. In this way, it is possible to use a single spectral library for different instruments and scan settings.

  • Fragment Isotope Correction Fragment isotope distributions depend on precursor isotope distributions as distorted by quadrupole mass filtering. Altimeter addresses this by predicting total fragment ion intensities rather than monoisotopic ones. Pioneer then accurately re-isotopes these library spectra using methods from Goldfarb et al.. This is particularly important for narrow-window DIA methods where precursor isotopic envelopes frequently straddle multiple windows.

  • Qaudrupole Transmission Modeling For narrow-window data, Pioneer can optionally estimate a quadrupole-transmission efficiency function for more accurate re-isotoping.

  • Intensity-Aware Fragment Index Search Pioneer implements a fast fragment index search inspired by MSFragger and Sage. Pioneer's implementation uniquely leverages accurate fragment intensity predictions from in silico libraries to improve both speed and specificity of the search.

  • Linear Regression onto Library Templates Pioneer explains each observed mass spectrum as a linear combination of template spectra from the library. To reduce quantitative bias from interfering signals, Pioneer minimizes the pseudo-Huber loss rather than squared error. This provides robust quantification even in complex spectra. For other examples of linear regression applied to DIA analyses, see Specter and Chimerys.

  • Scalability Pioneer was designed to scale to large experiments with many MS data files. Memory consumption remains constant as the number of data files in an experiment grows large.

Authors and Development

Pioneer is developed and maintained by:

Contact

For questions about Pioneer or to collaborate, please contact:

  • Nathan Wamsley (wamsleynathan@gmail.com)
  • Dennis Goldfarb (dennis.goldfarb@wustl.edu)

For toubleshooting use the Issues page on github. To critique methods or propose features use the Discussions page.

Exported Methods

Pioneer.SearchDIAFunction
SearchDIA(params_path::String)

Main entry point for the DIA (Data-Independent Acquisition) search workflow. Executes a series of SearchMethods and generates performance metrics.

Parameters:

  • params_path: Path to JSON configuration file containing search parameters

Output:

  • Generates a log file in the results directory
  • Long and wide-formatted tables (.arrow and .csv) for protein-group and precursor level id's and quantitation.
  • Reports timing and memory usage statistics

Example:

julia> SearchDIA("/path/to/config.json")
==========================================================================================
Sarting SearchDIA
==========================================================================================

Starting search at: 2024-12-30T14:01:01.510
Output directory: ./../data/ecoli_test/ecoli_test_results
[ Info: Loading Parameters...
[ Info: Loading Spectral Library...
 .
 .
 .

If it does not already exist, SearchDIA creates the user-specified results_dir and generates quality control plots, data tables, and logs.

results_dir/
├── pioneer_search_log.txt
├── qc_plots/
│   ├── collision_energy_alignment/
│   │   └── nce_alignment_plots.pdf
│   ├── quad_transmission_model/
│   │   ├── quad_data
│   │   │   └── quad_data_plots.pdf
│   │   └── quad_models
│   │       └── quad_model_plots.pdf
│   ├── rt_alignment_plots/
│   │   └── rt_alignment_plots.pdf
│   ├── mass_error_plots/
│   │   └── mass_error_plots.pdf
│   └── QC_PLOTS.pdf
├── precursors_long.arrow
├── precursors_long.tsv
├── precursors_wide.arrow
├── precurosrs_wide.tsv
├── protein_groups_long.arrow
├── protein_groups_long.tsv
├── protein_groups_wide.arrow
└── protein_groups_wide.tsv
source
Pioneer.GetSearchParamsFunction
getSearchParams(template_path::String, lib_path::String, ms_data_path::String, results_path::String)

Creates a new search parameter file based on a template, with updated file paths.

The function reads a template JSON configuration file and creates a new 'search_parameters.json' in the current working directory with updated paths while preserving all other settings.

Arguments:

  • lib_path: Path to the library file (.poin)
  • msdatapath: Path to the MS data directory
  • results_path: Path where results will be stored
  • paramspath: Path to folder or .json file in which to write the template parameters file. Defaults to joinpath(pwd(), "./searchparameters.json")

Returns:

  • String: Path to the newly created search parameters file

Example:

output_path = getSearchParams(
    "/path/to/speclib.poin",
    "/path/to/ms/data/dir",
    "/path/to/output/dir"
)
source
Pioneer.BuildSpecLibFunction
BuildSpecLib(params_path::String)

Main function to build a spectral library from parameters. Executes a series of steps:

  1. Parameter validation and directory setup
  2. Fragment bound detection
  3. Retention time prediction (optional)
  4. Fragment prediction (optional)
  5. Library index building

Parameters:

  • params_path: Path to JSON configuration file containing library building parameters

Output:

  • Generates a spectral library in the specified output directory
  • Creates a detailed log file with timing and performance metrics
  • Returns nothing
source
Pioneer.GetBuildLibParamsFunction
GetBuildLibParams(out_dir::String, lib_name::String, fasta_dir::String)

Creates a new library build parameter file with updated paths and automatically discovered FASTA files. Uses a default template from data/example_config/defaultBuildLibParams.json.

Arguments:

  • out_dir: Output directory path
  • lib_name: Library name path
  • fasta_dir: Directory to search for FASTA files
  • paramspath: Path to folder or .json file in which to write the template parameters file. Defaults to joinpath(pwd(), "./buildspeclibparams.json")

Returns:

  • String: Path to the newly created parameters file
source
Pioneer.ParseSpecLibFunction
ParseSpecLib(params_path::String)

Main function to build a Pioneer formatted spectral library from an empirical library.

Parameters:

  • params_path: Path to JSON configuration file containing library building parameters

Output:

  • Generates a spectral library in the specified output directory
  • Creates a detailed log file with timing and performance metrics
  • Returns the built spectral library
source
Pioneer.convertMzMLFunction
convertMzML(mzml_dir::String; skip_scan_header::Bool=true)

Convert mzML mass spectrometry data files to Arrow IPC format.

Takes either a directory containing mzML files or a path to a single mzML file and converts them to Arrow format, preserving scan data including m/z arrays, intensity arrays, and scan metadata.

Arguments

  • mzml_dir::String: Path to either a directory containing mzML files or a path to a single mzML file
  • skip_scan_header::Bool=true: When true, omits scan header information from the output to reduce file size

Returns

nothing

Output

Creates Arrow (.arrow) files in the same directory as the input mzML files and with the same base filename.

Examples

# Convert all mzML files in a directory
convertMzML("path/to/mzml/files")

# Convert a single mzML file
convertMzML("path/to/single/file.mzML")

# Include scan headers in output
convertMzML("path/to/mzml/files", skip_scan_header=false)

Notes

Each mzML file is converted to a corresponding Arrow IPC (.arrow) file in the same directory. This is particularly useful for Sciex data where direct .wiff/.wiff2 conversion is not supported

source