Introduction

Pioneer and its companion tool Altimeter are an open-source and performant solution for analysis of protein MS data acquired by data-independent acquisition (DIA). Poineer includes routines for searching DIA experments from Thermo and Sciex instruments and for building spectral libraries using the Koina interface. Given a spectral library of precursor fragment ion intensities and retention time estimates, Pioneer identifies and quantifies peptides from the library in the data.

Key Features

  • Isotope-Aware DIA Analysis: Narrow isolation windows distort fragment ion isotope distributions because the quadrupole partially transmits precursor isotopic envelopes. Pioneer addresses this by estimating a quadrupole transmission efficiency function for each scan and re-isotoping library spectra accordingly, using methods from Goldfarb et al.. This correction is critical for accurate matching and quantification in narrow-window DIA.

  • Altimeter: Collision Energy-Independent Spectral Libraries: Altimeter predicts coefficients for B-splines that model total rather than monoisotopic fragment ion intensities as a function of normalized collision energy (NCE). Evaluating the splines at a given NCE produces a complete spectrum, so a single library works across different instruments and acquisition settings. Pioneer calibrates the optimal NCE per data file automatically.

  • Intensity-Aware Fragment Index: Pioneer implements a fast fragment index search inspired by MSFragger and Sage. Pioneer's implementation uniquely leverages accurate fragment intensity predictions from in silico libraries—indexing only the highest-ranked fragments—to improve both speed and specificity of candidate identification.

  • Spectral Deconvolution with Robust Regression: Pioneer explains each observed mass spectrum as a linear combination of template spectra from the library. To reduce quantitative bias from interfering signals in chimeric spectra, Pioneer minimizes the pseudo-Huber loss rather than squared error. For other examples of linear regression applied to DIA analyses, see Specter and Chimerys.

  • Dual-Window Quantification: In narrow-window DIA, a precursor's isotopic envelope is split across adjacent windows. Pioneer normalizes quantification by the isolated precursor fraction and combines signal from adjacent windows for denser chromatographic sampling and improved quantitative accuracy.

  • Match Between Runs: Pioneer transfers peptide identifications across runs with false transfer rate (FTR) control, increasing coverage in large-scale experiments.

  • Spectral Library Prediction via Koina: Using Koina, Pioneer constructs fully predicted spectral libraries from a FASTA file and an internet connection. Pioneer uses Chronologer for retention time prediction and Altimeter for fragment ion intensity prediction.

Performance

  • Speed: 2–6x faster than DIA-NN and AlphaDIA on benchmark datasets
  • FDR Control: Conservative false discovery rate control validated by entrapment analysis
  • Scalability: Memory consumption remains constant as the number of raw files grows, scaling to experiments with hundreds of runs

Current Limitations

  • Variable modifications: Only oxidation of methionine (Unimod:35) is currently supported as a variable PTM
  • Digestion: Fully enzymatic digestion only (no semi-enzymatic or non-specific searches)
  • Interface: Command-line only; no graphical user interface yet

Authors and Development

Pioneer is developed and maintained by:

Citation

If you use Pioneer or Altimeter in your research, please cite:

Wamsley, N. T., Wilkerson, E. M., Major, M., & Goldfarb, D. "Pioneer and Altimeter: Fast Analysis of DIA Proteomics Data Optimized for Narrow Isolation Windows." bioRxiv (2025). DOI: [forthcoming]

Contact

For questions about Pioneer or to collaborate, please contact:

  • Nathan Wamsley (wamsleynathan@gmail.com)
  • Dennis Goldfarb (dennis.goldfarb@wustl.edu)

For troubleshooting use the Issues page on GitHub. To critique methods or propose features use the Discussions page.

Exported Methods

Pioneer.SearchDIAFunction
SearchDIA(params_path::String)

Main entry point for the DIA (Data-Independent Acquisition) search workflow. Executes a series of SearchMethods and generates performance metrics.

Parameters:

  • params_path: Path to JSON configuration file containing search parameters

Output:

  • Generates a log file in the results directory
  • Long and wide-formatted tables (.arrow and .csv) for protein-group and precursor level id's and quantitation.
  • Reports timing and memory usage statistics

Example:

julia> SearchDIA("/path/to/config.json")
==========================================================================================
Sarting SearchDIA
==========================================================================================

Starting search at: 2024-12-30T14:01:01.510
Output directory: ./../data/ecoli_test/ecoli_test_results
[ Info: Loading Parameters...
[ Info: Loading Spectral Library...
 .
 .
 .

If it does not already exist, SearchDIA creates the user-specified results_dir and generates quality control plots, data tables, and logs.

results_dir/
├── pioneer_search_log.txt
├── qc_plots/
│   ├── collision_energy_alignment/
│   │   └── nce_alignment_plots.pdf
│   ├── quad_transmission_model/
│   │   ├── quad_data
│   │   │   └── quad_data_plots.pdf
│   │   └── quad_models
│   │       └── quad_model_plots.pdf
│   ├── rt_alignment_plots/
│   │   └── rt_alignment_plots.pdf
│   ├── mass_error_plots/
│   │   └── mass_error_plots.pdf
│   └── QC_PLOTS.pdf
├── precursors_long.arrow
├── precursors_long.tsv
├── precursors_wide.arrow
├── precurosrs_wide.tsv
├── protein_groups_long.arrow
├── protein_groups_long.tsv
├── protein_groups_wide.arrow
└── protein_groups_wide.tsv
source
Pioneer.GetSearchParamsFunction
GetSearchParams(lib_path::String, ms_data_path::String, results_path::String; 
               params_path::Union{String, Missing} = missing,
               simplified::Bool = true)

Creates a search parameter configuration file with user-specified paths.

The function loads default parameters from either the simplified or full JSON template (from assets/example_config/) and creates a customized parameter file with the user's file paths. All other parameters retain their default values and can be modified later.

Arguments:

  • lib_path: Path to the spectral library file (.poin)
  • msdatapath: Path to the MS data directory
  • results_path: Path where search results will be stored
  • paramspath: Output path for the parameter file. Can be a directory (creates searchparameters.json) or full file path. Defaults to "search_parameters.json" in current directory.
  • simplified: If true (default), uses simplified template with essential parameters only. If false, uses full template with all advanced options.

Returns:

  • String: Path to the newly created search parameters file

Templates used:

  • Simplified: defaultSearchParamsSimplified.json (basic parameters)
  • Full: defaultSearchParams.json (all advanced parameters)

Example:

# Create simplified parameter file
output_path = GetSearchParams(
    "/path/to/speclib.poin",
    "/path/to/ms/data/dir", 
    "/path/to/results/dir"
)

# Create full parameter file with custom output location
output_path = GetSearchParams(
    "/path/to/speclib.poin",
    "/path/to/ms/data/dir",
    "/path/to/results/dir";
    params_path = "/custom/path/my_params.json",
    simplified = false
)
source
Pioneer.BuildSpecLibFunction
BuildSpecLib(params_path::String)

Main function to build a spectral library from parameters. Executes a series of steps:

  1. Parameter validation and directory setup
  2. Fragment bound detection
  3. Retention time prediction (optional)
  4. Fragment prediction (optional)
  5. Library index building

Parameters:

  • params_path: Path to JSON configuration file containing library building parameters

Output:

  • Generates a spectral library in the specified output directory
  • Creates a detailed log file with timing and performance metrics
  • Returns nothing
source
Pioneer.GetBuildLibParamsFunction
GetBuildLibParams(out_dir::String, lib_name::String, fasta_inputs; 
                 params_path::Union{String, Missing} = missing,
                 regex_codes::Union{Missing, Dict, Vector} = missing,
                 simplified::Bool = true)

Creates a library building parameter configuration file with user-specified paths and FASTA files.

The function loads default parameters from either the simplified or full JSON template (from assets/example_config/) and creates a customized parameter file with the user's paths and automatically discovered FASTA files. All other parameters retain their default values and can be modified later.

Arguments:

  • out_dir: Output directory path where the library will be built
  • lib_name: Name for the spectral library (used for directory and file naming)
  • fasta_inputs: FASTA file specification. Can be:
    • A single directory path (String) - searches for .fasta/.fasta.gz files
    • A single FASTA file path (String)
    • An array of directories and/or FASTA file paths
  • paramspath: Output path for the parameter file. Can be a directory (creates buildspeclibparams.json) or full file path. Defaults to "buildspeclib_params.json" in current directory.
  • regex_codes: Optional FASTA header regex patterns for protein annotation extraction. Can be:
    • A single Dict with keys: "accessions", "genes", "proteins", "organisms" (applied to all FASTA files)
    • A Vector of Dicts for positional mapping to fasta_inputs
    • If missing, uses default patterns from the template
  • simplified: If true (default), uses simplified template with essential parameters only. If false, uses full template with all advanced library building options.

Returns:

  • String: Path to the newly created library building parameters file

Templates used:

  • Simplified: defaultBuildLibParamsSimplified.json (basic parameters)
  • Full: defaultBuildLibParams.json (all advanced parameters)

The function automatically:

  • Discovers FASTA files in specified directories
  • Generates appropriate library names from FASTA filenames
  • Expands regex patterns to match the number of FASTA files found
  • Validates that all specified paths exist and are accessible

Example:

# Create simplified parameter file with directory of FASTA files
output_path = GetBuildLibParams(
    "/path/to/output", 
    "my_library",
    "/path/to/fasta/directory"
)

# Create full parameter file with specific FASTA files and custom regex
output_path = GetBuildLibParams(
    "/path/to/output",
    "my_library", 
    ["/path/to/human.fasta", "/path/to/yeast.fasta"];
    params_path = "/custom/path/build_params.json",
    regex_codes = Dict("accessions" => "^sp\|(\w+)\|", "genes" => " GN=(\S+)"),
    simplified = false
)
source
Pioneer.convertMzMLFunction
convertMzML(mzml_path::String; skip_scan_header::Bool=true, output_dir::String="", skip_existing::Bool=false, concurrent_files::Int=2)

Convert mzML mass spectrometry data files to Arrow IPC format.

Takes either a directory containing mzML files or a path to a single mzML file and converts them to Arrow format, preserving scan data including m/z arrays, intensity arrays, and scan metadata.

Arguments

  • mzml_path::String: Path to either a directory containing mzML files or a path to a single mzML file
  • skip_scan_header::Bool=true: When true, omits scan header information from the output to reduce file size
  • output_dir::String="": Output directory for .arrow files. Defaults to <input_dir>/arrow_out
  • skip_existing::Bool=false: When true, skip files whose existing .arrow output appears complete
  • concurrent_files::Int=2: Number of mzML files to convert at the same time

Returns

nothing

Output

Creates Arrow (.arrow) files in the requested output directory with the same base filename as the input mzML files.

Requirements

Only centroided mzML is supported. Files containing profile-mode spectra are skipped during conversion.

Examples

# Convert all mzML files in a directory
convertMzML("path/to/mzml/files")

# Convert a single mzML file
convertMzML("path/to/single/file.mzML")

# Convert to a custom output directory
convertMzML("path/to/mzml/files", output_dir="path/to/arrow_out")

# Skip files that already have complete Arrow outputs
convertMzML("path/to/mzml/files", skip_existing=true)

# Include scan headers in output
convertMzML("path/to/mzml/files", skip_scan_header=false)

Notes

Each mzML file is converted to a corresponding Arrow IPC (.arrow) file in the output directory. This is particularly useful for Sciex data where direct .wiff/.wiff2 conversion is not supported

source