Skip to content

Re-implement spooker in python #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open

Re-implement spooker in python #71

wants to merge 50 commits into from

Conversation

kelly-sovacool
Copy link
Member

@kelly-sovacool kelly-sovacool commented May 6, 2025

Changes

  • spooker has the same usage, with important changes:
    • re-implemented in Python
    • writes all metadata to a single gzipped JSON file rather than bundling everything as a tar archive
  • jobby now optionally includes job out/err log files (see --outerr and --include-completed)

Issues

PR Checklist

(Strikethrough any points that are not applicable.)

  • This comment contains a description of changes with justifications, with any relevant issues linked.
  • Write unit tests for any new features, bug fixes, or other code changes.
  • Update docs if there are any API changes.
  • Update CHANGELOG.md with a short description of any user-facing changes and reference the PR number. Guidelines: https://keepachangelog.com/en/1.1.0/

Copy link

codecov bot commented May 6, 2025

Codecov Report

Attention: Patch coverage is 77.23577% with 56 lines in your changes missing coverage. Please review.

Project coverage is 75.16%. Comparing base (3713f8d) to head (2cb9235).

Files with missing lines Patch % Lines
src/ccbr_tools/jobby.py 61.33% 29 Missing ⚠️
src/ccbr_tools/paths.py 62.50% 15 Missing ⚠️
src/ccbr_tools/pipeline/__init__.py 79.59% 10 Missing ⚠️
src/ccbr_tools/spooker.py 96.22% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #71      +/-   ##
==========================================
+ Coverage   73.79%   75.16%   +1.37%     
==========================================
  Files          22       25       +3     
  Lines        1595     1784     +189     
==========================================
+ Hits         1177     1341     +164     
- Misses        418      443      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kelly-sovacool
Copy link
Member Author

just use tree to determine number of files & directory size instead of du

@kelly-sovacool
Copy link
Member Author

for all master & slurm jobs that did not complete, get the log out/err files and include the text in the json

@kopardev
Copy link
Member

for all master & slurm jobs that did not complete, get the log out/err files and include the text in the json

Actually if you get the slurmjobids (which have status != COMPLETED) ... we can get the file paths of the .err or .out files from the tree itself... no need to glob.

@kopardev
Copy link
Member

  • add argparse or click for better argument parsing.

@kopardev
Copy link
Member

kopardev commented May 12, 2025

@kelly-sovacool The output should be nested-JSON file:

{
  "pipeline_metadata": {
    "pipeline_name": "XYZ (parsed as input)",
    "pipeline_path": "/path/to/pipeline (how are we getting this?)",
    "pipeline_outdir": "/path/to/output (parsed as input)",
    "pipeline_outdir_size": "(from tree JSON; look for type:report)",
    "pipeline_version": "1.0.0 (parsed as input)",
    "user": "user_name (from os.environ['USER'])",
    "groups": "group1 group2 (from `groups` command)",
    "date": "2025-05-12T15:37:48 (ISO 8601 format)",
    "nsamples": "(lookup via pipeline_regex.JSON and apply regex)"
  },
  "jobby": {
    "example_key": "example_value (output from `jobby --json`)"
  },
  "outdir_tree": {
    "example_tree": "output of `tree -J` on the output directory"
  },
  "master_job_log": {
    "txt": ""
  },
  "failed_jobs": {
    "12345": {
      "logfilepath": "/path/to/logfile (derived from tree)",
      "logfiletxt": "Content of log file here",
      "errfilepath": "/path/to/errfile (derived from tree)",
      "errfiletxt": "Content of error file here"
    }
  }
}

we can then gzip this JSON and move it to user-staging folder under /data/CCBR_Pipeliner

@kopardev
Copy link
Member

@kelly-sovacool add --du to tree ... something like tree -Ja --du <dirpath>

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented May 14, 2025

Current version is working. Need to test it on a run that had some jobs fail. Also need to test on nextflow.

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented May 14, 2025

Example output json (with tree, jobby, & master job log omitted for brevity)

{
    "outdir_tree": "...",
    "pipeline_metadata": {
        "pipeline_name": "RENEE",
        "pipeline_path": "/data/CCBR_Pipeliner/Pipelines/RENEE/renee-dev-sovacool",
        "pipeline_outdir": "/data/sovacoolkl/renee_test_hg38_48",
        "pipeline_outdir_size": 2840578746,
        "pipeline_version": "v2.6.7-dev",
        "ccbrpipeliner_module": "unknown",
        "user": "sovacoolkl",
        "uid": "60731",
        "groups": "CCBR CCBR_Pipeliner SCLCgenomics Ziegelbauer_lab sovacoolkl NCI-workbench-users SCLC_scRNA",
        "date": "2025-05-14T17-52-37",
        "nsamples": 4
    },
    "jobby": "...",
    "master_job_log": {
        "txt": "..."
    },
    "failed_jobs": {}
}

@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented May 15, 2025

For champagne. I added job name, state, & exit code to the failed jobs dict for better context.

{
    "outdir_tree": "...",
    "pipeline_metadata": {
        "pipeline_name": "CHAMPAGNE",
        "pipeline_path": "/data/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool",
        "pipeline_outdir": "/data/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool",
        "pipeline_outdir_size": 370213210424,
        "pipeline_version": "v0.4.1-dev",
        "ccbrpipeliner_module": "unknown",
        "user": "sovacoolkl",
        "uid": "60731",
        "groups": "CCBR CCBR_Pipeliner SCLCgenomics Ziegelbauer_lab sovacoolkl NCI-workbench-users SCLC_scRNA",
        "date": "2025-05-15T09-33-30",
        "nsamples": 0
    },
    "jobby": "...",
    "master_job_log": {
        "txt": "..."
    },
    "failed_jobs": {
        "57140712": {
            "JobName": "nf-CHIPSEQ_QC_PRESEQ_(SPT5_T0_1)",
            "JobState": "FAILED",
            "ExitCode": 11,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/0f/2dec15cccbbe8eb2982f720e37cda5/.command.out",
            "log_out_txt": "",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/0f/2dec15cccbbe8eb2982f720e37cda5/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\n/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/0f/2dec15cccbbe8eb2982f720e37cda5/.command.sh: line 3: 3058827 Segmentation fault      (core dumped) preseq lc_extrap -B -D -o SPT5_T0_1.lc_extrap.txt SPT5_T0_1.filtered.bam -seed 12345 -v -l 100000000000 2> SPT5_T0_1.preseq.log\n"
        },
        "57140974": {
            "JobName": "nf-CHIPSEQ_QC_PRESEQ_(SPT5_T0_2)",
            "JobState": "FAILED",
            "ExitCode": 11,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/3a/2913ebefce8ee7fe2c953d7c8b4f3e/.command.out",
            "log_out_txt": "",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/3a/2913ebefce8ee7fe2c953d7c8b4f3e/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\n/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/3a/2913ebefce8ee7fe2c953d7c8b4f3e/.command.sh: line 3: 1580954 Segmentation fault      (core dumped) preseq lc_extrap -B -D -o SPT5_T0_2.lc_extrap.txt SPT5_T0_2.filtered.bam -seed 12345 -v -l 100000000000 2> SPT5_T0_2.preseq.log\n"
        },
        "57141181": {
            "JobName": "nf-CHIPSEQ_QC_PRESEQ_(SPT5_INPUT)",
            "JobState": "FAILED",
            "ExitCode": 1,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/f7/824c11f58831bea0eea62183897593/.command.out",
            "log_out_txt": "",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/f7/824c11f58831bea0eea62183897593/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\n"
        },
        "57142656": {
            "JobName": "nf-CHIPSEQ_PHANTOM_PEAKS_(SPT5_INPUT)",
            "JobState": "FAILED",
            "ExitCode": 1,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/8f/e3f6d397128b65cb34864742b24737/.command.out",
            "log_out_txt": "################\nChIP data: SPT5_INPUT.filtered.dedup.sort.f66.bam \nControl data: NA \nstrandshift(min): -500 \nstrandshift(step): 5 \nstrandshift(max) 1500 \nuser-defined peak shift NA \nexclusion(min): 10 \nexclusion(max): NaN \nnum parallel nodes: NA \nFDR threshold: 0.01 \nNumPeaks Threshold: NA \nOutput Directory: . \nnarrowPeak output file name: NA \nregionPeak output file name: NA \nRdata filename: NA \nplot pdf filename: SPT5_INPUT.ppqt.pdf \nresult filename: SPT5_INPUT.spp.out \nOverwrite files?: FALSE\n\n[1] TRUE\nReading ChIP tagAlign/BAM file SPT5_INPUT.filtered.dedup.sort.f66.bam \nopened /tmp/RtmpCMj6B7/SPT5_INPUT.filtered.dedup.sort.f66.tagAlign1b57d369129cea\ndone. read 0 fragments\n",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/8f/e3f6d397128b65cb34864742b24737/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\nLoading required package: Rcpp\nError in read.table(bam2align.filename, nrows = 500) : \n  no lines available in input\nCalls: read.align -> read.table\nExecution halted\n"
        },
        "57143272": {
            "JobName": "nf-CHIPSEQ_CALL_PEAKS_GEM_(SPT5_T0_1)",
            "JobState": "FAILED",
            "ExitCode": 1,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/46/de0f89421c920f106e09570fbe4b65/.command.out",
            "log_out_txt": "\nGEM (version 3.4)!\n\nPlease cite: \nYuchun Guo, Shaun Mahony, David K. Gifford (2012) PLoS Computational Biology 8(8): e1002638. \nHigh Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints. \ndoi:10.1371/journal.pcbi.1002638\n\nGifford Laboratory at MIT (http://cgs.csail.mit.edu/gem/).\n\n----------------------------------\n\nStart time: 2025/05/14 18:16:14\n\nLoading data...\n    Loading reads from: SPT5_T0_1.filtered.dedup.sort.bam ... Loaded\n    Loading reads from: SPT5_INPUT.filtered.dedup.sort.bam ... Loaded\n",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/46/de0f89421c920f106e09570fbe4b65/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\nException in thread \"main\" java.lang.NullPointerException\n\tat edu.mit.csail.cgs.deepseq.utilities.AlignmentFileReader.populateArrays(AlignmentFileReader.java:260)\n\tat edu.mit.csail.cgs.deepseq.utilities.SAMReader.countReads(SAMReader.java:86)\n\tat edu.mit.csail.cgs.deepseq.utilities.AlignmentFileReader.getTotalHits(AlignmentFileReader.java:310)\n\tat edu.mit.csail.cgs.deepseq.utilities.FileReadLoader.<init>(FileReadLoader.java:147)\n\tat edu.mit.csail.cgs.deepseq.DeepSeqExpt.<init>(DeepSeqExpt.java:84)\n\tat edu.mit.csail.cgs.deepseq.DeepSeqExpt.<init>(DeepSeqExpt.java:78)\n\tat edu.mit.csail.cgs.deepseq.discovery.GEM.<init>(GEM.java:127)\n\tat edu.mit.csail.cgs.deepseq.discovery.GEM.main(GEM.java:343)\n"
        }
    }
}

@kelly-sovacool kelly-sovacool marked this pull request as ready for review May 16, 2025 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spooker: lookup table of regexes to determine nsamples for each pipeline refactor spooker cli
2 participants