Re-implement spooker in python #71

kelly-sovacool · 2025-05-06T14:12:17Z

Changes

spooker has the same usage, with important changes:
- re-implemented in Python
- writes all metadata to a single gzipped JSON file rather than bundling everything as a tar archive
jobby now optionally includes job out/err log files (see --outerr and --include-completed)

Issues

resolves refactor spooker cli #70
resolves spooker: lookup table of regexes to determine nsamples for each pipeline #72

PR Checklist

(~~Strikethrough~~ any points that are not applicable.)

This comment contains a description of changes with justifications, with any relevant issues linked.
Write unit tests for any new features, bug fixes, or other code changes.
Update docs if there are any API changes.
Update CHANGELOG.md with a short description of any user-facing changes and reference the PR number. Guidelines: https://keepachangelog.com/en/1.1.0/

codecov · 2025-05-06T14:13:21Z

Codecov Report

Attention: Patch coverage is 77.23577% with 56 lines in your changes missing coverage. Please review.

Project coverage is 75.16%. Comparing base (3713f8d) to head (2cb9235).

Files with missing lines	Patch %	Lines
src/ccbr_tools/jobby.py	61.33%	29 Missing ⚠️
src/ccbr_tools/paths.py	62.50%	15 Missing ⚠️
src/ccbr_tools/pipeline/__init__.py	79.59%	10 Missing ⚠️
src/ccbr_tools/spooker.py	96.22%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #71      +/-   ##
==========================================
+ Coverage   73.79%   75.16%   +1.37%     
==========================================
  Files          22       25       +3     
  Lines        1595     1784     +189     
==========================================
+ Hits         1177     1341     +164     
- Misses        418      443      +25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…to spooker-cli

no separate files for tree, jobby, etc

towards #72

kelly-sovacool · 2025-05-12T18:17:00Z

just use tree to determine number of files & directory size instead of du

kelly-sovacool · 2025-05-12T18:23:11Z

for all master & slurm jobs that did not complete, get the log out/err files and include the text in the json

kopardev · 2025-05-12T19:25:28Z

for all master & slurm jobs that did not complete, get the log out/err files and include the text in the json

Actually if you get the slurmjobids (which have status != COMPLETED) ... we can get the file paths of the .err or .out files from the tree itself... no need to glob.

kopardev · 2025-05-12T19:28:17Z

add argparse or click for better argument parsing.

kopardev · 2025-05-12T19:49:26Z

@kelly-sovacool The output should be nested-JSON file:

{
  "pipeline_metadata": {
    "pipeline_name": "XYZ (parsed as input)",
    "pipeline_path": "/path/to/pipeline (how are we getting this?)",
    "pipeline_outdir": "/path/to/output (parsed as input)",
    "pipeline_outdir_size": "(from tree JSON; look for type:report)",
    "pipeline_version": "1.0.0 (parsed as input)",
    "user": "user_name (from os.environ['USER'])",
    "groups": "group1 group2 (from `groups` command)",
    "date": "2025-05-12T15:37:48 (ISO 8601 format)",
    "nsamples": "(lookup via pipeline_regex.JSON and apply regex)"
  },
  "jobby": {
    "example_key": "example_value (output from `jobby --json`)"
  },
  "outdir_tree": {
    "example_tree": "output of `tree -J` on the output directory"
  },
  "master_job_log": {
    "txt": ""
  },
  "failed_jobs": {
    "12345": {
      "logfilepath": "/path/to/logfile (derived from tree)",
      "logfiletxt": "Content of log file here",
      "errfilepath": "/path/to/errfile (derived from tree)",
      "errfiletxt": "Content of error file here"
    }
  }
}

we can then gzip this JSON and move it to user-staging folder under /data/CCBR_Pipeliner

kopardev · 2025-05-12T21:49:41Z

@kelly-sovacool add --du to tree ... something like tree -Ja --du <dirpath>

…to spooker-cli

kelly-sovacool · 2025-05-14T22:01:07Z

Current version is working. Need to test it on a run that had some jobs fail. Also need to test on nextflow.

kelly-sovacool · 2025-05-14T22:03:22Z

Example output json (with tree, jobby, & master job log omitted for brevity)

{
    "outdir_tree": "...",
    "pipeline_metadata": {
        "pipeline_name": "RENEE",
        "pipeline_path": "/data/CCBR_Pipeliner/Pipelines/RENEE/renee-dev-sovacool",
        "pipeline_outdir": "/data/sovacoolkl/renee_test_hg38_48",
        "pipeline_outdir_size": 2840578746,
        "pipeline_version": "v2.6.7-dev",
        "ccbrpipeliner_module": "unknown",
        "user": "sovacoolkl",
        "uid": "60731",
        "groups": "CCBR CCBR_Pipeliner SCLCgenomics Ziegelbauer_lab sovacoolkl NCI-workbench-users SCLC_scRNA",
        "date": "2025-05-14T17-52-37",
        "nsamples": 4
    },
    "jobby": "...",
    "master_job_log": {
        "txt": "..."
    },
    "failed_jobs": {}
}

kelly-sovacool · 2025-05-15T13:37:42Z

For champagne. I added job name, state, & exit code to the failed jobs dict for better context.

{
    "outdir_tree": "...",
    "pipeline_metadata": {
        "pipeline_name": "CHAMPAGNE",
        "pipeline_path": "/data/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool",
        "pipeline_outdir": "/data/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool",
        "pipeline_outdir_size": 370213210424,
        "pipeline_version": "v0.4.1-dev",
        "ccbrpipeliner_module": "unknown",
        "user": "sovacoolkl",
        "uid": "60731",
        "groups": "CCBR CCBR_Pipeliner SCLCgenomics Ziegelbauer_lab sovacoolkl NCI-workbench-users SCLC_scRNA",
        "date": "2025-05-15T09-33-30",
        "nsamples": 0
    },
    "jobby": "...",
    "master_job_log": {
        "txt": "..."
    },
    "failed_jobs": {
        "57140712": {
            "JobName": "nf-CHIPSEQ_QC_PRESEQ_(SPT5_T0_1)",
            "JobState": "FAILED",
            "ExitCode": 11,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/0f/2dec15cccbbe8eb2982f720e37cda5/.command.out",
            "log_out_txt": "",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/0f/2dec15cccbbe8eb2982f720e37cda5/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\n/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/0f/2dec15cccbbe8eb2982f720e37cda5/.command.sh: line 3: 3058827 Segmentation fault      (core dumped) preseq lc_extrap -B -D -o SPT5_T0_1.lc_extrap.txt SPT5_T0_1.filtered.bam -seed 12345 -v -l 100000000000 2> SPT5_T0_1.preseq.log\n"
        },
        "57140974": {
            "JobName": "nf-CHIPSEQ_QC_PRESEQ_(SPT5_T0_2)",
            "JobState": "FAILED",
            "ExitCode": 11,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/3a/2913ebefce8ee7fe2c953d7c8b4f3e/.command.out",
            "log_out_txt": "",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/3a/2913ebefce8ee7fe2c953d7c8b4f3e/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\n/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/3a/2913ebefce8ee7fe2c953d7c8b4f3e/.command.sh: line 3: 1580954 Segmentation fault      (core dumped) preseq lc_extrap -B -D -o SPT5_T0_2.lc_extrap.txt SPT5_T0_2.filtered.bam -seed 12345 -v -l 100000000000 2> SPT5_T0_2.preseq.log\n"
        },
        "57141181": {
            "JobName": "nf-CHIPSEQ_QC_PRESEQ_(SPT5_INPUT)",
            "JobState": "FAILED",
            "ExitCode": 1,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/f7/824c11f58831bea0eea62183897593/.command.out",
            "log_out_txt": "",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/f7/824c11f58831bea0eea62183897593/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\n"
        },
        "57142656": {
            "JobName": "nf-CHIPSEQ_PHANTOM_PEAKS_(SPT5_INPUT)",
            "JobState": "FAILED",
            "ExitCode": 1,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/8f/e3f6d397128b65cb34864742b24737/.command.out",
            "log_out_txt": "################\nChIP data: SPT5_INPUT.filtered.dedup.sort.f66.bam \nControl data: NA \nstrandshift(min): -500 \nstrandshift(step): 5 \nstrandshift(max) 1500 \nuser-defined peak shift NA \nexclusion(min): 10 \nexclusion(max): NaN \nnum parallel nodes: NA \nFDR threshold: 0.01 \nNumPeaks Threshold: NA \nOutput Directory: . \nnarrowPeak output file name: NA \nregionPeak output file name: NA \nRdata filename: NA \nplot pdf filename: SPT5_INPUT.ppqt.pdf \nresult filename: SPT5_INPUT.spp.out \nOverwrite files?: FALSE\n\n[1] TRUE\nReading ChIP tagAlign/BAM file SPT5_INPUT.filtered.dedup.sort.f66.bam \nopened /tmp/RtmpCMj6B7/SPT5_INPUT.filtered.dedup.sort.f66.tagAlign1b57d369129cea\ndone. read 0 fragments\n",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/8f/e3f6d397128b65cb34864742b24737/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\nLoading required package: Rcpp\nError in read.table(bam2align.filename, nrows = 500) : \n  no lines available in input\nCalls: read.align -> read.table\nExecution halted\n"
        },
        "57143272": {
            "JobName": "nf-CHIPSEQ_CALL_PEAKS_GEM_(SPT5_T0_1)",
            "JobState": "FAILED",
            "ExitCode": 1,
            "log_out_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/46/de0f89421c920f106e09570fbe4b65/.command.out",
            "log_out_txt": "\nGEM (version 3.4)!\n\nPlease cite: \nYuchun Guo, Shaun Mahony, David K. Gifford (2012) PLoS Computational Biology 8(8): e1002638. \nHigh Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints. \ndoi:10.1371/journal.pcbi.1002638\n\nGifford Laboratory at MIT (http://cgs.csail.mit.edu/gem/).\n\n----------------------------------\n\nStart time: 2025/05/14 18:16:14\n\nLoading data...\n    Loading reads from: SPT5_T0_1.filtered.dedup.sort.bam ... Loaded\n    Loading reads from: SPT5_INPUT.filtered.dedup.sort.bam ... Loaded\n",
            "log_err_path": "/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/CHAMPAGNE/champagne-dev-sovacool/work/46/de0f89421c920f106e09570fbe4b65/.command.err",
            "log_err_txt": "WARNING: Not virtualizing pid namespace by configuration\nException in thread \"main\" java.lang.NullPointerException\n\tat edu.mit.csail.cgs.deepseq.utilities.AlignmentFileReader.populateArrays(AlignmentFileReader.java:260)\n\tat edu.mit.csail.cgs.deepseq.utilities.SAMReader.countReads(SAMReader.java:86)\n\tat edu.mit.csail.cgs.deepseq.utilities.AlignmentFileReader.getTotalHits(AlignmentFileReader.java:310)\n\tat edu.mit.csail.cgs.deepseq.utilities.FileReadLoader.<init>(FileReadLoader.java:147)\n\tat edu.mit.csail.cgs.deepseq.DeepSeqExpt.<init>(DeepSeqExpt.java:84)\n\tat edu.mit.csail.cgs.deepseq.DeepSeqExpt.<init>(DeepSeqExpt.java:78)\n\tat edu.mit.csail.cgs.deepseq.discovery.GEM.<init>(GEM.java:127)\n\tat edu.mit.csail.cgs.deepseq.discovery.GEM.main(GEM.java:343)\n"
        }
    }
}

…ate functions

…to spooker-cli

kelly-sovacool added 3 commits May 5, 2025 18:06

feat: implement spooker in python

f72e9f8

refactor: return generic cluster if unknown

49728d2

feat: add spooker to ccbr_tools cli

0cc2fd1

kelly-sovacool and others added 18 commits May 6, 2025 11:55

refactor: get_random_string() and get_timestamp() in pkg_util module

3929450

refactor: create spook() in hpc class

11f2421

refactor: spooker CLI uses positional args just like original

5438230

refactor: move spook() to spooker.py

0d79e3d

feat: run jobby --json during spooker

f791c7e

fix: use hyphens to separate time in timestamp

e5ad742

test: basic spooker test for GHA

d352598

ci: 🤖 render readme

e3e252b

test(spooker): fix tests for GHA

8f5a782

chore: delete legacy spooker

e1d3d06

chore: Merge branch 'spooker-cli' of https://github.com/CCBR/Tools in…

aabef4b

…to spooker-cli

docs: add spooker module

468498d

fix: use spook on biowulf, simple copy otherwise

c680bed

feat: write single JSON with all info bundled

71c44f6

no separate files for tree, jobby, etc

docs: write docstring w/ help from copilot

60d484c

chore: update CHANGELOG.md

45ee849

test: fix spooker test

d5ebde0

feat: [WIP] count samples for xavier & renee

3698bca

towards #72

kelly-sovacool force-pushed the spooker-cli branch from fbaec15 to 3698bca Compare May 8, 2025 18:09

kelly-sovacool added 2 commits May 12, 2025 15:54

test: add tree json from champagne with -a

ce5bfea

feat: include hidden files in tree output

e0a3802

kelly-sovacool and others added 10 commits May 14, 2025 11:19

feat: parse tree with ast when json fails; get size from tree or du

c36289f

fix: cannot access dict before creation

b0d66ba

ci: 🤖 render readme

661bcf2

feat: count nsamples for all pipelines but escape & logan

a222f5f

chore: Merge branch 'spooker-cli' of https://github.com/CCBR/Tools in…

3a1253c

…to spooker-cli

docs: update module docstrings

1b7fa3f

fix: imports

1633595

feat: add tree nsamples pattern for logan

84df3ab

feat: collect failed job logs

41bfd69

fix: syntax errors; add assertions

1aed66c

feat: add tree nsamples pattern for escape

ec8e914

kelly-sovacool added 14 commits May 15, 2025 10:06

feat: add job name & state to failed jobs dict

6410106

refactor: break out spook metadata collection & file writing to separ…

8e2af7f

…ate functions

refactor: keep jobby & tree as python dicts

af9efb0

feat: keep sample names for debugging

428d377

feat: keep sample names for debugging

d69b18a

feat: --outerr and --include-completed options for jobby

6ffaf14

chore: Merge branch 'main' into spooker-cli

56ae250

fix: handle missing records; do not pass to pandas

ff004e9

test: update tests for new jobby & spooker behavior

42a2174

test: drop outdir size from spooker test

645c5e8

test: drop outdir size from spooker test

07189aa

test: jobby cli with invalid log

3f21c4a

chore: Merge branch 'spooker-cli' of https://github.com/CCBR/Tools in…

41441e7

…to spooker-cli

test: fix test for gha

2cb9235

kelly-sovacool marked this pull request as ready for review May 16, 2025 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement spooker in python #71

Re-implement spooker in python #71

kelly-sovacool commented May 6, 2025 •

edited

Loading

codecov bot commented May 6, 2025 •

edited

Loading

kelly-sovacool commented May 12, 2025

kelly-sovacool commented May 12, 2025

kopardev commented May 12, 2025

kopardev commented May 12, 2025

kopardev commented May 12, 2025 •

edited

Loading

kopardev commented May 12, 2025

kelly-sovacool commented May 14, 2025 •

edited

Loading

kelly-sovacool commented May 14, 2025 •

edited

Loading

kelly-sovacool commented May 15, 2025 •

edited

Loading

Re-implement spooker in python #71

Are you sure you want to change the base?

Re-implement spooker in python #71

Conversation

kelly-sovacool commented May 6, 2025 • edited Loading

Changes

Issues

PR Checklist

codecov bot commented May 6, 2025 • edited Loading

Codecov Report

kelly-sovacool commented May 12, 2025

kelly-sovacool commented May 12, 2025

kopardev commented May 12, 2025

kopardev commented May 12, 2025

kopardev commented May 12, 2025 • edited Loading

kopardev commented May 12, 2025

kelly-sovacool commented May 14, 2025 • edited Loading

kelly-sovacool commented May 14, 2025 • edited Loading

kelly-sovacool commented May 15, 2025 • edited Loading

kelly-sovacool commented May 6, 2025 •

edited

Loading

codecov bot commented May 6, 2025 •

edited

Loading

kopardev commented May 12, 2025 •

edited

Loading

kelly-sovacool commented May 14, 2025 •

edited

Loading

kelly-sovacool commented May 14, 2025 •

edited

Loading

kelly-sovacool commented May 15, 2025 •

edited

Loading