Meeting Notes

We saw that at the most significant FARs, there were periods where the original model was outperforming the retrained model
Hypothesis: Periods where there the original model out performs the retrained model correspond to periods where the glitch rate of the original training period is higher than the glitch rate of the retrained period. The rational being that b/c the original model was able to train on more glitches, it was more robust. There is some evidence of this by looking at the glitch rate ratio vs the SV ratio:

Our intuition that the retrained model should perform better is exhibited when we look at slightly lower FARs. Possible that uncertainty in our FAR estimate at the most significant FAR's is the culprit.

Consensus: In the paper we will report our original models performance over time w/o the retrained models performance. This investigation will be left for future work.

TODO: Alec will get benchmarking and MDC code into paper branch. We will need to run this analysis and add plot to paper

10-19-2023

Finishing final production experiments for paper.

Headline results/re-seeded
- Both produced nearly identical performance - good
- Both had dips at lowest FARs, seems indicative of loud background events
- Questions/AIs:
  - How to present performance comparison (#439)? Plot over one another? Ratio? Percent difference? Difference in terms of number of standard deviations away?
  - #440 What are these loud background events? Are they the same for each? What happens when we apply vetoes?
Inference sampling rate
- Results came out mostly as expected
- Questions/AIs:
  - #439 again
Intervals/Longevity
- Fine-tuning produced models that rarely out-performed the initial model on the same validation set
- Ethan's running trainings from scratch to compare, but preliminary results indicate they'll come out roughly the same
- Could indicate that initial training is good enough for extended deployment
- Means fluctuations over time are just the result of local noise levels
- Curious why all interval runs have lower validation AUROC than initial training run, but there are too many variables at play here to sort out, and it's not worth the time before publication. From-scratch results are conclusive enough for us
- Questions/AIs
  - Wait for Ethan's current runs to complete and then draw conclusions (#443)
  - Compare our fluctuations in sensitivity over O3 to those experienced by other pipelines. Are they similar?
Offline throughput
- #442 outlines how we should compare our offline throughput per GPU to Eliu's from 2017
- This paper is worth comparing to because outside of topline results, we're also outlining for a software platform for BBH detection, and this is the only other paper I'm aware of that attempts something similar
- Questions/AIs
  - We don't know if the server stats for our current 1-year and 2-month runs correspond to 7 or 8 GPUs, so once Ethan is done with his retraining runs on Hanford Alec will re-run these using all 8 GPUs to compare their total throughputs.
Code Alec needs to get into paper branch
- MDC inference (#444)
- Online latency vs. throughput benchmarking (#445)
- ml4gw projection / SNR rescaling benchmarking/comparison to bilby (#446)

Next steps beyond this are

Get online deployment going again
- Alec to take some initial stabs at implementing some of our inference changes since the last one, hand off to Will mid-next week to take over and get a training run going. (#447)

08-31-2023

Paper text is being worked on online in parallel with code development. In meeting, mainly discussed outstanding code issues required to re-run all experiments to generate "final" results

Data generation
- Setting seeds during data generation
- Switching to using open segments for background
- Switch to 8s waveforms to avoid tail end of signal getting tacked on to coalescence
- Once these fixes are in, generate:
  - 1-year dataset
  - Interval datasets
  - A separate 2-month dataset for smaller experiments (using different seed from 1-year dataset)
  - MDC test set with parameters as close as possible to AresGW paper, explicitly record command
Training
- Seed everything in training
  - See if we can find a utility for this, otherwise follow instructions here
  - Note that this will require a slight reformatting of the ml4gw dataloader in order to allow for multiprocessed dataloading to be deterministic
- ~~Fix bug in validation where predictions on different injection views are being aggregated incorrectly~~ Actual issue came from how the PSD was being calculated. This was corrected by changing the averaging method to "median".
Inference
- Switch to an inference sampling rate of 4

The final set of experiments we'll need to run, and which we should have pipelines for, are:

Vanilla sandbox with 1 year worth of background (1-year dataset)
Intervals, both with initial model and with retraining (or fine-tuning)
Do inference on 1-year with median and worst non-nan hyperparameters from search to show benefit of search
Sandbox 1-year trained using a different seed
Inference sampling rate comparisons (using 2 month-background dataset)
Sandbox with all augmentations turned off
MDC

Computing a `p_astro`

Will has done a lot of good work trying to construct posteriors on the foreground and background counts described in this paper for developing a p_astro in the context of aframe, but without clearer steps from any of the pipelines we're having difficulty getting things to come out sensible. Next steps here are to:

Reach out to someone from one of the pipelines directly to see if we can get 30 minutes of their time to have them walk us through how they go about this. Preferably PyCBC since it seems simpler, but if no one on the group call tomorrow has any contacts on that team we'll reach out to Cody with GstLAL
If the implementation will take anything more than a notebook cell, put this on hold for a follow up paper where we actually run over 0-lag O3 data

08-24-2023

Discussed remaining tasks and best course of action for reaching publication of initial aframe paper. Decided that main figures of merit should be:

Main Section (probably in this order)

Table showing comparison of $p_{\text{astro}}$ thresholded sensitive volumes for aframe and GWTC-3 pipelines
Grid plot showing aframe's sensitive volume (not distance) vs. FAR for each of the 4 main mass combinations (35/35, 35/20, 20/20, 20/10), with $p_{\text{astro}}$ values plotted as horizontal lines (including aframe's)
Sensitive volume vs. time for Ethan's interval datasets
Online latency vs. inference sampling rate plot

Methods Section (in no particular order)

Will's experiments showing impact of inference sampling rate on sensitive volume
Grid plot showing comparison of sensitive volume achieved on the timeslides from 1-month after the training period vs. that achieved over Ethan's interval datasets.
ML Mock Data Challenge results/brief discussion on issues there
Various diagrams illustrating the aframe training and inference pipelines

To get there, there are a few outstanding action items, which can be roughly broken up into 3 categories.

1. Remaining results

Ethan will train models using the week before his interval test sets to see how that impacts performance over time
Will will take another look at how sensitive distance scales over time for each of the mass combinations
Alec will re-run ml4gw projection vs. bilby benchmarking
Everyone will take a look at PyCBC's $p_{\text{astro}}$ implementation and one of us will try to implement on one of our results
Once all code is ready, run an HP search with all augmentations turned off to get a sense of advantage
Begin producing online triggers and get connected to GraceDB Playground

2. Code merges, implemented on dedicated branch for paper

Seeding for dataset generation to ensure reproducibility
Ethan's interval runs
Will's inference sampling rate experiments
MDC train dataset generation and inference
Online benchmarking code
ml4gw benchmarking code (last two probably in same benchmarking project)
Replace vizapp with plotting code, stripping out vizapp components necessary for generating plot data from raw data
Add ability to apply vetoes when generating plots
Once everything is in, do a full re-run to ensure results are reproducible

3. Paper

Alec will write draft intro
Ethan working on data, augmentations, and SV in methods section

08-03-2023

Hyperparameter search on MDC data has produced a model that's competitive with SOA results. Given this, plan to present to CBC West call on Aug 15th to share initial findings and solicit feedback for methods paper and organizing a review committee. Key figures of merit we'd like to show are:

Sensitive distance vs. FAR for best performing model on MDC dataset. Compare with submitted results from original MDC
Sensitive distance vs. FAR for best performing model on 1-year worth of aframe-generated end O3 rates+pops prior timeslides. Include importance sampling to GWTC-3 mass combinations (figure 6), and plot horizontal lines for "Any" values from that plot (top left).
Sensitive distance vs. FAR for best performing end O3 model on 2-months worth of aframe-generated timeslides from 1-week taken from intermittent months after training period (e.g. 1, 3, 6, 9 months) to show how performance changes over time and infer retraining period.
Latency vs. inference sampling rate at different batch sizes for online timing comparison.
- At some point, investigate trend in relationship between inference sampling rate and sensitivity

Action items required to get us there:

While all of this is happening,

Start putting slides together at start of next week to begin organizing thoughts and story
Start digesting p_astro paper and thinking about how to frame this in our context

One key thing Erik pointed out that's worth making explicit both in this talk and in the paper is that we don't plan on showing any astrophysical results - i.e. we're not running on any zero-lag data. Our goal is to lay out a robust proof-of-concept and framework for doing ML-based detection in GW, setting the stage for a follow-up paper describing real astrophysical results.

Other, less pressing action items:

Start doc outlining skeletons of various papers/blog posts to come out of this
Accelerate code cleanup/robustness improvements
Explore impact of inference sampling rate on sensitivity
Reach out to AresGW team for evaluation HDF5 file for direct comparison in paper

07-20-2023

Summary of last week's activities

Attempted direct comparison of evaluation using MDC data/sensitive distance code and using aframe data/sensitive distance code
- Still having difficulty aligning the two calculations
- At the meeting we agreed that:
  - At this point, given what we know about the flaws in the MDC sensitive distance calculation, we have no real plans to use it outside of comparison purposes for this one paper
  - Given that and the fact that we have code for directly evaluating on the MDC datasets, we'll just start doing that going forward for comparisons, but report our "real" numbers using the end O3 r+p distribution, with events distributed uniformly in comoving volume. This framework will be one of the contributions of whatever paper we write.
- As such the plan moving forward will be:
  - Get the code for doing inference on MDC data merged into mdc-dev branch
  - Get mdc-dev branch merged into main
  - Begin taking multiple views of signals during validation, to account for differences in response to position in kernel, particularly for lower mass events
  - Split out main into two branches: one for continuing to produce predictions on MDC data and cleaning up that implementation, the other for stripping out the MDC-specific infrastructure and switching back to end O3 rates and pops
    - The MDC branch will be used to launch an initial HP search (once we remove hopeless SNR thresholding during validation) to try to achieve SOA on the MDC dataset
    - The end O3 branch will eventually make it back to main to act as our default distribution going forward
    - Once this is ready and the MDC HP search has completed, run a separate search on this data distribution
  - Once end O3 makes it back to main, this all will get merged into the online-deployment branch to begin re-running our online triggers.
  - Ethan has an implementation of glitch sampling that will be interesting to run through an HP search as well.
- All of this can be visualized by the following diagram of git commits
- Plan for publication is
  - Achieve best score possible on MDC for comparison purposes
  - Introduce our pipeline and present results on that
  - Compare to sensitive volumes in GWTC-3 paper, with the caveat that these aren't reported with (and often don't even have a definite sense of) corresponding false alarm rates
- Plan for deployment is
  - Get online pipeline running and start producing triggers. Will be the only way to get noticed
  - In parallel, produce predictions on Reed's O4 MDC for direct comparison to other pipelines
    - Will is going to take a look at this data to see how it would fit into our framework. We suspect this will likely be more amenable to dataloading/inference structure for MDC data.
- Immediate next steps
  - Review mdc-infer and merge into mdc-dev
  - Update train start/stop times to reflect the fact that MDC start_offset flag represents an offset in livetime, not total time (see this Slack post. MDC test set with --start-offset $((3600 * 24 * 10)) begins at GPS time 1241443783).
  - Final review of mdc-dev, merge into main
  - Validation signal fix on main
  - Split out MDC branch, fix SNR cutoff
  - Launch HP search

06-15-2023

Summary of last week's activities

Main focus was on running experiments to compare our algorithm with the MDC
- Ethan approached the problem by implementing the MDC prior and chirp distance calculation in the aframe framework. See #361
  
  (Sensitive Distance (MPC) vs FAR (months ^-1)
  - Will produced some comparisons of the chirp mass, chirp distance and m1 / m2 distributions between our implementation and the samples produced by the MDC code and everything looks consistent. (See plots here)
  - While analyzing output of run, discovered discrepancy between local torch and triton inference - could mean above results are not to be trusted
- Alec continued the approach of playing by the MDC papers rules, and using their data generation and evaluation script code.
While the above result is worse than Ethan's result, the acceptance window for events was 0.125 seconds, while Ethan's analysis essentially had an infinite acceptance window. Alec saw that increasing that window to 0.3 led to an increase of sensitive distance to 1290 Mpc. Additionally, Ethan's run was trained for longer, and saw validation score improvements.
A couple of inference code bugs were squashed in Ethan's MDC data branch (linked above):
1. Removing bogus events from the first psd_length seconds of data.
2. An edge case where some timeshifts were failing to complete for certain segments
Will is going to pursue an experiment that uses a uniform in volume distribution (as opposed to a uniform in comoving volume) under the standard aframe SD calculation (i.e. no chirp distance sampling, use hopeless snr cut, etc.) and see if this accounts for discrepancies.

Plans for next week

Get to the bottom of torch vs triton discrepancy
Continue pursuing MDC comparison

06-15-2023

Summary of last week's activities

Continued focus on infrastructure development to be able to make submission to ML-MDC.
- Finalized implementation of Local whitening during training as well as inference - In short, the new snapshotter keeps state for the kernel and data used to whiten the kernel
- Alec trained a model on O3a data near MDC testing period, generated MDC injection set, and ran inference
- Working on running evaluation script and getting a result
Development of revamped visualization app that performs live injection and inference
- Discovered odd discrepancy between torch inference and triton based inference. In short, torch based inference was not triggering to the same degree as triton inference on certain background events.
- Hard to see, but there is a slight difference in the whitened data seen by the torch vs triton models
- Need to determine if the NN output difference is due to this weird data discrepancy, or the model (e.g. some precision issue)
- Discovered another artifact where some signal outputs where creating vertical clusters - Possibly related to the above issue.

Plans for next week

Arrive at a metric on the MDC data
Iron out torch vs triton discrepancies
Continued development of vizapp

06-08-2023

Summary of last week's activities

Work this week has been focused on infrastructure developments required to create a submission to ML MDC mock data challenge

Alec implemented local whitening in the frequency domain. Data will now be whitened by the N previous seconds as opposed to using a fixed whitening filter fit to the training background.
ONNX does not support torch ffts, so we needed torchscript support to be able to export the new local whitener with the model - Alec implemented this in hermes.
Alec has implemented chunked data loading so that we can train on multiple discontiguous segments. See #352
Initial training run with these changes and analysis of corresponding validation outputs illustrated that things seem to be working
- One interesting tid bit discovered on the call:
  - It appeared that for low SNR events, we were performing monotonically better as a function of chirp mass:
  - However, we hypothesized that this was due to the fact that we were evaluating the detection statistic of validation signals with the coalescence time in the center. This means that signals with SNR content longer than kernel_length / 2, will not be fully seen. When adjusting the evaluation kernel such that all the coalescence time is at the end of the kernel we see the discrepancy disappear.
  - This is evidence that we may benefit from longer kernel lengths
- Alec ran some additional investigations offline that are outlined in this slack post

Will opened a PR that adds the dynamic input normalization (DAIN) layer used by the AresGW group #354, and ran some experiments using the old whitening and dataloading scheme. Loss curves for that run can be found in this slack post

Plans for next week

Continue focus on infrastructure developments that enable us to make an MDC submission
- Alec’s first naive implementation of a snapshotter that incorporates the new local whitening scheme was extremely slow due to the fact that we have to unfold multiple 10 second kernels. Solution is to use an N second window to whiten the entire batch.
Begin development of Vizapp so we can diagnose failure modes

TODOs

Beginning to think about standardizing nomenclature used in the repository #348

05-25-2023

Summary of last week's activities

Renamed from BBHNet to aframe based off ranked choice poll.

Will submitted ticket to GraceDB to get online alerts added to Playground
- Alec created online deployment branch for generating online observations, but network went stale after ~1 week and re-training has been bottlenecked by segfault issues with glitch generation, so model is offline for now.
- Need to make some changes to field names in JSON event files
Alec investigated correlations between test sensitive distance (SD) and various validation metrics by running inference with 24 checkpoints from a particular training run.
- Evaluated 4 combinations of background and signal
  - Background: just coincident validation glitches, or using timeslides of validation background data
  - Signal: Validation waveforms evaluated using pre-sampled sky parameters, or validation waveforms averaged over 10 different samples of sky parameters. In each case, waveforms had their SNR rescaled to be minimum 4 before evaluation.
- Validation metrics were either recall (fraction of signals above loudest noise event), or area under the ROC curve (AUROC) at different max FPR thresholds.
- Results shown below, with the y-axis in all plots being test sensitive distance and the x-axis being the corresponding validation metric at each epoch. Each group's correlation coefficient is superimposed over it in text. Measured against sensitive distances at 3 different values of test false alarm rate, indicated in the title of the top left plot in each image.
- Timeslide validation at lower FPR thresholds shows strongest correlation, not perfect but at least some noise comes from shortness of test livetime (~2 months)
- Double glitches just tend to grossly underestimate performance, which we thought might be helpful as a lower bound setter, but it turns out not to be the behavior you want to explicitly optimizer for.
- Going to implement timeslide-based validation using AUROC@1e-3 as our standard for the time being until we can gather more evidence.
PRs to merge
- renaming to aframe
- dedicated augmentation module and SNR scheduler
  - Ethan's PR to this branch fixing augmentation bug and adding tests
- Timeslide validation (no PR yet and needs some testing, but Alec has code working)
Decided order should go
- Merge aframe renaming
- Alec merge Ethan's PR to his fork off of the batch augmentation branch
- Rebase merged batch augmentation branch on top of renamed main, merge it
- Rebase timeslide validation on top of all of these
Discussed Will's paper draft, decided will probably make sense to break up into two papers
- One simple and straightforward, aimed at GW practitioners and in particular other search pipelines. Summarizes how pipeline works and results it achieved, with some overview of how we managed to avoid pitfalls of other ML applications
- One more in-depth on design philosophy and motivating things more from first principles, aimed at a broader audience and in particular ML practitioners who are interested in getting involved to cover the requisite background and things that need to be understood about why applying ML in this field is trickier than it seems.
- Need to start stripping out content from Will's draft and deciding what will go where.

Plans for next week

Will had a great post in Slack outlining short, near, and long-term plans, with some further discussion in the comments.

Get the PRs outlined above merged, then launch a hyperparameter search over the weekend that includes parameters of newly added training functionality
Depending on results of HP search
- Begin building summary plots to include in paper
- Begin cleaning up and standardizing code
- Figure out a way to get apples-to-apples comparison with other BBH paper
  - Either train around MDC data and make MDC submission, or build test dataset around MDC training times then apply their model
Segfaults during glitch generation are holding back online deployment for now, but might not be a priority if we're trying to get paper results out the door. At some point will need to simplify this code and avoid duplicate downloading noted by Ethan in #346

TODOs

Create issue outlining changes to JSON fields that need to go in to online deployment branch

05-11-2023

Summary of last week's activities

Initial work on building online deployment as true microservice (#338 and #339)
- Decided given the amount of work involved, it made more sense to begin with a static model trained using existing pipeline on ER15 data and deploy manually, re-running as necessary
- Will give us initial infrastructure and weak points that we can integrate into true production deployment in parallel with algorithm development
- Discussed that biggest computational challenge will be building background on each retraining of model to establish detection statistic threshold for a given FAR
  - Will need to be able to achieve FAR of > 1/day to get attention of searches
  - Will be useful to think of ways to build background once at start of run that we can then use online
    - One idea is to reintroduce normalization, to make units more objective (how many standard deviations over the mean of the last N seconds has the mean of the last second of outputs been?)
    - Could simulate by building threshold at start of O3, then verifying that FARs remain consistent when using threshold across rest of run
Ethan updating config to train model on ER15 data (#340), seemed to run pretty easily and achieve decent results
- Running inference at time of writing
Alec writing simple online deployment code based off of DeepClean implementation (#341)

Plans for next week

Talk with searches about getting BBHNet triggers in non-production GraceDB
- If we do this, we'll want to come up with a better name. Discussing internally and hoping to come to a decision next week
Get online deployment running and begin submitting triggers
Return to offline development work to improve model performance

TODOs

Reach out to searches group and figure out requirements for getting BBHNet added

05-04-2023

Summary of last week's activities

Beginning overhaul of vizapp to utilize new inference APIs and improved sensitive volume calculation (#330). Example of main sensitive distance plot summarizing performance of each run is

Ethan mocked up example of end-to-end data augmentation module here

Plans for next week

Main plans for next week will be getting online deployment off the ground. Other sections mentioned below are just for noting future plans, and may be pursued as time allows.

Online Deployment

Given that O4 will be starting soon, decided it's high priority to get something to start producing triggers to start getting eyes on pipeline. Discussed necessary steps below and organized them into a milestone

Run on live data, not replays
Copy over DeepClean deployment code for collecting frames and launching training jobs
- How do we know which frames are usable as training data
  - Query segments
- Omicron
  - Where do the triggers live?
How do we pick a trigger threshold?
- Timeslide 1 day worth of background from held out data and use loudest event as FAR 1/day threshold
Run local, non-IaaS inference to start with
- Where would it live? On DeepClean boxes?
- How do we recognize bad data?
- Integrate GraceDB triggers
  - https://gracedb.ligo.org/documentation/rest.html
- Run at >=1024Hz for trigger times

Training

Alec - Implement SNR scheduler and integrate with Ethan's augmentation module above
Ethan - Build validation metrics for validation/test metric correlation experiment
Will - Run large HP sweep once we have all parameters in and have a metric we're confident in

Inference

Finish vizapp overhaul, fork from new structure contained in #330. Consult image from last week for plots included in each page.
- Will - data summary page
- Alec - performance summary page
- Ethan - Results analysis page

TODOs

None

04-27-2023

Summary of last week's activities

Taken a couple weeks off of meeting for APS.

Training

Ethan implemented channel swapping/muting augmentations for training (#328)
Ethan implementing ML4GW overhauls for SNR scheduling and more robust augmentation pipeline (ML4GW/#48, ML4GW#49, ML4GW#50)

Inference

Fully fleshed out issues with sensitive volume calculation:
- End O3 prior had two large flaws
  - Bilby's pdf implementation for the conditional mass distribution was not properly normalized, so when we importance sampled we had to normalize the weights explicitly by dividing by the sum of all the weights (including the rejected parameters)
  - We chose to make the lower mass limit 10 solar masses to keep most of the signal SNR in our input kernel. This meant that for priors like the MDC prior (where the mass extended down to 7 solar masses), we had no samples and so our results were skewed
- For the time being, plan is to always normalize by the sum of the weights to be safe and switch to using the MDC prior (or possibly the End O3 prior with the lower mass limit moved to 7). If we have a sufficiently well behaved prior, we could switch back to just normalizing by num_injections, but we can cross that bridge once everything is nailed down.

Distributed inference PR (#308) merged into inference-overhaul branch
- Now saves rejected parameters during waveform generation to be used in sensitive volume calculation

Plans for next week

Training

Build out more robust SNR scheduling and augmentation pipeline using new ML4GW components (gist by Ethan here, crude drawing from meeting below)

Now that sensitive volume calculation is working, do large scale validation metric experiment to see which ones correlate with sensitive volume best
- Do inference on timeslides of validation background, then pool for "events", then compute recall wrt to highest event?
  - Probably won't provide enough coincident glitches
- Do just permutations of coincident validation glitches?
  - Look at existing runs and see how our network output distributions compare between pure background, single glitch, and coincident glitches
- Keep looking at AUROC-adjacent metrics?

Inference

Vizapp improvements
- Integrate new waveform generation/inference APIs into vizapp (#330, also fixes some inference issues that arose from last minute changes to waveform generation API
- Data tab summarizing training and test data distributions
- Performance tab summarizing training/validation histories as well as test sensitive volume estimates vs. FAR for multiple distributions
- Analysis tab that shows more information on background events and has more robust inspection (Q-scans, saliency maps, etc.)
- Overall aesthetic improvements - larger text, image download tools, etc.
- See all ideas discussed in meeting at APS in note below

TODOs

Alec send link for Nautilus cluster sign-up via Slack

03-30-2023

Summary of last week's activities

Training

Reverted to using 15/15 log normal distribution during training
- Still need to implement some form of scheduling
As discussed in #319, Ethan implemented SNR flooring to validation waveforms. Well trained model now performs better than random.
- E.g. here's the best performing hyper parameters from Will's last HP search
Ethan implemented training injection channel swapping augmentation outlined in #318
- Improved validation AUROC at all thresholds compared to no-swapping, but test-time sensitive volume performs worse (SV down from 3.5 to 2.3 Gpc^3)
- Needs hyperparameter search, but not useful unless validation metric actually correlates with test performance
- Possible that using new SNR distribution mitigates usefulness of AUROC metric - measuring performance below loudest background, but not going to operate there
  - Potentially validated by the fact that recall vs. just glitches performs better with no swapping in these runs
  - Glitches are contained in validation background, so no extra data here, but recall at specificity=1 gives us fraction of events over loudest background, where we expect to operate
- Need to do a full run with a few validation metrics of interest, then do inference with checkpoints to see which correlate best
- Once we have a good metric for new validation data distribution, re-run HP search on new augmentation parameters
Found bug where we are measuring sensitive distance wrt detector frame parameters, not source frame (#325)
- Also messing up units in n_eff and variance calculation
- Fixing this improves sensitive volume estimates drastically - brings Will's best HP search model up from 3.5 to 13.1 Gpc^3 on the ~10 day background dataset

Inference

Distributed inference now working (#308), running on 62 days worth of background in ~30 minutes using 8 V100 GPUs
- Still only at ~70% GPU saturation, could add more clients to improve throughput further
Running with Will's best model from the HP search produces SV of 6.9 Gpc^3
- As expected, some drop off due to longer background
Alec re-ran whole pipeline with similar HP parameters, achieved model with ~7.1 Gpc^3
Needs more extensive unit testing to validate that data is being iterated/postprocessed correctly, but sanity checks show most things are working as expected
- Ethan working on contributing some of these tests/fixes here

Plans for next week

Folks working on presentations for April APS meeting - aiming to finish by end-of-week for P&P approval
Training run with multiple validation metrics for running validation/test metric correlation analysis
Validating distributed inference code to get new pipeline merged
Fixing vizapp to work with distributed inference code

TODOs

Alec - Get Rafia access to deepclean boxes at detector sites

03-23-2023

Summary of last week's activities

Will ran another hyperparameter sweep successfully, produced roughly 2x SV compared to original parameters.

Still well short of SV from existing pipelines, but planned training augmentations should help to improve performance
Would like to compare to other BBH paper, but need to evaluate against their mass distribution, which is U(7, 50)
HP search showed that even our best models will perform worse than random guessing according to validation AUROC. This is because we evaluate on a lot of SNR events we don't expect to be able to recover. We should start doing SNR rejection on at least our validation waveform dataset, if not our training waveform dataset as well.

Alec working on finishing distributed inference PR.

Have implementation working, required reimplementing dataloading to get true asynchronous behavior
Ran with 8 GPUs, saw linear 4x improvement over 2 GPU throughput (1h20m down to 20m for 8-10 days worth of background)
Writing tests to verify everything is correct
Trying to run with Will's best performing HP search model to verify that the predictions come out sane, but having trouble getting condor jobs scheduled at LHO. Will re-run at LLO
Ethan made a PR to this branch to pull testing segments during background generation that Alec needs to merge. Then with tests this PR should be done.

Noted discrepancy in how pipelines report their performance in the catalog paper: real events are reported with the FAR for the corresponding decision statistic, but VTs are calculated using the threshold p_astro > 0.5, for which a corresponding FAR is not provided. Ethan thinks he can re-run the pipeline VT estimates using FAR thresholds to get a better comparison with our pipeline.

Plans for next week

Fix waveform data used during validation to get more meaningful estimate of performance
Re-compute pipeline VT estimates using FAR thresholds
Implement SNR scheduling and injection augmentations to improve robustness to glitches and lower SNR events
Implement U(7, 50) mass distribution and the relevant log normal (m1, m2) pairs from the catalogue paper as pre-computed FAR-vs-SV curves in the vizapp, showing other pipelines' performance as individual points
Finish testing distributed inference PR to get it merged

TODOs

Alec - Get Rafia access to deepclean boxes at detector sites

03-16-2023

Summary of last week's activities

Not much progress to report, competing projects and LVK meeting getting in the way. Most important changes are fixes to AUROC calculation, background data loading, sensitivity calculation, and visualization app contained in #318 and Will's PR against that branch.

New inference API now working, just need to integrate with Ethan's condor code to distribute.
Will ran existing pipeline using old SNR sampling scheme and model was able to achieve sensitive volume of around 1700 Mpc on current smaller test set, which is really impressive, beating results from this paper presented at LVK this week. (though obviously at lower confidence given the amount of background used).
Will has also begun adding a lot of text to the paper, which folks should take a look at when they have a chance.

Plans for next week

Current status of path to publication:

Merge fixes to training pipeline/sensitive volume calculation
Re-run HP search with LogNormal distribution, get sensitive volume measurement on current smaller test set
- Run inference with a couple different HP settings to evaluate correlation with validation AUROC
- If not great, implement new timeslide-based AUROC
Finish distributed inference PR
- Do a full run w/ month-year of background
- Use this as a higher confidence result that we can use to start seriously putting paper together
Once paper writing is underway, look at possible additional training improvements
- SNR scheduling following the example of the LVK paper reference above. Start with higher SNR distribution to get the network in a good area of parameter space, then temper to "true" distribution through the course of training
- Injection augmentation: in order to get the network to learn that signals in both IFOs must be both coincident and coherent, randomly zero-out the projected waveform of one IFO pre-injection (looks like a BBH but isn't coincident), and randomly swap one IFO from some pairs of injected waveforms (coincident things that like BBHs, but aren't coherent). Mark these augmented waveforms as 0s.

Alec will be called away to work primarily on DeepClean for the next week or two, but will get infer PR in before pausing heavy dev work.

TODOs

Alec - Get Rafia access to deepclean boxes at detector sites

03-09-2023

Summary of last week's activities

Issues with Hanford cluster slowed down some progress this week.

Will

Ran hyperparameter sweep using 64 iterations of the training pipeline in serial. Results described here
- Ran inference using top performing models, but they just output constant values. Turns out there was an issue with the AUROC calculation for the case when all the outputs are identical that was causing the performance to look perfect
- Real result of the sweep is that it seems the model had trouble learning new data distribution

Alec

Getting scaled up inference code together
- Integrated new injection API into timeslide waveform generation code, updated inference code to expect that all shifts live in the same file
- Local unit tests passing, but issues with Hanford cluster made running the pipeline difficult
  - Replicated environment at Livingston, but had issues with connecting to data services
  - Will pointed out that this is due to this issue, solution is to set environment variable GWDATAFIND_SERVER=ldrslave.ldas.ligo-la.caltech.edu:80

Plans for next week

Ethan will be attending LVK meeting at Northwestern, represent our work and keep tabs on similar work by others.

Training

Fix AUROC calculation by shuffling samples before sorting, that way constant values come out poor
Diagnose training difficulties
- Run current pipeline with louder SNR distribution
- Run current pipeline with old nonspin_bbh waveform dataset but current SNR sampling scheme
- Possible solution might eventually be to start with high SNRs then temper to lower ones through the course of training

Inference

Test scaled up inference pipeline with data generated by updated waveform generation script
Integrate with Ethan's condor code

Visualization

Integrate new inference API into vizapp
Add a tab for data visualization
- Train/test signal parameter distributions
- Pie chart showing relative fractions of background/waveforms/glitches and combinations of the two during training
- Show full config

Other work

Autoencoders for BNS detection
- Use latent timeseries as input to regular BBHNet framework, include encoder as part of preprocessor
- Major benefit of doing this way rather than training end-to-end is the ease with which you can supply external information in the autoencoder problem that gets thrown away in the binary detection problem, e.g. masses of the signals
  - One idea is to regress from an injected signal to the raw waveform
  - Benefit might be that including this information, and learning the ability to encode it, might make it easier to pick out these longer duration, lower SNR signals

03-02-2023

Summary of last week's activities

Ethan

Heroically slogging through finding the disparities between the redshift and M1 distributions of our injections, Reed's injections, and the Rates and Pop's teams injections. See #305 and #313 for more details.
- Still some slight disparities, but we're close enough and existing code isn't reproducible enough for us to keep spending time on this.
- We'll merge changes as-is and run the pipeline with them
Saving glitch timestamps so that we can separate out glitches from the validation period during training (see here and here

Will

Working through bilby solutions to prior issues. Settled on reverting to conditional priors #313
Using Ethan's glitch timestamps to handle splitting at train time (../../pull/312)
- Issues with rebasing his branch on main. Think solution will be to:
  - Ethan rebase his branch on main
  - Will fetch Ethan's branch, rebase his on top
  - Push to his remote, open PR against branch on Ethan's remote
  - Ethan merge his changes
  - Merge Ethan's updated branch to main

Rafia

Finishing feature additions to visualization app for easy comparison with existing pipelines #294
Prepared toy example of autoencoder training for longer duration waveforms, ready to apply it to real LIGO data

Alec

Building more robust API for new inference implementation to better handle chunked data loading from mldatafind and doing simultaneous background/foreground inference #308

Plans for next week

Training

Glitch timestamp fixes and finalized injection parameter distributions should settle the data generation issues once and for all (Ethan + Will)
Once these are merged, Will should be good to run a serial hyperparameter search

Inference

Few tests remaining for new inference implementation, then need to integrate with Ethan's condorized code for small-scale distributed run good for O(months) background in 1-2 hours (Alec)

Meeting Notes

Aframe weekly sync meeting notes

Quick reference

2023

Full Notes

11-2-2023

10-19-2023

08-31-2023

Computing a p_astro

08-24-2023

1. Remaining results

2. Code merges, implemented on dedicated branch for paper

3. Paper

08-03-2023

07-20-2023

Summary of last week's activities

06-15-2023

Summary of last week's activities

Plans for next week

06-15-2023

Summary of last week's activities

Plans for next week

06-08-2023

Summary of last week's activities

Plans for next week

TODOs

05-25-2023

Summary of last week's activities

Plans for next week

TODOs

05-11-2023

Summary of last week's activities

Plans for next week

TODOs

05-04-2023

Summary of last week's activities

Plans for next week

Online Deployment

Training

Inference

TODOs

04-27-2023

Summary of last week's activities

Training

Inference

Plans for next week

Training

Inference

TODOs

03-30-2023

Summary of last week's activities

Training

Inference

Plans for next week

TODOs

03-23-2023

Summary of last week's activities

Plans for next week

TODOs

03-16-2023

Summary of last week's activities

Plans for next week

TODOs

03-09-2023

Summary of last week's activities

Will

Alec

Plans for next week

Training

Inference

Visualization

Other work

03-02-2023

Summary of last week's activities

Ethan

Will

Rafia

Alec

Plans for next week

Training

Computing a `p_astro`