-
Notifications
You must be signed in to change notification settings - Fork 17
Meeting Notes
Documenting progress and issues from each week and planning tasks for the following week
Sensitive volume over time run.
- We saw that at the most significant FARs, there were periods where the original model was outperforming the retrained model
- Hypothesis: Periods where there the original model out performs the retrained model correspond to periods where the glitch rate of the original training period is higher than the glitch rate of the retrained period. The rational being that b/c the original model was able to train on more glitches, it was more robust. There is some evidence of this by looking at the glitch rate ratio vs the SV ratio:
- Our intuition that the retrained model should perform better is exhibited when we look at slightly lower FARs. Possible that uncertainty in our FAR estimate at the most significant FAR's is the culprit.
Consensus: In the paper we will report our original models performance over time w/o the retrained models performance. This investigation will be left for future work.
TODO: Alec will get benchmarking and MDC code into paper branch. We will need to run this analysis and add plot to paper
Finishing final production experiments for paper.
- Headline results/re-seeded
- Both produced nearly identical performance - good
- Both had dips at lowest FARs, seems indicative of loud background events
- Questions/AIs:
- Inference sampling rate
- Results came out mostly as expected
- Questions/AIs:
- #439 again
- Intervals/Longevity
- Fine-tuning produced models that rarely out-performed the initial model on the same validation set
- Ethan's running trainings from scratch to compare, but preliminary results indicate they'll come out roughly the same
- Could indicate that initial training is good enough for extended deployment
- Means fluctuations over time are just the result of local noise levels
- Curious why all interval runs have lower validation AUROC than initial training run, but there are too many variables at play here to sort out, and it's not worth the time before publication. From-scratch results are conclusive enough for us
- Questions/AIs
- Wait for Ethan's current runs to complete and then draw conclusions (#443)
- Compare our fluctuations in sensitivity over O3 to those experienced by other pipelines. Are they similar?
- Offline throughput
- #442 outlines how we should compare our offline throughput per GPU to Eliu's from 2017
- This paper is worth comparing to because outside of topline results, we're also outlining for a software platform for BBH detection, and this is the only other paper I'm aware of that attempts something similar
- Questions/AIs
- We don't know if the server stats for our current 1-year and 2-month runs correspond to 7 or 8 GPUs, so once Ethan is done with his retraining runs on Hanford Alec will re-run these using all 8 GPUs to compare their total throughputs.
- Code Alec needs to get into paper branch
Next steps beyond this are
- Get online deployment going again
- Alec to take some initial stabs at implementing some of our inference changes since the last one, hand off to Will mid-next week to take over and get a training run going. (#447)
Paper text is being worked on online in parallel with code development. In meeting, mainly discussed outstanding code issues required to re-run all experiments to generate "final" results
- Data generation
- Setting seeds during data generation
- Switching to using open segments for background
- Switch to 8s waveforms to avoid tail end of signal getting tacked on to coalescence
- Once these fixes are in, generate:
- 1-year dataset
- Interval datasets
- A separate 2-month dataset for smaller experiments (using different seed from 1-year dataset)
- MDC test set with parameters as close as possible to AresGW paper, explicitly record command
- Training
- Seed everything in training
- See if we can find a utility for this, otherwise follow instructions here
- Note that this will require a slight reformatting of the
ml4gw
dataloader in order to allow for multiprocessed dataloading to be deterministic
-
Fix bug in validation where predictions on different injection views are being aggregated incorrectlyActual issue came from how the PSD was being calculated. This was corrected by changing the averaging method to "median".
- Seed everything in training
- Inference
- Switch to an inference sampling rate of 4
The final set of experiments we'll need to run, and which we should have pipelines for, are:
- Vanilla sandbox with 1 year worth of background (1-year dataset)
- Intervals, both with initial model and with retraining (or fine-tuning)
- Do inference on 1-year with median and worst non-nan hyperparameters from search to show benefit of search
- Sandbox 1-year trained using a different seed
- Inference sampling rate comparisons (using 2 month-background dataset)
- Sandbox with all augmentations turned off
- MDC
Will has done a lot of good work trying to construct posteriors on the foreground and background counts described in this paper for developing a p_astro
in the context of aframe, but without clearer steps from any of the pipelines we're having difficulty getting things to come out sensible. Next steps here are to:
- Reach out to someone from one of the pipelines directly to see if we can get 30 minutes of their time to have them walk us through how they go about this. Preferably PyCBC since it seems simpler, but if no one on the group call tomorrow has any contacts on that team we'll reach out to Cody with GstLAL
- If the implementation will take anything more than a notebook cell, put this on hold for a follow up paper where we actually run over 0-lag O3 data
Discussed remaining tasks and best course of action for reaching publication of initial aframe paper. Decided that main figures of merit should be:
Main Section (probably in this order)
- Table showing comparison of
$p_{\text{astro}}$ thresholded sensitive volumes for aframe and GWTC-3 pipelines - Grid plot showing aframe's sensitive volume (not distance) vs. FAR for each of the 4 main mass combinations (35/35, 35/20, 20/20, 20/10), with
$p_{\text{astro}}$ values plotted as horizontal lines (including aframe's) - Sensitive volume vs. time for Ethan's interval datasets
- Online latency vs. inference sampling rate plot
Methods Section (in no particular order)
- Will's experiments showing impact of inference sampling rate on sensitive volume
- Grid plot showing comparison of sensitive volume achieved on the timeslides from 1-month after the training period vs. that achieved over Ethan's interval datasets.
- ML Mock Data Challenge results/brief discussion on issues there
- Various diagrams illustrating the aframe training and inference pipelines
To get there, there are a few outstanding action items, which can be roughly broken up into 3 categories.
- Ethan will train models using the week before his interval test sets to see how that impacts performance over time
- Will will take another look at how sensitive distance scales over time for each of the mass combinations
- Alec will re-run
ml4gw
projection vs. bilby benchmarking - Everyone will take a look at PyCBC's
$p_{\text{astro}}$ implementation and one of us will try to implement on one of our results - Once all code is ready, run an HP search with all augmentations turned off to get a sense of advantage
- Begin producing online triggers and get connected to GraceDB Playground
- Seeding for dataset generation to ensure reproducibility
- Ethan's interval runs
- Will's inference sampling rate experiments
- MDC train dataset generation and inference
- Online benchmarking code
-
ml4gw
benchmarking code (last two probably in samebenchmarking
project) - Replace
vizapp
with plotting code, stripping outvizapp
components necessary for generating plot data from raw data - Add ability to apply vetoes when generating plots
- Once everything is in, do a full re-run to ensure results are reproducible
- Alec will write draft intro
- Ethan working on data, augmentations, and SV in methods section
Hyperparameter search on MDC data has produced a model that's competitive with SOA results. Given this, plan to present to CBC West call on Aug 15th to share initial findings and solicit feedback for methods paper and organizing a review committee. Key figures of merit we'd like to show are:
- Sensitive distance vs. FAR for best performing model on MDC dataset. Compare with submitted results from original MDC
- Sensitive distance vs. FAR for best performing model on 1-year worth of aframe-generated end O3 rates+pops prior timeslides. Include importance sampling to GWTC-3 mass combinations (figure 6), and plot horizontal lines for "Any" values from that plot (top left).
- Sensitive distance vs. FAR for best performing end O3 model on 2-months worth of aframe-generated timeslides from 1-week taken from intermittent months after training period (e.g. 1, 3, 6, 9 months) to show how performance changes over time and infer retraining period.
- Latency vs. inference sampling rate at different batch sizes for online timing comparison.
- At some point, investigate trend in relationship between inference sampling rate and sensitivity
Action items required to get us there:
- Alec will keep running MDC HP search until EOD tomorrow
- Before then, run inference with new best performing model to see if gains are real and get data for first figure of merit
- Will starts putting together branch for generating end O3 training dataset and training on it
- Set
fftlength
for both training and inference to 2.5s to mitigate train/inference variance disparity - Start a training run using best HP params from MDC model for comparison
- Either Will or Alec launch HP search on remaining available GPUs on DGX boxes
- Set
- In parallel, Alec will launch generation of 1-year worth of end O3 timeslide data from 1 month after training period
- After HP search has run, at start of next week, take best performing model and run it on 2 months of 1 year dataset to sanity check
- If results look reasonable, launch 1-year inference run
- While 1-year run goes, Ethan start generating interval datasets
- Once 1-year run completes, launch inference jobs on interval datasets
- Alec do online benchmarking using in-memory model, possibly converting to TensorRT if time allows
While all of this is happening,
- Start putting slides together at start of next week to begin organizing thoughts and story
- Start digesting
p_astro
paper and thinking about how to frame this in our context
One key thing Erik pointed out that's worth making explicit both in this talk and in the paper is that we don't plan on showing any astrophysical results - i.e. we're not running on any zero-lag data. Our goal is to lay out a robust proof-of-concept and framework for doing ML-based detection in GW, setting the stage for a follow-up paper describing real astrophysical results.
Other, less pressing action items:
- Start doc outlining skeletons of various papers/blog posts to come out of this
- Accelerate code cleanup/robustness improvements
- Explore impact of inference sampling rate on sensitivity
- Reach out to AresGW team for evaluation HDF5 file for direct comparison in paper
-
Attempted direct comparison of evaluation using MDC data/sensitive distance code and using aframe data/sensitive distance code
- Still having difficulty aligning the two calculations
- At the meeting we agreed that:
- At this point, given what we know about the flaws in the MDC sensitive distance calculation, we have no real plans to use it outside of comparison purposes for this one paper
- Given that and the fact that we have code for directly evaluating on the MDC datasets, we'll just start doing that going forward for comparisons, but report our "real" numbers using the end O3 r+p distribution, with events distributed uniformly in comoving volume. This framework will be one of the contributions of whatever paper we write.
- As such the plan moving forward will be:
- Get the code for doing inference on MDC data merged into
mdc-dev
branch - Get
mdc-dev
branch merged intomain
- Begin taking multiple views of signals during validation, to account for differences in response to position in kernel, particularly for lower mass events
- Split out
main
into two branches: one for continuing to produce predictions on MDC data and cleaning up that implementation, the other for stripping out the MDC-specific infrastructure and switching back to end O3 rates and pops- The MDC branch will be used to launch an initial HP search (once we remove hopeless SNR thresholding during validation) to try to achieve SOA on the MDC dataset
- The end O3 branch will eventually make it back to
main
to act as our default distribution going forward - Once this is ready and the MDC HP search has completed, run a separate search on this data distribution
- Once end O3 makes it back to main, this all will get merged into the
online-deployment
branch to begin re-running our online triggers. - Ethan has an implementation of glitch sampling that will be interesting to run through an HP search as well.
- Get the code for doing inference on MDC data merged into
- All of this can be visualized by the following diagram of git commits
-
Plan for publication is
- Achieve best score possible on MDC for comparison purposes
- Introduce our pipeline and present results on that
- Compare to sensitive volumes in GWTC-3 paper, with the caveat that these aren't reported with (and often don't even have a definite sense of) corresponding false alarm rates
-
Plan for deployment is
- Get online pipeline running and start producing triggers. Will be the only way to get noticed
- In parallel, produce predictions on Reed's O4 MDC for direct comparison to other pipelines
- Will is going to take a look at this data to see how it would fit into our framework. We suspect this will likely be more amenable to dataloading/inference structure for MDC data.
-
Immediate next steps
- Review
mdc-infer
and merge intomdc-dev
- Update train start/stop times to reflect the fact that MDC
start_offset
flag represents an offset in livetime, not total time (see this Slack post. MDC test set with--start-offset $((3600 * 24 * 10))
begins at GPS time 1241443783). - Final review of
mdc-dev
, merge intomain
- Validation signal fix on
main
- Split out MDC branch, fix SNR cutoff
- Launch HP search
- Review
-
Main focus was on running experiments to compare our algorithm with the MDC
-
Ethan approached the problem by implementing the MDC prior and chirp distance calculation in the aframe framework. See #361
(Sensitive Distance (MPC) vs FAR (months ^-1)
-
Will produced some comparisons of the chirp mass, chirp distance and m1 / m2 distributions between our implementation and the samples produced by the MDC code and everything looks consistent. (See plots here)
-
While analyzing output of run, discovered discrepancy between local torch and triton inference - could mean above results are not to be trusted
-
-
Alec continued the approach of playing by the MDC papers rules, and using their data generation and evaluation script code.
While the above result is worse than Ethan's result, the acceptance window for events was 0.125 seconds, while Ethan's analysis essentially had an infinite acceptance window. Alec saw that increasing that window to 0.3 led to an increase of sensitive distance to 1290 Mpc. Additionally, Ethan's run was trained for longer, and saw validation score improvements.
-
-
A couple of inference code bugs were squashed in Ethan's MDC data branch (linked above):
- Removing bogus events from the first
psd_length
seconds of data. - An edge case where some timeshifts were failing to complete for certain segments
- Removing bogus events from the first
-
Will is going to pursue an experiment that uses a uniform in volume distribution (as opposed to a uniform in comoving volume) under the standard aframe SD calculation (i.e. no chirp distance sampling, use hopeless snr cut, etc.) and see if this accounts for discrepancies.
- Get to the bottom of torch vs triton discrepancy
- Continue pursuing MDC comparison
-
Continued focus on infrastructure development to be able to make submission to ML-MDC.
- Finalized implementation of Local whitening during training as well as inference - In short, the new snapshotter keeps state for the kernel and data used to whiten the kernel
- Alec trained a model on O3a data near MDC testing period, generated MDC injection set, and ran inference
- Working on running evaluation script and getting a result
-
Development of revamped visualization app that performs live injection and inference
- Discovered odd discrepancy between torch inference and triton based inference. In short, torch based inference was not triggering to the same degree as triton inference on certain background events.
-
Hard to see, but there is a slight difference in the whitened data seen by the torch vs triton models
-
Need to determine if the NN output difference is due to this weird data discrepancy, or the model (e.g. some precision issue)
-
Discovered another artifact where some signal outputs where creating vertical clusters - Possibly related to the above issue.
- Arrive at a metric on the MDC data
- Iron out torch vs triton discrepancies
- Continued development of vizapp
Work this week has been focused on infrastructure developments required to create a submission to ML MDC mock data challenge
-
Alec implemented local whitening in the frequency domain. Data will now be whitened by the N previous seconds as opposed to using a fixed whitening filter fit to the training background.
-
ONNX does not support torch ffts, so we needed torchscript support to be able to export the new local whitener with the model - Alec implemented this in hermes.
-
Alec has implemented chunked data loading so that we can train on multiple discontiguous segments. See #352
-
Initial training run with these changes and analysis of corresponding validation outputs illustrated that things seem to be working
-
One interesting tid bit discovered on the call:
-
It appeared that for low SNR events, we were performing monotonically better as a function of chirp mass:
-
However, we hypothesized that this was due to the fact that we were evaluating the detection statistic of validation signals with the coalescence time in the center. This means that signals with SNR content longer than
kernel_length / 2
, will not be fully seen. When adjusting the evaluation kernel such that all the coalescence time is at the end of the kernel we see the discrepancy disappear. -
This is evidence that we may benefit from longer kernel lengths
-
-
Alec ran some additional investigations offline that are outlined in this slack post
-
Will opened a PR that adds the dynamic input normalization (DAIN) layer used by the AresGW group #354, and ran some experiments using the old whitening and dataloading scheme. Loss curves for that run can be found in this slack post
- Continue focus on infrastructure developments that enable us to make an MDC submission
- Alec’s first naive implementation of a snapshotter that incorporates the new local whitening scheme was extremely slow due to the fact that we have to unfold multiple 10 second kernels. Solution is to use an N second window to whiten the entire batch.
- Begin development of Vizapp so we can diagnose failure modes
- Beginning to think about standardizing nomenclature used in the repository #348
Renamed from BBHNet to aframe based off ranked choice poll.
- Will submitted ticket to GraceDB to get online alerts added to Playground
- Alec created online deployment branch for generating online observations, but network went stale after ~1 week and re-training has been bottlenecked by segfault issues with glitch generation, so model is offline for now.
- Need to make some changes to field names in JSON event files
- Alec investigated correlations between test sensitive distance (SD) and various validation metrics by running inference with 24 checkpoints from a particular training run.
- Evaluated 4 combinations of background and signal
- Background: just coincident validation glitches, or using timeslides of validation background data
- Signal: Validation waveforms evaluated using pre-sampled sky parameters, or validation waveforms averaged over 10 different samples of sky parameters. In each case, waveforms had their SNR rescaled to be minimum 4 before evaluation.
- Validation metrics were either recall (fraction of signals above loudest noise event), or area under the ROC curve (AUROC) at different max FPR thresholds.
- Results shown below, with the y-axis in all plots being test sensitive distance and the x-axis being the corresponding validation metric at each epoch. Each group's correlation coefficient is superimposed over it in text. Measured against sensitive distances at 3 different values of test false alarm rate, indicated in the title of the top left plot in each image.
- Timeslide validation at lower FPR thresholds shows strongest correlation, not perfect but at least some noise comes from shortness of test livetime (~2 months)
- Double glitches just tend to grossly underestimate performance, which we thought might be helpful as a lower bound setter, but it turns out not to be the behavior you want to explicitly optimizer for.
- Going to implement timeslide-based validation using AUROC@1e-3 as our standard for the time being until we can gather more evidence.
- Evaluated 4 combinations of background and signal
- PRs to merge
- renaming to
aframe
-
dedicated augmentation module and SNR scheduler
- Ethan's PR to this branch fixing augmentation bug and adding tests
- Timeslide validation (no PR yet and needs some testing, but Alec has code working)
- renaming to
- Decided order should go
- Merge
aframe
renaming - Alec merge Ethan's PR to his fork off of the batch augmentation branch
- Rebase merged batch augmentation branch on top of renamed
main
, merge it - Rebase timeslide validation on top of all of these
- Merge
- Discussed Will's paper draft, decided will probably make sense to break up into two papers
- One simple and straightforward, aimed at GW practitioners and in particular other search pipelines. Summarizes how pipeline works and results it achieved, with some overview of how we managed to avoid pitfalls of other ML applications
- One more in-depth on design philosophy and motivating things more from first principles, aimed at a broader audience and in particular ML practitioners who are interested in getting involved to cover the requisite background and things that need to be understood about why applying ML in this field is trickier than it seems.
- Need to start stripping out content from Will's draft and deciding what will go where.
Will had a great post in Slack outlining short, near, and long-term plans, with some further discussion in the comments.
- Get the PRs outlined above merged, then launch a hyperparameter search over the weekend that includes parameters of newly added training functionality
- Depending on results of HP search
- Begin building summary plots to include in paper
- Begin cleaning up and standardizing code
- Figure out a way to get apples-to-apples comparison with other BBH paper
- Either train around MDC data and make MDC submission, or build test dataset around MDC training times then apply their model
- Segfaults during glitch generation are holding back online deployment for now, but might not be a priority if we're trying to get paper results out the door. At some point will need to simplify this code and avoid duplicate downloading noted by Ethan in #346
- Create issue outlining changes to JSON fields that need to go in to online deployment branch
- Initial work on building online deployment as true microservice (#338 and #339)
- Decided given the amount of work involved, it made more sense to begin with a static model trained using existing pipeline on ER15 data and deploy manually, re-running as necessary
- Will give us initial infrastructure and weak points that we can integrate into true production deployment in parallel with algorithm development
- Discussed that biggest computational challenge will be building background on each retraining of model to establish detection statistic threshold for a given FAR
- Will need to be able to achieve FAR of > 1/day to get attention of searches
- Will be useful to think of ways to build background once at start of run that we can then use online
- One idea is to reintroduce normalization, to make units more objective (how many standard deviations over the mean of the last N seconds has the mean of the last second of outputs been?)
- Could simulate by building threshold at start of O3, then verifying that FARs remain consistent when using threshold across rest of run
- Ethan updating config to train model on ER15 data (#340), seemed to run pretty easily and achieve decent results
- Running inference at time of writing
- Alec writing simple online deployment code based off of DeepClean implementation (#341)
- Talk with searches about getting BBHNet triggers in non-production GraceDB
- If we do this, we'll want to come up with a better name. Discussing internally and hoping to come to a decision next week
- Get online deployment running and begin submitting triggers
- Return to offline development work to improve model performance
- Reach out to searches group and figure out requirements for getting BBHNet added
- Beginning overhaul of vizapp to utilize new inference APIs and improved sensitive volume calculation (#330). Example of main sensitive distance plot summarizing performance of each run is
- Ethan mocked up example of end-to-end data augmentation module here
Main plans for next week will be getting online deployment off the ground. Other sections mentioned below are just for noting future plans, and may be pursued as time allows.
Given that O4 will be starting soon, decided it's high priority to get something to start producing triggers to start getting eyes on pipeline. Discussed necessary steps below and organized them into a milestone
- Run on live data, not replays
- Copy over DeepClean deployment code for collecting frames and launching training jobs
- How do we know which frames are usable as training data
- Query segments
- Omicron
- Where do the triggers live?
- How do we know which frames are usable as training data
- How do we pick a trigger threshold?
- Timeslide 1 day worth of background from held out data and use loudest event as FAR 1/day threshold
- Run local, non-IaaS inference to start with
- Where would it live? On DeepClean boxes?
- How do we recognize bad data?
- Integrate GraceDB triggers
- Run at >=1024Hz for trigger times
- Alec - Implement SNR scheduler and integrate with Ethan's augmentation module above
- Ethan - Build validation metrics for validation/test metric correlation experiment
- Will - Run large HP sweep once we have all parameters in and have a metric we're confident in
- Finish vizapp overhaul, fork from new structure contained in #330. Consult image from last week for plots included in each page.
- Will - data summary page
- Alec - performance summary page
- Ethan - Results analysis page
None
Taken a couple weeks off of meeting for APS.
- Ethan implemented channel swapping/muting augmentations for training (#328)
- Ethan implementing ML4GW overhauls for SNR scheduling and more robust augmentation pipeline (ML4GW/#48, ML4GW#49, ML4GW#50)
- Fully fleshed out issues with sensitive volume calculation:
- End O3 prior had two large flaws
- Bilby's pdf implementation for the conditional mass distribution was not properly normalized, so when we importance sampled we had to normalize the weights explicitly by dividing by the sum of all the weights (including the rejected parameters)
- We chose to make the lower mass limit 10 solar masses to keep most of the signal SNR in our input kernel. This meant that for priors like the MDC prior (where the mass extended down to 7 solar masses), we had no samples and so our results were skewed
- For the time being, plan is to always normalize by the sum of the weights to be safe and switch to using the MDC prior (or possibly the End O3 prior with the lower mass limit moved to 7). If we have a sufficiently well behaved prior, we could switch back to just normalizing by
num_injections
, but we can cross that bridge once everything is nailed down.
- End O3 prior had two large flaws
- Distributed inference PR (#308) merged into
inference-overhaul
branch- Now saves rejected parameters during waveform generation to be used in sensitive volume calculation
- Build out more robust SNR scheduling and augmentation pipeline using new ML4GW components (gist by Ethan here, crude drawing from meeting below)
- Now that sensitive volume calculation is working, do large scale validation metric experiment to see which ones correlate with sensitive volume best
- Do inference on timeslides of validation background, then pool for "events", then compute recall wrt to highest event?
- Probably won't provide enough coincident glitches
- Do just permutations of coincident validation glitches?
- Look at existing runs and see how our network output distributions compare between pure background, single glitch, and coincident glitches
- Keep looking at AUROC-adjacent metrics?
- Do inference on timeslides of validation background, then pool for "events", then compute recall wrt to highest event?
- Vizapp improvements
- Integrate new waveform generation/inference APIs into vizapp (#330, also fixes some inference issues that arose from last minute changes to waveform generation API
- Data tab summarizing training and test data distributions
- Performance tab summarizing training/validation histories as well as test sensitive volume estimates vs. FAR for multiple distributions
- Analysis tab that shows more information on background events and has more robust inspection (Q-scans, saliency maps, etc.)
- Overall aesthetic improvements - larger text, image download tools, etc.
- See all ideas discussed in meeting at APS in note below
- Alec send link for Nautilus cluster sign-up via Slack
-
Reverted to using 15/15 log normal distribution during training
- Still need to implement some form of scheduling
-
As discussed in #319, Ethan implemented SNR flooring to validation waveforms. Well trained model now performs better than random.
- E.g. here's the best performing hyper parameters from Will's last HP search
-
Ethan implemented training injection channel swapping augmentation outlined in #318
- Improved validation AUROC at all thresholds compared to no-swapping, but test-time sensitive volume performs worse (SV down from 3.5 to 2.3 Gpc^3)
- Needs hyperparameter search, but not useful unless validation metric actually correlates with test performance
- Possible that using new SNR distribution mitigates usefulness of AUROC metric - measuring performance below loudest background, but not going to operate there
- Potentially validated by the fact that recall vs. just glitches performs better with no swapping in these runs
- Glitches are contained in validation background, so no extra data here, but recall at specificity=1 gives us fraction of events over loudest background, where we expect to operate
- Need to do a full run with a few validation metrics of interest, then do inference with checkpoints to see which correlate best
- Once we have a good metric for new validation data distribution, re-run HP search on new augmentation parameters
-
Found bug where we are measuring sensitive distance wrt detector frame parameters, not source frame (#325)
- Also messing up units in
n_eff
and variance calculation - Fixing this improves sensitive volume estimates drastically - brings Will's best HP search model up from 3.5 to 13.1 Gpc^3 on the ~10 day background dataset
- Also messing up units in
- Distributed inference now working (#308), running on 62 days worth of background in ~30 minutes using 8 V100 GPUs
- Still only at ~70% GPU saturation, could add more clients to improve throughput further
- Running with Will's best model from the HP search produces SV of 6.9 Gpc^3
- As expected, some drop off due to longer background
- Alec re-ran whole pipeline with similar HP parameters, achieved model with ~7.1 Gpc^3
- Needs more extensive unit testing to validate that data is being iterated/postprocessed correctly, but sanity checks show most things are working as expected
- Ethan working on contributing some of these tests/fixes here
- Folks working on presentations for April APS meeting - aiming to finish by end-of-week for P&P approval
- Training run with multiple validation metrics for running validation/test metric correlation analysis
- Validating distributed inference code to get new pipeline merged
- Fixing vizapp to work with distributed inference code
- Alec - Get Rafia access to
deepclean
boxes at detector sites
Will ran another hyperparameter sweep successfully, produced roughly 2x SV compared to original parameters.
- Still well short of SV from existing pipelines, but planned training augmentations should help to improve performance
- Would like to compare to other BBH paper, but need to evaluate against their mass distribution, which is U(7, 50)
- HP search showed that even our best models will perform worse than random guessing according to validation AUROC. This is because we evaluate on a lot of SNR events we don't expect to be able to recover. We should start doing SNR rejection on at least our validation waveform dataset, if not our training waveform dataset as well.
Alec working on finishing distributed inference PR.
- Have implementation working, required reimplementing dataloading to get true asynchronous behavior
- Ran with 8 GPUs, saw linear 4x improvement over 2 GPU throughput (1h20m down to 20m for 8-10 days worth of background)
- Writing tests to verify everything is correct
- Trying to run with Will's best performing HP search model to verify that the predictions come out sane, but having trouble getting condor jobs scheduled at LHO. Will re-run at LLO
- Ethan made a PR to this branch to pull testing segments during background generation that Alec needs to merge. Then with tests this PR should be done.
Noted discrepancy in how pipelines report their performance in the catalog paper: real events are reported with the FAR for the corresponding decision statistic, but VTs are calculated using the threshold p_astro > 0.5
, for which a corresponding FAR is not provided. Ethan thinks he can re-run the pipeline VT estimates using FAR thresholds to get a better comparison with our pipeline.
- Fix waveform data used during validation to get more meaningful estimate of performance
- Re-compute pipeline VT estimates using FAR thresholds
- Implement SNR scheduling and injection augmentations to improve robustness to glitches and lower SNR events
- Implement U(7, 50) mass distribution and the relevant log normal
(m1, m2)
pairs from the catalogue paper as pre-computed FAR-vs-SV curves in the vizapp, showing other pipelines' performance as individual points - Finish testing distributed inference PR to get it merged
- Alec - Get Rafia access to
deepclean
boxes at detector sites
Not much progress to report, competing projects and LVK meeting getting in the way. Most important changes are fixes to AUROC calculation, background data loading, sensitivity calculation, and visualization app contained in #318 and Will's PR against that branch.
- New inference API now working, just need to integrate with Ethan's condor code to distribute.
- Will ran existing pipeline using old SNR sampling scheme and model was able to achieve sensitive volume of around 1700 Mpc on current smaller test set, which is really impressive, beating results from this paper presented at LVK this week. (though obviously at lower confidence given the amount of background used).
- Will has also begun adding a lot of text to the paper, which folks should take a look at when they have a chance.
Current status of path to publication:
- Merge fixes to training pipeline/sensitive volume calculation
- Re-run HP search with LogNormal distribution, get sensitive volume measurement on current smaller test set
- Run inference with a couple different HP settings to evaluate correlation with validation AUROC
- If not great, implement new timeslide-based AUROC
- Finish distributed inference PR
- Do a full run w/ month-year of background
- Use this as a higher confidence result that we can use to start seriously putting paper together
- Once paper writing is underway, look at possible additional training improvements
- SNR scheduling following the example of the LVK paper reference above. Start with higher SNR distribution to get the network in a good area of parameter space, then temper to "true" distribution through the course of training
- Injection augmentation: in order to get the network to learn that signals in both IFOs must be both coincident and coherent, randomly zero-out the projected waveform of one IFO pre-injection (looks like a BBH but isn't coincident), and randomly swap one IFO from some pairs of injected waveforms (coincident things that like BBHs, but aren't coherent). Mark these augmented waveforms as 0s.
Alec will be called away to work primarily on DeepClean for the next week or two, but will get infer PR in before pausing heavy dev work.
- Alec - Get Rafia access to
deepclean
boxes at detector sites
Issues with Hanford cluster slowed down some progress this week.
- Ran hyperparameter sweep using 64 iterations of the training pipeline in serial. Results described here
- Ran inference using top performing models, but they just output constant values. Turns out there was an issue with the AUROC calculation for the case when all the outputs are identical that was causing the performance to look perfect
- Real result of the sweep is that it seems the model had trouble learning new data distribution
- Getting scaled up inference code together
- Integrated new injection API into timeslide waveform generation code, updated inference code to expect that all shifts live in the same file
- Local unit tests passing, but issues with Hanford cluster made running the pipeline difficult
- Replicated environment at Livingston, but had issues with connecting to data services
- Will pointed out that this is due to this issue, solution is to set environment variable
GWDATAFIND_SERVER=ldrslave.ldas.ligo-la.caltech.edu:80
- Ethan will be attending LVK meeting at Northwestern, represent our work and keep tabs on similar work by others.
- Fix AUROC calculation by shuffling samples before sorting, that way constant values come out poor
- Diagnose training difficulties
- Run current pipeline with louder SNR distribution
- Run current pipeline with old
nonspin_bbh
waveform dataset but current SNR sampling scheme - Possible solution might eventually be to start with high SNRs then temper to lower ones through the course of training
- Test scaled up inference pipeline with data generated by updated waveform generation script
- Integrate with Ethan's condor code
- Integrate new inference API into vizapp
- Add a tab for data visualization
- Train/test signal parameter distributions
- Pie chart showing relative fractions of background/waveforms/glitches and combinations of the two during training
- Show full config
- Autoencoders for BNS detection
- Use latent timeseries as input to regular BBHNet framework, include encoder as part of preprocessor
- Major benefit of doing this way rather than training end-to-end is the ease with which you can supply external information in the autoencoder problem that gets thrown away in the binary detection problem, e.g. masses of the signals
- One idea is to regress from an injected signal to the raw waveform
- Benefit might be that including this information, and learning the ability to encode it, might make it easier to pick out these longer duration, lower SNR signals
- Heroically slogging through finding the disparities between the redshift and M1 distributions of our injections, Reed's injections, and the Rates and Pop's teams injections. See #305 and #313 for more details.
- Still some slight disparities, but we're close enough and existing code isn't reproducible enough for us to keep spending time on this.
- We'll merge changes as-is and run the pipeline with them
- Saving glitch timestamps so that we can separate out glitches from the validation period during training (see here and here
- Working through bilby solutions to prior issues. Settled on reverting to conditional priors #313
- Using Ethan's glitch timestamps to handle splitting at train time (../../pull/312)
- Issues with rebasing his branch on
main
. Think solution will be to:- Ethan rebase his branch on
main
- Will fetch Ethan's branch, rebase his on top
- Push to his remote, open PR against branch on Ethan's remote
- Ethan merge his changes
- Merge Ethan's updated branch to
main
- Ethan rebase his branch on
- Issues with rebasing his branch on
- Finishing feature additions to visualization app for easy comparison with existing pipelines #294
- Prepared toy example of autoencoder training for longer duration waveforms, ready to apply it to real LIGO data
- Building more robust API for new inference implementation to better handle chunked data loading from
mldatafind
and doing simultaneous background/foreground inference #308
- Glitch timestamp fixes and finalized injection parameter distributions should settle the data generation issues once and for all (Ethan + Will)
- Once these are merged, Will should be good to run a serial hyperparameter search
- Few tests remaining for new inference implementation, then need to integrate with Ethan's condorized code for small-scale distributed run good for O(months) background in 1-2 hours (Alec)