feat: preparations, training, attacks, evaluation and helpers

update README with instructions (tested on RWTH HPC) and links
AnnikaStein · Jun 26, 2022 · d39b62d · d39b62d
1 parent 26818ef
commit d39b62d
Show file tree

Hide file tree

Showing 18 changed files with 31,233 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,2 +1,38 @@
 # Adversarial-Training-for-Jet-Tagging
-Code for "Improving robustness of jet tagging algorithms with adversarial training" (arXiv:2203.13890)
+Code for:
+> <b><a href="https://arxiv.org/abs/2203.13890" target="_blank">Improving robustness of jet tagging algorithms with adversarial training</a></b>  
+> A. Stein, X. Coubez, S. Mondal, A. Novak, A. Schmidt  
+> 2022.
+
+<i>Jet Flavor dataset</i>
+
+Obtained from http://mlphysics.ics.uci.edu/ and originally created for 
+> <b><a href="https://arxiv.org/abs/1607.08633" target="_blank">Jet Flavor Classification in High-Energy Physics with Deep Neural Networks</a></b>  
+> D. Guest, J. Collado, P. Baldi, S. Hsu, G. Urban, and D. Whiteson  
+> Physical Review D, 2016.
+
+## Get and prepare dataset
+### Download
+Login to a copy18-node of the HPC with high bandwith (will download 2.2GB)
+```
+wget http://mlphysics.ics.uci.edu/data/hb_jet_flavor_2016/dataset.json.gz
+mkdir -p /hpcwork/<your-account>/jet_flavor_MLPhysics/dataset
+mv dataset.json.gz /hpcwork/<your-account>/jet_flavor_MLPhysics/dataset
+```
+### Extracting the data via awkward arrays
+Actually, it turns out that reading the file is not straightforward, at some point, the data has to be unzipped or extracted. The file might have the simple ending ".json", but it's rather various JSON-like entries distributed over several lines of the entire .json file. Consult the notebook `preparations/read_dataset.ipynb` for further details and potential alternatives to use the dataset. Finally, I ended up using awkward arrays with which the next steps become a bit easier.
+### A first look at the data
+Some initial investigations before proceeding to the actual framework will be conducted inside `preparations/explore_dataset.ipynb`.
+### Calculate defaults
+To use custom default values that fit well to the bulk distribution, preliminary studies are done inside `preparations/defaults.ipynb`. It's also the first notebook that makes use of `helpers/variables.py`.
+### Clean samples
+In order to not store too many versions of the same data, cleaning the samples will not be done as a separate step, but comes later when doing the preprocessing (scaling). There, also the final shape of the arrays will be flattened, the result should be a set of usable PyTorch tensors. During the cleaning, I would not cut on any variables, but would only modify certain unphysical values and place them at special default bins - i.e. the fractions of jets of a certain flavor, in certain pt and eta bins do not change by the next step of cleaning (and preprocessing) the data.
+### Calculate sample weights
+Sample weights are calculated in `preparations/reweighting.ipynb`.
+### Preprocessing
+Calculate scalers (from trainset only, and ignore defaults), apply scalers (do _not_ ignore defaults when applying the scaler, alternative: set to zero), train/val/test splitting & shuffling, build sample weights and bins. See `preparations/clean_preprocess.ipynb` for a first working example of the entire preprocessing chain. Also, `evaluate/tools.py` can be used later to facilitate communication between training or evaluation scripts with the preprocessing step.
+## Run framework (training, evaluation)
+### Training
+All relevant scripts are placed inside `training`, e.g. standalone training on current node is done with `training.py`, and for submission to the batch system, there is `training.sh` and `submit_training.py`. Can use nominal or adversarial training.
+### Evaluation
+ROC curves: `evaluate/eval_roc_new.py`. Training history (loss): `evaluate/plot_loss.py`. Tagger outputs and discriminator shapes: `evaluate/eval_discriminator_shapes.py`. Plotting of input variables `evaluate/eval_inputs.py`.
diff --git a/attack/disturb_inputs.py b/attack/disturb_inputs.py
@@ -0,0 +1,159 @@
+import numpy as np
+import torch
+
+import sys
+
+sys.path.append("/home/um106329/aisafety/jet_flavor_MLPhysics/helpers/")
+from tools import defaults_path, preprocessed_path, get_all_scalers, get_all_defaults
+from variables import integer_indices, n_input_features, get_wanted_full_indices, all_factor_epsilons
+
+all_scalers = np.array(get_all_scalers())
+all_defaults_scaled = np.array(get_all_defaults(scaled=True))
+all_defaults = np.array(get_all_defaults(scaled=False))
+
+def apply_noise(sample, magn=1e-2,offset=[0], dev="cpu", filtered_indices=[i for i in range(n_input_features)],restrict_impact=-1):
+    seed = 0
+    np.random.seed(seed)
+
+    if magn == 0:
+        return sample
+    n_Vars = len(filtered_indices)
+
+    wanted_full_indices = get_wanted_full_indices(filtered_indices)
+
+    scalers = all_scalers[wanted_full_indices]
+
+    defaults_per_variable = all_defaults[wanted_full_indices]
+    scaled_defaults_per_variable = all_defaults_scaled[wanted_full_indices]
+
+    device = torch.device(dev)
+
+    with torch.no_grad():
+        noise = torch.Tensor(np.random.normal(offset,magn,(len(sample),n_Vars))).to(device)
+        xadv = sample + noise
+
+        # use full indices and check if in int.vars. or defaults
+        for i in range(n_Vars):
+            if wanted_full_indices[i] in integer_indices:
+                xadv[:,i] = sample[:,i]
+            else: # non integer, but might have defaults that should be excluded from shift
+                defaults = sample[:,i].cpu() == scaled_defaults_per_variable[i]
+                if torch.sum(defaults) != 0:
+                    xadv[:,i][defaults] = sample[:,i][defaults]
+
+                if restrict_impact > 0:
+                    original_back = scalers[i].inverse_transform(sample[:,i])
+                    difference_back = scalers[i].inverse_transform(xadv[:,i]) - original_back
+                    allowed_perturbation = restrict_impact * np.abs(original_back)
+                    high_impact = np.abs(difference_back) > allowed_perturbation
+                    if np.sum(high_impact)!=0:
+                        scaled_back_max_perturbed = torch.from_numpy(original_back[high_impact]) + torch.from_numpy(allowed_perturbation[high_impact]) * torch.sign(noise[high_impact,i])
+                        xadv[high_impact,i] = torch.Tensor(scalers[i].transform(scaled_back_max_perturbed.reshape(-1,1)).flatten())
+
+        return xadv
+
+def fgsm_attack(epsilon=1e-2,sample=None,targets=None,thismodel=None,thiscriterion=None,reduced=True, dev="cpu", filtered_indices=[i for i in range(n_input_features)],restrict_impact=-1):
+    if epsilon == 0:
+        return sample
+    n_Vars = len(filtered_indices)
+
+    wanted_full_indices = get_wanted_full_indices(filtered_indices)
+    scalers = all_scalers[wanted_full_indices]
+    defaults_per_variable = all_defaults[wanted_full_indices]
+    scaled_defaults_per_variable = all_defaults_scaled[wanted_full_indices]
+
+    device = torch.device(dev)
+
+    xadv = sample.clone().detach()
+
+    # inputs need to be included when calculating gradients
+    xadv.requires_grad = True
+
+    # from the undisturbed predictions, both the model and the criterion are already available and can be used here again;
+    # it's just that they were each part of a function, so not automatically in the global scope
+    if thismodel==None and thiscriterion==None:
+        global model
+        global criterion
+
+    # forward
+    preds = thismodel(xadv)
+
+    loss = thiscriterion(preds, targets).mean()
+
+    thismodel.zero_grad()
+    loss.backward()
+
+    with torch.no_grad():
+        # get sign of gradient
+        dx = torch.sign(xadv.grad.detach())
+
+        # add to sample
+        xadv += epsilon*dx
+
+        # remove the impact on selected variables (exclude integers, default values)
+        # and limit perturbation based on original value
+        if reduced:
+            for i in range(n_Vars):
+                if wanted_full_indices[i] in integer_indices:
+                    xadv[:,i] = sample[:,i]
+                    #print('integer index:', wanted_full_indices[i])
+                else: # non integer, but might have defaults that should be excluded from shift
+                    defaults = sample[:,i].cpu() == scaled_defaults_per_variable[i]
+                    if torch.sum(defaults) != 0:
+                        xadv[:,i][defaults] = sample[:,i][defaults]
+
+                    if restrict_impact > 0:
+                        original_back = scalers[i].inverse_transform(sample[:,i])
+                        difference_back = scalers[i].inverse_transform(xadv.detach()[:,i]) - original_back
+                        allowed_perturbation = restrict_impact * np.abs(original_back)
+                        high_impact = np.abs(difference_back) > allowed_perturbation
+                        if np.sum(high_impact)!=0:
+                            scaled_back_max_perturbed = torch.from_numpy(original_back) + torch.from_numpy(allowed_perturbation) * dx[:,i]
+                            xadv[high_impact,i] = torch.Tensor(scalers[i].transform(scaled_back_max_perturbed[high_impact].reshape(-1,1)).flatten())
+
+        return xadv.detach()
+
+
+def syst_var(epsilon=1e-2,sample=None,reduced=True, dev="cpu", filtered_indices=[i for i in range(n_input_features)],restrict_impact=-1, up=True):
+    if epsilon == 0:
+        return sample
+    n_Vars = len(filtered_indices)
+
+    wanted_full_indices = get_wanted_full_indices(filtered_indices)
+
+    scalers = all_scalers[wanted_full_indices]
+
+    defaults_per_variable = all_defaults[wanted_full_indices]
+    scaled_defaults_per_variable = all_defaults_scaled[wanted_full_indices]
+
+    device = torch.device(dev)
+
+    with torch.no_grad():
+        # variation in common direction, default is upwards
+        systvar = epsilon * torch.Tensor(np.ones((len(sample),n_Vars))).to(device)
+        if up == False:
+            systvar *= -1.
+        # scale by a factor for individual feature
+        for i in range(n_Vars):
+            systvar[:,i] *= all_factor_epsilons[wanted_full_indices[i]]
+        xadv = sample + systvar
+
+        # use full indices and check if in int.vars. or defaults
+        for i in range(n_Vars):
+            if wanted_full_indices[i] in integer_indices:
+                xadv[:,i] = sample[:,i]
+            else: # non integer, but might have defaults that should be excluded from shift
+                defaults = sample[:,i].cpu() == scaled_defaults_per_variable[i]
+                if torch.sum(defaults) != 0:
+                    xadv[:,i][defaults] = sample[:,i][defaults]
+
+                if restrict_impact > 0:
+                    original_back = scalers[i].inverse_transform(sample[:,i])
+                    difference_back = scalers[i].inverse_transform(xadv[:,i]) - original_back
+                    allowed_perturbation = restrict_impact * np.abs(original_back)
+                    high_impact = np.abs(difference_back) > allowed_perturbation
+                    if np.sum(high_impact)!=0:
+                        scaled_back_max_perturbed = torch.from_numpy(original_back[high_impact]) + torch.from_numpy(allowed_perturbation[high_impact]) * torch.sign(systvar[high_impact,i])
+                        xadv[high_impact,i] = torch.Tensor(scalers[i].transform(scaled_back_max_perturbed.reshape(-1,1)).flatten())
+
+        return xadv