Skip to content

Commit

Permalink
feat: preparations, training, attacks, evaluation and helpers
Browse files Browse the repository at this point in the history
update README with instructions (tested on RWTH HPC) and links
  • Loading branch information
AnnikaStein committed Jun 26, 2022
1 parent 26818ef commit d39b62d
Show file tree
Hide file tree
Showing 18 changed files with 31,233 additions and 1 deletion.
38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,38 @@
# Adversarial-Training-for-Jet-Tagging
Code for "Improving robustness of jet tagging algorithms with adversarial training" (arXiv:2203.13890)
Code for:
> <b><a href="https://arxiv.org/abs/2203.13890" target="_blank">Improving robustness of jet tagging algorithms with adversarial training</a></b>
> A. Stein, X. Coubez, S. Mondal, A. Novak, A. Schmidt
> 2022.
<i>Jet Flavor dataset</i>

Obtained from http://mlphysics.ics.uci.edu/ and originally created for
> <b><a href="https://arxiv.org/abs/1607.08633" target="_blank">Jet Flavor Classification in High-Energy Physics with Deep Neural Networks</a></b>
> D. Guest, J. Collado, P. Baldi, S. Hsu, G. Urban, and D. Whiteson
> Physical Review D, 2016.
## Get and prepare dataset
### Download
Login to a copy18-node of the HPC with high bandwith (will download 2.2GB)
```
wget http://mlphysics.ics.uci.edu/data/hb_jet_flavor_2016/dataset.json.gz
mkdir -p /hpcwork/<your-account>/jet_flavor_MLPhysics/dataset
mv dataset.json.gz /hpcwork/<your-account>/jet_flavor_MLPhysics/dataset
```
### Extracting the data via awkward arrays
Actually, it turns out that reading the file is not straightforward, at some point, the data has to be unzipped or extracted. The file might have the simple ending ".json", but it's rather various JSON-like entries distributed over several lines of the entire .json file. Consult the notebook `preparations/read_dataset.ipynb` for further details and potential alternatives to use the dataset. Finally, I ended up using awkward arrays with which the next steps become a bit easier.
### A first look at the data
Some initial investigations before proceeding to the actual framework will be conducted inside `preparations/explore_dataset.ipynb`.
### Calculate defaults
To use custom default values that fit well to the bulk distribution, preliminary studies are done inside `preparations/defaults.ipynb`. It's also the first notebook that makes use of `helpers/variables.py`.
### Clean samples
In order to not store too many versions of the same data, cleaning the samples will not be done as a separate step, but comes later when doing the preprocessing (scaling). There, also the final shape of the arrays will be flattened, the result should be a set of usable PyTorch tensors. During the cleaning, I would not cut on any variables, but would only modify certain unphysical values and place them at special default bins - i.e. the fractions of jets of a certain flavor, in certain pt and eta bins do not change by the next step of cleaning (and preprocessing) the data.
### Calculate sample weights
Sample weights are calculated in `preparations/reweighting.ipynb`.
### Preprocessing
Calculate scalers (from trainset only, and ignore defaults), apply scalers (do _not_ ignore defaults when applying the scaler, alternative: set to zero), train/val/test splitting & shuffling, build sample weights and bins. See `preparations/clean_preprocess.ipynb` for a first working example of the entire preprocessing chain. Also, `evaluate/tools.py` can be used later to facilitate communication between training or evaluation scripts with the preprocessing step.
## Run framework (training, evaluation)
### Training
All relevant scripts are placed inside `training`, e.g. standalone training on current node is done with `training.py`, and for submission to the batch system, there is `training.sh` and `submit_training.py`. Can use nominal or adversarial training.
### Evaluation
ROC curves: `evaluate/eval_roc_new.py`. Training history (loss): `evaluate/plot_loss.py`. Tagger outputs and discriminator shapes: `evaluate/eval_discriminator_shapes.py`. Plotting of input variables `evaluate/eval_inputs.py`.
159 changes: 159 additions & 0 deletions attack/disturb_inputs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
import numpy as np
import torch

import sys

sys.path.append("/home/um106329/aisafety/jet_flavor_MLPhysics/helpers/")
from tools import defaults_path, preprocessed_path, get_all_scalers, get_all_defaults
from variables import integer_indices, n_input_features, get_wanted_full_indices, all_factor_epsilons

all_scalers = np.array(get_all_scalers())
all_defaults_scaled = np.array(get_all_defaults(scaled=True))
all_defaults = np.array(get_all_defaults(scaled=False))

def apply_noise(sample, magn=1e-2,offset=[0], dev="cpu", filtered_indices=[i for i in range(n_input_features)],restrict_impact=-1):
seed = 0
np.random.seed(seed)

if magn == 0:
return sample
n_Vars = len(filtered_indices)

wanted_full_indices = get_wanted_full_indices(filtered_indices)

scalers = all_scalers[wanted_full_indices]

defaults_per_variable = all_defaults[wanted_full_indices]
scaled_defaults_per_variable = all_defaults_scaled[wanted_full_indices]

device = torch.device(dev)

with torch.no_grad():
noise = torch.Tensor(np.random.normal(offset,magn,(len(sample),n_Vars))).to(device)
xadv = sample + noise

# use full indices and check if in int.vars. or defaults
for i in range(n_Vars):
if wanted_full_indices[i] in integer_indices:
xadv[:,i] = sample[:,i]
else: # non integer, but might have defaults that should be excluded from shift
defaults = sample[:,i].cpu() == scaled_defaults_per_variable[i]
if torch.sum(defaults) != 0:
xadv[:,i][defaults] = sample[:,i][defaults]

if restrict_impact > 0:
original_back = scalers[i].inverse_transform(sample[:,i])
difference_back = scalers[i].inverse_transform(xadv[:,i]) - original_back
allowed_perturbation = restrict_impact * np.abs(original_back)
high_impact = np.abs(difference_back) > allowed_perturbation
if np.sum(high_impact)!=0:
scaled_back_max_perturbed = torch.from_numpy(original_back[high_impact]) + torch.from_numpy(allowed_perturbation[high_impact]) * torch.sign(noise[high_impact,i])
xadv[high_impact,i] = torch.Tensor(scalers[i].transform(scaled_back_max_perturbed.reshape(-1,1)).flatten())

return xadv

def fgsm_attack(epsilon=1e-2,sample=None,targets=None,thismodel=None,thiscriterion=None,reduced=True, dev="cpu", filtered_indices=[i for i in range(n_input_features)],restrict_impact=-1):
if epsilon == 0:
return sample
n_Vars = len(filtered_indices)

wanted_full_indices = get_wanted_full_indices(filtered_indices)
scalers = all_scalers[wanted_full_indices]
defaults_per_variable = all_defaults[wanted_full_indices]
scaled_defaults_per_variable = all_defaults_scaled[wanted_full_indices]

device = torch.device(dev)

xadv = sample.clone().detach()

# inputs need to be included when calculating gradients
xadv.requires_grad = True

# from the undisturbed predictions, both the model and the criterion are already available and can be used here again;
# it's just that they were each part of a function, so not automatically in the global scope
if thismodel==None and thiscriterion==None:
global model
global criterion

# forward
preds = thismodel(xadv)

loss = thiscriterion(preds, targets).mean()

thismodel.zero_grad()
loss.backward()

with torch.no_grad():
# get sign of gradient
dx = torch.sign(xadv.grad.detach())

# add to sample
xadv += epsilon*dx

# remove the impact on selected variables (exclude integers, default values)
# and limit perturbation based on original value
if reduced:
for i in range(n_Vars):
if wanted_full_indices[i] in integer_indices:
xadv[:,i] = sample[:,i]
#print('integer index:', wanted_full_indices[i])
else: # non integer, but might have defaults that should be excluded from shift
defaults = sample[:,i].cpu() == scaled_defaults_per_variable[i]
if torch.sum(defaults) != 0:
xadv[:,i][defaults] = sample[:,i][defaults]

if restrict_impact > 0:
original_back = scalers[i].inverse_transform(sample[:,i])
difference_back = scalers[i].inverse_transform(xadv.detach()[:,i]) - original_back
allowed_perturbation = restrict_impact * np.abs(original_back)
high_impact = np.abs(difference_back) > allowed_perturbation
if np.sum(high_impact)!=0:
scaled_back_max_perturbed = torch.from_numpy(original_back) + torch.from_numpy(allowed_perturbation) * dx[:,i]
xadv[high_impact,i] = torch.Tensor(scalers[i].transform(scaled_back_max_perturbed[high_impact].reshape(-1,1)).flatten())

return xadv.detach()


def syst_var(epsilon=1e-2,sample=None,reduced=True, dev="cpu", filtered_indices=[i for i in range(n_input_features)],restrict_impact=-1, up=True):
if epsilon == 0:
return sample
n_Vars = len(filtered_indices)

wanted_full_indices = get_wanted_full_indices(filtered_indices)

scalers = all_scalers[wanted_full_indices]

defaults_per_variable = all_defaults[wanted_full_indices]
scaled_defaults_per_variable = all_defaults_scaled[wanted_full_indices]

device = torch.device(dev)

with torch.no_grad():
# variation in common direction, default is upwards
systvar = epsilon * torch.Tensor(np.ones((len(sample),n_Vars))).to(device)
if up == False:
systvar *= -1.
# scale by a factor for individual feature
for i in range(n_Vars):
systvar[:,i] *= all_factor_epsilons[wanted_full_indices[i]]
xadv = sample + systvar

# use full indices and check if in int.vars. or defaults
for i in range(n_Vars):
if wanted_full_indices[i] in integer_indices:
xadv[:,i] = sample[:,i]
else: # non integer, but might have defaults that should be excluded from shift
defaults = sample[:,i].cpu() == scaled_defaults_per_variable[i]
if torch.sum(defaults) != 0:
xadv[:,i][defaults] = sample[:,i][defaults]

if restrict_impact > 0:
original_back = scalers[i].inverse_transform(sample[:,i])
difference_back = scalers[i].inverse_transform(xadv[:,i]) - original_back
allowed_perturbation = restrict_impact * np.abs(original_back)
high_impact = np.abs(difference_back) > allowed_perturbation
if np.sum(high_impact)!=0:
scaled_back_max_perturbed = torch.from_numpy(original_back[high_impact]) + torch.from_numpy(allowed_perturbation[high_impact]) * torch.sign(systvar[high_impact,i])
xadv[high_impact,i] = torch.Tensor(scalers[i].transform(scaled_back_max_perturbed.reshape(-1,1)).flatten())

return xadv
Loading

0 comments on commit d39b62d

Please sign in to comment.