-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Accuracy much lower than Validation Accuracy #9
Comments
Hi, I try to compute the F1 score but using the default (released) hyperparameter, the best F1 score I got with the final model is only about 92. However, when I lower the learning rate to 0.0001, the best F1 score is 93.2. Can you kindly tell me what are the correct hyperparameters to replicate the result reported in the paper? Thank you! |
Trying these files instead can probably help get the correct results: As to the hyperparameter, the most important one is the learning rate. For different dataset the lr might be different, and we select the best one from [0.01, 0.001, 0.0001]. 0.001 Is the default one and will always work (it means at least achieving decent results). On Reddit, 0.0001 might be the best one. |
Thank you for a quick response. I tried the filed attached but that did not solve the problem with training accuracy. May I ask which version of tensorflow did you use to run these scripts? |
Sorry I need to take back my words. I checked our log, and the training acc is indeed around 0.65 same as yours. So it is not a problem caused by tensorflow version. Your result is correct. It is possible that the training accuracy is lower than the validation one, because in the current implementation the former is computed by using sampling (for efficiency purpose) but the latter not. The sampling results in an approximation that is consistent but not unbiased, and hence the approximation may be far from the truth. In the paper we propose to use sampling for only learning the model parameters but not for computing predictions, i.e. in the test/validation phase, we did not do sampling. |
Thank you! With learning rate 0.001 I successfully replicate the paper reported F1. Great job! However, I try to benchmark FastGCN and GraphSAGE on GPU (a 1080 Ti) and got some unexpected results. Using the default early stopping criteria: Is this the correct behavior? I see in the paper that you only report the wall clock time on CPU. Did you also measure the GPU performance? Thank you again! |
@Tiiiger Sorry for my late reply due to a conference deadline. We did not compare the performance on GPU. All the early experiments (including the hyperparameter tuning) were done on a machine without a gpu, so we also did the remaining experiments using cpu later. The codes and hyperparameters were not optimized for gpu. |
Thank you! |
@matenure
The ressult of 'A.dot(X.dot(w))' usually have lots of zero rows, which means that there will be some zero rows in the sampling gcn result of batch nodes, and these rows will not give any imformation. As a result, in each batch, there are some nodes have not uesed to train, and the training accuracy will have a upper bound. Adjusting the sampling numbers: rank0,rank1 will be help, but cannot deal with this problem. You can try it and compute the ratio of nonzero rows using 'np.count_nonzero(res[:,0])/res.shape[0]'. When rank0>>rank1>>batch_size, the ratio is close to 1, and then the upper bound of trainingAcc is close to 1 to! Hope it useful for you to explain the problems what we mentioned~ |
@Zian-Zhou Thank you for the information. That is indeed a problem. Actually in our final model "train_batch_multiRank_inductive_reddit_Mixlayers_sampleA" you can see that we already solved that problem and only sampled nonzero rows. |
@matenure
"A.dot(X.dot(w))" simulate a result of GCN with sampling. Hope you check it by yourself with Pubmed dataset and Reddit~ |
@Zian-Zhou Is this result from the "appr2layers" model? Then you are right, we have not changed the code there. Actually, it is even more consistent with our theory, but in practice, as you said, avoiding zero rows will be better. You can refer to the codes in "**sampleA" where we sampled only non-zero rows to avoid this problem. |
@matenure
The result of "res = A.dot(X.dot(w))" and "np.count_nonzero(res[:,0])/res.shape[0]" will show something. |
@matenure I build a dataset. Please have a look~
Change the number of rank1 or batch_size, and run the code. I think it will show something. |
@Zian-Zhou I got it. Yes, you are right, there will be zero-rows using our sampling method. And I think it is not easy to avoid this problem if we use batch training (except for batch size 1). But I do not think it should impact the training acc... Anyway, thank you for the discussion. And I suggest using "evaluate" to get the "real" training accuracy. |
@matenure I completely agree! I had intended to mask the batchnodes correspond to the zero-rows. It is unnecessary. I will try to use evalue() to check the training accuracy, and I believe it is right. But if I want to do experiment on a larger graph data, it maybe take some time. Thanks for your kindly help! What's a great work! |
In running the sample code, I found that the training accuracy is much lower than the validation accuracy, which is different from training GraphSAGE on reddit in their repo. Is this normal.
For example, the logging I got:
Epoch: 0042 train_loss= 1.72849 train_acc= 0.66406 val_loss= 3.17200 val_acc= 0.90848 time per batch= 0.01104
Epoch: 0043 train_loss= 1.84603 train_acc= 0.59375 val_loss= 3.18259 val_acc= 0.90506 time per batch= 0.01108
Epoch: 0044 train_loss= 1.86952 train_acc= 0.60156 val_loss= 3.17415 val_acc= 0.90324 time per batch= 0.01116
Also, the paper reports the F1 measure. How to get F1 score using the codebase?
The text was updated successfully, but these errors were encountered: