-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed save_imatrix to match old behaviour for MoE #7099
Conversation
This fix is simple and clear, but unnecessarily doubles the memory overhead..
I think in principle the |
Yeah, it's nice and clear what it's doing atm so it depends if it's worth the effort to fix and add extra complexity. The |
examples/imatrix/imatrix.cpp
Outdated
// We select top-k experts, the number of calls for the expert tensors will be k times larger. | ||
// NOTE: This will trigger the "if (e.ncall > m_last_call)" save conditional on the first active expert. | ||
// The commented out "if (idx == t->src[0]->ne[0] - 1) ++e.ncall;" doesn't work. | ||
if (((int32_t *) t->op_params)[0] == 0) ++e.ncall; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past only one expert was evaluated per mul_mat_id, and op_params
was used to store the expert being evaluated, but that's no longer the case. op_params
is not used anymore in mul_mat_id, so this condition doesn't really do anything, op_params
will always be zero so it's always true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't test it but the old if (idx == 0)
did work.
What test can be done to test for the callback being the last for the MoE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in effect, this change does nothing, e.ncall
is increased unconditionally as it was before. I think that increasing ncall
unconditionally here is the correct thing to do, since the count is later corrected in save_imatrix
with your change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What test can be done to test for the callback being the last for the MoE?
Currently there is only one call to mul_mat_id
regardless of the number of experts being used. This was changed in #6505.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in effect, this change does nothing,
e.ncall
is increased unconditionally as it was before. I think that increasingncall
unconditionally here is the correct thing to do, since the count is later corrected insave_imatrix
with your change.
Yeah, it will still work:
(p.second.values[i] / static_cast<float>(p.second.counts[i])) * static_cast<float>(p.second.ncall)
Basically we divide down to get the actual mean based on how many actual values were added to an element values[i]
and then multiply back up to get a value that can be used for the weighted combination of other imatrix files and for quantize
to get back the original mean when it divides by ncall
stored in the file.
So having the weighted combination scaled up by num-top-k
won't effect either of these.
But this will still cause it to save the 10 chunks too often:
[1]3.4990,[2]2.7563,[3]2.8327,[4]2.8365,
save_imatrix: stored collected data after 10 chunks in wizard-lm-2:8x22b-f16.imatrix
[5]3.2415,[6]3.1667,[7]2.9011,[8]3.2475,[9]3.2100,
save_imatrix: stored collected data after 20 chunks in wizard-lm-2:8x22b-f16.imatrix
[10]3.5357,[11]3.7258,[12]3.6469,[13]3.9192,[14]4.2641,
save_imatrix: stored collected data after 30 chunks in wizard-lm-2:8x22b-f16.imatrix
[15]4.4561,[16]4.7251,[17]4.8591,[18]5.0424,[19]5.1595,
vs
[1]6.8864,[2]5.5590,[3]4.6385,[4]5.2093,[5]5.6050,[6]4.6732,[7]4.7876,[8]5.3775,[9]5.6677,
save_imatrix: stored collected data after 10 chunks in dbrx:16x12b-instruct-f16.imatrix
[10]5.4960,[11]5.8453,[12]6.4653,[13]6.7705,[14]7.1977,[15]7.3001,[16]7.4528,[17]7.6426,[18]7.2825,[19]7.3690,
save_imatrix: stored collected data after 20 chunks in dbrx:16x12b-instruct-f16.imatrix
[20]7.4835,[21]7.8310,[22]7.9035,[23]7.7323,[24]7.6813,[25]7.4121,[26]7.3496,[27]7.3934,[28]7.8041,[29]7.9666,
save_imatrix: stored collected data after 30 chunks in dbrx:16x12b-instruct-f16.imatrix
[30]8.1926,[31]8.3989,[32]8.6105,[33]8.7318,[34]8.8261,[35]8.8406,[36]8.8695,[37]9.0027,[38]9.0287,[39]8.9052,
save_imatrix: stored collected data after 40 chunks in dbrx:16x12b-instruct-f16.imatrix
and with the debug output on in quantize
, will print num-top-k
times more for each ncall
for the experts:
load_imatrix: loaded data (size = 172032, ncall = 364) for 'blk.38.ffn_down_exps.weight'
load_imatrix: loaded data (size = 6144, ncall = 91) for 'blk.38.ffn_gate_inp.weight'
load_imatrix: loaded data (size = 6144, ncall = 91) for 'blk.38.attn_output.weight'
load_imatrix: loaded data (size = 172032, ncall = 364) for 'blk.37.ffn_down_exps.weight'
load_imatrix: loaded data (size = 98304, ncall = 364) for 'blk.37.ffn_gate_exps.weight'
load_imatrix: loaded data (size = 6144, ncall = 91) for 'blk.37.ffn_gate_inp.weight'
load_imatrix: loaded data (size = 6144, ncall = 91) for 'blk.37.attn_output.weight'
vs
load_imatrix: loaded data (size = 172032, ncall = 91) for 'blk.38.ffn_down_exps.weight'
load_imatrix: loaded data (size = 6144, ncall = 91) for 'blk.38.ffn_gate_inp.weight'
load_imatrix: loaded data (size = 6144, ncall = 91) for 'blk.38.attn_output.weight'
load_imatrix: loaded data (size = 172032, ncall = 91) for 'blk.37.ffn_down_exps.weight'
load_imatrix: loaded data (size = 98304, ncall = 91) for 'blk.37.ffn_gate_exps.weight'
load_imatrix: loaded data (size = 6144, ncall = 91) for 'blk.37.ffn_gate_inp.weight'
load_imatrix: loaded data (size = 6144, ncall = 91) for 'blk.37.attn_output.weight'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this PR with ncall
increased unconditionally with mixtral and it seems to produce the expected results:
compute_imatrix: computing over 50 chunks with batch_size 512
compute_imatrix: 1.46 seconds per pass - ETA 1.22 minutes
[1]3.3282,[2]5.5064,[3]5.7696,[4]6.0597,[5]6.6383,[6]6.4067,[7]6.0626,[8]6.1729,[9]6.3318,
save_imatrix: stored collected data after 10 chunks in imatrix.dat
[10]5.8754,[11]5.6783,[12]5.8278,[13]5.8804,[14]5.7391,[15]5.9534,[16]5.9483,[17]5.9110,[18]6.0203,[19]5.9764,
save_imatrix: stored collected data after 20 chunks in imatrix.dat
[20]5.9101,[21]5.8586,[22]5.8696,[23]5.9431,[24]5.9631,[25]6.0114,[26]6.0204,[27]5.9588,[28]5.7325,[29]5.7142,
save_imatrix: stored collected data after 30 chunks in imatrix.dat
[30]5.6387,[31]5.5779,[32]5.4650,[33]5.4179,[34]5.3390,[35]5.2645,[36]5.2147,[37]5.1724,[38]5.1585,[39]5.1434,
save_imatrix: stored collected data after 40 chunks in imatrix.dat
[40]5.1864,[41]5.1752,[42]5.1467,[43]5.0827,[44]5.0719,[45]5.0194,[46]5.0461,[47]5.0968,[48]5.1533,[49]5.1977,
save_imatrix: stored collected data after 50 chunks in imatrix.dat
[50]5.1661,
Final estimate: PPL = 5.1661 +/- 0.10175
save_imatrix: stored collected data after 50 chunks in imatrix.dat
load_imatrix: loaded data (size = 114688, ncall = 50) for 'blk.31.ffn_down_exps.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.31.attn_k.weight'
load_imatrix: loaded data (size = 114688, ncall = 50) for 'blk.30.ffn_down_exps.weight'
load_imatrix: loaded data (size = 32768, ncall = 50) for 'blk.30.ffn_up_exps.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.30.ffn_gate_inp.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.30.attn_v.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.30.attn_k.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.30.attn_q.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.29.ffn_gate_inp.weight'
load_imatrix: loaded data (size = 32768, ncall = 50) for 'blk.31.ffn_up_exps.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.29.attn_v.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.29.attn_k.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.29.attn_q.weight'
load_imatrix: loaded data (size = 114688, ncall = 50) for 'blk.28.ffn_down_exps.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.28.attn_v.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.28.attn_q.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.31.attn_q.weight'
load_imatrix: loaded data (size = 114688, ncall = 50) for 'blk.27.ffn_down_exps.weight'
load_imatrix: loaded data (size = 32768, ncall = 50) for 'blk.27.ffn_gate_exps.weight'
load_imatrix: loaded data (size = 32768, ncall = 50) for 'blk.27.ffn_up_exps.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.27.ffn_gate_inp.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.27.attn_v.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.27.attn_k.weight'
load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.27.attn_output.weight'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly, you based these changes on a build before #6505 was merged, and the results that show a higher number of
ncall
for the moe tensors is with a build without #6505, correct?
Yeah, I used the build right after the dbrx
PR was pushed as originally went on this search after having lots of trouble quantizing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this PR with
ncall
increased unconditionally with mixtral and it seems to produce the expected results:compute_imatrix: computing over 50 chunks with batch_size 512 compute_imatrix: 1.46 seconds per pass - ETA 1.22 minutes [1]3.3282,[2]5.5064,[3]5.7696,[4]6.0597,[5]6.6383,[6]6.4067,[7]6.0626,[8]6.1729,[9]6.3318, save_imatrix: stored collected data after 10 chunks in imatrix.dat [10]5.8754,[11]5.6783,[12]5.8278,[13]5.8804,[14]5.7391,[15]5.9534,[16]5.9483,[17]5.9110,[18]6.0203,[19]5.9764, save_imatrix: stored collected data after 20 chunks in imatrix.dat [20]5.9101,[21]5.8586,[22]5.8696,[23]5.9431,[24]5.9631,[25]6.0114,[26]6.0204,[27]5.9588,[28]5.7325,[29]5.7142, save_imatrix: stored collected data after 30 chunks in imatrix.dat [30]5.6387,[31]5.5779,[32]5.4650,[33]5.4179,[34]5.3390,[35]5.2645,[36]5.2147,[37]5.1724,[38]5.1585,[39]5.1434, save_imatrix: stored collected data after 40 chunks in imatrix.dat [40]5.1864,[41]5.1752,[42]5.1467,[43]5.0827,[44]5.0719,[45]5.0194,[46]5.0461,[47]5.0968,[48]5.1533,[49]5.1977, save_imatrix: stored collected data after 50 chunks in imatrix.dat [50]5.1661, Final estimate: PPL = 5.1661 +/- 0.10175 save_imatrix: stored collected data after 50 chunks in imatrix.dat
load_imatrix: loaded data (size = 114688, ncall = 50) for 'blk.31.ffn_down_exps.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.31.attn_k.weight' load_imatrix: loaded data (size = 114688, ncall = 50) for 'blk.30.ffn_down_exps.weight' load_imatrix: loaded data (size = 32768, ncall = 50) for 'blk.30.ffn_up_exps.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.30.ffn_gate_inp.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.30.attn_v.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.30.attn_k.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.30.attn_q.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.29.ffn_gate_inp.weight' load_imatrix: loaded data (size = 32768, ncall = 50) for 'blk.31.ffn_up_exps.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.29.attn_v.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.29.attn_k.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.29.attn_q.weight' load_imatrix: loaded data (size = 114688, ncall = 50) for 'blk.28.ffn_down_exps.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.28.attn_v.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.28.attn_q.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.31.attn_q.weight' load_imatrix: loaded data (size = 114688, ncall = 50) for 'blk.27.ffn_down_exps.weight' load_imatrix: loaded data (size = 32768, ncall = 50) for 'blk.27.ffn_gate_exps.weight' load_imatrix: loaded data (size = 32768, ncall = 50) for 'blk.27.ffn_up_exps.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.27.ffn_gate_inp.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.27.attn_v.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.27.attn_k.weight' load_imatrix: loaded data (size = 4096, ncall = 50) for 'blk.27.attn_output.weight'
Yeah, it looks like that can just be left unconditional then.
It's probably worth trying to re-quantize mixtral with and without these fixes too, just in case something else has changed since then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth trying to re-quantize mixtral with and without these fixes too, just in case something else has changed since then.
I will give it a try with a low number of chunks, but I don't have enough VRAM to create a imatrix for mixtral with the full wiki.train.raw
in a reasonable amount of time.
Co-authored-by: slaren <slarengh@gmail.com>
Yeah, I can confirm this also seems to help I won't post the full output, but the test question (with
as have found in the past this is very telling of broken and/or dumb models...
USERCan you write me a C++ program to calculate logistic regression using GSL? Write a short driver in main to test it with hard coded values. ASSISTANTCertainly! Below is a simple C++ program that uses the GNU Scientific Library (GSL) to perform logistic regression on a set of hardcoded data points. This example assumes you have GSL installed on your system and that you can link against it when compiling the program. #include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_multifit_nlinear.h>
// The logistic function
double logistic_function(double x, const gsl_vector *beta) {
double z = gsl_vector_get(beta, 0) + gsl_vector_get(beta, 1) * x;
return 1.0 / (1.0 + exp(-z));
}
// The vectorized logistic function for a set of samples
void logistic_f(const gsl_vector *x, void *params, gsl_vector *f) {
size_t n = f->size;
double *y = (double *)params;
for (size_t i = 0; i < n; ++i) {
double fi = logistic_function(gsl_vector_get(x, i), params);
gsl_vector_set(f, i, fi - y[i]);
}
}
// The Jacobian of the logistic function
void logistic_df(const gsl_vector *x, void *params, gsl_matrix *J) {
size_t n = J->size1;
double *y = (double *)params;
for (size_t i = 0; i < n; ++i) {
double fi = logistic_function(gsl_vector_get(x, i), params);
gsl_matrix_set(J, i, 0, fi * (1.0 - fi));
gsl_matrix_set(J, i, 1, x->data[i] * fi * (1.0 - fi));
}
}
int main() {
// Hardcoded data points (features and labels)
double x[] = {1.0, 2.0, 3.0, 4.0}; // Features
double y[] = {0.1, 0.2, 0.6, 0.9}; // Labels (probabilities)
const size_t n = sizeof(x) / sizeof(x[0]); // Number of data points
const size_t p = 2; // Number of parameters (intercept + slope)
// Allocate workspace and vectors/matrices for GSL
gsl_multifit_nlinear_workspace *w;
gsl_vector *f;
gsl_matrix *J;
gsl_vector *beta;
// Initialize the non-linear least squares solver
gsl_multifit_nlinear_fdf fdf;
fdf.f = logistic_f;
fdf.df = logistic_df;
fdf.fpp = NULL; // We do not provide a second derivative function
fdf.params = y;
gsl_multifit_nlinear_parameters params = gsl_multifit_nlinear_default_parameters();
w = gsl_multifit_nlinear_alloc(gsl_multifit_nlinear_trust, n, p);
gsl_multifit_nlinear_set(w, &fdf, beta, ¶ms);
// Initialize the starting values for the parameters
beta = gsl_vector_alloc(p);
gsl_vector_set(beta, 0, 0.0); // Intercept
gsl_vector_set(beta, 1, 0.0); // Slope
// Perform the non-linear least squares fit
int status;
size_t iter = 0;
do {
iter++;
status = gsl_multifit_nlinear_iterate(w);
if (status) {
std::cerr << "Error: " << gsl_strerror(status) << std::endl;
break;
}
status = gsl_multifit_nlinear_test_delta(w, 1e-4, 1e-4);
} while (status == GSL_CONTINUE && iter < 100);
// Retrieve the optimized parameters
gsl_vector_memcpy(beta, w->x);
std::cout << "Status: " << gsl_strerror(status) << std::endl;
std::cout << "Iterations: " << iter << std::endl;
std::cout << "Beta parameters:" << std::endl;
gsl_vector_fprintf(stdout, beta, "%g");
// Clean up
gsl_multifit_nlinear_free(w);
gsl_vector_free(beta);
return 0;
} To compile this program, you need to link against the GSL library. If you have GSL installed in a standard location, you can compile the program with the following command: g++ -o logistic_regression logistic_regression.cpp -lgsl -lgslcblas -lm If GSL is not installed in a standard location, you may need to specify include directories and library paths using After compiling, you can run the program with: ./logistic_regression This program will output the optimized beta parameters for the logistic regression model fitted to the hardcoded data points. The USERCan you write me a C++ program to calculate logistic regression using GSL? Write a short driver in main to test it with hard coded values. ASSISTANTCertainly! Below is a simple C++ program that uses the GNU Scientific Library (GSL) to perform logistic regression on a set of hardcoded data points. This example assumes you have GSL installed on your system and that you can link against it when compiling the program. #include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_multifit_nlinear.h>
// The logistic function
double logistic_function(double z) {
return 1.0 / (1.0 + exp(-z));
}
// The derivative of the logistic function
double dlogistic_function(double y) {
return y * (1 - y);
}
// The function to compute the residual
int logistic_f(const gsl_vector *x, void *data, gsl_vector *f) {
size_t n = ((struct data *)data)->n;
double *y = ((struct data *)data)->y;
double *X = ((struct data *)data)->X;
double sum;
for (size_t i = 0; i < n; ++i) {
sum = gsl_vector_get(x, 0); // intercept
for (size_t j = 1; j < x->size; ++j) {
sum += gsl_vector_get(x, j) * X[i * (x->size - 1) + j - 1]; // features
}
gsl_vector_set(f, i, y[i] - logistic_function(sum));
}
return GSL_SUCCESS;
}
// The function to compute the Jacobian of the residual
int logistic_df(const gsl_vector *x, void *data, gsl_matrix *J) {
size_t n = ((struct data *)data)->n;
double *X = ((struct data *)data)->X;
for (size_t i = 0; i < n; ++i) {
double y_pred = logistic_function(gsl_vector_get(x, 0)); // intercept
for (size_t j = 1; j < x->size; ++j) {
y_pred += gsl_vector_get(x, j) * X[i * (x->size - 1) + j - 1]; // features
}
double dy_pred = dlogistic_function(y_pred);
gsl_matrix_set(J, i, 0, dy_pred);
for (size_t j = 1; j < x->size; ++j) {
gsl_matrix_set(J, i, j, dy_pred * X[i * (x->size - 1) + j - 1]);
}
}
return GSL_SUCCESS;
}
// The function to compute the residual and the Jacobian
int logistic_fd(const gsl_vector *x, void *data, gsl_vector *f, gsl_matrix *J) {
logistic_f(x, data, f);
logistic_df(x, data, J);
return GSL_SUCCESS;
}
int main() {
// Hardcoded dataset: 3 samples with 2 features each (including the bias term)
const size_t n = 3; // number of samples
const size_t p = 3; // number of parameters (including intercept)
double X_data[] = {1.0, 0.5, -1.0, 1.0, 1.0, 2.0}; // feature matrix with bias term
double y_data[] = {0.0, 1.0, 1.0}; // target values
// Allocate memory for the dataset
gsl_matrix *X = gsl_matrix_alloc(n, p - 1);
gsl_vector *y = gsl_vector_alloc(n);
struct data d = {n, y_data, X_data};
// Copy the hardcoded data into GSL structures
gsl_matrix_set_all(X, 1.0); // Set the bias term (intercept) to 1
for (size_t i = 0; i < n; ++i) {
for (size_t j = 0; j < p - 1; ++j) {
gsl_matrix_set(X, i, j, X_data[i * (p - 1) + j]);
}
gsl_vector_set(y, i, y_data[i]);
}
// Initial guess for the parameters (including intercept)
gsl_vector *x = gsl_vector_alloc(p);
gsl_vector_set(x, 0, 0.0); // intercept
gsl_vector_set(x, 1, 0.0); // coefficient for feature 1
gsl_vector_set(x, 2, 0.0); // coefficient for feature 2
// Allocate space for the covariance matrix
gsl_matrix *covar = gsl_matrix_alloc(p, p);
// Perform the non-linear least squares fit
gsl_multifit_nlinear_fdf fdf;
gsl_multifit_nlinear_parameters params = gsl_multifit_nlinear_default_parameters();
fdf.f = &logistic_f;
fdf.df = &logistic_df;
fdf.fdf = &logistic_fd;
fdf.n = n;
fdf.p = p;
fdf.params = &d;
gsl_multifit_nlinear_workspace *w = gsl_multifit_nlinear_alloc(n, p);
gsl_multifit_nlinear_init(&fdf, x, w);
int status;
size_t iter = 0;
do {
iter++;
status = gsl_multifit_nlinear_iterate(w);
// Check if the iteration has converged
if (status) break;
status = gsl_multifit_nlinear_test_delta(w, 1e-4, 1e-4);
} while (status == GSL_CONTINUE && iter < 100);
// Retrieve the optimized parameters
gsl_vector_memcpy(x, w->x);
// Compute the covariance matrix
gsl_multifit_nlinear_covar(w, covar);
// Output the results
std::cout << "Best fit parameters:\n";
for (size_t i = 0; i < p; ++i) {
std::cout << "x[" << i << "] = " << gsl_vector_get(x, i) << "\n";
}
std::cout << "Covariance matrix:\n";
gsl_matrix_fprintf(stdout, covar, "%g");
// Clean up
gsl_multifit_nlinear_free(w);
gsl_matrix_free(X);
gsl_vector_free(y);
gsl_vector_free(x);
gsl_matrix_free(covar);
return 0;
} To compile this program, you need to link against the GSL library. If you have GSL installed in a standard location, you can compile the program with the following command: g++ -o logistic_regression logistic_regression.cpp -lgsl -lgslcblas -lm If GSL is not installed in a standard location, you may need to specify the include and library paths using Please note that this example uses non-linear least squares fitting provided by GSL's So even though I'm going to try this on |
I tested mixtral in master and in this PR with a 200 chunk imatrix (~100k tokens) with IQ3_S with these commands: ./imatrix -f wikitext-2-raw/wiki.train.raw -m models/mixtral/ggml-model-q8_0.gguf -ngl 12 --chunks 200
./quantize --allow-requantize --imatrix imatrix.dat models/mixtral/ggml-model-q8_0.gguf iq3_s
./perplexity -f wikitext-2-raw/wiki.test.raw -m models/mixtral/ggml-model-IQ3_S.gguf -ngl 99 And got these results: So there is a slight reduction, but within the margin of error. Maybe the imatrix is too small, or IQ3_S is too big to be noticeable. |
Used the same imatrix to quantize and test IQ1_S: More significant difference with IQ1_S. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov please take a look.
It could just be that I did notice that the models that seemed the worst effected had the highest reported PPL whilst running
That seemed very odd to me as both |
My results are from mixtral 8x7b. |
It might be interesting to have printed the counts for each expert: it could simply be that some instruct models, or base models, have a more uneven softmax gating distribution than others? |
I imagine that's a possibility. Intuitively, even with even distribution, an moe model may require as much as the number of experts times of samples to achieve the same coverage, so using only 100k tokens in my tests may also be a factor. In any case, the difference in IQ1_S seems significant to me. |
Yeah, it might be interesting if @dranger003 tries again with the |
@jukofyork Do we have the updated pre-tokenizer convert script somewhere for dbrx or you used an old version to convert the weights? |
I just used the older version from mid-April. I did get the new Python scripts working for |
Ah yes, I see you have an open issue. Happy to submit a PR but I'm not sure about the regex splits. Wouldn't having proper pre-tokenization affect perplexity in testing? |
This makes sense. @dranger003 Let me push DBRX support on |
Nvm, there is some issue with DBRX using Tiktoken and it fails to convert - it will probably take some time to resolve. I will do some tests with Mixtral instead |
Yeah, it would be interesting to see what causes such a huge difference in perplexity between I'm not sure if the The |
Hold off pushing this - I've found a bug when combining! Will fix it now. |
- Fixed segfault bug because the counts vector needed to be created. - Fixed pre-existing bug didn't actually add to the counts for "--combine" option.
Fixed it, but also found there was a pre-existing bug that meant the "--combine" option didn't actually combine anything and just overwrote:
So only the last imatrix in the list to be combined was getting used. Will test now with some extra debugging output to double check the weighted sum is as expected. |
I guess combining imatrices never worked. Isn't there a risk of loss of precision after enough steps if data keeps being accumulated into a float? |
Confirmed working - printed from inside imatrix::save_imatrix() loop:
and:
So if this file were recombined the weighted-sum would be correct. and when it gets loaded inside quantize::load_imatrix() it will be divided correctly:
to get the mean of the combined values:
|
Yeah, there is some risk when summing squared values but this was just an outright bug. The biggest risk is when you start taking the difference of squared values (ie: Catastrophic Cancellation) or if the magnitudes of the squares are vastly different (see here for good example), but all the norm layers in these LLMs used to negate the "exploding gradient problem" likely avoids those sort of problems, and just looking at the weighted sum of 2 vs 10 files looks like any numerical errors will be pretty insignificant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using base Mixtral 8x7b and IQ3_S
quantization - perplexity improves slightly using the new imatrix. Think this is good to merge - nice work!
Well spending all night and using a much bigger sample of data for the imatrix seems to pay off: #include <iostream>
#include <vector>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_blas.h>
#include <gsl/gsl_multifit_nlinear.h>
// The logistic function
double logistic_function(double z) {
return 1.0 / (1.0 + exp(-z));
}
// The derivative of the logistic function
double logistic_derivative(double y) {
return y * (1.0 - y);
}
// The function to compute the residual
void logistic_f(const gsl_vector *x, void *params, gsl_vector *f) {
size_t n = ((struct data *)params)->n;
double *y = ((struct data *)params)->y;
double *X = ((struct data *)params)->X;
gsl_vector_view Xi(const_cast<double *>(X) + n * i, n);
double yi = gsl_vector_get(x, i);
double y_pred = logistic_function(gsl_vector_dot_product(&Xi.vector, x));
gsl_vector_set(f, i, yi - y_pred);
}
// The function to compute the Jacobian
void logistic_df(const gsl_vector *x, const gsl_vector *f, gsl_matrix *J, void *params) {
size_t n = ((struct data *)params)->n;
double *y = ((struct data *)params)->y;
double *X = ((struct data *)params)->X;
for (size_t i = 0; i < n; ++i) {
gsl_vector_view Xi(const_cast<double *>(X) + n * i, n);
double yi = gsl_vector_get(x, i);
double y_pred = logistic_function(gsl_vector_dot_product(&Xi.vector, x));
double dy_pred = logistic_derivative(y_pred);
for (size_t j = 0; j < n; ++j) {
gsl_matrix_set(J, i, j, -dy_pred * gsl_vector_get(&Xi.vector, j));
}
gsl_matrix_set(J, i, i, gsl_matrix_get(J, i, i) + 1.0);
}
}
// The logistic regression solver
int logistic_regression(const std::vector<std::vector<double>> &X, const std::vector<double> &y, gsl_vector *beta) {
size_t n = X.size();
size_t p = X[0].size();
// Convert data to GSL format
gsl_matrix *gslX = gsl_matrix_alloc(n, p);
gsl_vector *gslY = gsl_vector_alloc(n);
for (size_t i = 0; i < n; ++i) {
for (size_t j = 0; j < p; ++j) {
gsl_matrix_set(gslX, i, j, X[i][j]);
}
gsl_vector_set(gslY, i, y[i]);
}
// Initialize the solver parameters
const gsl_multifit_nlinear_type *T = gsl_multifit_nlinear_trust;
gsl_multifit_nlinear_workspace *w = gsl_multifit_nlinear_alloc(T, n, p);
gsl_multifit_nlinear_fdf fdf;
fdf.f = &logistic_f;
fdf.df = &logistic_df;
fdf.fvv = nullptr;
fdf.n = n;
fdf.p = p;
fdf.params = &data;
// Initialize with starting values
for (size_t i = 0; i < p; ++i) {
gsl_vector_set(beta, i, 0.01);
}
// Solve the system
int status;
size_t iter = 0;
const size_t max_iter = 100;
double chisq;
status = gsl_multifit_nlinear_solve(
&fdf,
beta,
gslX,
gslY,
w,
&chisq,
&iter,
nullptr,
nullptr
);
// Clean up
gsl_multifit_nlinear_free(w);
gsl_matrix_free(gslX);
gsl_vector_free(gslY);
return status;
}
int main() {
// Hardcoded data for testing
std::vector<std::vector<double>> X = {
{1, 0.1, 0.2},
{1, 0.2, 0.3},
{1, 0.3, 0.5},
{1, 0.4, 0.7},
{1, 0.5, 0.9},
{1, 0.6, 1.1},
{1, 0.7, 1.3},
{1, 0.8, 1.5},
{1, 0.9, 1.7}
};
std::vector<double> y = {0, 0, 0, 1, 1, 1, 1, 1, 1};
// Allocate space for the parameters
gsl_vector *beta = gsl_vector_alloc(X[0].size());
// Perform logistic regression
int status = logistic_regression(X, y, beta);
if (status) {
std::cerr << "Error: " << gsl_strerror(status) << std::endl;
} else {
// Output the fitted parameters
std::cout << "Fitted parameters:" << std::endl;
for (size_t i = 0; i < beta->size; ++i) {
std::cout << "beta[" << i << "] = " << gsl_vector_get(beta, i) << std::endl;
}
}
// Clean up
gsl_vector_free(beta);
return 0;
} I would say this is subjectively a better attempt than that which used the imatrix file generated from just ~100 chunks of I can't share the dataset as it has pretty much every random bit of code and programming related material I could find all stuffed in it, but if anybody is interested then as a test I actually got I also ran using 512, 1024, 2048, 4096 and 8192 contexts (evenly weighted) as I have a feeling the extreme "lazy-GPTness" that showed up with this model and with BTW: This is the post that made me take much more interest in |
I actually have another idea on how we could potentially improve perplexity of these low-bpw MoE models and did some experiments on it here: I originally tried to scale the But then lost interest because it was clear at best all I was going to do was make a slightly worse model that ran slower... Then it occurred to me that these low-bpw quants of MoE models have such a big jump in perplexity, it might actually be a useful method to help smooth out some of the noise: basically add in more (or even all) experts, then try to sharpen the gating weight softmax to the value that minimize the PPL. @slaren's example above shows the jump between 3-bpw and 1-bpw is massive, so there is a good chance this could improve PPL more than it hurts it. The biggest question is just how will the categorical distribution look at the optimal scale-factor: if it just looks very like the top-k distribution then it won't do anything useful, but if it actually has significant contributions coming from the of the non-top-k bins in it then it might worthwhile! My biggest problem was to optimize this, currently I have to use Mergekit to do the scaling, then convert to fp16 GGUF, then quantize... Over and over :/ |
Checks out, I came across this (same ppl after merge), but totally forgot to report or look into this, thanks :) |
This fix is simple and clear, but unnecessarily doubles the memory overhead...
It can be used as a test to refactor
collect_imatrix()
and/or makestruct Stats
only use O(d + n_experts) memory.Also the commented out
if (idx == t->src[0]->ne[0] - 1) ++e.ncall
code didn't work as intended (ie: increment on the last top-k expert callback) and I had to useif (idx == 0)
instead. This will have the possibly unwanted effect of triggering theif (e.ncall > m_last_call)
conditional below to save the imatrix on the first top-k expert rather than the last.