Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed save_imatrix to match old behaviour for MoE #7099

Merged
merged 6 commits into from
May 8, 2024
Merged

Fixed save_imatrix to match old behaviour for MoE #7099

merged 6 commits into from
May 8, 2024

Conversation

jukofyork
Copy link
Contributor

@jukofyork jukofyork commented May 6, 2024

This fix is simple and clear, but unnecessarily doubles the memory overhead...

It can be used as a test to refactor collect_imatrix() and/or make struct Stats only use O(d + n_experts) memory.

Also the commented out if (idx == t->src[0]->ne[0] - 1) ++e.ncall code didn't work as intended (ie: increment on the last top-k expert callback) and I had to use if (idx == 0) instead. This will have the possibly unwanted effect of triggering the if (e.ncall > m_last_call) conditional below to save the imatrix on the first top-k expert rather than the last.

This fix is simple and clear, but unnecessarily doubles the memory overhead..
@slaren
Copy link
Collaborator

slaren commented May 6, 2024

I think in principle the counts vector only needs to be as big as the number of experts n_as, but is this really enough memory overhead to worry about? These vectors should be very small in comparison with the size of the model.

@jukofyork
Copy link
Contributor Author

I think in principle the counts vector only needs to be as big as the number of experts n_as, but is this really enough memory overhead to worry about? These vectors should be very small in comparison with the size of the model.

Yeah, it's nice and clear what it's doing atm so it depends if it's worth the effort to fix and add extra complexity.

The if (((int32_t *) t->op_params)[0] == 0) ++e.ncall bit really needs looking at as I don't really understand what it's doing nor could I figure out how to get the number of top-k experts to compare it against.

// We select top-k experts, the number of calls for the expert tensors will be k times larger.
// NOTE: This will trigger the "if (e.ncall > m_last_call)" save conditional on the first active expert.
// The commented out "if (idx == t->src[0]->ne[0] - 1) ++e.ncall;" doesn't work.
if (((int32_t *) t->op_params)[0] == 0) ++e.ncall;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past only one expert was evaluated per mul_mat_id, and op_params was used to store the expert being evaluated, but that's no longer the case. op_params is not used anymore in mul_mat_id, so this condition doesn't really do anything, op_params will always be zero so it's always true.

Copy link
Contributor Author

@jukofyork jukofyork May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't test it but the old if (idx == 0) did work.

What test can be done to test for the callback being the last for the MoE?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in effect, this change does nothing, e.ncall is increased unconditionally as it was before. I think that increasing ncall unconditionally here is the correct thing to do, since the count is later corrected in save_imatrix with your change.

Copy link
Collaborator

@slaren slaren May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What test can be done to test for the callback being the last for the MoE?

Currently there is only one call to mul_mat_id regardless of the number of experts being used. This was changed in #6505.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in effect, this change does nothing, e.ncall is increased unconditionally as it was before. I think that increasing ncall unconditionally here is the correct thing to do, since the count is later corrected in save_imatrix with your change.

Yeah, it will still work:

(p.second.values[i] / static_cast<float>(p.second.counts[i])) * static_cast<float>(p.second.ncall)

Basically we divide down to get the actual mean based on how many actual values were added to an element values[i] and then multiply back up to get a value that can be used for the weighted combination of other imatrix files and for quantize to get back the original mean when it divides by ncall stored in the file.

So having the weighted combination scaled up by num-top-k won't effect either of these.

But this will still cause it to save the 10 chunks too often:

[1]3.4990,[2]2.7563,[3]2.8327,[4]2.8365,
save_imatrix: stored collected data after 10 chunks in wizard-lm-2:8x22b-f16.imatrix
[5]3.2415,[6]3.1667,[7]2.9011,[8]3.2475,[9]3.2100,
save_imatrix: stored collected data after 20 chunks in wizard-lm-2:8x22b-f16.imatrix
[10]3.5357,[11]3.7258,[12]3.6469,[13]3.9192,[14]4.2641,
save_imatrix: stored collected data after 30 chunks in wizard-lm-2:8x22b-f16.imatrix
[15]4.4561,[16]4.7251,[17]4.8591,[18]5.0424,[19]5.1595,

vs

[1]6.8864,[2]5.5590,[3]4.6385,[4]5.2093,[5]5.6050,[6]4.6732,[7]4.7876,[8]5.3775,[9]5.6677,
save_imatrix: stored collected data after 10 chunks in dbrx:16x12b-instruct-f16.imatrix
[10]5.4960,[11]5.8453,[12]6.4653,[13]6.7705,[14]7.1977,[15]7.3001,[16]7.4528,[17]7.6426,[18]7.2825,[19]7.3690,
save_imatrix: stored collected data after 20 chunks in dbrx:16x12b-instruct-f16.imatrix
[20]7.4835,[21]7.8310,[22]7.9035,[23]7.7323,[24]7.6813,[25]7.4121,[26]7.3496,[27]7.3934,[28]7.8041,[29]7.9666,
save_imatrix: stored collected data after 30 chunks in dbrx:16x12b-instruct-f16.imatrix
[30]8.1926,[31]8.3989,[32]8.6105,[33]8.7318,[34]8.8261,[35]8.8406,[36]8.8695,[37]9.0027,[38]9.0287,[39]8.9052,
save_imatrix: stored collected data after 40 chunks in dbrx:16x12b-instruct-f16.imatrix

and with the debug output on in quantize, will print num-top-k times more for each ncall for the experts:

load_imatrix: loaded data (size = 172032, ncall =    364) for 'blk.38.ffn_down_exps.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.38.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.38.attn_output.weight'
load_imatrix: loaded data (size = 172032, ncall =    364) for 'blk.37.ffn_down_exps.weight'
load_imatrix: loaded data (size =  98304, ncall =    364) for 'blk.37.ffn_gate_exps.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.37.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.37.attn_output.weight'

vs

load_imatrix: loaded data (size = 172032, ncall =     91) for 'blk.38.ffn_down_exps.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.38.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.38.attn_output.weight'
load_imatrix: loaded data (size = 172032, ncall =     91) for 'blk.37.ffn_down_exps.weight'
load_imatrix: loaded data (size =  98304, ncall =     91) for 'blk.37.ffn_gate_exps.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.37.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.37.attn_output.weight'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, you based these changes on a build before #6505 was merged, and the results that show a higher number of ncall for the moe tensors is with a build without #6505, correct?

Copy link
Collaborator

@slaren slaren May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR with ncall increased unconditionally with mixtral and it seems to produce the expected results:

compute_imatrix: computing over 50 chunks with batch_size 512
compute_imatrix: 1.46 seconds per pass - ETA 1.22 minutes
[1]3.3282,[2]5.5064,[3]5.7696,[4]6.0597,[5]6.6383,[6]6.4067,[7]6.0626,[8]6.1729,[9]6.3318,
save_imatrix: stored collected data after 10 chunks in imatrix.dat
[10]5.8754,[11]5.6783,[12]5.8278,[13]5.8804,[14]5.7391,[15]5.9534,[16]5.9483,[17]5.9110,[18]6.0203,[19]5.9764,
save_imatrix: stored collected data after 20 chunks in imatrix.dat
[20]5.9101,[21]5.8586,[22]5.8696,[23]5.9431,[24]5.9631,[25]6.0114,[26]6.0204,[27]5.9588,[28]5.7325,[29]5.7142,
save_imatrix: stored collected data after 30 chunks in imatrix.dat
[30]5.6387,[31]5.5779,[32]5.4650,[33]5.4179,[34]5.3390,[35]5.2645,[36]5.2147,[37]5.1724,[38]5.1585,[39]5.1434,
save_imatrix: stored collected data after 40 chunks in imatrix.dat
[40]5.1864,[41]5.1752,[42]5.1467,[43]5.0827,[44]5.0719,[45]5.0194,[46]5.0461,[47]5.0968,[48]5.1533,[49]5.1977,
save_imatrix: stored collected data after 50 chunks in imatrix.dat
[50]5.1661,
Final estimate: PPL = 5.1661 +/- 0.10175

save_imatrix: stored collected data after 50 chunks in imatrix.dat
load_imatrix: loaded data (size = 114688, ncall =     50) for 'blk.31.ffn_down_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.31.attn_k.weight'
load_imatrix: loaded data (size = 114688, ncall =     50) for 'blk.30.ffn_down_exps.weight'
load_imatrix: loaded data (size =  32768, ncall =     50) for 'blk.30.ffn_up_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.30.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.30.attn_v.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.30.attn_k.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.30.attn_q.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.29.ffn_gate_inp.weight'
load_imatrix: loaded data (size =  32768, ncall =     50) for 'blk.31.ffn_up_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.29.attn_v.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.29.attn_k.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.29.attn_q.weight'
load_imatrix: loaded data (size = 114688, ncall =     50) for 'blk.28.ffn_down_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.28.attn_v.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.28.attn_q.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.31.attn_q.weight'
load_imatrix: loaded data (size = 114688, ncall =     50) for 'blk.27.ffn_down_exps.weight'
load_imatrix: loaded data (size =  32768, ncall =     50) for 'blk.27.ffn_gate_exps.weight'
load_imatrix: loaded data (size =  32768, ncall =     50) for 'blk.27.ffn_up_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.27.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.27.attn_v.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.27.attn_k.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.27.attn_output.weight'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, you based these changes on a build before #6505 was merged, and the results that show a higher number of ncall for the moe tensors is with a build without #6505, correct?

Yeah, I used the build right after the dbrx PR was pushed as originally went on this search after having lots of trouble quantizing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR with ncall increased unconditionally with mixtral and it seems to produce the expected results:

compute_imatrix: computing over 50 chunks with batch_size 512
compute_imatrix: 1.46 seconds per pass - ETA 1.22 minutes
[1]3.3282,[2]5.5064,[3]5.7696,[4]6.0597,[5]6.6383,[6]6.4067,[7]6.0626,[8]6.1729,[9]6.3318,
save_imatrix: stored collected data after 10 chunks in imatrix.dat
[10]5.8754,[11]5.6783,[12]5.8278,[13]5.8804,[14]5.7391,[15]5.9534,[16]5.9483,[17]5.9110,[18]6.0203,[19]5.9764,
save_imatrix: stored collected data after 20 chunks in imatrix.dat
[20]5.9101,[21]5.8586,[22]5.8696,[23]5.9431,[24]5.9631,[25]6.0114,[26]6.0204,[27]5.9588,[28]5.7325,[29]5.7142,
save_imatrix: stored collected data after 30 chunks in imatrix.dat
[30]5.6387,[31]5.5779,[32]5.4650,[33]5.4179,[34]5.3390,[35]5.2645,[36]5.2147,[37]5.1724,[38]5.1585,[39]5.1434,
save_imatrix: stored collected data after 40 chunks in imatrix.dat
[40]5.1864,[41]5.1752,[42]5.1467,[43]5.0827,[44]5.0719,[45]5.0194,[46]5.0461,[47]5.0968,[48]5.1533,[49]5.1977,
save_imatrix: stored collected data after 50 chunks in imatrix.dat
[50]5.1661,
Final estimate: PPL = 5.1661 +/- 0.10175

save_imatrix: stored collected data after 50 chunks in imatrix.dat
load_imatrix: loaded data (size = 114688, ncall =     50) for 'blk.31.ffn_down_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.31.attn_k.weight'
load_imatrix: loaded data (size = 114688, ncall =     50) for 'blk.30.ffn_down_exps.weight'
load_imatrix: loaded data (size =  32768, ncall =     50) for 'blk.30.ffn_up_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.30.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.30.attn_v.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.30.attn_k.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.30.attn_q.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.29.ffn_gate_inp.weight'
load_imatrix: loaded data (size =  32768, ncall =     50) for 'blk.31.ffn_up_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.29.attn_v.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.29.attn_k.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.29.attn_q.weight'
load_imatrix: loaded data (size = 114688, ncall =     50) for 'blk.28.ffn_down_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.28.attn_v.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.28.attn_q.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.31.attn_q.weight'
load_imatrix: loaded data (size = 114688, ncall =     50) for 'blk.27.ffn_down_exps.weight'
load_imatrix: loaded data (size =  32768, ncall =     50) for 'blk.27.ffn_gate_exps.weight'
load_imatrix: loaded data (size =  32768, ncall =     50) for 'blk.27.ffn_up_exps.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.27.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.27.attn_v.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.27.attn_k.weight'
load_imatrix: loaded data (size =   4096, ncall =     50) for 'blk.27.attn_output.weight'

Yeah, it looks like that can just be left unconditional then.

It's probably worth trying to re-quantize mixtral with and without these fixes too, just in case something else has changed since then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth trying to re-quantize mixtral with and without these fixes too, just in case something else has changed since then.

I will give it a try with a low number of chunks, but I don't have enough VRAM to create a imatrix for mixtral with the full wiki.train.raw in a reasonable amount of time.

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved
Co-authored-by: slaren <slarengh@gmail.com>
@jukofyork
Copy link
Contributor Author

jukofyork commented May 6, 2024

Yeah, I can confirm this also seems to help wizard-lm-2 as well.

I won't post the full output, but the test question (with temperature=0) I've used is:

Can you write me a C++ program to calculate logistic regression using GSL? Write a short driver in main to test it with hard coded values.

as have found in the past this is very telling of broken and/or dumb models...

dbrx:16x12b-instruct-q5_K_S

  • no-imtarix: get's confused and creates code for linear regression. Also doesn't know what "GSL" stands for.
  • old-imatrix: makes a reasonable go of the code but it's not very clear and all in 2 massive functions.
  • new-imatrix: makes a better job of the code and defines some helper functions for logistic PDF, derivatives, etc.

mixtral:8x22b-instruct-q4_K_M

  • no-imtarix: get's confused and produces terrible code.
  • old-imatrix: makes a reasonable go of the code, but has the "stopping mid-sentence" problem.
  • new-imatrix: makes a better job of the code and also defines some helper functions for logistic_f and logistic_df.

wizard-lm-2:8x22b-instruct-q4_K_M (old-imatrix first):


USER

Can you write me a C++ program to calculate logistic regression using GSL? Write a short driver in main to test it with hard coded values.

ASSISTANT

Certainly! Below is a simple C++ program that uses the GNU Scientific Library (GSL) to perform logistic regression on a set of hardcoded data points. This example assumes you have GSL installed on your system and that you can link against it when compiling the program.

#include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_multifit_nlinear.h>

// The logistic function
double logistic_function(double x, const gsl_vector *beta) {
    double z = gsl_vector_get(beta, 0) + gsl_vector_get(beta, 1) * x;
    return 1.0 / (1.0 + exp(-z));
}

// The vectorized logistic function for a set of samples
void logistic_f(const gsl_vector *x, void *params, gsl_vector *f) {
    size_t n = f->size;
    double *y = (double *)params;

    for (size_t i = 0; i < n; ++i) {
        double fi = logistic_function(gsl_vector_get(x, i), params);
        gsl_vector_set(f, i, fi - y[i]);
    }
}

// The Jacobian of the logistic function
void logistic_df(const gsl_vector *x, void *params, gsl_matrix *J) {
    size_t n = J->size1;
    double *y = (double *)params;

    for (size_t i = 0; i < n; ++i) {
        double fi = logistic_function(gsl_vector_get(x, i), params);
        gsl_matrix_set(J, i, 0, fi * (1.0 - fi));
        gsl_matrix_set(J, i, 1, x->data[i] * fi * (1.0 - fi));
    }
}

int main() {
    // Hardcoded data points (features and labels)
    double x[] = {1.0, 2.0, 3.0, 4.0}; // Features
    double y[] = {0.1, 0.2, 0.6, 0.9}; // Labels (probabilities)

    const size_t n = sizeof(x) / sizeof(x[0]); // Number of data points
    const size_t p = 2;                         // Number of parameters (intercept + slope)

    // Allocate workspace and vectors/matrices for GSL
    gsl_multifit_nlinear_workspace *w;
    gsl_vector *f;
    gsl_matrix *J;
    gsl_vector *beta;

    // Initialize the non-linear least squares solver
    gsl_multifit_nlinear_fdf fdf;
    fdf.f = logistic_f;
    fdf.df = logistic_df;
    fdf.fpp = NULL; // We do not provide a second derivative function
    fdf.params = y;

    gsl_multifit_nlinear_parameters params = gsl_multifit_nlinear_default_parameters();
    w = gsl_multifit_nlinear_alloc(gsl_multifit_nlinear_trust, n, p);
    gsl_multifit_nlinear_set(w, &fdf, beta, &params);

    // Initialize the starting values for the parameters
    beta = gsl_vector_alloc(p);
    gsl_vector_set(beta, 0, 0.0); // Intercept
    gsl_vector_set(beta, 1, 0.0); // Slope

    // Perform the non-linear least squares fit
    int status;
    size_t iter = 0;
    do {
        iter++;
        status = gsl_multifit_nlinear_iterate(w);

        if (status) {
            std::cerr << "Error: " << gsl_strerror(status) << std::endl;
            break;
        }

        status = gsl_multifit_nlinear_test_delta(w, 1e-4, 1e-4);
    } while (status == GSL_CONTINUE && iter < 100);

    // Retrieve the optimized parameters
    gsl_vector_memcpy(beta, w->x);

    std::cout << "Status: " << gsl_strerror(status) << std::endl;
    std::cout << "Iterations: " << iter << std::endl;
    std::cout << "Beta parameters:" << std::endl;
    gsl_vector_fprintf(stdout, beta, "%g");

    // Clean up
    gsl_multifit_nlinear_free(w);
    gsl_vector_free(beta);

    return 0;
}

To compile this program, you need to link against the GSL library. If you have GSL installed in a standard location, you can compile the program with the following command:

g++ -o logistic_regression logistic_regression.cpp -lgsl -lgslcblas -lm

If GSL is not installed in a standard location, you may need to specify include directories and library paths using -I and -L flags respectively.

After compiling, you can run the program with:

./logistic_regression

This program will output the optimized beta parameters for the logistic regression model fitted to the hardcoded data points. The logistic_f function computes the residuals between the predicted probabilities and the actual labels, while the logistic_df function computes the Jacobian matrix of the logistic function with respect to the parameters. The GSL non-linear least squares solver uses these functions to iteratively update the parameter estimates until convergence or a maximum number of iterations is reached.


USER

Can you write me a C++ program to calculate logistic regression using GSL? Write a short driver in main to test it with hard coded values.

ASSISTANT

Certainly! Below is a simple C++ program that uses the GNU Scientific Library (GSL) to perform logistic regression on a set of hardcoded data points. This example assumes you have GSL installed on your system and that you can link against it when compiling the program.

#include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_multifit_nlinear.h>

// The logistic function
double logistic_function(double z) {
    return 1.0 / (1.0 + exp(-z));
}

// The derivative of the logistic function
double dlogistic_function(double y) {
    return y * (1 - y);
}

// The function to compute the residual
int logistic_f(const gsl_vector *x, void *data, gsl_vector *f) {
    size_t n = ((struct data *)data)->n;
    double *y = ((struct data *)data)->y;
    double *X = ((struct data *)data)->X;
    double sum;

    for (size_t i = 0; i < n; ++i) {
        sum = gsl_vector_get(x, 0); // intercept
        for (size_t j = 1; j < x->size; ++j) {
            sum += gsl_vector_get(x, j) * X[i * (x->size - 1) + j - 1]; // features
        }
        gsl_vector_set(f, i, y[i] - logistic_function(sum));
    }

    return GSL_SUCCESS;
}

// The function to compute the Jacobian of the residual
int logistic_df(const gsl_vector *x, void *data, gsl_matrix *J) {
    size_t n = ((struct data *)data)->n;
    double *X = ((struct data *)data)->X;

    for (size_t i = 0; i < n; ++i) {
        double y_pred = logistic_function(gsl_vector_get(x, 0)); // intercept
        for (size_t j = 1; j < x->size; ++j) {
            y_pred += gsl_vector_get(x, j) * X[i * (x->size - 1) + j - 1]; // features
        }
        double dy_pred = dlogistic_function(y_pred);

        gsl_matrix_set(J, i, 0, dy_pred);
        for (size_t j = 1; j < x->size; ++j) {
            gsl_matrix_set(J, i, j, dy_pred * X[i * (x->size - 1) + j - 1]);
        }
    }

    return GSL_SUCCESS;
}

// The function to compute the residual and the Jacobian
int logistic_fd(const gsl_vector *x, void *data, gsl_vector *f, gsl_matrix *J) {
    logistic_f(x, data, f);
    logistic_df(x, data, J);
    return GSL_SUCCESS;
}

int main() {
    // Hardcoded dataset: 3 samples with 2 features each (including the bias term)
    const size_t n = 3; // number of samples
    const size_t p = 3; // number of parameters (including intercept)
    double X_data[] = {1.0, 0.5, -1.0, 1.0, 1.0, 2.0}; // feature matrix with bias term
    double y_data[] = {0.0, 1.0, 1.0}; // target values

    // Allocate memory for the dataset
    gsl_matrix *X = gsl_matrix_alloc(n, p - 1);
    gsl_vector *y = gsl_vector_alloc(n);
    struct data d = {n, y_data, X_data};

    // Copy the hardcoded data into GSL structures
    gsl_matrix_set_all(X, 1.0); // Set the bias term (intercept) to 1
    for (size_t i = 0; i < n; ++i) {
        for (size_t j = 0; j < p - 1; ++j) {
            gsl_matrix_set(X, i, j, X_data[i * (p - 1) + j]);
        }
        gsl_vector_set(y, i, y_data[i]);
    }

    // Initial guess for the parameters (including intercept)
    gsl_vector *x = gsl_vector_alloc(p);
    gsl_vector_set(x, 0, 0.0); // intercept
    gsl_vector_set(x, 1, 0.0); // coefficient for feature 1
    gsl_vector_set(x, 2, 0.0); // coefficient for feature 2

    // Allocate space for the covariance matrix
    gsl_matrix *covar = gsl_matrix_alloc(p, p);

    // Perform the non-linear least squares fit
    gsl_multifit_nlinear_fdf fdf;
    gsl_multifit_nlinear_parameters params = gsl_multifit_nlinear_default_parameters();
    fdf.f = &logistic_f;
    fdf.df = &logistic_df;
    fdf.fdf = &logistic_fd;
    fdf.n = n;
    fdf.p = p;
    fdf.params = &d;

    gsl_multifit_nlinear_workspace *w = gsl_multifit_nlinear_alloc(n, p);
    gsl_multifit_nlinear_init(&fdf, x, w);

    int status;
    size_t iter = 0;
    do {
        iter++;
        status = gsl_multifit_nlinear_iterate(w);

        // Check if the iteration has converged
        if (status) break;

        status = gsl_multifit_nlinear_test_delta(w, 1e-4, 1e-4);
    } while (status == GSL_CONTINUE && iter < 100);

    // Retrieve the optimized parameters
    gsl_vector_memcpy(x, w->x);

    // Compute the covariance matrix
    gsl_multifit_nlinear_covar(w, covar);

    // Output the results
    std::cout << "Best fit parameters:\n";
    for (size_t i = 0; i < p; ++i) {
        std::cout << "x[" << i << "] = " << gsl_vector_get(x, i) << "\n";
    }
    std::cout << "Covariance matrix:\n";
    gsl_matrix_fprintf(stdout, covar, "%g");

    // Clean up
    gsl_multifit_nlinear_free(w);
    gsl_matrix_free(X);
    gsl_vector_free(y);
    gsl_vector_free(x);
    gsl_matrix_free(covar);

    return 0;
}

To compile this program, you need to link against the GSL library. If you have GSL installed in a standard location, you can compile the program with the following command:

g++ -o logistic_regression logistic_regression.cpp -lgsl -lgslcblas -lm

If GSL is not installed in a standard location, you may need to specify the include and library paths using -I and -L flags respectively.

Please note that this example uses non-linear least squares fitting provided by GSL's gsl_multifit_nlinear module, which is suitable for logistic regression since it involves a non-linear model. The hardcoded dataset in the main function is minimal and only for demonstration purposes. In practice, you would use a larger dataset and possibly normalize your features before fitting the model.


So even though wizard-lm-2:8x22b didn't seem bad before, it clearly was effected too.

I'm going to try this on Eurux:8x22b-nca now to see if it has the same effect.

@slaren
Copy link
Collaborator

slaren commented May 6, 2024

I tested mixtral in master and in this PR with a 200 chunk imatrix (~100k tokens) with IQ3_S with these commands:

./imatrix -f wikitext-2-raw/wiki.train.raw -m models/mixtral/ggml-model-q8_0.gguf -ngl 12 --chunks 200
./quantize --allow-requantize --imatrix imatrix.dat models/mixtral/ggml-model-q8_0.gguf iq3_s
./perplexity -f wikitext-2-raw/wiki.test.raw -m models/mixtral/ggml-model-IQ3_S.gguf -ngl 99

And got these results:
Master: Final estimate: PPL = 4.3625 +/- 0.02228
PR: Final estimate: PPL = 4.3620 +/- 0.02228

So there is a slight reduction, but within the margin of error. Maybe the imatrix is too small, or IQ3_S is too big to be noticeable.

@slaren
Copy link
Collaborator

slaren commented May 6, 2024

Used the same imatrix to quantize and test IQ1_S:
Master: Final estimate: PPL = 7.9913 +/- 0.04708
PR: Final estimate: PPL = 7.8876 +/- 0.04655

More significant difference with IQ1_S.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov please take a look.

@slaren slaren requested a review from ggerganov May 6, 2024 15:51
@jukofyork
Copy link
Contributor Author

jukofyork commented May 6, 2024

Used the same imatrix to quantize and test IQ1_S: Master: Final estimate: PPL = 7.9913 +/- 0.04708 PR: Final estimate: PPL = 7.8876 +/- 0.04655

More significant difference with IQ1_S.

It could just be that mixtral:8x7b isn't as effected like wizard-lm-2 was or if it wasn't the instruct model, it could effect them much more?

I did notice that the models that seemed the worst effected had the highest reported PPL whilst running imatrix on groups_merged.txt:

  • dbrx-instruct something like 9.
  • mixtral:8x22b-instruct something like 12.
  • wizard-lm-2 something like 3-4.

That seemed very odd to me as both mixtral:8x22b-instruct and wizard-lm-2 were trained from the same base model and mixtral:8x22b-instruct only has a handful of extra token in it's vocab (32768 vs 32000).

@slaren
Copy link
Collaborator

slaren commented May 6, 2024

My results are from mixtral 8x7b.

@jukofyork
Copy link
Contributor Author

It might be interesting to have printed the counts for each expert: it could simply be that some instruct models, or base models, have a more uneven softmax gating distribution than others?

@slaren
Copy link
Collaborator

slaren commented May 6, 2024

I imagine that's a possibility. Intuitively, even with even distribution, an moe model may require as much as the number of experts times of samples to achieve the same coverage, so using only 100k tokens in my tests may also be a factor. In any case, the difference in IQ1_S seems significant to me.

@jukofyork
Copy link
Contributor Author

I imagine that's a possibility. Intuitively, even with even distribution, an moe model may require as much as the number of experts times of samples to achieve the same coverage, so using only 100k tokens in my tests may also be a factor. In any case, the difference in IQ1_S seems significant to me.

Yeah, it might be interesting if @dranger003 tries again with the dbrx-instruct model as he found that didn't really seem to take well to low bitrate quants a few weeks ago.

@dranger003
Copy link
Contributor

@jukofyork Do we have the updated pre-tokenizer convert script somewhere for dbrx or you used an old version to convert the weights?

@jukofyork
Copy link
Contributor Author

@jukofyork Do we have the updated pre-tokenizer convert script somewhere for dbrx or you used an old version to convert the weights?

I just used the older version from mid-April. I did get the new Python scripts working for dbrx but then saw in the command-r PR that it needed some C++ code writing to work with the added dbrx tag and didn't go any further.

@dranger003
Copy link
Contributor

Ah yes, I see you have an open issue. Happy to submit a PR but I'm not sure about the regex splits. Wouldn't having proper pre-tokenization affect perplexity in testing?

@ggerganov
Copy link
Owner

Basically we divide down to get the actual mean based on how many actual values were added to an element values[i] and then multiply back up to get a value that can be used for the weighted combination of other imatrix files and for quantize to get back the original mean when it divides by ncall stored in the file.

This makes sense.

@dranger003 Let me push DBRX support on master and would appreciate if you can give it a go together with the changes in this PR

@ggerganov
Copy link
Owner

@dranger003 Let me push DBRX support on master and would appreciate if you can give it a go together with the changes in this PR

Nvm, there is some issue with DBRX using Tiktoken and it fails to convert - it will probably take some time to resolve. I will do some tests with Mixtral instead

@jukofyork
Copy link
Contributor Author

jukofyork commented May 7, 2024

Ah yes, I see you have an open issue. Happy to submit a PR but I'm not sure about the regex splits. Wouldn't having proper pre-tokenization affect perplexity in testing?

Yeah, it would be interesting to see what causes such a huge difference in perplexity between wizard-lm-2:8x22b and mixtral-instruct:8x22b.

I'm not sure if the mixtral-instruct:8x22b model is just broken is some other way though - I did notice it has the "use BOS token" set to false (which wasn't the case for the base mixtral:8x22b, the original base/instructmixtral:7x22b, nor in wizard-lm-2:8x22b or any of the 3-4 available fine-tunes off the base mixtral:8x22b).

The dbrx model does seem to have a larger vocab (100k+ vs ~32k IIRC), so that is most likely explainable, but the wizard-lm-2:8x22b and mixtral-instruct:8x22b is very perplexing ( 😁 ) considering they are almost the same vocab size (32k vs 32768).

@jukofyork
Copy link
Contributor Author

Hold off pushing this - I've found a bug when combining! Will fix it now.

- Fixed segfault bug because the counts vector needed to be created.
- Fixed pre-existing bug didn't actually add to the counts for "--combine" option.
@jukofyork
Copy link
Contributor Author

Hold off pushing this - I've found a bug when combining! Will fix it now.

Fixed it, but also found there was a pre-existing bug that meant the "--combine" option didn't actually combine anything and just overwrote:

in.read((char*)e.values.data(), nval*sizeof(float));

So only the last imatrix in the list to be combined was getting used.

Will test now with some extra debugging output to double check the weighted sum is as expected.

@slaren
Copy link
Collaborator

slaren commented May 7, 2024

I guess combining imatrices never worked. Isn't there a risk of loss of precision after enough steps if data keeps being accumulated into a float?

@jukofyork
Copy link
Contributor Author

jukofyork commented May 7, 2024

Confirmed working - printed from inside imatrix::save_imatrix() loop:

Combining the following 2 files
    wizard-lm-2:8x22b-f16.imatrix_05
    wizard-lm-2:8x22b-f16.imatrix_01
values[i]=33.277802 counts[i]=512 tmp[i]=33.277802 ncall=512
values[i]=32.688107 counts[i]=512 tmp[i]=32.688107 ncall=512
values[i]=24.805161 counts[i]=512 tmp[i]=24.805161 ncall=512
values[i]=25.708702 counts[i]=512 tmp[i]=25.708702 ncall=512
values[i]=30.115463 counts[i]=512 tmp[i]=30.115463 ncall=512
values[i]=36.828308 counts[i]=512 tmp[i]=36.828308 ncall=512
values[i]=24.616169 counts[i]=512 tmp[i]=24.616169 ncall=512
values[i]=31.104462 counts[i]=512 tmp[i]=31.104462 ncall=512
values[i]=18.073238 counts[i]=512 tmp[i]=18.073238 ncall=512
values[i]=52.407349 counts[i]=512 tmp[i]=52.407349 ncall=512
values[i]=39.958645 counts[i]=512 tmp[i]=39.958645 ncall=512

and:

Combining the following 10 files
    wizard-lm-2:8x22b-f16.imatrix_01
    wizard-lm-2:8x22b-f16.imatrix_02
    wizard-lm-2:8x22b-f16.imatrix_03
    wizard-lm-2:8x22b-f16.imatrix_04
    wizard-lm-2:8x22b-f16.imatrix_05
    wizard-lm-2:8x22b-f16.imatrix_06
    wizard-lm-2:8x22b-f16.imatrix_07
    wizard-lm-2:8x22b-f16.imatrix_08
    wizard-lm-2:8x22b-f16.imatrix_09
    wizard-lm-2:8x22b-f16.imatrix_10
values[i]=171.538055 counts[i]=2560 tmp[i]=171.538055 ncall=2560
values[i]=179.629059 counts[i]=2560 tmp[i]=179.629059 ncall=2560
values[i]=121.026230 counts[i]=2560 tmp[i]=121.026230 ncall=2560
values[i]=145.577026 counts[i]=2560 tmp[i]=145.577026 ncall=2560
values[i]=154.203339 counts[i]=2560 tmp[i]=154.203339 ncall=2560
values[i]=212.427841 counts[i]=2560 tmp[i]=212.427826 ncall=2560
values[i]=123.816238 counts[i]=2560 tmp[i]=123.816238 ncall=2560
values[i]=170.491425 counts[i]=2560 tmp[i]=170.491425 ncall=2560
values[i]=96.285919 counts[i]=2560 tmp[i]=96.285919 ncall=2560
values[i]=249.120377 counts[i]=2560 tmp[i]=249.120392 ncall=2560
values[i]=203.492462 counts[i]=2560 tmp[i]=203.492462 ncall=2560
values[i]=134.679352 counts[i]=2560 tmp[i]=134.679352 ncall=2560
values[i]=93.139969 counts[i]=2560 tmp[i]=93.139969 ncall=2560
values[i]=147.348282 counts[i]=2560 tmp[i]=147.348282 ncall=2560
values[i]=683.955627 counts[i]=2560 tmp[i]=683.955627 ncall=2560
values[i]=187.924072 counts[i]=2560 tmp[i]=187.924072 ncall=2560
values[i]=142.019104 counts[i]=2560 tmp[i]=142.019104 ncall=2560
values[i]=217.543076 counts[i]=2560 tmp[i]=217.543060 ncall=2560

So if this file were recombined the weighted-sum would be correct.

and when it gets loaded inside quantize::load_imatrix() it will be divided correctly:

        if (ncall > 0) {
            for (auto& v : e) v /= ncall;
        }

to get the mean of the combined values:

v = 0.067007
v = 0.070168
v = 0.047276
v = 0.056866
v = 0.060236
v = 0.082980
v = 0.048366
v = 0.066598

@jukofyork
Copy link
Contributor Author

I guess combining imatrices never worked. Isn't there a risk of loss of precision after enough steps if data keeps being accumulated into a float?

Yeah, there is some risk when summing squared values but this was just an outright bug.

The biggest risk is when you start taking the difference of squared values (ie: Catastrophic Cancellation) or if the magnitudes of the squares are vastly different (see here for good example), but all the norm layers in these LLMs used to negate the "exploding gradient problem" likely avoids those sort of problems, and just looking at the weighted sum of 2 vs 10 files looks like any numerical errors will be pretty insignificant.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using base Mixtral 8x7b and IQ3_S quantization - perplexity improves slightly using the new imatrix. Think this is good to merge - nice work!

@jukofyork
Copy link
Contributor Author

jukofyork commented May 7, 2024

wizard-lm-2:8x22b-instruct-q4_K_M (old-imatrix first):

USER

Can you write me a C++ program to calculate logistic regression using GSL? Write a short driver in main to test it with hard coded values.

ASSISTANT

Certainly! Below is a simple C++ program that uses the GNU Scientific Library (GSL) to perform logistic regression on a set of hardcoded data points. This example assumes you have GSL installed on your system and that you can link against it when compiling the program.

#include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_multifit_nlinear.h>

// The logistic function
double logistic_function(double x, const gsl_vector *beta) {
    double z = gsl_vector_get(beta, 0) + gsl_vector_get(beta, 1) * x;
    return 1.0 / (1.0 + exp(-z));
}

// The vectorized logistic function for a set of samples
void logistic_f(const gsl_vector *x, void *params, gsl_vector *f) {
    size_t n = f->size;
    double *y = (double *)params;

    for (size_t i = 0; i < n; ++i) {
        double fi = logistic_function(gsl_vector_get(x, i), params);
        gsl_vector_set(f, i, fi - y[i]);
    }
}

// The Jacobian of the logistic function
void logistic_df(const gsl_vector *x, void *params, gsl_matrix *J) {
    size_t n = J->size1;
    double *y = (double *)params;

    for (size_t i = 0; i < n; ++i) {
        double fi = logistic_function(gsl_vector_get(x, i), params);
        gsl_matrix_set(J, i, 0, fi * (1.0 - fi));
        gsl_matrix_set(J, i, 1, x->data[i] * fi * (1.0 - fi));
    }
}

int main() {
    // Hardcoded data points (features and labels)
    double x[] = {1.0, 2.0, 3.0, 4.0}; // Features
    double y[] = {0.1, 0.2, 0.6, 0.9}; // Labels (probabilities)

    const size_t n = sizeof(x) / sizeof(x[0]); // Number of data points
    const size_t p = 2;                         // Number of parameters (intercept + slope)

    // Allocate workspace and vectors/matrices for GSL
    gsl_multifit_nlinear_workspace *w;
    gsl_vector *f;
    gsl_matrix *J;
    gsl_vector *beta;

    // Initialize the non-linear least squares solver
    gsl_multifit_nlinear_fdf fdf;
    fdf.f = logistic_f;
    fdf.df = logistic_df;
    fdf.fpp = NULL; // We do not provide a second derivative function
    fdf.params = y;

    gsl_multifit_nlinear_parameters params = gsl_multifit_nlinear_default_parameters();
    w = gsl_multifit_nlinear_alloc(gsl_multifit_nlinear_trust, n, p);
    gsl_multifit_nlinear_set(w, &fdf, beta, &params);

    // Initialize the starting values for the parameters
    beta = gsl_vector_alloc(p);
    gsl_vector_set(beta, 0, 0.0); // Intercept
    gsl_vector_set(beta, 1, 0.0); // Slope

    // Perform the non-linear least squares fit
    int status;
    size_t iter = 0;
    do {
        iter++;
        status = gsl_multifit_nlinear_iterate(w);

        if (status) {
            std::cerr << "Error: " << gsl_strerror(status) << std::endl;
            break;
        }

        status = gsl_multifit_nlinear_test_delta(w, 1e-4, 1e-4);
    } while (status == GSL_CONTINUE && iter < 100);

    // Retrieve the optimized parameters
    gsl_vector_memcpy(beta, w->x);

    std::cout << "Status: " << gsl_strerror(status) << std::endl;
    std::cout << "Iterations: " << iter << std::endl;
    std::cout << "Beta parameters:" << std::endl;
    gsl_vector_fprintf(stdout, beta, "%g");

    // Clean up
    gsl_multifit_nlinear_free(w);
    gsl_vector_free(beta);

    return 0;
}

To compile this program, you need to link against the GSL library. If you have GSL installed in a standard location, you can compile the program with the following command:

g++ -o logistic_regression logistic_regression.cpp -lgsl -lgslcblas -lm

If GSL is not installed in a standard location, you may need to specify include directories and library paths using -I and -L flags respectively.

After compiling, you can run the program with:

./logistic_regression

This program will output the optimized beta parameters for the logistic regression model fitted to the hardcoded data points. The logistic_f function computes the residuals between the predicted probabilities and the actual labels, while the logistic_df function computes the Jacobian matrix of the logistic function with respect to the parameters. The GSL non-linear least squares solver uses these functions to iteratively update the parameter estimates until convergence or a maximum number of iterations is reached.

USER

Can you write me a C++ program to calculate logistic regression using GSL? Write a short driver in main to test it with hard coded values.

ASSISTANT

Certainly! Below is a simple C++ program that uses the GNU Scientific Library (GSL) to perform logistic regression on a set of hardcoded data points. This example assumes you have GSL installed on your system and that you can link against it when compiling the program.

#include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_multifit_nlinear.h>

// The logistic function
double logistic_function(double z) {
    return 1.0 / (1.0 + exp(-z));
}

// The derivative of the logistic function
double dlogistic_function(double y) {
    return y * (1 - y);
}

// The function to compute the residual
int logistic_f(const gsl_vector *x, void *data, gsl_vector *f) {
    size_t n = ((struct data *)data)->n;
    double *y = ((struct data *)data)->y;
    double *X = ((struct data *)data)->X;
    double sum;

    for (size_t i = 0; i < n; ++i) {
        sum = gsl_vector_get(x, 0); // intercept
        for (size_t j = 1; j < x->size; ++j) {
            sum += gsl_vector_get(x, j) * X[i * (x->size - 1) + j - 1]; // features
        }
        gsl_vector_set(f, i, y[i] - logistic_function(sum));
    }

    return GSL_SUCCESS;
}

// The function to compute the Jacobian of the residual
int logistic_df(const gsl_vector *x, void *data, gsl_matrix *J) {
    size_t n = ((struct data *)data)->n;
    double *X = ((struct data *)data)->X;

    for (size_t i = 0; i < n; ++i) {
        double y_pred = logistic_function(gsl_vector_get(x, 0)); // intercept
        for (size_t j = 1; j < x->size; ++j) {
            y_pred += gsl_vector_get(x, j) * X[i * (x->size - 1) + j - 1]; // features
        }
        double dy_pred = dlogistic_function(y_pred);

        gsl_matrix_set(J, i, 0, dy_pred);
        for (size_t j = 1; j < x->size; ++j) {
            gsl_matrix_set(J, i, j, dy_pred * X[i * (x->size - 1) + j - 1]);
        }
    }

    return GSL_SUCCESS;
}

// The function to compute the residual and the Jacobian
int logistic_fd(const gsl_vector *x, void *data, gsl_vector *f, gsl_matrix *J) {
    logistic_f(x, data, f);
    logistic_df(x, data, J);
    return GSL_SUCCESS;
}

int main() {
    // Hardcoded dataset: 3 samples with 2 features each (including the bias term)
    const size_t n = 3; // number of samples
    const size_t p = 3; // number of parameters (including intercept)
    double X_data[] = {1.0, 0.5, -1.0, 1.0, 1.0, 2.0}; // feature matrix with bias term
    double y_data[] = {0.0, 1.0, 1.0}; // target values

    // Allocate memory for the dataset
    gsl_matrix *X = gsl_matrix_alloc(n, p - 1);
    gsl_vector *y = gsl_vector_alloc(n);
    struct data d = {n, y_data, X_data};

    // Copy the hardcoded data into GSL structures
    gsl_matrix_set_all(X, 1.0); // Set the bias term (intercept) to 1
    for (size_t i = 0; i < n; ++i) {
        for (size_t j = 0; j < p - 1; ++j) {
            gsl_matrix_set(X, i, j, X_data[i * (p - 1) + j]);
        }
        gsl_vector_set(y, i, y_data[i]);
    }

    // Initial guess for the parameters (including intercept)
    gsl_vector *x = gsl_vector_alloc(p);
    gsl_vector_set(x, 0, 0.0); // intercept
    gsl_vector_set(x, 1, 0.0); // coefficient for feature 1
    gsl_vector_set(x, 2, 0.0); // coefficient for feature 2

    // Allocate space for the covariance matrix
    gsl_matrix *covar = gsl_matrix_alloc(p, p);

    // Perform the non-linear least squares fit
    gsl_multifit_nlinear_fdf fdf;
    gsl_multifit_nlinear_parameters params = gsl_multifit_nlinear_default_parameters();
    fdf.f = &logistic_f;
    fdf.df = &logistic_df;
    fdf.fdf = &logistic_fd;
    fdf.n = n;
    fdf.p = p;
    fdf.params = &d;

    gsl_multifit_nlinear_workspace *w = gsl_multifit_nlinear_alloc(n, p);
    gsl_multifit_nlinear_init(&fdf, x, w);

    int status;
    size_t iter = 0;
    do {
        iter++;
        status = gsl_multifit_nlinear_iterate(w);

        // Check if the iteration has converged
        if (status) break;

        status = gsl_multifit_nlinear_test_delta(w, 1e-4, 1e-4);
    } while (status == GSL_CONTINUE && iter < 100);

    // Retrieve the optimized parameters
    gsl_vector_memcpy(x, w->x);

    // Compute the covariance matrix
    gsl_multifit_nlinear_covar(w, covar);

    // Output the results
    std::cout << "Best fit parameters:\n";
    for (size_t i = 0; i < p; ++i) {
        std::cout << "x[" << i << "] = " << gsl_vector_get(x, i) << "\n";
    }
    std::cout << "Covariance matrix:\n";
    gsl_matrix_fprintf(stdout, covar, "%g");

    // Clean up
    gsl_multifit_nlinear_free(w);
    gsl_matrix_free(X);
    gsl_vector_free(y);
    gsl_vector_free(x);
    gsl_matrix_free(covar);

    return 0;
}

To compile this program, you need to link against the GSL library. If you have GSL installed in a standard location, you can compile the program with the following command:

g++ -o logistic_regression logistic_regression.cpp -lgsl -lgslcblas -lm

If GSL is not installed in a standard location, you may need to specify the include and library paths using -I and -L flags respectively.

Please note that this example uses non-linear least squares fitting provided by GSL's gsl_multifit_nlinear module, which is suitable for logistic regression since it involves a non-linear model. The hardcoded dataset in the main function is minimal and only for demonstration purposes. In practice, you would use a larger dataset and possibly normalize your features before fitting the model.

So even though wizard-lm-2:8x22b didn't seem bad before, it clearly was effected too.

I'm going to try this on Eurux:8x22b-nca now to see if it has the same effect.

Well spending all night and using a much bigger sample of data for the imatrix seems to pay off:

#include <iostream>
#include <vector>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_blas.h>
#include <gsl/gsl_multifit_nlinear.h>

// The logistic function
double logistic_function(double z) {
    return 1.0 / (1.0 + exp(-z));
}

// The derivative of the logistic function
double logistic_derivative(double y) {
    return y * (1.0 - y);
}

// The function to compute the residual
void logistic_f(const gsl_vector *x, void *params, gsl_vector *f) {
    size_t n = ((struct data *)params)->n;
    double *y = ((struct data *)params)->y;
    double *X = ((struct data *)params)->X;

    gsl_vector_view Xi(const_cast<double *>(X) + n * i, n);
    double yi = gsl_vector_get(x, i);
    double y_pred = logistic_function(gsl_vector_dot_product(&Xi.vector, x));

    gsl_vector_set(f, i, yi - y_pred);
}

// The function to compute the Jacobian
void logistic_df(const gsl_vector *x, const gsl_vector *f, gsl_matrix *J, void *params) {
    size_t n = ((struct data *)params)->n;
    double *y = ((struct data *)params)->y;
    double *X = ((struct data *)params)->X;

    for (size_t i = 0; i < n; ++i) {
        gsl_vector_view Xi(const_cast<double *>(X) + n * i, n);
        double yi = gsl_vector_get(x, i);
        double y_pred = logistic_function(gsl_vector_dot_product(&Xi.vector, x));
        double dy_pred = logistic_derivative(y_pred);

        for (size_t j = 0; j < n; ++j) {
            gsl_matrix_set(J, i, j, -dy_pred * gsl_vector_get(&Xi.vector, j));
        }

        gsl_matrix_set(J, i, i, gsl_matrix_get(J, i, i) + 1.0);
    }
}

// The logistic regression solver
int logistic_regression(const std::vector<std::vector<double>> &X, const std::vector<double> &y, gsl_vector *beta) {
    size_t n = X.size();
    size_t p = X[0].size();

    // Convert data to GSL format
    gsl_matrix *gslX = gsl_matrix_alloc(n, p);
    gsl_vector *gslY = gsl_vector_alloc(n);
    for (size_t i = 0; i < n; ++i) {
        for (size_t j = 0; j < p; ++j) {
            gsl_matrix_set(gslX, i, j, X[i][j]);
        }
        gsl_vector_set(gslY, i, y[i]);
    }

    // Initialize the solver parameters
    const gsl_multifit_nlinear_type *T = gsl_multifit_nlinear_trust;
    gsl_multifit_nlinear_workspace *w = gsl_multifit_nlinear_alloc(T, n, p);
    gsl_multifit_nlinear_fdf fdf;
    fdf.f = &logistic_f;
    fdf.df = &logistic_df;
    fdf.fvv = nullptr;
    fdf.n = n;
    fdf.p = p;
    fdf.params = &data;

    // Initialize with starting values
    for (size_t i = 0; i < p; ++i) {
        gsl_vector_set(beta, i, 0.01);
    }

    // Solve the system
    int status;
    size_t iter = 0;
    const size_t max_iter = 100;
    double chisq;

    status = gsl_multifit_nlinear_solve(
        &fdf,
        beta,
        gslX,
        gslY,
        w,
        &chisq,
        &iter,
        nullptr,
        nullptr
    );

    // Clean up
    gsl_multifit_nlinear_free(w);
    gsl_matrix_free(gslX);
    gsl_vector_free(gslY);

    return status;
}

int main() {
    // Hardcoded data for testing
    std::vector<std::vector<double>> X = {
        {1, 0.1, 0.2},
        {1, 0.2, 0.3},
        {1, 0.3, 0.5},
        {1, 0.4, 0.7},
        {1, 0.5, 0.9},
        {1, 0.6, 1.1},
        {1, 0.7, 1.3},
        {1, 0.8, 1.5},
        {1, 0.9, 1.7}
    };
    std::vector<double> y = {0, 0, 0, 1, 1, 1, 1, 1, 1};

    // Allocate space for the parameters
    gsl_vector *beta = gsl_vector_alloc(X[0].size());

    // Perform logistic regression
    int status = logistic_regression(X, y, beta);

    if (status) {
        std::cerr << "Error: " << gsl_strerror(status) << std::endl;
    } else {
        // Output the fitted parameters
        std::cout << "Fitted parameters:" << std::endl;
        for (size_t i = 0; i < beta->size; ++i) {
            std::cout << "beta[" << i << "] = " << gsl_vector_get(beta, i) << std::endl;
        }
    }

    // Clean up
    gsl_vector_free(beta);

    return 0;
}

I would say this is subjectively a better attempt than that which used the imatrix file generated from just ~100 chunks of groups_merged as: it uses vector operations in place of loops way more, clearly differentiates the logistic_regression() code out of main(), and it's driver is much more easily adapted via the use of std::vector rather than c-style arrays... I should add that I'm never really interested in the code actually working properly or even being runnable (show me the programmer who can write this in 1-shot with no bugs!) - I'm much more interested in the approach it takes...

I can't share the dataset as it has pretty much every random bit of code and programming related material I could find all stuffed in it, but if anybody is interested then as a test I actually got wizard-lm-2 to write the C++ program I used to create the psedo-groups_merged dataset: it recursively scans a folder and fishes out anything that is text-based, then selects chunks of this to randomly glue together using a given mean and std-dev for the chunk sizes (I used: mean = 512 chars / std-dev 1024 chars).

I also ran using 512, 1024, 2048, 4096 and 8192 contexts (evenly weighted) as I have a feeling the extreme "lazy-GPTness" that showed up with this model and with dbrx, might be subtly due to using only 512 contexts (I will redo this tonight on dbrx and share the results).

BTW: This is the post that made me take much more interest in wizard-lm-2 and especially it's ability with "C" family languages like C, C++, Java, etc (which in the past LLMs seem to often suck at):

https://old.reddit.com/r/LocalLLaMA/comments/1clfahu/we_benchmarked_30_llms_across_26_languages_using/

https://prollm.toqan.ai/leaderboard

@jukofyork
Copy link
Contributor Author

Using base Mixtral 8x7b and IQ3_S quantization - perplexity improves slightly using the new imatrix. Think this is good to merge - nice work!

I actually have another idea on how we could potentially improve perplexity of these low-bpw MoE models and did some experiments on it here:

arcee-ai/mergekit#294

I originally tried to scale the experts.w2.weight tensor from the MoE MLPs but that didn't work well, so then tried scaling up the block_sparse_moe.gate.weight tensor, which in turn sharpens the categorical distribution of the gating weight softmax.

But then lost interest because it was clear at best all I was going to do was make a slightly worse model that ran slower...

Then it occurred to me that these low-bpw quants of MoE models have such a big jump in perplexity, it might actually be a useful method to help smooth out some of the noise: basically add in more (or even all) experts, then try to sharpen the gating weight softmax to the value that minimize the PPL. @slaren's example above shows the jump between 3-bpw and 1-bpw is massive, so there is a good chance this could improve PPL more than it hurts it.

The biggest question is just how will the categorical distribution look at the optimal scale-factor: if it just looks very like the top-k distribution then it won't do anything useful, but if it actually has significant contributions coming from the of the non-top-k bins in it then it might worthwhile!

My biggest problem was to optimize this, currently I have to use Mergekit to do the scaling, then convert to fp16 GGUF, then quantize... Over and over :/

@slaren slaren merged commit 48b2f9c into ggerganov:master May 8, 2024
58 checks passed
@Green-Sky
Copy link
Collaborator

Fixed it, but also found there was a pre-existing bug that meant the "--combine" option didn't actually combine anything and just overwrote:

in.read((char*)e.values.data(), nval*sizeof(float));

So only the last imatrix in the list to be combined was getting used.

Checks out, I came across this (same ppl after merge), but totally forgot to report or look into this, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants