Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect non-sensitive platform-specific config stats #56762

Closed
mshustov opened this issue Feb 4, 2020 · 31 comments
Closed

Collect non-sensitive platform-specific config stats #56762

mshustov opened this issue Feb 4, 2020 · 31 comments
Assignees
Labels
Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@mshustov
Copy link
Contributor

mshustov commented Feb 4, 2020

We would like to understand what platform features are used and how they configured by the users. That would help us make decisions about API completeness.
The main problem with this that the platform doesn't have access to the Telemetry plugin. Nor telemetry plugin has access to the platform config.

One item we'd like to include this as well is which 3rd party plugins are installed + enabled.

Questions we'd like to answer for config things:

  • How many SavedObjects do we have?
  • Which plugins were disabled by the user?
  • Are any 3rd party plugins installed? Which ones?

Data to collect, all keys should be opt-in:

  • Disabled plugins + installed 3rd party plugins
  • HTTP config (?)
  • Elasticsearch config (?)
  • Logging config (?)
@mshustov mshustov added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:New Platform labels Feb 4, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@elasticmachine
Copy link
Contributor

Pinging @elastic/pulse (Team:Pulse)

@pgayvallet
Copy link
Contributor

The main problem with this that the platform doesn't have access to the Telemetry plugin.

Only option here would be to open hook-like API as we do for the http authent, I think. We really need a good reason to do so however as this is relatively heavy changes.

Nor telemetry plugin has access to the platform config.

Which part of the config would the plugin actually need access to? Everything?

@mshustov
Copy link
Contributor Author

mshustov commented Feb 10, 2020

Which part of the config would the plugin actually need access to? Everything?

In theory yes, we might want to track any part of the config: from Elasticsearch SSL setup to logging flags.
maybe we should start by defining the information we're interested in: config values, plugin usage stats, runtime metrics, etc. OTOH our system should allow us easily to adjust the list of metrics.

@Bamieh
Copy link
Member

Bamieh commented Feb 10, 2020

@restrry Telemetry plugin is only responsible for sending telemetry to our cluster. The usageCollection plugin will be responsible for fetching data from the platform in this case (via the usageCollection.registerCollector method).

I feel that we can move the usageCollection to core. We do not allow disabling this plugin since we always want it to be running. Usage collection is used by monitoring, all plugins sending telemetry, and potentially pulse collectors (still under discussion).

@pgayvallet
Copy link
Contributor

I feel that we can move the usageCollection to core. We do not allow disabling this plugin since we always want it to be running. Usage collection is used by monitoring, all plugins sending telemetry, and potentially pulse collectors (still under discussion).

That was my feeling too. A plugin that can't/shouldn't be disabled and is used by every other plugin should probably be part of core.

@pgayvallet
Copy link
Contributor

Reopening the discussion as this is currently due for 7.9.

  • Regarding the usageCollection plugin

So, as it seems we did get a consensus on that, should we move the usageCollection feature from the current plugin to a core service?

usageCollection has no declared dependencies, so it's a least technically doable, even if this is doing to be quite a lot of changes as the plugin is currently used a lot.

BTW The plugin is currently owned by @elastic/pulse. Moving it to core would change the owner to the platform team (I guess?). Is everybody alright with that?

  • Regarding the data core wants to collect

One item we'd like to include this as well is which 3rd party plugins are installed + enabled.

In theory yes, we might want to track any part of the config: from Elasticsearch SSL setup to logging flags.
maybe we should start by defining the information we're interested in: config values, plugin usage stats, runtime metrics, etc.

Any further insight on what we should collect, as least for the initial version?

@afharo
Copy link
Member

afharo commented Jun 10, 2020

BTW The plugin is currently owned by @elastic/pulse. Moving it to core would change the owner to the platform team (I guess?). Is everybody alright with that?

I have no objections to that.

One item we'd like to include this as well is which 3rd party plugins are installed + enabled.

I think we can implement this without moving it to core. We can create a very simple plugin where we add in the kibana.json all the known third-party plugins to the optionalDependencies. The collector will check if they are available in the dependencies parameter and return true or false based on that. Would that work?

Also, if those plugins are registered apps in Kibana, application_usage will automatically report them whenever they are used.

In theory yes, we might want to track any part of the config: from Elasticsearch SSL setup to logging flags.

"Any" seems a bit risky. I agree with only exposing a well-defined set of properties we may want to report.

Q: Can we retrieve any of those configurations via an Elasticsearch API request?

NB: Maybe core can actually define a CoreUsageCollector service that will call usageCollector.makeUsageCollector and usageCollector.registerUsageCollector. Technically, the usageCollector can receive new registrations at any given point. It's not limited to the setup lifecycle step.

maybe we should start by defining the information we're interested in: config values, plugin usage stats, runtime metrics, etc.

Please, bear in mind the current payload (it can be seen in Management > Stack Management > Kibana > Advanced Settings > Usage Data > See an example of what we collect) is already huge and we might be duplicating an effort about something that is already reported:

i.e.: We already collect the overridden configurations in the UI Management section: https://github.com/elastic/kibana/blob/master/src/plugins/kibana_usage_collection/server/collectors/management/telemetry_management_collector.ts#L26-L43

Re the runtime metrics, maybe APM is the way to go? Personally, I think anything we may implement to measure them would end up looking pretty similar to APM's existing agents. There is a blocker at the moment, but I'd say we push the effort to get it fixed over re-implementing our own solution.

@pgayvallet
Copy link
Contributor

pgayvallet commented Jun 10, 2020

I think we can implement this without moving it to core. We can create a very simple plugin where we add in the kibana.json all the known third-party plugins to the optionalDependencies

That seems like a very fragile mechanism compared to using the informations from the plugin service imho. Also we won't dissociate non-installed plugins from disabled ones.

NB: Maybe core can actually define a CoreUsageCollector service that will call usageCollector.makeUsageCollector and usageCollector.registerUsageCollector

core shouldn't (I won't say can't because it's not technically totally true, but it should be considered as is) consume any plugin API.

Please, bear in mind the current payload (it can be seen in Management > Stack Management > Kibana > Advanced Settings > Usage Data > See an example of what we collect) is already huge

Hum, I really don't know well how we are sending data, but the whole payload containing all the metrics is always sent? We don't have an event-based API to send custom events with custom payloads?

@afharo
Copy link
Member

afharo commented Jun 10, 2020

That seems like a very fragile mechanism compared to using the informations from the plugin service imho. Also we won't dissociate non-installed plugins from disabled ones.

Yeah, core has more visibility and can provide better insights :)

core shouldn't (I won't say can't because it's not technically totally true, but it should be considered as is) consume any plugin API.

Fair enough. I'm really OK with moving the usageCollection plugin to core. It was a suggestion as an optional approach we might want to explore :)

Hum, I really don't know well how we are sending data, but the whole payload containing all the metrics is always sent? We don't have an event-based API to send custom events with custom payloads?

We don't have any event-based API. We only send the usage payload once every 24h (with a * here because it's not that strict). At that point of sending the telemetry payload, we call the fetch methods registered to the usageCollection plugin.
At the moment, we don't send telemetry on an event-triggered way. Some pieces in the telemetry may be event related (like ui_metric), but they are stored locally and send to the remote telemetry cluster only when the fetch method is called (around once every 24h).

@joshdover
Copy link
Contributor

The primary argument I can think of against moving usage collectors to Core is more of an organization / team design one then a technical one. In essence, I think it's simpler if all the code in given subdirectory is fully owned by a single team. It eliminates communication overhead and allows a single team to make sweeping changes to that directory without coordination.

That said, usage collectors is such a small pattern / mechanism that it may be simple enough to just move to Core and make the Platform team the owners. However if there are future enhancements planned for this pattern or we expect to use something else entirely in the near- to mid-term, then I think we should explore other options.

One way we could achieve what we need here without moving usage collectors, is to expose an API from Core that essentially is a usage collector, but isn't tied into the registry of usage collectors directly. Instead, the telemetry plugin could setup a usage collector to call this Core API in order to grab data from Core.

Example

Extension of the Core API

interface CoreSetup {
  // Core could expose some API that is _essentially_ a usage collector
  telemetry: {
    type: 'kibana_core',
    schema: CoreTelemetrySchema;
    fetch(): Promise<CoreTelemetry>;
  }
}

Telemetry plugin uses Core API to create a core usage collector

class TelemetryPlugin {
  setup(core, plugins) {
    const coreCollector = usageCollection.makeUsageCollector(core.telemetry);
    plugins.usageCollection.registerCollector(coreCollector);
  }
}

@TinaHeiligers
Copy link
Contributor

The primary argument I can think of against moving usage collectors to Core is more of an organization / team design one then a technical one

We shouldn’t restrict ourselves to the current team/organization design when it comes to making the right decisions. Teams can be restructured and need to change with Elastic’s needs 😄

I think it's simpler if all the code in given subdirectory is fully owned by a single team. It eliminates communication overhead and allows a single team to make sweeping changes to that directory without coordination.

A lot of different teams send usage data to our telemetry cluster and the demand for data is increasing. The data we collect from Kibana allows business to monitor solutions adoptions and, IMHO, any sweeping changes made to the collection mechanism and the data we do collect needs to be communicated and controlled. ATM, nothing’s really stopping the small kibana-telemetry team from making such changes in isolation and, with the importance of usage data increasing, it’s maybe time to reconsider that approach.

However if there are future enhancements planned for this pattern or we expect to use something else entirely in the near- to mid-term, then I think we should explore other options.

Code wise, we can certainly take the approach you suggest to get the job done. We do have a few improvement initiatives we're working on while helping Infra take on the task of the remote Elastic Telemetry service.

@joshdover
Copy link
Contributor

A lot of different teams send usage data to our telemetry cluster and the demand for data is increasing. The data we collect from Kibana allows business to monitor solutions adoptions and, IMHO, any sweeping changes made to the collection mechanism and the data we do collect needs to be communicated and controlled. ATM, nothing’s really stopping the small kibana-telemetry team from making such changes in isolation and, with the importance of usage data increasing, it’s maybe time to reconsider that approach.

Sounds like there is potential for a decent amount of change in this coming. We aim for Core APIs to be stable-ish, so I lean towards the approach I outlined above. If we go with this option, I think the Platform team could handle all of the changes. We can have the Pulse team review the change to the usage_collection plugin to integrate with the new Core API.

The main downside I see of this approach is that we'd be exposing internal data from core that is not normally available to plugins. However, I think we can document that this API is only valid for telemetry and should not be used for any other reason. @elastic/kibana-platform any objections?

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Jul 6, 2020

@joshdover What I was trying to say is that we need got to a point where the usage collection and telemetry API's are stable. The improvements we're investigating are under the condition that they won't change the usage collection API's. We're striving to get to a point where collecting data and shipping it to the remote service is as stable and reliable as using Saved Objects or any of the other Core capabilities.

I'm by no means trying to shift the work over to another team, we're more than willing to take the task on and @afharo 's already working on a draft. What I was trying to say is that telemetry has become so important, that we need to be careful about the approach from here on.

That being said, we can always reconsider the approach later if needs be.
For now, my vote goes to the Extension of the Core API
I'll wait to see what others think.

@afharo
Copy link
Member

afharo commented Jul 7, 2020

Thank you @TinaHeiligers and @joshdover!

I think I agree with you both: the Usage Collection APIs (not to be confused with the telemetry plugin) are key APIs and I think they should belong to core eventually.
At the same time, I think that we, at the Kibana Telemetry team, are currently planning some improvements on the existing APIs (either extensions or internal improvements), so maybe they are not mature and stable enough to do the move for now.

+1 on the core.telemetry extension, but with 2 heads up:

  1. Beware of the publicly exposed data to other plugins (maybe, for security purposes, we can limit that core.telemetry.getUsageCollector() can only be called only once or Kibana will crash?)
  2. The Kibana Telemetry team will aim to design the UsageCollector plugin in a way that will be easier to migrate in the future (as long as these thoughts don't affect too much on the velocity in delivering the features and improvements) :)

Looking forward to knowing all your thoughts :)

@pgayvallet
Copy link
Contributor

pgayvallet commented Jul 8, 2020

interface CoreSetup {
  // Core could expose some API that is _essentially_ a usage collector
  telemetry: {
    type: 'kibana_core',
    schema: CoreTelemetrySchema;
    fetch(): Promise<CoreTelemetry>;
  }
}

Let's be honest, it's definitely not a correct design, we are just cheating with pseudo inversion of control here. But it's the most pragmatic solution in term of impact to the codebase and to the code owners, so I'm not opposed to it.

However, I'm wondering if we will be able to achieve this option without importing (or duplicating) types from the usageCollector plugin to core though? What is going to be the type of CoreSetup.telemetry?

@afharo
Copy link
Member

afharo commented Jul 9, 2020

I totally understand your point, @pgayvallet!

How about meeting somewhere in the middle? What about core providing a 'stats' API (as part of the CoreStart contract instead) that works in some form of fetch method and we'll register the collector in the kibana_usage_collection plugin (that's where we've defined most of the "general-purpose" usage collectors)?

@TinaHeiligers
Copy link
Contributor

@joshdover the 7.9 deadline looks a little tight here. Do you still think that we can make it before FF?

@joshdover
Copy link
Contributor

joshdover commented Jul 9, 2020

@TinaHeiligers I think it's likely this will slip to 7.10 at this point :-/ And that's not on y'all, we've been working on other tasks.

@joshdover joshdover self-assigned this Sep 16, 2020
@joshdover
Copy link
Contributor

joshdover commented Sep 16, 2020

I've ran through all of Core's configuration and come up with a potential list of non-sensitive metrics that we could collect. For sensitive configuration keys (eg. hostname, ssl), I include fields that indicate a feature is being used rather than including the actual configuration value.

There is some overlap with some of the metrics collected by src/plugins/kibana_usage_collection/server/collectors/ops_stats/ops_stats_collector.ts which is supposed to include many of the stats exposed by the MetricsService. However, when taking a look at our actual telemetry, I was not able to find many of these values in our dataset. Some clarification from @afharo would be helpful here on what is actually included and reported.

export interface CoreTelemetry {
  config: CoreConfigTelemetry;
  usage: CoreUsageTelemetry;
  environment: CoreEnvironmentTelemetry;
}
/**
 * Telemetry data on this cluster's usage of Core features
 */
export interface CoreUsageTelemetry {
  savedObjects: {
    totalCount: number;
    typesCount: number;
    /** Record of Kibana version -> number of failures that occured on that version's upgrade attempt */
    migrationFailures: Record<string, number>;
  };
}
/**
 * Telemetry data on this Kibana node's runtime environment.
 */
export interface CoreEnvironmentTelemetry {
  memory: {
    heapTotalBytes: number;
    heapUsedBytes: number;
    /** V8 heap size limit */
    heapSizeLimit: number;
  };
  os: {
    platform: string;
    platformRelease: string;
    distro?: string;
    distroRelease?: string;
  };
}
/**
 * Telemetry data on this cluster's configuration of Core features
 */
export interface CoreConfigTelemetry {
  elasticsearch: {
    sniffOnStart: boolean;
    sniffIntervalMs?: number;
    sniffOnConnectionFault: boolean;
    numberOfHostsConfigured: boolean;
    preserveHost: boolean;
    requestHeadersWhitelistConfigured: boolean;
    customHeadersConfigured: boolean;
    shardTimeoutMs: number;
    requestTimeoutMs: number;
    pingTimeoutMs: number;
    startupTimeoutMs: number;
    logQueries: boolean;
    ssl: {
      verificationMode: 'none' | 'certificate' | 'full';
      certificateAuthoritiesConfigured: boolean;
      certificateConfigured: boolean;
      keyConfigured: boolean;
      keystoreConfigured: boolean;
      truststoreConfigured: boolean;
      alwaysPresentCertificate: boolean;
    };
    apiVersion: string;
    healthCheck: {
      delayMs: number;
    };
  };

  http: {
    basePath: string;
    maxPayloadInBytes: number;
    rewriteBasePath: boolean;
    keepaliveTimeout: number;
    socketTimeout: number;
    compression: {
      enabled: boolean;
      referrerWhitelistConfigured: boolean;
    };
    cors: {
      enabled: boolean;
    }
    xsrf: {
      disableProtection: boolean;
      whitelistConfigured: boolean;
    };
    requestId: {
      allowFromAnyIp: boolean;
      ipAllowlistConfigured: boolean;
    };
    ssl: {
      certificateAuthoritiesConfigured: boolean;
      certificateConfigured: boolean;
      cipherSuites: Array<'TLSv1' | 'TLSv1.1' | 'TLSv1.2'>;
      keyConfigured: boolean;
      keystoreConfigured: boolean;
      truststoreConfigured: boolean;
      redirectHttpFromPort?: number;
      supportedProtocols: Array<'TLSv1' | 'TLSv1.1' | 'TLSv1.2'>;
      clientAuthentication: 'none' | 'optional' | 'required';
    };
  };

  logging: {
    appendersTypesUsed: string[];
    loggersConfiguredCount: number;
  };

  plugins: {
    /** list of built-in plugins that are disabled */
    firstPartyDisabled: string[];
    /** list of third-party plugins that are installed and enabled */
    thirdParty: string[];
  };

  savedObjects: {
    maxImportPayloadBytes: number;
    maxImportExportSizeBytes: number;
    migrations: {
      batchSize: number;
      scrollDurationMs: number;
      pollIntervalMs: number;
      skip: boolean;
    };
  };

  uiSettings: {
    overridesCount: number;
  };
}

@legrego
Copy link
Member

legrego commented Sep 16, 2020

@joshdover what are your thoughts on collecting ssl config for CoreConfigTelemetry['http'], similar to what we've done for CoreConfigTelemetry['elasticsearch']?

Perhaps something like (added a bonus cors setting too):

/**
 * Telemetry data on this cluster's configuration of Core features
 */
export interface CoreConfigTelemetry {
  http: {
    cors: {
      enabled: boolean;
    };
    ssl: {
      enabled: boolean;
      supportedProtocols: string[];
      keyConfigured: boolean;
      keystoreConfigured: boolean;
      truststoreConfigured: boolean;
    };
  };
} 

@joshdover
Copy link
Contributor

joshdover commented Sep 16, 2020

Makes sense, not sure how I skipped http.ssl

EDIT: updated my comment with the http.ssl and http.cors fields.

@kobelb
Copy link
Contributor

kobelb commented Sep 16, 2020

What about CSP? I could see it being helpful to know how many users have a custom CSP policy in place.

For SSL, when the certificate is specified the key must also be specified. I can't think of a good name to denote both of these being configured, so perhaps that's justification for leaving them separate?

@legrego
Copy link
Member

legrego commented Sep 16, 2020

What about CSP? I could see it being helpful to know how many users have a custom CSP policy in place.

CSP is already being reported here, although it'd be nice to consolidate the fields into CoreConfigTelemetry instead of having them spread out:

schema: {
strict: {
type: 'boolean',
},
warnLegacyBrowsers: {
type: 'boolean',
},
rulesChangedFromDefault: {
type: 'boolean',
},
},

@alexfrancoeur
Copy link

This is fantastic. We have so many different types of saved objects now, is there any value in getting count by type?

@afharo
Copy link
Member

afharo commented Sep 18, 2020

There is some overlap with some of the metrics collected by src/plugins/kibana_usage_collection/server/collectors/ops_stats/ops_stats_collector.ts which is supposed to include many of the stats exposed by the MetricsService. However, when taking a look at our actual telemetry, I was not able to find many of these values in our dataset. Some clarification from @afharo would be helpful here on what is actually included and reported.

Hey @joshdover! Sorry for my late response. This collector is treated in a special way at the moment:

  1. For telemetry purposes, it only reports the OS information (code stripping kibana_stats out of the payload).
  2. In Monitoring, it's used by the legacy collector (AFAIK, to be deprecated by 8.0) to report the ops stats to the monitoring cluster.

We have so many different types of saved objects now, is there any value in getting count by type?

@alexfrancoeur we already report the saved objects count per type in here. Although it's limited to these 6 types only. We might want to revisit that list?

@joshdover joshdover assigned rudolf and unassigned joshdover Sep 22, 2020
@rudolf
Copy link
Contributor

rudolf commented Sep 23, 2020

@alexfrancoeur we already report the saved objects count per type in here. Although it's limited to these 6 types only. We might want to revisit that list?

Ideally we could generate metrics that are more dynamic so that all types are reported on, that would probably be easier to do from within Core. For saved object migrations it would also be useful to understand the size of the .kibana/.kibana_task_manager indices.

@rudolf
Copy link
Contributor

rudolf commented Sep 29, 2020

http: {
basePath: string;

Do we care about the actual basepath or just whether or not it was set? Should we rather use the following?

basePathConfigured: boolean

@rudolf
Copy link
Contributor

rudolf commented Sep 29, 2020

I'm leaving out http.cors until we implement it #16714

@joshdover
Copy link
Contributor

Do we care about the actual basepath or just whether or not it was set? Should we rather use the following?

basePathConfigured: boolean

Good catch, I think basePathConfigured is what we want.

@pgayvallet
Copy link
Contributor

I'll consider this completed by #79101 and will close it. Please reopen if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
Status: Done (7.13)
Development

No branches or pull requests