Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving data sources #516

Merged
merged 25 commits into from
Jun 26, 2019

Conversation

ikedas
Copy link
Member

@ikedas ikedas commented Jan 4, 2019

NOTE: Please discuss/report bugs on #693.


Changes in code:

Changes in behavior:

  • Now data sources won't overwrite display name (gecos) of existing subscriber: It is assigned only when a new user is added (included) from data source, or if it is added (subscribed) by users.
    Previously, some (all, with very earlier version) data sources overwrote.
  • Outdated users who have been included from data source will be expired only when all data sources of the list succeeds inclusion.
    Previously, users were delayed expiration when corresponding data source failed synchronization: In order to do that, entire subscribers had to be loaded in memory (this bug has been pointed out by #11: Spurious error on duplicate keys with admin sync, and changes on configuration for owners/moderators #275). By this change, if there are data sources with nosync_time_ranges set, deletion of outdated users can be delayed 24 hours at most.
  • Names of data sources will no longer be shown in subscriber table of web interface.
  • All names of data sources will no longer be shown in subscriber table of web interface. Only one that was actually included will be shown.

Other changes:

  • Database schema: included_* and include_sources_* in subscriber_table and admin_table were deprecated. inclusion_* and inclusion_ext_* fields will be used instead.
    inclusion_* holds the last time of confirmation of inclusion. Additionally, inclusion_ext_* holds the time from "external" data sources. "External" means that it is not include_sympa_list or not including list on local domain (this data source should be treated specially to detect inclusion loop).
    Table content will be updated during upgrading process.
  • include_voot_group has not been implemented, and it was invalidated.

Improvements and bug fixes:

  • Memory usage:
    • include_sql_query and include_sql_ca stores inclusion data in temporary file so that big data sources will be handled safely.
    • See also "Changes in behavior" above.
  • include_ldap_2level_ca data source was implemented.
    However it needs more tests.
  • include_remote_sympa_list data source:
    • HTTP/1.1 is supported thanks to LWP::UserAgent.
    • Now it has user, passwd, timeout, ssl_version, ssl_ciphers and ca_verify parameters.
    • host, port and path parameters were obsoleted. Use url instead.
    • cert parameter was obsoleted.
  • include_remote_file data source:
    • Now it has timeout, ssl_version, ssl_ciphers, ca_verify parameters.
    • It may use client certificate and key in cert.pem and private_key files (as include_remote_sympa_list does).
  • include_ldap_query and include_ldap_ca data sources did not implement "regex" option of "selection" subparameter. Now they support it.
  • Now include_remote_file, include_sympa_list and include_remote_sympa_list paragraphs also may have nosync_time_ranges parameter. No reason they don't have it.
    include_file still doesn't have it.
  • Consistencies on parameter names (older parameter names are also available):
    • include_remote_sympa_list:
      New parameter url replaced host, port and path.
    • include_ldap_query, include_ldap_2level_query, include_ldap_ca & include_ldap_2level_ca:
      bind_dn & bind_password replaced user & passwd, respectively.
    • include_sql_query & include_sql_ca:
      db_host, db_options, db_user & db_passwd replaced host, connect_options, user & passwd, respectively.
  • [bug] include_remote_file should limit protocols to HTTP, HTTPS and FTP.
  • [bug] LDAP multiple values in custom attribute with include_ldap_ca was broken.

Known bug:

  • If data source is a incude_sympa_list data source with the list included from the other external data source(s), it will be treated as non-external and move_user request on corresponding users will be allowed.

@ikedas ikedas added this to the 6.2.40 milestone Jan 4, 2019
@ikedas ikedas self-assigned this Jan 4, 2019
@ikedas
Copy link
Member Author

ikedas commented Jan 10, 2019

I'm planning to add some test cases for data sources. Please wait merger for a while.

@dverdin
Copy link
Contributor

dverdin commented Jan 10, 2019

That's very cool!
A remark: I think the mention of datasources in the subscribers list is usefull. It allows to understand for which reason a user is subscribed. Not a month passes without me checking that kind of information.
For what reason do you want to remove it?

@ikedas
Copy link
Member Author

ikedas commented Jan 11, 2019

@dverdin, whether a user is included from data source(s) or not may be known by checking whether inclusion_* field is non-Null or not.

For what reason do you want to remove it?

include_sources_* field held data source IDs, and has been used for several purposes differ from each other:

  • To track failure of each data source.
    However, logically, that information need not being stored by each users (rows of table).
  • To determine whether user is included from non-external source or not.
    By the problem described in below, it won't work reliablly.
  • To get names of data sources.
    It is also not reliable by the same reason.

Problems are:

  • Number of available data sources is limited.
    When width of field is 50, more than 5 sources won't be handled (This problem has been reported by some users, though I can't find conversasion with reporters by now).
  • Data source ID is not consistent.
    As it is a hash of parameters, any changes (e.g. adding or renaming parameter, even without changes by users) can break consistency, and it is inappropriate to be used for permanently unique identifier.
  • Also, it doesn't correspond to configuration.
    A data source definition file (*.incl) can contain more than one paragraphs, i.e. multiple data sources. Thus, it is generally impossible to get the "name of data source" (name defined in *.incl) from a data source ID.
  • It is not so easy to update field including multiple IDs in atomic way.
    I once have tried such way (see commits 13b206a, 7dbe22d 4e6d9b7, 266230e), but it turned out that non-portable function was required.

Additionally, determining data sources by each user is not so useful by these reason:

  • Information of a user (email and gecos) is included from only one data source at once.
  • User can change information above by their own, and after all, information in the table are not connected to particular data source.

In conclusion, I decided not to store data source IDs in the tables.


Edit: Corrected a commit ID. (again)

@dverdin
Copy link
Contributor

dverdin commented Jan 11, 2019

I understand why the current setup don't always give relevant informations and thus should not be trusted.
However, it raises a new question: how to reliably determine, from the owners point of view, from which sources a user is included. Shall we open a new issue about it?

@ikedas
Copy link
Member Author

ikedas commented Jan 11, 2019

how to reliably determine, from the owners point of view, from which sources a user is included.

  • As described in initial comment, each user will be included from single source by each: Once they are included, they will never be overwritten by the other sources.
  • And as described in my previous comment, once a user is included, their information will no longer be connected to any sources.

Consequently, it is in vain to ask which sources could include a user, at least from view of Sympa.

However, if users (owners) pleased, a name of the source by which user was actually included may be stored in a field of the table (doing this, incorrespondence between data source and configuration does not matter). However, gathering all names of data sources which possiblly could include a user is nearly impossible.

ikedas added 20 commits February 3, 2019 11:55
…t::Handler::include, Sympa::DataSource and its subclasses. Sympa::Datasource was deprecated.

And some fixes:
  * [bug] Now include_remote_file, include_sympa_list and include_remote_sympa_list paragraphs may have nosync_time_ranges subparameter.  No reason they don't have it.  include_file stil doesn't have it.
  * [-feature] include_voot_group has not been implemented, and it was invalidated.
  * [bug] include_ldap_query and include_ldap_ca data sources did not implement "regex" option of "selection" subparameter.  Now they support it.
  * [bug] LDAP multiple values in custom attribute with include_ldap_ca was broken.
  * [-feature] include_sql_query and include_sql_ca stores inclusion data in temporary file so that big data sources will be handled safely.

Tentative change:
  * include_ldap_query, include_ldap_2level and include_sql_query won't clear (overwrite) gecos if retrieved value was empty or undefined.  Now they clear (overwrite) gecos in such cases.
  - Now it have timeout, ssl_version, ssl_ciphers, ca_verify parameters.
  - Client certificate and key in cert.pem and private_key files are available.
  - HTTP/1.1 is supported thanks to LWP::UserAgent.
  - Now it has user, passwd, timeout, ssl_version, ssl_ciphers and ca_verify parameters.
  - host, port and path parameters were obsoleted.  Use url instead.
  - cert parameter was obsoleted.
It is supported by MySQL, PostgreSQL and Oracle (also MS SQL 2017) while SQLite does not.
…gned only when a new user is added (included) from data source, or when it is added (subscribed) by users.
…ata sources of the list has succeeded inclusion.

By this change, if there are data sources with nosync_time_ranges set, deletion of outdated users can be delayed 24 hours at most.
[change] Database schema: included_* and include_sources_* in subscriber_table and admin_table were deprecated.  inclusion_* fields will be used instead.
… and admin_table to give the last time of inclusion from external data sources.

"External" means that it is not include_sympa_list or not including list on local domain.
Note that if inclusion_ext_* field is updated, inclusion_* field must be updated at the same time.

Known bug:
  - If data source is a list included from the other external data source(s), it will be treated as non-external and move_user request on corresponding users will be allowed.
@ikedas ikedas force-pushed the issue-461-addition-trial5 branch from 8126821 to d5b31fe Compare February 3, 2019 04:18
@ikedas
Copy link
Member Author

ikedas commented Feb 7, 2019

... a name of the source by which user was actually included may be stored in a field of the table ...

Done. See 1b42ea0

However, gathering all names of data sources which possiblly could include a user is nearly impossible.

Conflicts:
	ext/Plugin/lib/Sympa/Plugin/ListSource.pm
	src/lib/Sympa/List.pm
	src/lib/Sympa/ListDef.pm
@ikedas
Copy link
Member Author

ikedas commented Mar 6, 2019

I was busy in the end of fiscal year. This will be postponed to the next milestone 6.2.44.

@ikedas ikedas modified the milestones: 6.2.42, 6.2.44 Mar 6, 2019
@ikedas ikedas modified the milestones: 6.2.44, 6.2.46 May 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants