Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: improve capacity unit calculation #339

Merged
merged 10 commits into from
Jun 18, 2019
Merged

Conversation

qinzuoyan
Copy link
Contributor

@qinzuoyan qinzuoyan commented Jun 16, 2019

What problem does this PR solve?

Current capacity unit calculation (committed in #318) has these problems:

  • not robust enough:
    • if get_apps_and_nodes returns false, then terminate, which may cause CU lost.
      if (!get_apps_and_nodes(sc, apps, nodes))
          return false;
      
      Solution: retry for some times after several seconds.
    • if decode_node_perf_counter_info failed for one node, then terminate all, which is not reasonable.
         if (!decode_node_perf_counter_info(node_addr, results[i], info))
             return false;
      
      Solution: ignore failure nodes.
  • stat result is not compact enough:
    • in the old may, the stat result is like:
      "2019-06-16 20:35:45" : "10.239.35.217:34801" => "{\"temp\":[8,11]}"
      
    • in the new way, the stat result is like (using app id instead of app name, and add prefix cu@ in sort_key to diff with storage size stat):
      "2019-06-16 20:35:45" : "cu@10.239.35.217:34801" => "{\"1\":[8,11]}"
      
      Advantage: less space; more accurate, because it is not affected by table rename.
  • lack of storage size stat:
    • so we add storage size stat as:
      "2019-06-17 18:47:15" : "ss" => "{\"1\":[8,8,0],\"2\":[8,8,476]}"
      
      the triple is [app_partition_count, stat_partition_count, storage_size_in_mb], where stat_partition_count is the count of partitions whose storage is accounted.
  • performance:
    • by adding remote command perf-counters-by-substr, we use simple sub-string search to replace regex matching, which probably reduces loading pressure of replica-server and gives better performance.

What is changed and how it works?

As mentioned above

Check List

Tests

  • No code
  • Manual test passed

@qinzuoyan qinzuoyan added the type/enhancement Indicates new feature requests label Jun 16, 2019
@qinzuoyan qinzuoyan requested a review from acelyc111 June 17, 2019 11:23
src/server/config-server.ini Outdated Show resolved Hide resolved
_cu_fetch_retry_count = 3;
_cu_fetch_retry_wait_seconds = 1;

_st_fetch_interval_seconds =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改 _storage_fetch_interval_seconds,不然这个配置名反人类了

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加全吧, _storage_size_fetch_interval_seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那改成storage_size_fetch_interval_seconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

配置文件里面该成 storage_size_fetch_interval_seconds,但是变量名不用改,毕竟也没啥歧义,不会造成困扰。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -26,6 +26,7 @@ namespace server {

DEFINE_TASK_CODE(LPC_PEGASUS_APP_STAT_TIMER, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
DEFINE_TASK_CODE(LPC_PEGASUS_CU_STAT_TIMER, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
DEFINE_TASK_CODE(LPC_PEGASUS_ST_STAT_TIMER, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也一样,还是不追求名字短,追求可读吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/server/config.ini Outdated Show resolved Hide resolved
src/server/info_collector.h Outdated Show resolved Hide resolved
continue;
if (pc.partition_flags != 0) // already calculated
continue;
pc.partition_flags = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partition_flags 只有这里会用到吗?如果其它地方也改了partition_flags会不会有问题?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

app_partitions 是个临时变量,partition_flags 在这个函数里根本不会用到,所以不管它有没有在其他地方使用,这里就是被用来当做一个临时flag。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在这个函数里面,只有这里使用,用于临时flag。

_cu_fetch_retry_count = 3;
_cu_fetch_retry_wait_seconds = 1;

_st_fetch_interval_seconds =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加全吧, _storage_size_fetch_interval_seconds?

app_stat_interval_seconds = 10

cu_stat_app = stat
cu_fetch_interval_seconds = 8
st_fetch_interval_seconds = 3600
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

跟上面的配置为什么相差这么大?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改进点3:增加storage size统计,默认每小时统计一次,统计每个表的存储空间(和app_stat命令一样)。因为变化不大,所以没必要太频繁。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -65,6 +66,17 @@ info_collector::info_collector()
"cu_fetch_interval_seconds",
8, // default value 8s
"capacity unit fetch interval seconds");
_cu_fetch_retry_count = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

写成std::min(3, _cu_fetch_interval_seconds) 吧, 避免cu_fetch_interval_seconds配的很小, 跟重试有重叠

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -65,6 +66,17 @@ info_collector::info_collector()
"cu_fetch_interval_seconds",
8, // default value 8s
"capacity unit fetch interval seconds");
_cu_fetch_retry_count = 3;
_cu_fetch_retry_wait_seconds = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个如果固定为1的话, 写成静态常量好了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


void info_collector::on_storage_size_stat(int remaining_retry_count)
{
ddebug("start to stat storage size");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉这行日志吧, 没有意义

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个还是有必要的,表示在正常工作,和available_detector里面一样。

Copy link
Member

@acelyc111 acelyc111 Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我是感觉正常情况下, 这种周期性的任务就没必要打日志了, 只在异常分支打日志, 不然日志文件中会充满这类日志.
如果是debug问题, 有其他手段, 比如pstack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

毕竟没有多少日志,这对查找问题还是有帮助的,先不改了

}
return;
}
_result_writer->set_result(st_stat.timestamp, "st", st_stat.dump_to_json());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非得简化的话,sz可能更好一点

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage_size不应当是ss吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


std::string dump_to_json() const
{
std::map<int32_t, std::vector<int64_t>> values;
for (auto kv : cu_value_by_app) {
auto &pair = kv.second;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kv没加引用, pair也没有必要加

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

都得加

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/server/info_collector.cpp Outdated Show resolved Hide resolved
src/server/info_collector.h Outdated Show resolved Hide resolved
src/server/config.ini Outdated Show resolved Hide resolved
src/server/config-server.ini Outdated Show resolved Hide resolved
@@ -66,9 +66,11 @@ class info_collector
void on_app_stat();
AppStatCounters *get_app_counters(const std::string &app_name);

void on_capacity_unit_stat();
void on_capacity_unit_stat(int remaining_retry_count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成 retry_count 吧, 函数本身没有remaining这层概念

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得有remaining语义更清楚,表示还允许多少次重试。不然只是retry_count,那么retry_count=3可能被理解为这次调用是第3次重试。

src/server/info_collector.cpp Outdated Show resolved Hide resolved
src/server/info_collector.cpp Show resolved Hide resolved
src/shell/command_helper.h Show resolved Hide resolved
@qinzuoyan qinzuoyan merged commit c5dbab2 into apache:master Jun 18, 2019
@qinzuoyan qinzuoyan deleted the qinzuoyan branch June 18, 2019 10:08
This was referenced Jun 20, 2019
neverchanje pushed a commit to neverchanje/pegasus that referenced this pull request Jul 13, 2019
Former-commit-id: c2b15d46396af85529ba4c164e0f7c13f0f139fd [formerly c5dbab2]
Former-commit-id: 50b17f8d276466cc00d671e3d40fa3f7749aa49c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.11.5 type/enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants