Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib: execute pthread_join only when tid != 0 in flb_stop. #9428

Closed

Conversation

pwhelan
Copy link
Contributor

@pwhelan pwhelan commented Sep 26, 2024

Summary

This PR wraps the call to pthread_join in flb_stop in a conditional for tid != 0.

Calling pthread_join with tid == 0 is undefined. Technically it should return ESRCH but CentOS and RHEL do not seem to conform.

The following crash occurs:

[2024/09/26 18:09:24] [engine] caught signal (SIGHUP)
[2024/09/26 18:09:24] [ info] reloading instance pid=551 tid=0x7ffff7fe5d80
[2024/09/26 18:09:24] [error] [config] section 'dummmy' tried to instance a plugin name that doesn't exist

Thread 1 "calyptia-fluent" received signal SIGSEGV, Segmentation fault.
0x00007ffff78094f0 in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007ffff78094f0 in __pthread_timedjoin_ex () from /lib64/libpthread.so.0
#1  0x000000000050ddac in flb_stop ()
#2  0x00000000005426b5 in flb_reload ()
#3  0x0000000000486898 in flb_main ()
#4  0x00007ffff4a23493 in __libc_start_main () from /lib64/libc.so.6
#5  0x000000000048433e in _start ()

Debugging the actual code in __pthread_timedjoin_ex we can see it is in fact a NULL dereference:

(gdb) x/4i $rip
=> 0x7ffff78094f0 <__pthread_timedjoin_ex+16>:	mov    0x2d0(%rdi),%r8d
   0x7ffff78094f7 <__pthread_timedjoin_ex+23>:	mov    %fs:0x28,%rax
   0x7ffff7809500 <__pthread_timedjoin_ex+32>:	mov    %rax,0x38(%rsp)
   0x7ffff7809505 <__pthread_timedjoin_ex+37>:	xor    %eax,%eax
(gdb) info registers
rax            0x1                 1
rbx            0x0                 0
rcx            0x1                 1
rdx            0x0                 0
rsi            0x0                 0
rdi            0x0                 0
rbp            0x0                 0x0
rsp            0x7fffffffe9e0      0x7fffffffe9e0
r8             0x2a                42

Root cause analysis

The value of tid is derived from ctx->config->worker here:

    tid = ctx->config->worker;

    if (ctx->status == FLB_LIB_NONE || ctx->status == FLB_LIB_ERROR) {
        /*
         * There is a chance the worker thread is still active while
         * the service exited for some reason (plugin action). Always
         * wait and double check that the child thread is not running.
         */
#if defined(FLB_SYSTEM_MACOS)
        pthread_cancel(tid);
#endif
        pthread_join(tid, NULL);
        return 0;
    }

The call to pthread_join checks ctx->status, which is not changed when ctx->config->worker is set:

    /* spawn worker thread */
    config = ctx->config;
    ret = mk_utils_worker_spawn(flb_lib_worker, ctx, &tid);
    if (ret == -1) {
        return -1;
    }
    config->worker = tid;

    /* Wait for the started signal so we can return to the caller */

The value of ctx->status can be set to the following values which are defined in flb_lib.h:

  • FLB_LIB_ERROR -1
  • FLB_LIB_NONE 0
  • FLB_LIB_OK 1

The places where ctx->status is set do not correlate with the value of ctx->config->worker at all:

In fact we might want to just check the value of ctx->config->worker and not ctx->status before joining it. There is a correlation between both values but I do not see it as being strong enough to depend upon one or the other.

The code that calls pthread_join behind that conditional is also meant for extraordinary or erroneous circumstances and therefore being defensive to me does not seem to be a bad policy.

References


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting


Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@edsiper
Copy link
Member

edsiper commented Sep 26, 2024

@pwhelan can we get more details on why CentOS and Redhat don't conform ? I cannot find a reference about it.

It seems to me that the proper approach would be to get the return of pthread_join(3) and print the errors if any:

  • EDEADLK
  • EINVAL
  • ESRCH (if this is the case and we get a repro, we need to troubleshoot "why")

a simple flb_errno() can do the job since that is set in errno

@pwhelan
Copy link
Contributor Author

pwhelan commented Sep 26, 2024

@pwhelan can we get more details on why CentOS and Redhat don't conform ? I cannot find a reference about it.

It seems to me that the proper approach would be to get the return of pthread_join(3) and print the errors if any:

  • EDEADLK
  • EINVAL
  • ESRCH (if this is the case and we get a repro, we need to troubleshoot "why")

a simple flb_errno() can do the job since that is set in errno

That would be great, but pthread_join crashes on RHEL and CentOS before it can even return when passing the first argument as zero.

@edsiper
Copy link
Member

edsiper commented Sep 26, 2024

@pwhelan what are the steps to make it crash ?

@pwhelan
Copy link
Contributor Author

pwhelan commented Sep 26, 2024

@pwhelan what are the steps to make it crash ?

You need to configure a calyptia fleet with an invalid configuration, ie: with a non-existent plugin:

Calyptia Fluent Bit v24.4, patch_level=13
* Enterprise Fluent Bit by Calyptia
* https://calyptia.com
[2024/09/17 02:59:32] [ info] [calyptia fluent bit] version=24.4 patch=13 pid=77100
[2024/09/17 02:59:32] [ info] [custom:calyptia:calyptia.0] custom initialized!
[2024/09/17 02:59:32] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/09/17 02:59:32] [ info] [cmetrics] version=0.9.1
[2024/09/17 02:59:32] [ info] [ctraces ] version=0.5.1
[2024/09/17 02:59:32] [ info] [input:fluentbit_metrics:fluentbit_metrics.0] initializing
[2024/09/17 02:59:32] [ info] [input:fluentbit_metrics:fluentbit_metrics.0] storage_strategy='memory' (memory only)
[2024/09/17 02:59:32] [ info] [input:calyptia_fleet:calyptia_fleet.1] initializing
[2024/09/17 02:59:32] [ info] [input:calyptia_fleet:calyptia_fleet.1] storage_strategy='memory' (memory only)
[2024/09/17 02:59:32] [ info] [input:calyptia_fleet:calyptia_fleet.1] initializing calyptia fleet input.
[2024/09/17 02:59:32] [ info] [input:calyptia_fleet:calyptia_fleet.1] loading configuration from /tmp/calyptia-fleet/2ecbcb172eb37b742129a89e78f8be7773e81a33aafcc2f7089a0e3f9d19dd67/cba-test-1/new.conf.
[2024/09/17 02:59:32] [ info] [input:calyptia_fleet:calyptia_fleet.1] changing to config dir: /tmp/calyptia-fleet/2ecbcb172eb37b742129a89e78f8be7773e81a33aafcc2f7089a0e3f9d19dd67/cba-test-1/1726540838
[2024/09/17 02:59:32] [ info] [input:calyptia_fleet:calyptia_fleet.1] changing to config dir: /tmp/calyptia-fleet/2ecbcb172eb37b742129a89e78f8be7773e81a33aafcc2f7089a0e3f9d19dd67/cba-test-1/1726540838
[2024/09/17 02:59:32] [ info] [sp] stream processor started
[2024/09/17 02:59:37] [engine] caught signal (SIGHUP)
[2024/09/17 02:59:37] [ info] reloading instance pid=77100 tid=0x7f885548cd80
[2024/09/17 02:59:37] [error] [config] section 'dummmy' tried to instance a plugin name that doesn't exist
[2024/09/17 02:59:37] [engine] caught signal (SIGSEGV)
#0  0x50ddab            in  ???() at ???:0
#1  0x5426b4            in  ???() at ???:0
#2  0x486897            in  ???() at ???:0
#3  0x7f88531f07e4      in  ???() at ???:0
#4  0x48433d            in  ???() at ???:0
#5  0xffffffffffffffff  in  ???() at ???:0

Reproducing it is hard under other circumstances.

@edsiper
Copy link
Member

edsiper commented Sep 26, 2024

can we get a minimal repro ? it seems this needs hot reload with a speacial crafted config, something we can repro with curl ?

@pwhelan
Copy link
Contributor Author

pwhelan commented Sep 26, 2024

can we get a minimal repro ? it seems this needs hot reload with a speacial crafted config, something we can repro with curl ?

You need to run fluent-bit under the CentOS/8 container, modify the configuration file once it is running and then send it SIGHUP.

[root@316b586e3195 /]# cat fleet.conf
[INPUT]
    tag             dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f
    Name            dummy
    rate            1
    dummy           {"message":"dummy#1"}
    samples         0
    start_time_sec  -1
    start_time_nsec -1
[OUTPUT]
    name        stdout
    Match_Regex .{0,}
[root@316b586e3195 /]# /opt/calyptia-fluent-bit/bin/calyptia-fluent-bit -c fleet.conf -Y
Calyptia Fluent Bit v24.4, patch_level=13
* Enterprise Fluent Bit by Calyptia
* https://calyptia.com/
[2024/09/26 18:18:34] [ info] [calyptia fluent bit] version=24.4 patch=13 pid=598
[2024/09/26 18:18:34] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/09/26 18:18:34] [ info] [cmetrics] version=0.9.1
[2024/09/26 18:18:34] [ info] [ctraces ] version=0.5.1
[2024/09/26 18:18:34] [ info] [input:dummy:dummy.0] initializing
[2024/09/26 18:18:34] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2024/09/26 18:18:34] [ info] [sp] stream processor started
[2024/09/26 18:18:34] [ info] [output:stdout:stdout.0] worker #0 started
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374715.121369209, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374716.121380884, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374717.121366447, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374718.121378017, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374719.121398257, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374720.121360347, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374721.121362824, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374722.121359415, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374723.121365817, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374724.121381714, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374725.121367499, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374726.121360743, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727374727.121363223, {}], {"message"=>"dummy#1"}]
[2024/09/26 18:18:48] [engine] caught signal (SIGHUP)
[2024/09/26 18:18:48] [ info] reloading instance pid=598 tid=0x7ffff7fe5d80
[2024/09/26 18:18:48] [error] [config] section 'dummmy' tried to instance a plugin name that doesn't exist
[2024/09/26 18:18:48] [engine] caught signal (SIGSEGV)
#0  0x50ddab            in  flb_stop() at ???:0
#1  0x5426b4            in  flb_reload() at ???:0
#2  0x486897            in  flb_main() at ???:0
#3  0x7ffff4a23492      in  __libc_start_main() at ???:0
#4  0x48433d            in  _start() at ???:0
#5  0xffffffffffffffff  in  ???() at ???:0
Aborted (core dumped)
--- in a parallel terminal ---
[root@316b586e3195 /]# vi fleet.conf # change dummy to dummmy
[root@316b586e3195 /]# ps -fea | grep fluent-bit
root         598     561  0 18:18 pts/1    00:00:00 /opt/calyptia-fluent-bit/bin/calyptia-fluent-bit -c fleet.conf -Y
root         606       1  0 18:18 pts/0    00:00:00 grep --color=auto fluent-bit
[root@316b586e3195 /]# kill -HUP 598

@edsiper
Copy link
Member

edsiper commented Sep 27, 2024

closing in favor of #9432

@edsiper edsiper closed this Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants