Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat race condition start / stop harvester #3779

Closed
ruflin opened this issue Mar 21, 2017 · 0 comments
Closed

Filebeat race condition start / stop harvester #3779

ruflin opened this issue Mar 21, 2017 · 0 comments
Labels

Comments

@ruflin
Copy link
Contributor

ruflin commented Mar 21, 2017

CI on AppVeyor recently run into a race condition during the tests. The race condition seems to be cause by Stopping a harvester before it was fully started, means Wait() is called before all Add(1) are called on the waitgroup. See full log messages below.

Running test_shutdown.py:Test.test_shutdown_wait_ok can lead to the following log output / race condition:

2017/03/20 16:05:50.682951 beat.go:319: INFO Home path: [C:\gopath\src\github.com\elastic\beats\filebeat] Config path: [C:\gopath\src\github.com\elastic\beats\filebeat] Data path: [C:\gopath\src\github.com\elastic\beats\filebeat\data] Logs path: [C:\gopath\src\github.com\elastic\beats\filebeat\logs]
2017/03/20 16:05:50.682951 metrics.go:23: INFO Metrics logging every 30s
2017/03/20 16:05:50.682951 beat.go:344: INFO Beat metadata path: C:\gopath\src\github.com\elastic\beats\filebeat\data\meta.json
2017/03/20 16:05:50.683952 beat.go:326: INFO Beat UUID: a9a026d7-5da0-4904-aee5-dfb1bce68f32
2017/03/20 16:05:50.683952 beat.go:208: INFO Setup Beat: filebeat; Version: 6.0.0-alpha1
2017/03/20 16:05:50.683952 processor.go:44: DBG  Processors: 
2017/03/20 16:05:50.683952 beat.go:214: DBG  Initializing output plugins
2017/03/20 16:05:50.683952 file.go:54: INFO File output path set to: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/output
2017/03/20 16:05:50.683952 file.go:55: INFO File output base filename set to: filebeat
2017/03/20 16:05:50.683952 file.go:58: INFO Rotate every bytes set to: 1024000
2017/03/20 16:05:50.683952 file.go:62: INFO Number of files set to: 7
2017/03/20 16:05:50.683952 outputs.go:117: INFO Activated file as output plugin.
2017/03/20 16:05:50.683952 publish.go:236: DBG  Create output worker
2017/03/20 16:05:50.684952 publish.go:278: DBG  No output is defined to store the topology. The server fields might not be filled.
2017/03/20 16:05:50.684952 publish.go:293: INFO Publisher name: APPVYR-WIN
2017/03/20 16:05:50.686952 async.go:63: INFO Flush Interval set to: -1s
2017/03/20 16:05:50.686952 async.go:64: INFO Max Bulk Size set to: -1
2017/03/20 16:05:50.687952 beat.go:251: INFO filebeat start running.
2017/03/20 16:05:50.687952 service_windows.go:51: DBG  Windows is interactive: true
2017/03/20 16:05:50.687952 registrar.go:72: INFO No registry file found under: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/registry. Creating a new registry file.
2017/03/20 16:05:50.687952 registrar.go:346: DBG  Write registry file: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/registry
2017/03/20 16:05:50.741954 file_windows.go:51: DBG  delete old: remove C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/registry.old: The system cannot find the file specified.
2017/03/20 16:05:50.742952 file_windows.go:57: DBG  rotate to old: rename C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/registry C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/registry.old: The system cannot find the file specified.
2017/03/20 16:05:50.742952 registrar.go:373: DBG  Registry file updated. 0 states written.
2017/03/20 16:05:50.742952 registrar.go:120: INFO Loading registrar data from C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/registry
2017/03/20 16:05:50.742952 registrar.go:140: INFO States Loaded from registrar: 0
2017/03/20 16:05:50.742952 registrar.go:271: INFO Starting Registrar
2017/03/20 16:05:50.742952 sync.go:46: INFO Start sending events to output
2017/03/20 16:05:50.742952 spooler.go:68: INFO Starting spooler: spool_size: 2048; idle_timeout: 100ms
2017/03/20 16:05:50.742952 crawler.go:42: INFO Loading Prospectors: 1
2017/03/20 16:05:50.743951 prospector.go:90: DBG  File Configs: [C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/log/*]
2017/03/20 16:05:50.743951 prospector_log.go:46: DBG  exclude_files: []
2017/03/20 16:05:50.743951 prospector_log.go:71: INFO Prospector with previous states loaded: 0
2017/03/20 16:05:50.743951 prospector.go:142: INFO Starting prospector of type: log; id: 3861952635306202170 
2017/03/20 16:05:50.744952 crawler.go:67: INFO Loading and starting Prospectors completed. Enabled prospectors: 1
2017/03/20 16:05:50.744952 prospector_log.go:77: DBG  Start next scan
2017/03/20 16:05:50.744952 prospector_log.go:267: DBG  Check file for harvesting: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log
2017/03/20 16:05:50.843080 spooler.go:103: DBG  Flushing spooler because of timeout. Events flushed: 0
2017/03/20 16:05:50.943941 spooler.go:103: DBG  Flushing spooler because of timeout. Events flushed: 0
2017/03/20 16:05:50.945950 service.go:33: DBG  Received sigterm/sigint, stopping
2017/03/20 16:05:50.945950 service.go:39: DBG  Received svc stop/shutdown request
2017/03/20 16:05:50.945950 filebeat.go:251: INFO Stopping filebeat
2017/03/20 16:05:50.945950 prospector.go:157: INFO Prospector channel stopped
2017/03/20 16:05:50.945950 crawler.go:109: INFO Stopping Crawler
2017/03/20 16:05:50.945950 crawler.go:122: INFO Stopping 1 prospectors
2017/03/20 16:05:50.956946 prospector_log.go:287: DBG  Start harvester for new file: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log
2017/03/20 16:05:50.957921 state.go:72: DBG  New state added for C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log
2017/03/20 16:05:50.957921 log.go:341: DBG  Setting offset for file based on seek: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log
2017/03/20 16:05:50.957921 log.go:325: DBG  Setting offset for file: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log. Offset: 0 
2017/03/20 16:05:50.957921 prospector_log.go:99: DBG  Prospector states cleaned up. Before: 1, After: 1
2017/03/20 16:05:50.957921 prospector.go:208: INFO Prospector ticker stopped
2017/03/20 16:05:50.957921 prospector.go:276: INFO Stopping Prospector: 3861952635306202170
2017/03/20 16:05:50.957921 log.go:109: INFO Harvester started for file: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log
==================
WARNING: DATA RACE
Write at 0x00c0421b57cc by goroutine 40:
  internal/race.Write()
      C:/Users/appveyor/.gvm/versions/go1.7.4.windows.amd64/src/internal/race/race.go:41 +0x3f
  sync.(*WaitGroup).Wait()
      C:/Users/appveyor/.gvm/versions/go1.7.4.windows.amd64/src/sync/waitgroup.go:129 +0x158
  github.com/elastic/beats/filebeat/harvester.(*Harvester).Stop()
      github.com/elastic/beats/filebeat/harvester/_obj/log.go:200 +0x81
  github.com/elastic/beats/filebeat/prospector.(*harvesterRegistry).Stop.func1()
      github.com/elastic/beats/filebeat/prospector/_obj/registry.go:49 +0x6b

Previous read at 0x00c0421b57cc by goroutine 38:
  internal/race.Read()
      C:/Users/appveyor/.gvm/versions/go1.7.4.windows.amd64/src/internal/race/race.go:37 +0x3f
  sync.(*WaitGroup).Add()
      C:/Users/appveyor/.gvm/versions/go1.7.4.windows.amd64/src/sync/waitgroup.go:71 +0x299
  github.com/elastic/beats/filebeat/harvester.(*Harvester).Harvest()
      github.com/elastic/beats/filebeat/harvester/_obj/log.go:68 +0x10a
  github.com/elastic/beats/filebeat/prospector.(*harvesterRegistry).start.func1()
      github.com/elastic/beats/filebeat/prospector/_obj/registry.go:76 +0xb7

Goroutine 40 (running) created at:
  github.com/elastic/beats/filebeat/prospector.(*harvesterRegistry).Stop()
      github.com/elastic/beats/filebeat/prospector/_obj/registry.go:50 +0x17c
  github.com/elastic/beats/filebeat/prospector.(*Prospector).stop()
      github.com/elastic/beats/filebeat/prospector/_obj/prospector.go:286 +0x1e3
  github.com/elastic/beats/filebeat/prospector.(*Prospector).Start.func2.1()
      github.com/elastic/beats/filebeat/prospector/_obj/prospector.go:182 +0x81
  github.com/elastic/beats/filebeat/prospector.(*Prospector).Start.func2()
      github.com/elastic/beats/filebeat/prospector/_obj/prospector.go:187 +0x9a

Goroutine 38 (running) created at:
2017/03/20 16:05:50.958921 log.go:366: DBG  Stopping harvester for file: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log
  github.com/elastic/beats/filebeat/prospector.(*harvesterRegistry).start()
      github.com/elastic/beats/filebeat/prospector/_obj/registry.go:77 +0xc2
  github.com/elastic/beats/filebeat/prospector.(*Prospector).startHarvester()
      github.com/elastic/beats/filebeat/prospector/_obj/prospector.go:376 +0x5d4
  github.com/elastic/beats/filebeat/prospector.(*ProspectorLog).scan()
      github.com/elastic/beats/filebeat/prospector/_obj/prospector_log.go:288 +0x8cd
  github.com/elastic/beats/filebeat/prospector.(*ProspectorLog).Run()
      github.com/elastic/beats/filebeat/prospector/_obj/prospector_log.go:93 +0x124
  github.com/elastic/beats/filebeat/prospector.(*Prospector).Run()
      github.com/elastic/beats/filebeat/prospector/_obj/prospector.go:195 +0x9e
  github.com/elastic/beats/filebeat/prospector.(*Prospector).Start.func2()
      github.com/elastic/beats/filebeat/prospector/_obj/prospector.go:186 +0x99
==================
2017/03/20 16:05:50.958921 log.go:373: DBG  Closing file: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log
2017/03/20 16:05:50.958921 log.go:217: DBG  Update state: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log, offset: 88
2017/03/20 16:05:50.958921 crawler.go:138: INFO Crawler stopped
2017/03/20 16:05:50.958921 filebeat.go:235: INFO Shutdown output timer started. Waiting for max 10s.
2017/03/20 16:05:51.044903 spooler.go:103: DBG  Flushing spooler because of timeout. Events flushed: 1
2017/03/20 16:05:51.044903 client.go:158: DBG  No events to publish
2017/03/20 16:05:51.044903 sync.go:82: DBG  Events sent: 1
2017/03/20 16:05:51.044903 registrar.go:320: DBG  Processing 1 events
2017/03/20 16:05:51.044903 state.go:72: DBG  New state added for C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok\log\nasa-50k.log
2017/03/20 16:05:51.044903 registrar.go:302: DBG  Registrar states cleaned up. Before: 1, After: 1
2017/03/20 16:05:51.044903 registrar.go:346: DBG  Write registry file: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/registry
2017/03/20 16:05:51.066776 file_windows.go:51: DBG  delete old: remove C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/registry.old: The system cannot find the file specified.
2017/03/20 16:05:51.067798 registrar.go:373: DBG  Registry file updated. 1 states written.
2017/03/20 16:05:51.067798 signalwait.go:94: INFO Continue shutdown: All enqueued events being published.
2017/03/20 16:05:51.067798 spooler.go:116: INFO Stopping spooler
2017/03/20 16:05:51.067798 spooler.go:121: DBG  Spooler has stopped
2017/03/20 16:05:51.067798 sync.go:54: DBG  Shutting down sync publisher
2017/03/20 16:05:51.067798 registrar.go:338: INFO Stopping Registrar
2017/03/20 16:05:51.067798 registrar.go:287: INFO Ending Registrar
2017/03/20 16:05:51.067798 registrar.go:346: DBG  Write registry file: C:\gopath\src\github.com\elastic\beats\filebeat\build\system-tests\run\test_shutdown.Test.test_shutdown_wait_ok/registry
2017/03/20 16:05:51.068774 registrar.go:373: DBG  Registry file updated. 1 states written.
2017/03/20 16:05:51.068774 metrics.go:51: INFO Total non-zero values:  beat.memstats.gc_next=4194304 beat.memstats.memory_alloc=1326664 beat.memstats.memory_total=1326664 filebeat.harvester.closed=1 filebeat.harvester.started=1 publish.events=1 registrar.states.current=1 registrar.states.update=1 registrar.writes=3
2017/03/20 16:05:51.068774 metrics.go:52: INFO Uptime: 397.8223ms
2017/03/20 16:05:51.068774 beat.go:255: INFO filebeat stopped.
PASS
coverage: 42.2% of statements in ./...
Found 1 data race(s)
@ruflin ruflin added Filebeat Filebeat review bug and removed review labels Mar 21, 2017
ruflin added a commit to ruflin/beats that referenced this issue Apr 28, 2017
There is still a case where a race condition could happen (see TODO). Registry must be improved to also handle this case.

This should not be backported as the issue can only happen on shutdown in very rare cases and should not have any side effect.

Closes elastic#3779
ruflin added a commit to ruflin/beats that referenced this issue May 15, 2017
It was possible that with reloading enabled that during the shutdown of filebeat, a new harvester was started. This is now prevent by having a lock on the starting of the harvester so no new harvesters can be started when shutdown started.

This is not backported to 5.x because it can only happen on shutdown and does not have any side affects.

Closes elastic#3779
exekias pushed a commit that referenced this issue May 15, 2017
It was possible that with reloading enabled that during the shutdown of filebeat, a new harvester was started. This is now prevent by having a lock on the starting of the harvester so no new harvesters can be started when shutdown started.

This is not backported to 5.x because it can only happen on shutdown and does not have any side affects.

Closes #3779
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant