Skip to content

Instantly share code, notes, and snippets.

@dlwh
Created March 4, 2024 06:07
Show Gist options
  • Save dlwh/f0c43daacd3380ad561c5b067bf816bd to your computer and use it in GitHub Desktop.
Save dlwh/f0c43daacd3380ad561c5b067bf816bd to your computer and use it in GitHub Desktop.
tpu coordination service crash
(maybe coordinator?)
2024-03-04 05:14:56.199355: E external/tsl/tsl/distributed_runtime/coordination/coordination_service.cc:584] /job:jax_worker/replica:0/task:11 unexpectedly tried to connect with a different incarnation. It has likely restarted.
2024-03-04 05:14:56.199447: E external/tsl/tsl/distributed_runtime/coordination/coordination_service.cc:992] /job:jax_worker/replica:0/task:11 has been set to ERROR in coordination service: ABORTED: /job:jax_worker/replica:0/task:11 unexpectedly tried to connect with a different incarnation. It has likely restarted. [type.googleapis.com/tensorflow.CoordinationServiceError='\"\x0e\n\njax_worker\x10\x0b']
2024-03-04 05:14:56.199460: E external/tsl/tsl/distributed_runtime/coordination/coordination_service.cc:828] Stopping coordination service as there is no service-to-client connection, but we encountered an error: ABORTED: /job:jax_worker/replica:0/task:11 unexpectedly tried to connect with a different incarnation. It has likely restarted. [type.googleapis.com/tensorflow.CoordinationServiceError='\"\x0e\n\njax_worker\x10\x0b']
2024-03-04 05:14:57.253186: E external/tsl/tsl/distributed_runtime/coordination/coordination_service_agent.cc:767] Coordination agent is set to ERROR: INVALID_ARGUMENT: Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:0. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529297.253080681","description":"Error received from peer ipv4:10.130.15.239:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:0. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2024-03-04 05:14:57.253257: E external/xla/xla/pjrt/distributed/client.cc:86] Coordination service agent in error status: INVALID_ARGUMENT: Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:0. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529297.253080681","description":"Error received from peer ipv4:10.130.15.239:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:0. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2024-03-04 05:14:57.253288: F external/xla/xla/pjrt/distributed/client.h:77] Terminating process because the coordinator detected missing heartbeats. This most likely indicates that another task died; see the other task logs for more details. Disable Python buffering, i.e. `python -u`, to be sure to see all the previous output. Status: INVALID_ARGUMENT: Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:0. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529297.253080681","description":"Error received from peer ipv4:10.130.15.239:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:0. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
https://symbolize.stripped_domain/r/?trace=7f5133696a7c,7f513364251f&map=
*** SIGABRT received by PID 21756 (TID 21903) on cpu 14 from PID 21756; stack trace: ***
PC: @ 0x7f5133696a7c (unknown) pthread_kill
@ 0x7f4f7f121c87 928 (unknown)
@ 0x7f5133642520 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f5133696a7c,7f4f7f121c86,7f513364251f&map=
E0304 05:14:57.259477 21903 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0304 05:14:57.259492 21903 coredump_hook.cc:481] RAW: Skipping coredump since rlimit was 0 at process start.
E0304 05:14:57.259498 21903 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0304 05:14:57.259510 21903 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0304 05:14:57.259526 21903 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0304 05:14:57.259534 21903 coredump_hook.cc:598] RAW: Dumping core locally.
E0304 05:14:57.621419 21903 process_state.cc:807] RAW: Raising signal 6 with default behavior
2024-03-04 05:15:13.538170: E external/tsl/tsl/distributed_runtime/coordination/coordination_service_agent.cc:767] Coordination agent is set to ERROR: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529313.538096663","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1709529313.534572163","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
2024-03-04 05:15:13.538220: E external/xla/xla/pjrt/distributed/client.cc:86] Coordination service agent in error status: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529313.538096663","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1709529313.534572163","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
2024-03-04 05:15:13.538231: F external/xla/xla/pjrt/distributed/client.h:77] Terminating process because the coordinator detected missing heartbeats. This most likely indicates that another task died; see the other task logs for more details. Disable Python buffering, i.e. `python -u`, to be sure to see all the previous output. Status: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529313.538096663","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1709529313.534572163","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
https://symbolize.stripped_domain/r/?trace=7fe4a3096a7c,7fe4a304251f&map=
*** SIGABRT received by PID 21549 (TID 21696) on cpu 174 from PID 21549; stack trace: ***
PC: @ 0x7fe4a3096a7c (unknown) pthread_kill
@ 0x7fe30b121c87 928 (unknown)
@ 0x7fe4a3042520 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7fe4a3096a7c,7fe30b121c86,7fe4a304251f&map=
E0304 05:15:13.543875 21696 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0304 05:15:13.543887 21696 coredump_hook.cc:481] RAW: Skipping coredump since rlimit was 0 at process start.
E0304 05:15:13.543891 21696 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0304 05:15:13.543895 21696 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0304 05:15:13.543909 21696 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0304 05:15:13.543913 21696 coredump_hook.cc:598] RAW: Dumping core locally.
E0304 05:15:13.926991 21696 process_state.cc:807] RAW: Raising signal 6 with default behavior
2024-03-04 05:15:13.538390: E external/tsl/tsl/distributed_runtime/coordination/coordination_service_agent.cc:767] Coordination agent is set to ERROR: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529313.538286813","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1709529313.534708193","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
2024-03-04 05:15:13.538485: E external/xla/xla/pjrt/distributed/client.cc:86] Coordination service agent in error status: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529313.538286813","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1709529313.534708193","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
2024-03-04 05:15:13.538497: F external/xla/xla/pjrt/distributed/client.h:77] Terminating process because the coordinator detected missing heartbeats. This most likely indicates that another task died; see the other task logs for more details. Disable Python buffering, i.e. `python -u`, to be sure to see all the previous output. Status: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529313.538286813","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1709529313.534708193","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
https://symbolize.stripped_domain/r/?trace=7f4293a96a7c,7f4293a4251f&map=
*** SIGABRT received by PID 21429 (TID 21574) on cpu 205 from PID 21429; stack trace: ***
PC: @ 0x7f4293a96a7c (unknown) pthread_kill
@ 0x7f40fb121c87 928 (unknown)
@ 0x7f4293a42520 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f4293a96a7c,7f40fb121c86,7f4293a4251f&map=
E0304 05:15:13.544341 21574 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0304 05:15:13.544351 21574 coredump_hook.cc:481] RAW: Skipping coredump since rlimit was 0 at process start.
E0304 05:15:13.544356 21574 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0304 05:15:13.544359 21574 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0304 05:15:13.544374 21574 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0304 05:15:13.544377 21574 coredump_hook.cc:598] RAW: Dumping core locally.
E0304 05:15:13.932416 21574 process_state.cc:807] RAW: Raising signal 6 with default behavior
2024-03-04 05:14:57.493682: E external/tsl/tsl/distributed_runtime/coordination/coordination_service_agent.cc:767] Coordination agent is set to ERROR: INVALID_ARGUMENT: Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:1. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529297.493569869","description":"Error received from peer ipv4:10.130.15.239:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:1. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2024-03-04 05:14:57.493758: E external/xla/xla/pjrt/distributed/client.cc:86] Coordination service agent in error status: INVALID_ARGUMENT: Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:1. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529297.493569869","description":"Error received from peer ipv4:10.130.15.239:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:1. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
2024-03-04 05:14:57.493769: F external/xla/xla/pjrt/distributed/client.h:77] Terminating process because the coordinator detected missing heartbeats. This most likely indicates that another task died; see the other task logs for more details. Disable Python buffering, i.e. `python -u`, to be sure to see all the previous output. Status: INVALID_ARGUMENT: Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:1. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/Heartbeat:
:{"created":"@1709529297.493569869","description":"Error received from peer ipv4:10.130.15.239:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unexpected heartbeat request from task: /job:jax_worker/replica:0/task:1. This usually implies an earlier error that caused coordination service to shut down before the workers disconnect. Check the task leader's logs for an earlier error to debug the root cause.","grpc_status":3} [type.googleapis.com/tensorflow.CoordinationServiceError='']
https://symbolize.stripped_domain/r/?trace=7f4adbe96a7c,7f4adbe4251f&map=
*** SIGABRT received by PID 21533 (TID 21673) on cpu 232 from PID 21533; stack trace: ***
PC: @ 0x7f4adbe96a7c (unknown) pthread_kill
@ 0x7f4943121c87 928 (unknown)
@ 0x7f4adbe42520 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f4adbe96a7c,7f4943121c86,7f4adbe4251f&map=
E0304 05:14:57.499979 21673 coredump_hook.cc:442] RAW: Remote crash data gathering hook invoked.
E0304 05:14:57.499989 21673 coredump_hook.cc:481] RAW: Skipping coredump since rlimit was 0 at process start.
E0304 05:14:57.499994 21673 client.cc:269] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0304 05:14:57.499997 21673 coredump_hook.cc:537] RAW: Sending fingerprint to remote end.
E0304 05:14:57.500012 21673 coredump_hook.cc:546] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0304 05:14:57.500015 21673 coredump_hook.cc:598] RAW: Dumping core locally.
E0304 05:14:57.856100 21673 process_state.cc:807] RAW: Raising signal 6 with default behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment