An excellent tool when you need to troubleshoot distributed systems is
ss, which provides visibility into TCP states of your live connections. Even if you know
netstat this is a command you should get acquainted with because of its more powerful interface which requires fewer pipes to find the connection information you need and it is probably already installed on your favorite distribution.
In this post, we walk through a couple of examples, and then show a real-world production problem that can be diagnosed by effective use of
To list TCP [=-t=] connections with destination port 9200 [=dst :9200=] and showing numeric port [=-n=] plus resolving DNS of addresses [=-r=] we use the following:
ss -rnt dst :9200
The output of the above command might look something like this:
Recv-Q Send-Q Local Address:Port Peer Address:Port 0 0 localhost:28655 localhost:9200 # OR Recv-Q Send-Q Local Address:Port Peer Address:Port 0 0 mysvc1-app3.dc1.exampleapp.com:44541 logs.dc1.exampleapp.com:9200 0 0 mysvc1-app3.dc1.exampleapp.com:33389 logs.dc1.exampleapp.com:9200 0 0 mysvc1-app3.dc1.exampleapp.com:45740 logs.dc1.exampleapp.com:9200 0 0 mysvc1-app3.dc1.exampleapp.com:53848 logs.dc1.exampleapp.com:9200 0 0 mysvc1-app3.dc1.exampleapp.com:33406 logs.dc1.exampleapp.com:9200 0 0 mysvc1-app3.dc1.exampleapp.com:33407 logs.dc1.exampleapp.com:9200 0 0 mysvc1-app3.dc1.exampleapp.com:33405 logs.dc1.exampleapp.com:9200
Now, this isn't very remarkable. Let's try to find TCP connections in
CLOSE_WAIT state going to a specific destination address and port:
$ ss -tarn state close-wait Recv-Q Send-Q Local Address:Port Peer Address:Port 1 0 mysvc-app1.dc1.exampleapp.com:40701 mysvc-app1.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:29470 mysvc-app1.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:35594 mysvc-app1.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:39683 mysvc-app1.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:22715 mysvc-app1.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:23824 mysvc-app1.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:37015 mysvc-app10.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:28927 mysvc-app1.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:37298 mysvc-app1.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:42596 mysvc-app20.dc1.exampleapp.com:50010 1 0 mysvc-app1.dc1.exampleapp.com:45345 mysvc-app1.dc1.exampleapp.com:50010
What about timers on sockets? Well,
ss has what you want (list all [=-a=] TCP sockets [=-t=] with resoled DNS [=-r=], numeric ports [=-n=] that are in state of
TIME_WAIT with timer information [=-o=]):
$ ss -arnto state time-wait Recv-Q Send-Q Local Address:Port Peer Address:Port 0 0 mysvc-app1.dc1.exampleapp.com:21665 ldap.dc1.exampleapp.com:389 timer:(timewait,22sec,0) 0 0 mysvc-app1.dc1.exampleapp.com:45311 mysvc-app13.dc1.exampleapp.com:2191 timer:(timewait,24sec,0) 0 0 mysvc-app1.dc1.exampleapp.com:21606 ldap.dc1.exampleapp.com:389 timer:(timewait,2.730ms,0) 0 0 mysvc-app1.dc1.exampleapp.com:40319 mysvc-app2.dc1.exampleapp.com:9092 timer:(timewait,4.949ms,0) 0 0 mysvc-app1.dc1.exampleapp.com:37364 mysvc-app12.dc1.exampleapp.com:2191 timer:(timewait,28sec,0) 0 0 localhost:39282 localhost:17123 timer:(timewait,54sec,0)
Look ma, no pipes!
Looking for which process is listening on a port? Simple:
-lwe limit ourselves to listening sockets
-twe limit ourselves to TCP
-nwe make sure the port is shown as numeric not alias
-rwe resolve the hostnames
-pwe show processes using that socket
Remember to get process information you usually need to sudo :)
$ sudo ss -nltp state listening src :3030 Recv-Q Send-Q Local Address:Port Peer Address:Port 0 100 127.0.0.1:3030 *:* users:(("sensu-client",46889,17))
Note the users output is a 3-tuple with the process name as first entry, the PID as the second entry and unknown third entry (to me). Anyone know, it isn't explained in the man page.
Can Anybody Hear Me?
Finally, let us walk through a production issue where
ss helped us get to the root of the issue faster. Recently an alert was generated when one of our services in an environnt was no longer receiving Kafka messages. After taking the necessary steps to ensure that the Kafka cluster was up and functioning correctly, we went to the node that alerted. There we used
ss to check for TCP connections to Kafka we saw no established TCP connections to Kafka, but we did see a connection in TCP state
$ sudo ss -antlp state syn-sent dst :9200 Recv-Q Send-Q Local Address:Port Peer Address:Port 0 1 10.4.1.40:48928 10.4.1.80:9200 users:(("java",pid=13657,fd=3))
The output above tells us some useful information, specifically that this host is trying to talk to Kafka, but it isn't able to connect to the broker. This information allowed us to investigate a possible networking problem, and in this specific example the DNS for Kafka had changed, and the application had cached the old entry.