ssh - Jenkins Slave Offline Node Connection timed out / closed - Docker container - Relaunch step - Configuration is picking Old Port -
jenkins version: 1.643.2
docker plugin version: 0.16.0
in jenkins environment, have jenkins master 2-5 slave node servers (slave1, slave2, slave3).
each of these slaves configured in jenkins global configuration using docker plugin.
everything working @ minute.
i saw our monitoring system throwing alerts high swap space usage on slave3 (for ex ip: 11.22.33.44) ssh'ed machine , ran: sudo docker ps
gave me valid output running docker containers on slave3 machine.
by running ps -eo pmem,pcpu,vsize,pid,cmd | sort -k 1 -nr | head -10
on target slave's machine (where 4 containers running), found top 5 processes eating ram java -jar slave.jar
running inside each container. thought why not restart shit , recoup memory back. in following output, see state of sudo docker ps
command before , after docker restart <container_instance>
step. scroll right, you'll notice in 2nd line container id ending ...0a02
, virtual port (listed under heading names) on host (slave3) machine 1053 (which mapped container's virtual ip's port 22 ssh). cool, means is, when jenkins manage node
section, if try relaunch slave's container, jenkins try connect host ip's 11.22.33.44:1053 , whatever it's supposed bring slave up. so, jenkins holding port (1053) somewhere.
container id image command created status ports names ae3eb02a278d docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1 "bash -c '/usr/sbin/s" 26 hours ago 26 hours 0.0.0.0:1048->22/tcp lonely_lalande d4745b720a02 docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1 "bash -c '/usr/sbin/s" 9 days ago hour 0.0.0.0:1053->22/tcp cocky_yonath bd9e451265a6 docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1 "bash -c '/usr/sbin/s" 9 days ago hour 0.0.0.0:1050->22/tcp stoic_bell 0e905a6c3851 docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1 "bash -c '/usr/sbin/s" 9 days ago hour 0.0.0.0:1051->22/tcp serene_tesla sudo docker restart d4745b720a02; echo $? d4745b720a02 0 container id image command created status ports names ae3eb02a278d docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1 "bash -c '/usr/sbin/s" 26 hours ago 26 hours 0.0.0.0:1048->22/tcp lonely_lalande d4745b720a02 docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1 "bash -c '/usr/sbin/s" 9 days ago 4 seconds 0.0.0.0:1054->22/tcp cocky_yonath bd9e451265a6 docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1 "bash -c '/usr/sbin/s" 9 days ago hour 0.0.0.0:1050->22/tcp stoic_bell 0e905a6c3851 docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1 "bash -c '/usr/sbin/s" 9 days ago hour 0.0.0.0:1051->22/tcp serene_tesla
after running sudo docker restart <instanceidofcontainer>
ran free -h
/ grep -i swap /proc/meminfo
, found ram (which earlier used , showing remaining 230mb free) 1gb free , swap size 1g total, 1g used (i tried both swappiness 60 or 10), 450mb swap space free. alert thing got resolved. cool.
but, notice sudo docker ps
output above, after restart step, container id ...0a02
, got new port# 1054!!
when went manage nodes > tried bring node offline, stopped it, , relaunched it, jenkins not picking new port (1054). it's still somehow picking old port 1053 (while trying make ssh connection 11.22.33.44 (host's ip) on port 1053 (which mapped container's virtual ip's port # 22 (ssh)).
how can change port or configuration in jenkins slave container jenkins see new port , can relaunch?
ps: clicking "configure" on node see it's configuration not showing me other name field. there's lot of fields in regular slave (where can define labels, root dir, launch method, properties env variables, tools slave environment guess these docker containers, i'm not seeing other name field). clicking test connection in jenkins global configuration (under docker plugin's section) shows it's finding docker version 1.8.3
right now, 1053 port (telnet) not working it's 1054 container's instanceid (after restart step), jenkins relaunch step failing during ssh connection step (first thing connect via ssh method).
[07/27/17 17:17:19] [ssh] opening ssh connection 11.22.33.44:1053. connection timed out error: unexpected error in launching slave. bug in jenkins. java.lang.illegalstateexception: connection not established! @ com.trilead.ssh2.connection.getremainingauthmethods(connection.java:1030) @ com.cloudbees.jenkins.plugins.sshcredentials.impl.trileadsshpasswordauthenticator.canauthenticate(trileadsshpasswordauthenticator.java:82) @ com.cloudbees.jenkins.plugins.sshcredentials.sshauthenticator.newinstance(sshauthenticator.java:207) @ com.cloudbees.jenkins.plugins.sshcredentials.sshauthenticator.newinstance(sshauthenticator.java:169) @ hudson.plugins.sshslaves.sshlauncher.openconnection(sshlauncher.java:1212) @ hudson.plugins.sshslaves.sshlauncher$2.call(sshlauncher.java:711) @ hudson.plugins.sshslaves.sshlauncher$2.call(sshlauncher.java:706) @ java.util.concurrent.futuretask.run(futuretask.java:266) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617) @ java.lang.thread.run(thread.java:745) [07/27/17 17:19:26] launch failed - cleaning connection [07/27/17 17:19:26] [ssh] connection closed.
ok. zeeesus!
in jenkins_home (of master server), searched config file holding old port# info that/those container node(s) showing offline.
changed directory to: nodes folder inside $jenkins_home , found there config.xml files each nodes.
for ex: $jenkins_home/nodes/<slave3_node_ip>
-d4745b720a02/config.xml
resolution steps:
- vim edited file change old new port.
- manage jenkins > reload configuration disk.
- manage nodes > selected particular node offline.
- relaunch slave, , time jenkins picked new port , started container slave expected (as ssh connection new port visible after configuration change).
i think page: https://my.company.jenkins.instance.com/projectinstance/docker-plugin/server/<slave3_ip>
/ web page, shows containers info (in tabular form running on given slave machine), page has button (last column) stop given slave's container not start or restart.
having start or restart button there should did above in fashion.
better solution:
what happening is, 4 long lived container nodes running on slave3 competing gaining available ram (11-12gb) , on time jvm process (java -jar slave.jar relaunch step starts on target container's virtual machine (ip) running on slave3 slave server) individual container trying take memory (ram) could. leading low free memory , swap getting used , getting used point monitoring tool start screaming @ via sending notifications etc.
to fix situation, first thing 1 should is:
1) under jenkins global configuration (manage jenkins > configure systems > docker plugin section, slave server's image / docker template, under advanced settings section, can put jvm options tell container not compete ram. putting following jvm options helped. these jvm settings try , keep heap space of each container in smaller box not starve out rest of system.
you can start 3-4gb depending upon how total ram have on slave/machine containers based slave nodes running.
2) recent version of slave.jar (that may have performance / maintenance enhancements in place help.
3) integrating monitoring solution (incinga/etc have) auto launch jenkins job (where jenkins job run piece of action - bash 1 liner, python shit or groovy goodness, ansible playbook etc) fix issue related such alert.
4) automatically have container slave nodes relaunched (i.e. relaunch step) - take slave offline, online, relaunch step that'll bring slave rejuvenated state of freshness. have is, idle slave (if it's not running job) then, take offline > online > relaunch slave using jenkins rest api via small groovy script , put in jenkins job , let above if slave nodes long lived.
5) or 1 can spin container based slaves on fly - use , throw model each time jenkins queues job run.
Comments
Post a Comment