Today has been one of These days, nothing wanted to work on the first go, everything behaved differently than expected or information was just not where it was expected, but by the end of the day almost every thing was running or running again.
Services – Subversion, Redmine, WordPress, Git
A few have probably noticed performance issues with one of our services in past few days, being it the blog, redmine, git or svn – just to let you know upfront: these where all related. As many of you hopefully already know, most of our systems run Solaris and on these we use Solaris Zones (or containers if you prefer) to get dedicated environments for most of our services. So even though it might not look like it all the above services share (or better shared) the same physical host. Usually this setup works quite well for us, at least it has during the last 4-5 years but in the recent past trouble arose. It startet out with Subversion/Apache suddenly producing log files of sizes of hundreds of megabytes to a few gigabytes a day, already then it was quite clear that the new build-servers, used in class, were the main reason for this but as I couldn’t easily change the build-servers I just fixed the logging of Apache and the log rotation a bit and for a moment everything seemed fine again. Until yesterday… wow it’s already this late… so it was the day before yesterday
, when some people started to notice long waiting times for one or the other of these services so I had a look again and noticed right a way that the system hosting these zones had I load average of between 30 and 40 – in my opinion in this case just way to much! So the decision was made do move some of the services to a separate host – dev on one and git, blog and pm on the other, but first I turned my attention to the build-servers. It turns out they where polling Subversion each minute for each build project, together with the students accesses that was just too much for our meager V210 system, so I changed this schedule to every 5 minutes… – here was one of the points where stuff just didn’t want to work out, but telling this story in detail too would go to far – and almost like magic the load on the system dropped and the responsiveness of the other services was also way better again. Still, to avoid SVN impacting the other services again I moved them off to another host and we’re making sure that the build-servers will start using svn post-commit-hooks instead of polling to further reduce the load.
EEEnet – the Experimental Enterprise Environment Network
A few weeks ago our Lab was one of the first to be connected the the new EEEnet, a network which in the near future should connect all the different laboratories of all the school departments with as few barriers as possible. A second “Lab” that was added around the same time, is the new HPC Cluster of the department of Mechanical Engineering. We’ve already been maintaining their “old” Cluster and there they have been using some of our services (proxy, mail server, ssh gw, etc.) so they would have liked to do the same on the new cluster. But it turns out it was not just a setting in our firewall which was needed to allow access to do so, no, it was routing troubles again and figuring out routing problems around firewalls for some of which you don’t know the settings and rule sets, I’m telling you, can be a pain in the a**. This one is still an open task and a clean solution would be much appreciated, as our network can be quite confusing as it is and definitely doesn’t need anymore quirks which make it harder to understand
.
let’s see what the next week brings.