{"id":820,"date":"2013-12-12T08:48:09","date_gmt":"2013-12-12T06:48:09","guid":{"rendered":"\/\/www.mcgill.org.za\/stuff\/?p=820"},"modified":"2014-02-12T12:19:02","modified_gmt":"2014-02-12T10:19:02","slug":"december-is-a-bad-month-for-linux-system-administration","status":"publish","type":"post","link":"https:\/\/www.mcgill.org.za\/stuff\/archives\/820","title":{"rendered":"December is a bad month for Linux system administration"},"content":{"rendered":"<p>The system load for Linux is an indication of how busy the system is, and how long you are going to wait to get something done.\u00a0 When the load is 0.0, that means that the system is ready to do whatever you want immediately.\u00a0 If the load is 0.50, then half of the time, the system is busy with something else.\u00a0 If the load is 1.00, the system is doing as much as it can.\u00a0 If the load is higher than 1, then that&#8217;s an indication of the backlog of the amount of tasks the system has to do.\u00a0 When the system load is 3.14, that means that there is a backlog of things being done, such that there are always \u03c0 things before your thing.\u00a0 When the load hits 10, you have trouble.\u00a0 When the load hits 59, you have serious trouble.\u00a0 When the load hits 272, something is horribly broken, and the system is not going to recover by itself.<\/p>\n<p>The thing that loads a system is competition for some scarce resource, e.g:<\/p>\n<ul>\n<li>CPU time: if you have one CPU in your system, it can only do so much.\u00a0 If you have 16 CPU&#8217;s they can only do 16 times as much as that.\u00a0 There&#8217;s a limit.<\/li>\n<li>Disk time: the electronic highway that relays data to and from your disks is only so wide, and you can only go so fast on it.\u00a0 If the highway is full, then queries must wait in line.\u00a0 If you need data from the disk, there&#8217;s nothing to do but wait.<\/li>\n<li>Memory: a computer has only so much memory to do its tasks.\u00a0 If it runs out of memory, then it has to either stop doing some things, or it has to use slower memory (disk access).<\/li>\n<\/ul>\n<p>If you care whether your systems are working or not, you monitor the load, and if something goes beep, you have a look. You start like this:<\/p>\n<pre>root@dogmatix [~]# <strong>uptime<\/strong>\r\n07:55:05 up 23 days, 15:36,\u00a0 1 user,\u00a0 load average: 1.05, 2.57, 2.85<\/pre>\n<p>That says that for the last 10 minutes, the load has been 2.85, and for the last 1 minute it&#8217;s been 1.05. This is not a serious situation. I found another one:<\/p>\n<pre>root@kiarapoint [~]# <strong>uptime<\/strong><font size=-1>\r\n 16:57:23 up 22 days, 21:42,  1 user,  load average: <strong>66.27<\/strong>, 56.02, 36.21<\/font><\/pre>\n<p>Okay, that&#8217;s pretty bad. 66.27 times as many things to do as we can actually do in a minute. Not cool. Next step is to run <code>vmstat<\/code>:<\/p>\n<pre>root@kiarapoint [~]# <strong>vmstat 3 10<\/strong><font size=-1>\r\nprocs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------\r\n r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\r\n 2 10   3652 104448 1770072 9313432    0    0     8    63    5    4  8  1 84  8  0\r\n 0  4   3652 115168 1770076 9313420    0    0     0    35 1071  476 11  1 18 70  0\r\n 1  5   3652 137784 1770076 9313428    0    0     0    63 1086  670 12  1 36 51  0\r\n 0  5   3652 154716 1770076 9313428    0    0     0    40 1057  402  8  1 31 60  0\r\n 0  5   3652 346420 1770076 9313320    0    0     0     0 1075  529 10  2 22 66  0\r\n 0  6   3652 346880 1770076 9313148    0    0     0    28 1047  259  1  0 11 88  0\r\n 0  6   3652 346624 1770076 9313148    0    0     1     0 1036  315  5  1  0 94  0\r\n 0  6   3652 346168 1770076 9313180    0    0     8   240 1069  173  0  0  0 99  0\r\n 0  6   3652 343216 1770076 9313180    0    0     0    23 1079  304  1  1  0 99  0\r\n 0  6   3652 343832 1770076 9313180    0    0     0    23 1022   86  0  0  0 100  0<\/font><\/pre>\n<p>If you examine that carefully, you&#8217;ll notice that it makes no sense at all. This is in part because the columns don&#8217;t line up.  This should be better:<\/p>\n<pre><font size=-1>procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------\r\n r  <strong>b<\/strong>   swpd   free   buff  cache   si   so    <strong>bi<\/strong>    <strong>bo<\/strong>   in   cs us sy id <strong>wa<\/strong> st\r\n 2 10   3652 104448 1770072          0    0     8    63    5    4  8  1 84  8  0\r\n 0  4   3652 115168 1770076          0    0     0    35 1071  476 11  1 18 70  0\r\n 1  5   3652 137784 1770076          0    0     0    63 1086  670 12  1 36 51  0\r\n 0  5   3652 154716 1770076          0    0     0    40 1057  402  8  1 31 60  0\r\n 0  5   3652 346420 1770076          0    0     0     0 1075  529 10  2 22 66  0\r\n 0  6   3652 346880 1770076          0    0     0    28 1047  259  1  0 11 88  0\r\n 0  6   3652 346624 1770076          0    0     1     0 1036  315  5  1  0 94  0\r\n 0  6   3652 346168 1770076          0    0     8   240 <strong>1069<\/strong>  173  0  0  0 <strong>99<\/strong>  0\r\n 0  6   3652 343216 1770076          0    0     0    23 <strong>1079<\/strong>  304  1  1  0 <strong>99<\/strong>  0\r\n 0  6   3652 343832 1770076          0    0     0    23 <strong>1022<\/strong>   86  0  0  0 <strong>100<\/strong> 0<\/font><\/pre>\n<p>That&#8217;s better. The columns of interest here are:<\/p>\n<ul>\n<li><b>b<\/b> &#8212; blocked processes (not running) &#8212; there are around 6 processes waiting for the disk<\/li>\n<li><b>bi<\/b> and <b>bo<\/b> &#8212; buffers in and buffers out &#8212; this is the actual disk throughput in something like kilobytes per second<\/li>\n<li><b>wa<\/b> &#8212; waiting percentage &#8212; well, we&#8217;re mostly waiting because we&#8217;re waiting for disk.<\/li>\n<\/ul>\n<p>So this machine is running slow because its disk can only do so much. It&#8217;s busy writing out around 1000 blocks per second, which is roughly 8Mbps towards the disk. For this system, that does not indicate a disk performance problem, so we&#8217;re just pushing too hard.<\/p>\n<p>Now, we do this &#8212; get a list of the processes, and see which ones are in &#8220;Disk-wait&#8221; state, identified as <code>D<\/code> preceeded by a space &#8212; hence the <code>grep ' D'<\/code> in the command below:<\/p>\n<pre>root@kiarapoint [~]# <strong>ps uaxwf | grep ' D'<\/strong><font size=-1>\r\nroot\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 357\u00a0 0.0\u00a0 0.0\u00a0\u00a0\u00a0\u00a0\u00a0 0\u00a0\u00a0\u00a0\u00a0 0 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <strong>D<\/strong>\u00a0\u00a0\u00a0 Nov19\u00a0\u00a0 4:20\u00a0 \\_ [pdflush]\r\nroot\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 573\u00a0 0.0\u00a0 0.0\u00a0\u00a0\u00a0\u00a0\u00a0 0\u00a0\u00a0\u00a0\u00a0 0 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <strong>D<\/strong>&lt;\u00a0\u00a0 Nov19\u00a0 13:45\u00a0 \\_ [kjournald]\r\nroot\u00a0\u00a0\u00a0\u00a0 32748\u00a0 0.0\u00a0 0.0\u00a0 61248\u00a0\u00a0 756 pts\/0\u00a0\u00a0\u00a0 S+\u00a0\u00a0 16:58\u00a0\u00a0 0:00\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \\_ grep --color=auto\u00a0 D\r\nnobody\u00a0\u00a0 24458\u00a0 0.0\u00a0 0.0\u00a0 88536 14772 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 S\u00a0\u00a0\u00a0 <font color=red>Dec06<\/font>\u00a0\u00a0 0:00\u00a0 \\_ \/usr\/local\/apache\/bin\/httpd -k start -DSSL\r\nmailnull 32444\u00a0 0.0\u00a0 0.0\u00a0 67568\u00a0 2696 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <strong>D<\/strong>N\u00a0\u00a0 16:55\u00a0\u00a0 0:00\u00a0 |\u00a0\u00a0 |\u00a0\u00a0 \\_ \/usr\/sbin\/sendmail -t -i\r\nnobody\u00a0\u00a0 32514\u00a0 0.0\u00a0 0.0\u00a0 88944 14756 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <strong>D<\/strong>\u00a0\u00a0\u00a0 16:56\u00a0\u00a0 0:00\u00a0 \\_ \/usr\/local\/apache\/bin\/httpd -k start -DSSL\r\nmailnull 31948\u00a0 0.0\u00a0 0.0 124048\u00a0 4800 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <strong>D<\/strong>\u00a0\u00a0\u00a0 16:50\u00a0\u00a0 0:00\u00a0 |\u00a0\u00a0 \\_ \/usr\/sbin\/sendmail -FCronDaemon -i -odi -oem -oi -t\r\nroot\u00a0\u00a0\u00a0\u00a0\u00a0 7620\u00a0 0.0\u00a0 0.0\u00a0 44912\u00a0 1004 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Ss\u00a0\u00a0 <font color=red>Dec08<\/font>\u00a0\u00a0 0:00\u00a0\u00a0\u00a0\u00a0\u00a0 \\_ (ncsa_auth) \/etc\/squid\/passwd\r\nroot\u00a0\u00a0\u00a0\u00a0\u00a0 7624\u00a0 0.0\u00a0 0.0\u00a0 44912\u00a0 1004 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Ss\u00a0\u00a0 <font color=red>Dec08<\/font>\u00a0\u00a0 0:00\u00a0\u00a0\u00a0\u00a0\u00a0 \\_ (ncsa_auth) \/etc\/squid\/passwd\r\nroot\u00a0\u00a0\u00a0\u00a0\u00a0 7625\u00a0 0.0\u00a0 0.0\u00a0 44912\u00a0 1000 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Ss\u00a0\u00a0 <font color=red>Dec08<\/font>\u00a0\u00a0 0:00\u00a0\u00a0\u00a0\u00a0\u00a0 \\_ (ncsa_auth) \/etc\/squid\/passwd\r\nroot\u00a0\u00a0\u00a0\u00a0\u00a0 7626\u00a0 0.0\u00a0 0.0\u00a0 44912\u00a0 1004 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Ss\u00a0\u00a0 <font color=red>Dec08<\/font>\u00a0\u00a0 0:00\u00a0\u00a0\u00a0\u00a0\u00a0 \\_ (ncsa_auth) \/etc\/squid\/passwd\r\nroot\u00a0\u00a0\u00a0\u00a0\u00a0 7627\u00a0 0.0\u00a0 0.0\u00a0 44912\u00a0 1000 ?\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Ss\u00a0\u00a0 <font color=red>Dec08<\/font>\u00a0\u00a0 0:00\u00a0\u00a0\u00a0\u00a0\u00a0 \\_ (ncsa_auth) \/etc\/squid\/passwd<\/font><\/pre>\n<p><strong>HOWEVER<\/strong> it is December.  December, the only month in the year to have the bad taste to start with <code>D<\/code>.  We have processes that have been running for a few days, which started in <strong>D<\/strong>ecember.  So, for the entire month, we cannot do this:<\/p>\n<pre>ps uaxwf | grep ' D'<\/pre>\n<p>but have to do this instead:<\/p>\n<pre>ps uaxwf | grep ' D[^e]'<\/pre>\n<p>Aargh!  Bracket, shift, 6, release shift, e, bracket &#8230; Aaaaarrrgh!  The injury!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The system load for Linux is an indication of how busy the system is, and how long you are going to wait to get something done.\u00a0 When the load is 0.0, that means that the system is ready to do &hellip; <a href=\"https:\/\/www.mcgill.org.za\/stuff\/archives\/820\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[130,132,37,131,17,190,11],"class_list":["post-820","post","type-post","status-publish","format-standard","hentry","category-stuff","tag-december","tag-first-world-problem","tag-linux","tag-load","tag-rants","tag-stuff","tag-stupidity"],"_links":{"self":[{"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/posts\/820","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/comments?post=820"}],"version-history":[{"count":12,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/posts\/820\/revisions"}],"predecessor-version":[{"id":834,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/posts\/820\/revisions\/834"}],"wp:attachment":[{"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/media?parent=820"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/categories?post=820"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/tags?post=820"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}