{"id":897,"date":"2014-05-21T17:49:04","date_gmt":"2014-05-21T15:49:04","guid":{"rendered":"\/\/www.mcgill.org.za\/stuff\/?p=897"},"modified":"2014-05-22T09:09:32","modified_gmt":"2014-05-22T07:09:32","slug":"vmware-ipfix-nat-intermittent-fail","status":"publish","type":"post","link":"https:\/\/www.mcgill.org.za\/stuff\/archives\/897","title":{"rendered":"VMWare + ipfix + NAT = intermittent fail"},"content":{"rendered":"<p>So I put all of these things together a while ago for bandwidth reporting:<\/p>\n<ul>\n<li><strong>VMWare<\/strong>&#8216;s netflow reporting &#8211; I configured a virtual\u00a0distributed switch to send netflow reporting to a collector. \u00a0Every time some machine runs up its internet usage, the bytes and things get tallied up and sent off for collection.<\/li>\n<li><strong>NAT<\/strong>: in order to reach the collector, the <a title=\"Netflow, according to wikipedia, that unbiased source, at least for computer technical detail\" href=\"http:\/\/en.wikipedia.org\/wiki\/NetFlow\" target=\"_blank\">netflow<\/a> packets pass through a NAT gateway, which rewrite them because the source network address is on a private network. \u00a0Okay, so that makes no sense if you don&#8217;t speak jargon, but that&#8217;s just tough. \u00a0The packets went through this thing. \u00a0Come to think of it, I think I didn&#8217;t even consider how the packets were getting routed from the source to the collector &#8211; they arrived in good order, so I was happy, and I thought no further.<\/li>\n<li><strong>ipfix<\/strong>: VMware doesn&#8217;t send netflow packets, but sends something like netflow called <a href=\"http:\/\/en.wikipedia.org\/wiki\/IP_Flow_Information_Export\" target=\"_blank\">ipfix<\/a>.<\/li>\n<\/ul>\n<p>And, for good measure:<\/p>\n<ul>\n<li>My preferred flow capture thing didn&#8217;t speak ipfix, so I used <a title=\"nfdump ... like tcpdump, just less mature\" href=\"http:\/\/nfdump.sourceforge.net\/\" target=\"_blank\">nfcapd<\/a> to translate the packets.<\/li>\n<\/ul>\n<p>And it worked wonderfully &#8211; one of those quick hacks that becomes permanent. It was perfect(ish) &#8230; until the tenth\u00a0of this month. \u00a0At that point, it started logging very odd bandwidth information:<\/p>\n<pre>srcIP    dstIP      prot  srcPort  dstPort  octets     packets\r\n0.0.0.0  0.0.0.1    0     23047    38382    285327087  43\r\n0.0.0.0  0.0.0.231  0     0        0        325        285327087\r\n<\/pre>\n<p>The famous IP address 0.0.0.0 sent 43 packets with a total of 285327087 bytes &#8230; that means that some of those packets were seriously <i>huge<\/i> &#8211; mega jumbo frames of 6.3Mb each. Then it sent 285327087 packets totalling 325 bytes. The famous <a title=\"NULL packet RFC\" href=\"http:\/\/tools.ietf.org\/html\/rfc6592\" target=\"_blank\">NULL packet<\/a> was detected! \u00a0This information appeared in the flow-capture log &#8230; and investigation showed that it was in the nfcapd logs. \u00a0Wireshark could not understand the captures, which left the prime suspect as VMware &#8211; surely they updated something and broke everything.\u00a0Maybe a memory error on the collector, or maybe a memory error on the netflow originator &#8230; what could it be?<\/p>\n<p>It turns out that at the bottom of this pile of steaming randomness, is that the ipfix\/netflow streams from multiple VMware ESX servers together\u00a0look like\u00a0<em>one<\/em> stream from a single confused host, rather than discrete streams from a number of slightly less confused hosts. The collector sees them all having the same source IP address. \u00a0Usually the data sent by different VMware hosts would be sent at different times, but as the load increased, and as time passed and things went in and out of sync, so it became more likely that there would be a collision, confusion and corruption.<\/p>\n<p>All I needed to do was remove the NAT, and the problem evaporated.<\/p>\n<p>The moral of the story is: netflow and NAT do not mix. \u00a0Don&#8217;t cross the beams.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>So I put all of these things together a while ago for bandwidth reporting: VMWare&#8216;s netflow reporting &#8211; I configured a virtual\u00a0distributed switch to send netflow reporting to a collector. \u00a0Every time some machine runs up its internet usage, the &hellip; <a href=\"https:\/\/www.mcgill.org.za\/stuff\/archives\/897\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[159,157,33,190,11,158],"class_list":["post-897","post","type-post","status-publish","format-standard","hentry","category-stuff","tag-ipfix","tag-netflow","tag-networking","tag-stuff","tag-stupidity","tag-vmware"],"_links":{"self":[{"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/posts\/897","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/comments?post=897"}],"version-history":[{"count":6,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/posts\/897\/revisions"}],"predecessor-version":[{"id":903,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/posts\/897\/revisions\/903"}],"wp:attachment":[{"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/media?parent=897"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/categories?post=897"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mcgill.org.za\/stuff\/wp-json\/wp\/v2\/tags?post=897"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}