一台机器做NAT,发现某个时候会频繁被卡一下,但是netstat却不显示有丢包,一直很奇怪。
无意中用了netstat -s,发现ip栏有一项:packets not forwardable,这个数目还挺大,流量大的时候,一秒就能上百个包,顾名思义,就是有不少包没有被转发,这应该就是造成卡的原因。
排查了很多参数,终于发现是“快速转发”导致。
先看实验结果:
# sysctl net.inet.ip.fastforwarding=1
一分钟以后:
# netstat -ssp ip
ip:
1771218 total packets received
58439 packets for this host
19 packets for unknown/unsupported protocol
1692988 packets forwarded (16079 packets fast forwarded)
483 packets not forwardable
1339 packets sent from this host
53 output packets dropped due to no bufs, etc.
但是netstat -idb却不显示错误,原因是netstat显示的是错误包或丢掉的包,没有被正确转发的包不属于这两种之列,所以很长时间一来没有发现这个问题。
# netstat -idbh | head
Name Mtu Network Address Ipkts Ierrs Idrop Ibytes Opkts Oerrs Obytes Coll Drop
em0 1500 <Link#1> 01:02:03:04:50:50 602M 0 0 262G 627M 0 575G 0 0
em0 1500 192.168.0.0/24 192.168.0.3 0 - - 0 0 - 0 - -
em1 1500 <Link#2> 01:02:03:04:50:51 614M 1 0 571G 555M 0 237G 0 0
em1 1500 10.10.10.1/24 10.10.10.5 7.7M - - 504M 10K - 715K - -
lo0 16384 <Link#3> 0 0 0 0 0 0 0 0 0
lo0 16384 your-net localhost 0 - - 0 0 - 0 - -
同时netstat -m也显示比较正常,没有达到任何的极限,这是一个比较隐秘的问题:
# netstat -m
8370/5760/14130 mbufs in use (current/cache/total)
8367/3489/11856/492680 mbuf clusters in use (current/cache/total/max)
8367/3475 mbuf+clusters out of packet secondary zone in use (current/cache)
0/7/7/246339 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/36784 9k jumbo clusters in use (current/cache/total/max)
0/0/0/20691 16k jumbo clusters in use (current/cache/total/max)
18826K/8446K/27272K bytes allocated to network (current/cache/total)
528/240/8342 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
把“快速转发”禁用后,一切正常,注意netstat -s显示,已经没有了fastforward字样:
# sysctl net.inet.ip.fastforwarding=1
# netstat -ssp ip
ip:
1825218 total packets received
xxxx packets for this host
xx packets for unknown/unsupported protocol
xxxxxx packets forwarded
xxxx packets sent from this host
xxx output packets dropped due to no bufs, etc.
网上很多优化教程都建议使用fastforward,但为什么会出问题呢?又仔细搜索了一阵,发现有一个文章做了一个解释:
http://alter.org.ua/soft/fbsd/netisr/
? net.inet.ip.fastforwarding - process incoming packets immediately (including ipfw) in context of interrupt service routine before passing to netisr queue. Like net.isr.direct, this option should be used when number of CPU cores is less or equal than number of NICs. Impacts INCOMING traffic performance. Note, that not all NICs can correctly queue incoming packets while current packet is under processing. If you have net.inet.ip.fastforwarding enabled, you will meet the following side-effect: routing and passing packet through ipfw are precessed in the same execution context inside interrupt service routine. It means, that processing of new incoming packets is blocked during this time. Such behavior is efective when you have mainly incoming treaffic or have the only CPU core in system.
大致意思是:fastforwarding选项将进入的数据包不再送往netisr的队列,而是直接发送到中断实例中去。这对于CPU数目小于网卡数目的系统来说是很有用的,能显著提高“进入”方向的性能(译注:本页面还有一个isr的选项是提高“发送”方向的性能)。但是要注意:当正在处理数据包的时候,不是所有的网卡都能正确地将进入的数据包排队。如果启用了本选项,你就可能面临着一个副面的影响:在同一个中断实例里面,同时发生了数据包的路由和IPFW传输,这意味着在这段时间内进入的数据包将被阻塞,特别是有大量数据包进入或者仅有一个CPU的时候,这个现象更容易出现。
我用的是82574L网卡,看来这个网卡就属于上文提示到之列了,这个fastforwarding最好还是关闭算了,从netstat -s的数值上看,它大约能转发百分之一的包,为了这百分之一去换来丢包,不是很值得。