プロフィール

kosaki

Author:kosaki
連絡先はコチラ

ブログ検索
最近の記事
最近のコメント
最近のトラックバック
リンク
カテゴリー
月別アーカイブ
RSSフィード
FC2ブログランキング

スポンサーサイト このエントリーをはてなブックマークに追加

上記の広告は1ヶ月以上更新のないブログに表示されています。
新しい記事を書く事で広告が消せます。


スポンサー広告 | 【--------(--) --:--:--】 | Trackback(-) | Comments(-)

selectでpeerがclose済みの時のpollの挙動って一貫してないなぁ・・・ このエントリーをはてなブックマークに追加

背景
==

akrさんが、Linuxのselectの挙動がバグではないかと問題提起している。
http://www.a-k-r.org/d/2010-07.html#a2010_07_06_1

SUSの規定をみてみる。
http://www.opengroup.org/onlinepubs/009695399/functions/poll.html

If the writefds argument is not a null pointer, it points to an object of type fd_set that on input specifies the file descriptors to be checked for being ready to write, and on output indicates which file
descriptors are ready to write.

A descriptor shall be considered ready for writing when a call to an output function with O_NONBLOCK clear would not block, whether or not the function would transfer data successfully.



うむ。akrさんが正しそうである。

つぎにLinuxの実装を見てみる

まず、fs/select.c

#define POLLIN_SET (POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR)
#define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR)
#define POLLEX_SET (POLLPRI)


writable bitがONになる条件は POLLWRBAND、POLLWRNORM、POLLOUT、POLLERRの4つ

つぎにpipeを見てみる fs/pipe.c
/* No kernel lock held - fine */
static unsigned int
pipe_poll(struct file *filp, poll_table *wait)
{
unsigned int mask;
struct inode *inode = filp->f_path.dentry->d_inode;
struct pipe_inode_info *pipe = inode->i_pipe;
int nrbufs;

poll_wait(filp, &pipe->wait, wait);

/* Reading only -- no need for acquiring the semaphore. */
nrbufs = pipe->nrbufs;
mask = 0;
if (filp->f_mode & FMODE_READ) {
mask = (nrbufs > 0) ? POLLIN | POLLRDNORM : 0;
if (!pipe->writers && filp->f_version != pipe->w_counter)
mask |= POLLHUP;
}

if (filp->f_mode & FMODE_WRITE) {
mask |= (nrbufs < pipe->buffers) ? POLLOUT | POLLWRNORM : 0;
/*
* Most Unices do not set POLLERR for FIFOs but on Linux they
* behave exactly like pipes for poll().
*/
if (!pipe->readers)
mask |= POLLERR;
}

return mask;
}


バッファが空いていたらpeerの状態によらず POLLOUT、POLLWRNORM が立つ。まあこれはFIFOを考えると
必要か。なぜreaderがいないときにPOLLERRを立てるのか全然理解できぬ。POLLERRってreadable bitも
立ってしまうやんか。


つぎ、Unix Domain Socket。net/unix/af_unix.c


static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table *wait)
{
struct sock *sk = sock->sk;
unsigned int mask;

sock_poll_wait(file, sk_sleep(sk), wait);
mask = 0;

/* exceptional events? */
if (sk->sk_err)
mask |= POLLERR;
if (sk->sk_shutdown == SHUTDOWN_MASK)
mask |= POLLHUP;
if (sk->sk_shutdown & RCV_SHUTDOWN)
mask |= POLLRDHUP;

/* readable? */
if (!skb_queue_empty(&sk->sk_receive_queue) ||
(sk->sk_shutdown & RCV_SHUTDOWN))
mask |= POLLIN | POLLRDNORM;

/* Connection-based need to check for termination and startup */
if ((sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) &&
sk->sk_state == TCP_CLOSE)
mask |= POLLHUP;

/*
* we set writable also when the other side has shut down the
* connection. This prevents stuck sockets.
*/
if (unix_writable(sk))
mask |= POLLOUT | POLLWRNORM | POLLWRBAND;

return mask;
}

static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
}


POLLERRが立つのはソケットエラーが起きたときだけ、peerが閉じたら(SHUTDOWN_MASKなら)POLLHUPが立つ、
peerの状態によらずバッファが空いていれば POLLOUT、POLLWRNORM、POLLWRBAND の3つが立つ。
POLLWBANDを立てる理由はよく分からない。

推測するとPOLLWBANDの規定が以下なので、socketがSTREAMS上に実装されていたOSにおいて
Unix Domain Socketを使うとPOLLWBANDが立つようなので、マネしてみたとかそういう
ノリじゃなかろうか。
http://www.opengroup.org/onlinepubs/009695399/functions/poll.html


POLLWRBAND
Priority data may be written.

[XSR] [Option Start] For STREAMS, data on priority bands greater than 0 may be written without blocking. If any priority band has been written to on this STREAM, this event only examines bands that have been written to at least once. [Option End]




最後に問題のTCPのpoll。net/ipv4/tcp.c


/*
* Wait for a TCP event.
*
* Note that we don't need to lock the socket, as the upper poll layers
* take care of normal races (between the test and the event) and we don't
* go look at any of the socket buffers directly.
*/
unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
unsigned int mask;
struct sock *sk = sock->sk;
struct tcp_sock *tp = tcp_sk(sk);

sock_poll_wait(file, sk_sleep(sk), wait);
if (sk->sk_state == TCP_LISTEN)
return inet_csk_listen_poll(sk);

/* Socket is not locked. We are protected from async events
* by poll logic and correct handling of state changes
* made by other threads is impossible in any case.
*/

mask = 0;
if (sk->sk_err)
mask = POLLERR;

/*
* POLLHUP is certainly not done right. But poll() doesn't
* have a notion of HUP in just one direction, and for a
* socket the read side is more interesting.
*
* Some poll() documentation says that POLLHUP is incompatible
* with the POLLOUT/POLLWR flags, so somebody should check this
* all. But careful, it tends to be safer to return too many
* bits than too few, and you can easily break real applications
* if you don't tell them that something has hung up!
*
* Check-me.
*
* Check number 1. POLLHUP is _UNMASKABLE_ event (see UNIX98 and
* our fs/select.c). It means that after we received EOF,
* poll always returns immediately, making impossible poll() on write()
* in state CLOSE_WAIT. One solution is evident --- to set POLLHUP
* if and only if shutdown has been made in both directions.
* Actually, it is interesting to look how Solaris and DUX
* solve this dilemma. I would prefer, if POLLHUP were maskable,
* then we could set it on SND_SHUTDOWN. BTW examples given
* in Stevens' books assume exactly this behaviour, it explains
* why POLLHUP is incompatible with POLLOUT. --ANK
*
* NOTE. Check for TCP_CLOSE is added. The goal is to prevent
* blocking on fresh not-connected or disconnected socket. --ANK
*/
if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE)
mask |= POLLHUP;
if (sk->sk_shutdown & RCV_SHUTDOWN)
mask |= POLLIN | POLLRDNORM | POLLRDHUP;

/* Connected? */
if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
int target = sock_rcvlowat(sk, 0, INT_MAX);

if (tp->urg_seq == tp->copied_seq &&
!sock_flag(sk, SOCK_URGINLINE) &&
tp->urg_data)
target++;

/* Potential race condition. If read of tp below will
* escape above sk->sk_state, we can be illegally awaken
* in SYN_* states. */
if (tp->rcv_nxt - tp->copied_seq >= target)
mask |= POLLIN | POLLRDNORM;

if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)) {
mask |= POLLOUT | POLLWRNORM;
} else { /* send SIGIO later */
set_bit(SOCK_ASYNC_NOSPACE,
&sk->sk_socket->flags);
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);

/* Race breaker. If space is freed after
* wspace test but before the flags are set,
* IO signal will be lost.
*/
if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk))
mask |= POLLOUT | POLLWRNORM;
}
}

if (tp->urg_data & TCP_URG_VALID)
mask |= POLLPRI;
}
return mask;
}


POLLERRのあつかいはUnix Domain Socketと同じ。相手が閉じたぐらいではERRにしない。
peerが閉じちゃったときは(sk->sk_shutdown & SEND_SHUTDOWN の時)はあらゆるbitが立たない。つまりブロックする!
Unix Domain Socketと異なりPOLLWRBAND は使わない。


まあ、tcp_poll()を直すべきな気がするですよ。
もっというと、Unix Domain Socketもバッファサイズをチェックしている時点で
EPIPEになる状況はwritableになるべきというakrさんの主張はみたされていないので、
これも一緒に直すべき。

うーん。ほんとかな。ちょっと自信がない

関連記事


linux | 【2010-07-07(Wed) 08:59:17】 | Trackback:(0) | Comments:(0)
  1. 無料アクセス解析
上記広告は1ヶ月以上更新のないブログに表示されています。新しい記事を書くことで広告を消せます。