前言

在上一篇文章NGINX源码分析：事件模型中，NGINX将epoll封装作为事件模型的核心模块，驱动着NGINX的端口侦听循环和数据收发循环。

epoll是Linux内核为处理大批量文件描述符而作了改进的poll，是Linux下多路复用IO接口select/poll的增强版本，它能显著提高程序在大量并发连接中只有少量活跃的情况下的系统CPU利用率。

简单总结select/poll/epoll三者的区别如下：

select:
- 监听的文件描述符个数有最高上限；
- 就绪检查时需要遍历所有事件，效率低；
- 内核态和用户态需要内存拷贝，开销大；
poll：
- 没有最大连接数的限制，基于链表来实现；
- 就绪检查时需要遍历所有事件，效率低；
- 内核态和用户态需要内存拷贝，开销大；
epoll：
- 没有最大连接数的限制，支持的文件描述符上限是最大可以打开文件的数目，1G内存的机器上是大约10万左右；
- 就绪检查时只需要遍历已就绪的事件，效率高；
- 内核态和用户态不需要内存拷贝，内核可以直接将就绪事件列表添加到用户提供的内存区；

epoll属于已经被写烂了的东西，因为它的实现代码十分精简，只有简单的一个文件”fs/eventpoll.c”，甚至也就刚刚超过2000行，但是它的应用场景却十分广泛，基本上所有高性能Web服务器的事件模型中都封装有epoll的实现。

所以，本文权当备忘记录来简单回顾epoll在Linux内核中的源码实现。

epoll的实现

epoll的系统调用有如下三个：

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)；
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

“epoll_create()”创建一个epoll句柄，同时也占用一个文件描述符；
“epoll_ctl()”用于向epoll注册一个事件，而且明确监听的事件类型；
“epoll_wait()”用于检测和等待激活的事件，其中的”events”和”maxevents”提供给内核用于添加就绪事件；

第一部分：epoll的创建

“epoll_create()”用于创建一个目标epoll句柄，显而易见的，epoll需要占用某些系统资源。

Linux内核在系统启动时，其实已经提前完成了该部分资源的分配和初始化。

/* 本文的内核源码版本为linux-3.10.103 */

/* linux-3.10.103/linux-3.10.103/fs/eventpoll.c */

static int __init eventpoll_init(void)
{
    struct sysinfo si;

    si_meminfo(&si);
    /*
     * Allows top 4% of lomem to be allocated for epoll watches (per user).
     */
    max_user_watches = (((si.totalram - si.totalhigh) / 25) << PAGE_SHIFT) /
        EP_ITEM_COST;
    BUG_ON(max_user_watches < 0);

    /*
     * Initialize the structure used to perform epoll file descriptor
     * inclusion loops checks.
     */
    ep_nested_calls_init(&poll_loop_ncalls);

    /* Initialize the structure used to perform safe poll wait head wake ups */
    ep_nested_calls_init(&poll_safewake_ncalls);

    /* Initialize the structure used to perform file's f_op->poll() calls */
    ep_nested_calls_init(&poll_readywalk_ncalls);

    /*
     * We can have many thousands of epitems, so prevent this from
     * using an extra cache line on 64-bit (and smaller) CPUs
     */
    BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128);

    /* Allocates slab cache used to allocate "struct epitem" items */
    epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
            0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);

    /* Allocates slab cache used to allocate "struct eppoll_entry" */
    pwq_cache = kmem_cache_create("eventpoll_pwq",
            sizeof(struct eppoll_entry), 0, SLAB_PANIC, NULL);

    return 0;
}

“eventpoll_init()”在内核模块加载时将被调用来完成epoll相关的资源分配，从其源码可见，Linux内核为epoll提供了”eventpoll_epi”和”eventpoll_pwq”两块基于slab机制的内存单元，我们可以通过”proc”文件系统直接观察到该块内存，这里需要自行比对内核版本观察..。


shell# cat /proc/slabinfo

很明显，epoll中对象的分配和释放操作非常频繁，slab可以最大程度提升epoll使用内存的效率。

接着再来分析”epoll_create()”的源码：

/* linux-3.10.103/linux-3.10.103/fs/eventpoll.c */

/*
 * Open an eventpoll file descriptor.
 */
SYSCALL_DEFINE1(epoll_create1, int, flags)
{
    int error, fd;
    struct eventpoll *ep = NULL;
    struct file *file;

    /* Check the EPOLL_* constant for consistency.  */
    BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

    if (flags & ~EPOLL_CLOEXEC)
        return -EINVAL;
    /*
     * Create the internal data structure ("struct eventpoll").
     */
    error = ep_alloc(&ep);
    if (error < 0)
        return error;
    /*
     * Creates all the items needed to setup an eventpoll file. That is,
     * a file structure and a free file descriptor.
     */
    fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
    if (fd < 0) {
        error = fd;
        goto out_free_ep;
    }
    file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
                 O_RDWR | (flags & O_CLOEXEC));
    if (IS_ERR(file)) {
        error = PTR_ERR(file);
        goto out_free_fd;
    }
    ep->file = file;
    fd_install(fd, file);
    return fd;

out_free_fd:
    put_unused_fd(fd);
out_free_ep:
    ep_free(ep);
    return error;
}

可以看到，”epoll_create()”的逻辑也不复杂，”ep_alloc()”为epoll创建了一个”struct eventpoll”对象，这是epoll的核心结构，它通过”fd”和”file”关联起来，后续所有的针对epoll的操作直接通过这个相关联的”fd”来操作即可。

“fd”用来操作epoll，”file”用来关联Linux中的某个文件，因为epoll是作为文件形式存在的，但其实是以匿名的方式，因为它只要可以和对应的inode关联起来即可，不需要更复杂的文件操作。

“anon_inode_getfile()”完成匿名文件的创建，还有很重要的一点，epoll通过该函数将”ep”和”file->private_data”关联起来。

另外，”ep_alloc()”在实例化epoll对象的时候，会初始化一系列非常重要的成员。

/* linux-3.10.103/linux-3.10.103/fs/eventpoll.c */

static int ep_alloc(struct eventpoll **pep)
{
    int error;
    struct user_struct *user;
    struct eventpoll *ep;

    user = get_current_user();
    error = -ENOMEM;
    ep = kzalloc(sizeof(*ep), GFP_KERNEL);
    if (unlikely(!ep))
        goto free_uid;

    spin_lock_init(&ep->lock);
    mutex_init(&ep->mtx);
    init_waitqueue_head(&ep->wq);
    init_waitqueue_head(&ep->poll_wait);
    INIT_LIST_HEAD(&ep->rdllist);
    ep->rbr = RB_ROOT;
    ep->ovflist = EP_UNACTIVE_PTR;
    ep->user = user;

    *pep = ep;

    return 0;

free_uid:
    free_uid(user);
    return error;
}

ep->wq: “wq”是”epoll_wait()”在等待事件就绪时将会用到的等待队列；
ep->poll_wait: “poll_wait”是
ep->rdllist：”rdllist”是用来保存epoll就绪事件的就绪队列;

第二部分：epoll的事件注册

“epoll_ctl()”用于向epoll注册一个事件，”epfd”指定”epoll_create()”返回的”fd”，op”指定事件增删改的操作，”fd”指定需要监听的套接字或者其它”fd”，”event”指定监听事件方式(如LT/ET..)，其源码如下：

/* linux-3.10.103/linux-3.10.103/fs/eventpoll.c */

/*
 * The following function implements the controller interface for
 * the eventpoll file that enables the insertion/removal/change of
 * file descriptors inside the interest set.
 */
SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
        struct epoll_event __user *, event)
{
    int error;
    int did_lock_epmutex = 0;
    struct file *file, *tfile;
    struct eventpoll *ep;
    struct epitem *epi;
    struct epoll_event epds;

    error = -EFAULT;
    if (ep_op_has_event(op) &&
        copy_from_user(&epds, event, sizeof(struct epoll_event)))
        goto error_return;

    /* Get the "struct file *" for the eventpoll file */
    error = -EBADF;
    file = fget(epfd);
    if (!file)
        goto error_return;

    /* Get the "struct file *" for the target file */
    tfile = fget(fd);
    if (!tfile)
        goto error_fput;

    /* The target file descriptor must support poll */
    error = -EPERM;
    if (!tfile->f_op || !tfile->f_op->poll)
        goto error_tgt_fput;

    /* Check if EPOLLWAKEUP is allowed */
    if ((epds.events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND))
        epds.events &= ~EPOLLWAKEUP;

    /*
     * We have to check that the file structure underneath the file descriptor
     * the user passed to us _is_ an eventpoll file. And also we do not permit
     * adding an epoll file descriptor inside itself.
     */
    error = -EINVAL;

    /* epoll_ctl有两个file，file是epoll的文件描述符，tfile是监听的文件描述符 */

    if (file == tfile || !is_file_epoll(file))
        goto error_tgt_fput;

    /*
     * At this point it is safe to assume that the "private_data" contains
     * our own data structure.
     */
    ep = file->private_data;

    /*
     * When we insert an epoll file descriptor, inside another epoll file
     * descriptor, there is the change of creating closed loops, which are
     * better be handled here, than in more critical paths. While we are
     * checking for loops we also determine the list of files reachable
     * and hang them on the tfile_check_list, so we can check that we
     * haven't created too many possible wakeup paths.
     *
     * We need to hold the epmutex across both ep_insert and ep_remove
     * b/c we want to make sure we are looking at a coherent view of
     * epoll network.
     */

    /* 这里的一大串注释很清晰解释了epoll对于循环监听的异常处理 */

    if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) {
        mutex_lock(&epmutex);
        did_lock_epmutex = 1;
    }
    if (op == EPOLL_CTL_ADD) {
        if (is_file_epoll(tfile)) {
            error = -ELOOP;
            if (ep_loop_check(ep, tfile) != 0) {
                clear_tfile_check_list();
                goto error_tgt_fput;
            }
        } else
            list_add(&tfile->f_tfile_llink, &tfile_check_list);
    }

    mutex_lock_nested(&ep->mtx, 0);

    /*
     * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
     * above, we can be sure to be able to use the item looked up by
     * ep_find() till we release the mutex.
     */
    epi = ep_find(ep, tfile, fd);

    error = -EINVAL;
    switch (op) {
    case EPOLL_CTL_ADD:
        if (!epi) {
            epds.events |= POLLERR | POLLHUP;
            error = ep_insert(ep, &epds, tfile, fd);
        } else
            error = -EEXIST;
        clear_tfile_check_list();
        break;
    case EPOLL_CTL_DEL:
        if (epi)
            error = ep_remove(ep, epi);
        else
            error = -ENOENT;
        break;
    case EPOLL_CTL_MOD:
        if (epi) {
            epds.events |= POLLERR | POLLHUP;
            error = ep_modify(ep, epi, &epds);
        } else
            error = -ENOENT;
        break;
    }
    mutex_unlock(&ep->mtx);

error_tgt_fput:
    if (did_lock_epmutex)
        mutex_unlock(&epmutex);

    fput(tfile);
error_fput:
    fput(file);
error_return:

    return error;
}

“ep_find()”之前的代码用来做一些异常处理操作，epoll是不允许循环监听的，即”epoll_ctl()”监听的”fd”不能是epoll文件描述符本身。

“epoll_ctl()”的核心代码在”ep_find()”之后，其实就是”switch”语句里的EPOLL_CTL_ADD/EPOLL_CTL_DEL/EPOLL_CTL_MOD”来完成事件的注册、删除和修改，后两者的代码不是很复杂，我们直接来关注核心的事件注册函数”ep_insert()”。

/* linux-3.10.103/linux-3.10.103/fs/eventpoll.c */

/*
 * Must be called with "mtx" held.
 */
static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
             struct file *tfile, int fd)
{
    int error, revents, pwake = 0;
    unsigned long flags;
    long user_watches;
    struct epitem *epi;
    struct ep_pqueue epq;

    user_watches = atomic_long_read(&ep->user->epoll_watches);
    if (unlikely(user_watches >= max_user_watches))
        return -ENOSPC;

    /* slab可以提高epitem的分配和回收效率 */

    if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
        return -ENOMEM;

    /* 三个队列的初始化，就绪队列、关联文件队列、等待队列 */   

    /* Item initialization follow here ... */
    INIT_LIST_HEAD(&epi->rdllink);
    INIT_LIST_HEAD(&epi->fllink);
    INIT_LIST_HEAD(&epi->pwqlist);

    /* 将epitem和对应的epoll关联起来 */

    epi->ep = ep;
    ep_set_ffd(&epi->ffd, tfile, fd);
    epi->event = *event;
    epi->nwait = 0;
    epi->next = EP_UNACTIVE_PTR;

    /* epoll的自动休眠问题，有兴趣可以阅读EPOLLWAKEUP的注释 */

    if (epi->event.events & EPOLLWAKEUP) {
        error = ep_create_wakeup_source(epi);
        if (error)
            goto error_create_wakeup_source;
    } else {
        RCU_INIT_POINTER(epi->ws, NULL);
    }

    /* 设置epoll的poll table，其实就是设置epq.pt的回调函数为ep_ptable_queue_proc */

    /* Initialize the poll table using the queue callback */
    epq.epi = epi;
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

    /*
     * Attach the item to the poll hooks and get current event bits.
     * We can safely use the file* here because its usage count has
     * been increased by the caller of this function. Note that after
     * this operation completes, the poll callback can start hitting
     * the new item.
     */

    /* 将上面设置好回调函数的epq.pt传入驱动层 */
    revents = ep_item_poll(epi, &epq.pt);

    /*
     * We have to check if something went wrong during the poll wait queue
     * install process. Namely an allocation for a wait queue failed due
     * high memory pressure.
     */
    error = -ENOMEM;
    if (epi->nwait < 0)
        goto error_unregister;

    /* Add the current item to the list of active epoll hook for this file */
    spin_lock(&tfile->f_lock);
    list_add_tail(&epi->fllink, &tfile->f_ep_links);
    spin_unlock(&tfile->f_lock);

    /* 添加epitem至epoll的红黑树中 */

    /*
     * Add the current item to the RB tree. All RB tree operations are
     * protected by "mtx", and ep_insert() is called with "mtx" held.
     */
    ep_rbtree_insert(ep, epi);

    /* now check if we've created too many backpaths */
    error = -EINVAL;
    if (reverse_path_check())
        goto error_remove_epi;

    /* We have to drop the new item inside our item list to keep track of it */
    spin_lock_irqsave(&ep->lock, flags);

    /* If the file is already "ready" we drop it inside the ready list */
    if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
        list_add_tail(&epi->rdllink, &ep->rdllist);
        ep_pm_stay_awake(epi);

        /* Notify waiting tasks that events are available */
        if (waitqueue_active(&ep->wq))
            wake_up_locked(&ep->wq);
        if (waitqueue_active(&ep->poll_wait))
            pwake++;
    }

    spin_unlock_irqrestore(&ep->lock, flags);

    atomic_long_inc(&ep->user->epoll_watches);

    /* We have to call this outside the lock */
    if (pwake)
        ep_poll_safewake(&ep->poll_wait);

    return 0;

error_remove_epi:
    spin_lock(&tfile->f_lock);
    if (ep_is_linked(&epi->fllink))
        list_del_init(&epi->fllink);
    spin_unlock(&tfile->f_lock);

    rb_erase(&epi->rbn, &ep->rbr);

error_unregister:
    ep_unregister_pollwait(ep, epi);

    /*
     * We need to do this because an event could have been arrived on some
     * allocated wait queue. Note that we don't care about the ep->ovflist
     * list, since that is used/cleaned only inside a section bound by "mtx".
     * And ep_insert() is called with "mtx" held.
     */
    spin_lock_irqsave(&ep->lock, flags);
    if (ep_is_linked(&epi->rdllink))
        list_del_init(&epi->rdllink);
    spin_unlock_irqrestore(&ep->lock, flags);

    wakeup_source_unregister(ep_wakeup_source(epi));

error_create_wakeup_source:
    kmem_cache_free(epi_cache, epi);

    return error;
}

INIT_LIST_HEAD完成epoll事件对象”epitem”相关的三个队列的初始化：

rdllink: epitem的事件就绪队列；
fllink: epitem的关联文件队列；
pwqlist: epitem的等待队列；

这三个队列非常重要，”fllink”用来记录epitem关联的文件句柄（释放资源时会用到），”rdllink”和”pwqlist”后面会继续分析。

epoll在注册事件的时候会设置其回调函数为”ep_ptable_queue_proc()”，当有事件到达时，驱动层会调用该回调函数完成事件的激活。

“ep_item_poll()”会调用对应的”fd”的poll操作, 看是否有事件到达，有兴趣的可以去阅读驱动层的”poll_wait()”，这部分有机会再讲。

epoll在注册事件时会将事件对象插入epoll的红黑树”rbr”中，时间开销为O(logn)，n为事件的总数，一般情况下都是总连接数。

“ep_insert()”也同时会检测当前”fd”代表的文件是否已经就绪（红黑树后的代码），若已就绪，则直接将其添加至”rdllink”，然后唤醒等待中的目标进程；

第三部分：epoll的事件就绪

在epoll的事件注册中提到的”ep_ptable_queue_proc()”将epoll的事件对象和设备（主要是网卡）驱动程序关联起来，当相应的事件发生后，驱动层就会调用该回调函数。

/* linux-3.10.103/linux-3.10.103/fs/eventpoll.c */

/*
 * This is the callback that is used to add our wait queue to the
 * target file wakeup lists.
 */
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
                 poll_table *pt)
{
    struct epitem *epi = ep_item_from_epqueue(pt);
    struct eppoll_entry *pwq;

    if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
        init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
        pwq->whead = whead;
        pwq->base = epi;
        add_wait_queue(whead, &pwq->wait);
        list_add_tail(&pwq->llink, &epi->pwqlist);
        epi->nwait++;
    } else {
        /* We have to signal that an error occurred */
        epi->nwait = -1;
    }
}

可以看到，该回调函数的核心逻辑就是利用Linux内核的等待队列，将”eppoll_entry”对象添加到epitem的等待队列”pwqlist”中，同时设置其回调函数为”ep_poll_callback()”。

“ep_ptable_queue_proc()”需要配合驱动层的代码来完成回调，主要是TCP/IP协议栈的实现代码，我们这里不展开，默认当数据到达驱动层时，”ep_poll_callback()”将被回调。

/* linux-3.10.103/linux-3.10.103/fs/eventpoll.c */

/*
 * This is the callback that is passed to the wait queue wakeup
 * mechanism. It is called by the stored file descriptors when they
 * have events to report.
 */
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
    int pwake = 0;
    unsigned long flags;
    struct epitem *epi = ep_item_from_wait(wait);
    struct eventpoll *ep = epi->ep;

    if ((unsigned long)key & POLLFREE) {
        ep_pwq_from_wait(wait)->whead = NULL;
        /*
         * whead = NULL above can race with ep_remove_wait_queue()
         * which can do another remove_wait_queue() after us, so we
         * can't use __remove_wait_queue(). whead->lock is held by
         * the caller.
         */
        list_del_init(&wait->task_list);
    }

    spin_lock_irqsave(&ep->lock, flags);

    /*
     * If the event mask does not contain any poll(2) event, we consider the
     * descriptor to be disabled. This condition is likely the effect of the
     * EPOLLONESHOT bit that disables the descriptor when an event is received,
     * until the next EPOLL_CTL_MOD will be issued.
     */
    if (!(epi->event.events & ~EP_PRIVATE_BITS))
        goto out_unlock;

    /*
     * Check the events coming with the callback. At this stage, not
     * every device reports the events in the "key" parameter of the
     * callback. We need to be able to handle both cases here, hence the
     * test for "key" != NULL before the event match test.
     */
    if (key && !((unsigned long) key & epi->event.events))
        goto out_unlock;

    /*
     * If we are transferring events to userspace, we can hold no locks
     * (because we're accessing user memory, and because of linux f_op->poll()
     * semantics). All the events that happen during that period of time are
     * chained in ep->ovflist and requeued later on.
     */
    if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
        if (epi->next == EP_UNACTIVE_PTR) {
            epi->next = ep->ovflist;
            ep->ovflist = epi;
            if (epi->ws) {
                /*
                 * Activate ep->ws since epi->ws may get
                 * deactivated at any time.
                 */
                __pm_stay_awake(ep->ws);
            }

        }
        goto out_unlock;
    }

    /* 添加就绪事件至rdllist */

    /* If this file is already in the ready list we exit soon */
    if (!ep_is_linked(&epi->rdllink)) {
        list_add_tail(&epi->rdllink, &ep->rdllist);
        ep_pm_stay_awake_rcu(epi);
    }

    /*
     * Wake up ( if active ) both the eventpoll wait list and the ->poll()
     * wait list.
     */

    /* 唤醒阻塞在epoll_wait()的进程 */

    if (waitqueue_active(&ep->wq))
        wake_up_locked(&ep->wq);

    /* 唤醒阻塞在eventpoll文件的进程 */

    if (waitqueue_active(&ep->poll_wait))
        pwake++;

out_unlock:
    spin_unlock_irqrestore(&ep->lock, flags);

    /* We have to call this outside the lock */
    if (pwake)
        ep_poll_safewake(&ep->poll_wait);

    return 1;
}

“ep_poll_callback()”的作用主要有有两个：

添加就绪事件至”rdllist”；
唤醒阻塞在”epoll_wait()”的进程；

我们最后再来分析”epoll_wait()”的源码：

/* linux-3.10.103/linux-3.10.103/fs/eventpoll.c */

/*
 * Implement the event wait interface for the eventpoll file. It is the kernel
 * part of the user space epoll_wait(2).
 */
SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
        int, maxevents, int, timeout)
{
    int error;
    struct fd f;
    struct eventpoll *ep;

    /* The maximum number of event must be greater than zero */
    if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
        return -EINVAL;

    /* Verify that the area passed by the user is writeable */
    if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))
        return -EFAULT;

    /* Get the "struct file *" for the eventpoll file */
    f = fdget(epfd);
    if (!f.file)
        return -EBADF;

    /*
     * We have to check that the file structure underneath the fd
     * the user passed to us _is_ an eventpoll file.
     */
    error = -EINVAL;
    if (!is_file_epoll(f.file))
        goto error_fput;

    /*
     * At this point it is safe to assume that the "private_data" contains
     * our own data structure.
     */
    ep = f.file->private_data;

    /* Time to fish for events ... */
    error = ep_poll(ep, events, maxevents, timeout);

error_fput:
    fdput(f);
    return error;
}

这部分代码很简略，我们主要关注的对象是入参的”events”和”maxevents”和函数”ep_poll()”，”events”和”maxevents”是用户提供给内核用于添加就绪事件的地方，这样可以避免用户空间和内核空间的多次拷贝。

“ep_poll()”的代码如下：

/* linux-3.10.103/linux-3.10.103/fs/eventpoll.c */

/**
 * ep_poll - Retrieves ready events, and delivers them to the caller supplied
 *           event buffer.
 *
 * @ep: Pointer to the eventpoll context.
 * @events: Pointer to the userspace buffer where the ready events should be
 *          stored.
 * @maxevents: Size (in terms of number of events) of the caller event buffer.
 * @timeout: Maximum timeout for the ready events fetch operation, in
 *           milliseconds. If the @timeout is zero, the function will not block,
 *           while if the @timeout is less than zero, the function will block
 *           until at least one event has been retrieved (or an error
 *           occurred).
 *
 * Returns: Returns the number of ready events which have been fetched, or an
 *          error code, in case of error.
 */
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
           int maxevents, long timeout)
{
    int res = 0, eavail, timed_out = 0;
    unsigned long flags;
    long slack = 0;
    wait_queue_t wait;
    ktime_t expires, *to = NULL;

    if (timeout > 0) {
        struct timespec end_time = ep_set_mstimeout(timeout);

        slack = select_estimate_accuracy(&end_time);
        to = &expires;
        *to = timespec_to_ktime(end_time);
    } else if (timeout == 0) {
        /*
         * Avoid the unnecessary trip to the wait queue loop, if the
         * caller specified a non blocking operation.
         */
        timed_out = 1;
        spin_lock_irqsave(&ep->lock, flags);
        goto check_events;
    }

fetch_events:
    spin_lock_irqsave(&ep->lock, flags);

    /* 如果没有就绪事件，则持续等待，直到有事件被激活或者超时 */

    if (!ep_events_available(ep)) {
        /*
         * We don't have any available event to return to the caller.
         * We need to sleep here, and we will be wake up by
         * ep_poll_callback() when events will become available.
         */

        /* 进程将在此休眠，直到上文中的"ep_poll_callback()"在有事件就绪时唤醒该进程 */

        init_waitqueue_entry(&wait, current);
        __add_wait_queue_exclusive(&ep->wq, &wait);

        for (;;) {
            /*
             * We don't want to sleep if the ep_poll_callback() sends us
             * a wakeup in between. That's why we set the task state
             * to TASK_INTERRUPTIBLE before doing the checks.
             */
            set_current_state(TASK_INTERRUPTIBLE);
            if (ep_events_available(ep) || timed_out)
                break;
            if (signal_pending(current)) {
                res = -EINTR;
                break;
            }

            spin_unlock_irqrestore(&ep->lock, flags);
            if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
                timed_out = 1;

            spin_lock_irqsave(&ep->lock, flags);
        }
        __remove_wait_queue(&ep->wq, &wait);

        set_current_state(TASK_RUNNING);
    }
check_events:
    /* Is it worth to try to dig for events ? */
    eavail = ep_events_available(ep);

    spin_unlock_irqrestore(&ep->lock, flags);

    /*
     * Try to transfer events to user space. In case we get 0 events and
     * there's still timeout left over, we go trying again in search of
     * more luck.
     */

    /* 注释很明白地解释了内核在此将就绪事件拷贝至用户空间中 */

    if (!res && eavail &&
        !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
        goto fetch_events;

    return res;
}

epoll的效率比select和poll更高的一个重要原因就在于就绪检查时只需要遍历已就绪的事件，通过上述分析，也验证了其原理。

epoll在实现”epoll_wait()”的时候，不需要去遍历所有的事件, 它通过回调机制把就绪事件主动拷贝到就绪队列”rdllist”中。

另外，epoll有效地利用Linux内核的等待队列，可以在有事件就绪时可以及时唤醒阻塞在”epoll_wait()”上的进程。

总结

epoll的代码确实比较简洁紧凑，但是其实现机制主要还是依赖于Linux内核的等待队列，所以其回调模型也不容易理解。

通过分析epoll的源码实现，也可以很容易明白其效率对比select和poll更高效的原因和实现原理。

但是epoll其实也存在一些问题，epoll对于事件的添加和删除依赖红黑树的特性，红黑树在添加和删除结点时其时间复杂度为O(logn)，因此在大连接量下频繁建立和关闭连接时仍然有一定的性能问题（超短连接场景下），所以，epoll的适用场景主要还是在于活跃连接数有限时。

狼烟

Linux内核源码分析：epoll

前言

epoll的实现

第一部分：epoll的创建

第二部分：epoll的事件注册

第三部分：epoll的事件就绪

总结

NGINX源码分析：事件模型

Linux内核源码分析：文件锁

NGINX源码分析：错误页面

VPN in Vultr VPS

NGINX PLUS vs NGINX

像黑客一样写自己的技术博客

Hello world!