深入理解TCP协议及其源代码

本文从TCP的基本概念和TCP三次握手的过程入手,结合socket API中的connect及bind、listen、accept函数对TCP协议进行深入理解。

一、TCP的基本概念

TCP协议:TCP协议提供提供一种面向连接的、可靠的字节流服务。TCP旨在适应支持多网络应用的分层协议层次结构。 连接到不同但互连的计算机通信网络的主计算机中的成对进程之间依靠TCP提供可靠的通信服务。TCP假设它可以从较低级别的协议获得简单的,可能不可靠的数据报服务。 原则上,TCP应该能够在从硬线连接到分组交换或电路交换网络的各种通信系统之上操作。

SYN:同步序列编号(Synchronize Sequence Numbers)。是TCP/IP建立连接时使用的握手信号。

ACK: 确认字符(Acknowledge character)。表示发来的数据已确认接收无误。

三次握手:TCP是因特网中的传输层协议,使用三次握手协议建立连接。当主动方发出SYN连接请求后,等待对方回答SYN+ACK,并最终对对方的 SYN 执行 ACK 确认。这种建立连接的方法可以防止产生错误的连接,TCP使用的流量控制协议是可变大小的滑动窗口协议。三次握手完成,TCP客户端和服务器端成功地建立连接,可以开始传输数据了。

TCP三次握手过程
1.客户端发送SYN(SEQ=x)报文给服务器端,进入SYN_SEND状态。
2.服务器端收到SYN报文,回应一个SYN (SEQ=y)ACK(ACK=x+1)报文,进入SYN_RECV状态。
3.客户端收到服务器端的SYN报文,回应一个ACK(ACK=y+1)报文,进入Established状态。
深入理解TCP协议及其源代码_第1张图片

二、源码阅读

TCP协议相关的代码主要集中在linux-5.0.1/net/ipv4/目录下,由之前的分析可知在linux5.0.1/net/ipv4文件中定义的结构体变量struct proto tcp_prot指定了TCP协议栈的访问接口函数,我们分析出socket API中的sock->ops->bind函数的传输层接口函数为inet_csk_get_port函数。同理,不难分析出sock->opt->connect和sock->opt->accept函数对应的传输层接口函数分别为tcp_v4_connect和inet_csk_accept函数。

深入理解TCP协议及其源代码_第2张图片

我们接下来继续对tcp_v4_connect函数和inet_csk_accept函数进行分析:

tcp_v4_connect:

int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
    struct inet_sock *inet = inet_sk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
    //指向高速缓冲区的路由
    struct rtable *rt;
    __be32 daddr, nexthop;
    int tmp;
    int err;
 
    //地址长度检查
    if (addr_len < sizeof(struct sockaddr_in))
        return -EINVAL;
    //协议族检查
    if (usin->sin_family != AF_INET)
        return -EAFNOSUPPORT;
 
    //是否设置源路由选项
    nexthop = daddr = usin->sin_addr.s_addr;
    if (inet->opt && inet->opt->srr) {
        if (!daddr)
            return -EINVAL;
        nexthop = inet->opt->faddr;
    }
 
    //选路由,路由保存在rt->rt_dst中
    tmp = ip_route_connect(&rt, nexthop, inet->inet_saddr,
                   RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
                   IPPROTO_TCP,
                   inet->inet_sport, usin->sin_port, sk, 1);
    if (tmp < 0) {
        if (tmp == -ENETUNREACH)
            IP_INC_STATS_BH(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
        return tmp;
    }
 
    //组传送地址、广播地址则返回错误
    if (rt->rt_flags & (RTCF_MULTICAST | RTCF_BROADCAST)) {
        ip_rt_put(rt);
        return -ENETUNREACH;
    }
    //如果没有设置源路由ip选项,就使用路由表寻址的路由
    if (!inet->opt || !inet->opt->srr)
        daddr = rt->rt_dst;
 
    if (!inet->inet_saddr)
        inet->inet_saddr = rt->rt_src;
    inet->inet_rcv_saddr = inet->inet_saddr;
 
    if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
        /* Reset inherited state */
        tp->rx_opt.ts_recent       = 0;
        tp->rx_opt.ts_recent_stamp = 0;
        tp->write_seq           = 0;
    }
 
    //获取套接字最近使用的时间
    if (tcp_death_row.sysctl_tw_recycle &&
        !tp->rx_opt.ts_recent_stamp && rt->rt_dst == daddr) {
        struct inet_peer *peer = rt_get_peer(rt);
        /*
         * VJ's idea. We save last timestamp seen from
         * the destination in peer table, when entering state
         * TIME-WAIT * and initialize rx_opt.ts_recent from it,
         * when trying new connection.
         */
        if (peer != NULL &&
            (u32)get_seconds() - peer->tcp_ts_stamp <= TCP_PAWS_MSL) {
            tp->rx_opt.ts_recent_stamp = peer->tcp_ts_stamp;
            tp->rx_opt.ts_recent = peer->tcp_ts;
        }
    }
 
    inet->inet_dport = usin->sin_port;
    inet->inet_daddr = daddr;
 
    inet_csk(sk)->icsk_ext_hdr_len = 0;
    if (inet->opt)
        inet_csk(sk)->icsk_ext_hdr_len = inet->opt->optlen;
 
    tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;
 
    /* Socket identity is still unknown (sport may be zero).
     * However we set state to SYN-SENT and not releasing socket
     * lock select source port, enter ourselves into the hash tables and
     * complete initialization after this.
     */
     //设置套接字状态为TCP_SYN_SENT
    tcp_set_state(sk, TCP_SYN_SENT);
    //将套接字sk放入TCP连接管理哈希链表中
    err = inet_hash_connect(&tcp_death_row, sk);
    if (err)
        goto failure;
 
    //为连接分配一个临时端口
    err = ip_route_newports(&rt, IPPROTO_TCP,
                inet->inet_sport, inet->inet_dport, sk);
    if (err)
        goto failure;
 
    /* OK, now commit destination to socket.  */
    sk->sk_gso_type = SKB_GSO_TCPV4;
    sk_setup_caps(sk, &rt->u.dst);
 
    if (!tp->write_seq)
        //初始化TCP数据段序列号
        tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
                               inet->inet_daddr,
                               inet->inet_sport,
                               usin->sin_port);
 
    inet->inet_id = tp->write_seq ^ jiffies;
    //构建SYN包调用tcp_transmit_skb发送到IP层
    err = tcp_connect(sk);
    rt = NULL;
    if (err)
        goto failure;
 
    return 0;
 
failure:
    /*
     * This unhashes the socket and releases the local port,
     * if necessary.
     */
     //失败设置套接字状态为CLOSED
    tcp_set_state(sk, TCP_CLOSE);
    ip_rt_put(rt);
    sk->sk_route_caps = 0;
    inet->inet_dport = 0;
    return err;
}

其中参数sk为套接字指针,uaddr为sockaddr类型的地址,addr_len为套接字地址长度。

tcp_v4_connect函数的具体工作:

1.初始化

检查目的IP长度、协议、如果设置了源路由选项而且数据包目的地址不为空,则从用户给定的源路由列表中取一个IP地址赋给网关地址。

2.选择路由

根据目的ip、目的端口、网络设备接口调用ip_route_connect选路由,路由结构保存到rt->rt_dst中,实际调用的函数是ip_route_output_flow,如果是广播地址、组地址就返回。

3.设置连接状态

调用tcp_set_state设置套接字状态为TCP_SYN_SENT,本把套接字sk加入到连接管理哈希链表中,为连接分配一个临时端口。

4.发送连接请求

初始化第一个序列号,调用tcp_connect函数完成建立连接,包括发送SYN,tcp_connect将创建号的SYN数据段加入到套接字发送队列,最后调用tcp_transmit_skb数据包发送到IP层。

5.连接建立失败

如果连接建立失败,就将TCP状态切换回CLOSE,将套接字从连接管理hash表中移除,释放本地端口。

 

inet_csk_accept:

struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct request_sock_queue *queue = &icsk->icsk_accept_queue;
    struct request_sock *req;
    struct sock *newsk;
    int error;

   //获取sock锁将sk->sk_lock.owned设置为1
    //此锁用于进程上下文和中断上下文

    lock_sock(sk);

    /* We need to make sure that this socket is listening,
     * and that it has something pending.
     */
   //用于accept的sock必须处于监听状态

    error = -EINVAL;
    if (sk->sk_state != TCP_LISTEN)
        goto out_err;

    /* Find already established connection */
    //在监听套接字上的连接队列如果为空

    if (reqsk_queue_empty(queue)) {
     //设置接收超时时间,若调用accept的时候设置了O_NONBLOCK,表示马上返回不阻塞
        long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);

        /* If this is a non blocking socket don't sleep */
        error = -EAGAIN;
        if (!timeo)//如果是非阻塞模式timeo为0,则马上返回
            goto out_err;

     //将进程阻塞,等待连接的完成
        error = inet_csk_wait_for_connect(sk, timeo);
        if (error)//返回值为0说明监听套接字的完全建立连接队列不为空
            goto out_err;
    }

   //在监听套接字建立连接的队列中删除此request_sock连接项,并返回建立连接的sock
   //三次握手的完成是在tcp_v4_rcv中完成的

    req = reqsk_queue_remove(queue, sk);
    newsk = req->sk;

     //此时sock的状态应为TCP_ESTABLISHED
    if (sk->sk_protocol == IPPROTO_TCP &&
        tcp_rsk(req)->tfo_listener) {
        spin_lock_bh(&queue->fastopenq.lock);
        if (tcp_rsk(req)->tfo_listener) {
            /* We are still waiting for the final ACK from 3WHS
             * so can't free req now. Instead, we set req->sk to
             * NULL to signify that the child socket is taken
             * so reqsk_fastopen_remove() will free the req
             * when 3WHS finishes (or is aborted).
             */
            req->sk = NULL;
            req = NULL;
        }
        spin_unlock_bh(&queue->fastopenq.lock);
    }
out:
    release_sock(sk);
    if (req)
        reqsk_put(req);
    return newsk;
out_err:
    newsk = NULL;
    req = NULL;
    *err = error;
    goto out;
}
EXPORT_SYMBOL(inet_csk_accept);

其中参数sk为套接字指针,flags为文件标志(例如:O_NONBLOCK),err用于接收错误。

inet_csk_accept函数的具体工作:

1.初始化

从队列取带建立连接的套接字。其中icsk_accept_queue在listen时初始化,存放于SYN_RECV状态等待建立连接的套接字。

2.设置监听状态

当前状态必须为TCP_LISTEN,否则出错。

3.设置定时器

若设置了O_NONBLOCK非阻塞,队列没有数据直接返回;否则inet_csk_wait_for_connect一直阻塞,等待新的连接,直至timeo超时。

4.处理请求队列

 

inet_csk_wait_for_connect:

static int inet_csk_wait_for_connect(struct sock *sk, long timeo)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    DEFINE_WAIT(wait);
    int err;
    for (;;) {
        prepare_to_wait_exclusive(sk_sleep(sk), &wait,
                      TASK_INTERRUPTIBLE);
        release_sock(sk);
        if (reqsk_queue_empty(&icsk->icsk_accept_queue))
            timeo = schedule_timeout(timeo);
        sched_annotate_sleep();
        lock_sock(sk);
        err = 0;
        if (!reqsk_queue_empty(&icsk->icsk_accept_queue))
            break;
        err = -EINVAL;
        if (sk->sk_state != TCP_LISTEN)
            break;
        err = sock_intr_errno(timeo);
        if (signal_pending(current))
            break;
        err = -EAGAIN;
        if (!timeo)
            break;
    }
    finish_wait(sk_sleep(sk), &wait);
    return err;
}

此段代码的核心为for(;;)死循环,只有当接收到来自客户端的带有SYN标志的报文时,即accept队列非空时才跳出循坏,返回inet_csk_accept函数。

 

此时,我们通过分析代码,已经对TCP三次握手建立连接时的调用关系十分明了了,即客户端通过conncet()函数调用tcp协议中的tcp_v4_connect函数,服务端通过accept()函数调用inet_csk_accept函数,而inet_csk_accept函数通过调用inet_csk_wait_for_connect函数监听连接请求的队列,一旦收到请求就跳出循环,否则一直阻塞。

接下来,我们继续分析TCP如何往队列中写入请求的。

static struct net_protocol tcp_protocol = {
    .early_demux    =    tcp_v4_early_demux,
    .early_demux_handler =  tcp_v4_early_demux,
    .handler    =    tcp_v4_rcv,
    .err_handler    =    tcp_v4_err,
    .no_policy    =    1,
    .netns_ok    =    1,
    .icmp_strict_tag_validation = 1,
};

在此部分tcp协议的定义中,可以看出函数指针handler指向了tcp_v4_rcv函数,进一步分析tcp_v4_rcv函数:

tcp_v4_rcv:

 int tcp_v4_rcv(struct sk_buff *skb)
 {
 //查找对应的套接字
 lookup:
     sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source,
                    th->dest, sdif, &refcounted);
     if (!sk)
         goto no_tcp_socket;
 //根据套接字的状态进行处理,此时处于TCP_LISTEN
 process:
 ...
     if (sk->sk_state == TCP_LISTEN) {
         ret = tcp_v4_do_rcv(sk, skb);
         goto put_and_return;
     }
 ...
 put_and_return:
     if (refcounted)
         sock_put(sk);
 
     return ret;
 ...
 }

其中核心函数为tcp_v4_do_rcv,继续跟踪:

tcp_v4_do_rcv:

int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 {
     struct sock *rsk;
 ...
     if (sk->sk_state == TCP_LISTEN) {
         struct sock *nsk = tcp_v4_cookie_check(sk, skb);
         //返回NULL:出错
         //nsk == sk:没有找到新的TCB,所以收到的是第一次握手的SYN
         //nsk != SK: 找到了新的TCB,所以收到的是第三次握手的ACK
         if (!nsk)
             goto discard;
         if (nsk != sk) {
             if (tcp_child_process(sk, nsk, skb)) {
                 rsk = nsk;
                 goto reset;
             }
             return 0;
         }
     } else
         sock_rps_save_rxhash(sk, skb);
     //收到的TCP报文由该函数根据TCP的状态处理
     if (tcp_rcv_state_process(sk, skb)) {
         rsk = sk;
         goto reset;
     }
     return 0;
 //请求重传
 reset:
     tcp_v4_send_reset(rsk, skb);
 ...
 }
 EXPORT_SYMBOL(tcp_v4_do_rcv);

其中核心函数为tcp_rcv_state_process,继续跟踪:

tcp_rcv_state_process:

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 {
     ...
     switch (sk->sk_state) {
     ...
     case TCP_LISTEN:
         //此函数只处理SYN报文段,如果ACK置为,说明收到的是非预期的报文,
         //返回1会导致向对端回复RST报文
         if (th->ack)
             return 1;
         //收到RST报文,只是忽略该报文
         if (th->rst)
             goto discard;
         
         if (th->syn) {
             //收到了SYN报文
 ...
             acceptable = icsk->icsk_af_ops->conn_request(sk, skb) >= 0;
 ...
         }
     ...
 }

其中conn_request在ipv4_specific结构体中声明:

const struct inet_connection_sock_af_ops ipv4_specific = {
     .queue_xmit       = ip_queue_xmit,
     .send_check       = tcp_v4_send_check,
     .rebuild_header       = inet_sk_rebuild_header,
     .sk_rx_dst_set       = inet_sk_rx_dst_set,
     .conn_request       = tcp_v4_conn_request,
     .syn_recv_sock       = tcp_v4_syn_recv_sock,
     .net_header_len       = sizeof(struct iphdr),
     .setsockopt       = ip_setsockopt,
     .getsockopt       = ip_getsockopt,
     .addr2sockaddr       = inet_csk_addr2sockaddr,
     .sockaddr_len       = sizeof(struct sockaddr_in),
 #ifdef CONFIG_COMPAT
     .compat_setsockopt = compat_ip_setsockopt,
     .compat_getsockopt = compat_ip_getsockopt,
 #endif
     .mtu_reduced       = tcp_v4_mtu_reduced,
 };
 EXPORT_SYMBOL(ipv4_specific);

继续分析tcp_v4_conn_reques:

int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 {
     /* Never answer to SYNs send to broadcast or multicast */
     if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
         goto drop;
 
     return tcp_conn_request(&tcp_request_sock_ops,
                 &tcp_request_sock_ipv4_ops, sk, skb);
 
 drop:
     tcp_listendrop(sk);
     return 0;
 }
 EXPORT_SYMBOL(tcp_v4_conn_request);

核心函数为tcp_conn_request函数,继续分析:

int tcp_conn_request(struct request_sock_ops *rsk_ops,
              const struct tcp_request_sock_ops *af_ops,
              struct sock *sk, struct sk_buff *skb)
 {
     /* 
         注意: 这个函数将连接状态更新为TCP_NEW_SYN_RECV 
      */
     req = inet_reqsk_alloc(rsk_ops, sk, !want_cookie);
     ...
 
     if (fastopen_sk) {
         /* 发送syn+ack */
         af_ops->send_synack(fastopen_sk, dst, &fl, req,
                     &foc, TCP_SYNACK_FASTOPEN);
         /* Add the child socket directly into the accept queue */
         if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) {
             reqsk_fastopen_remove(fastopen_sk, req, false);
             bh_unlock_sock(fastopen_sk);
             sock_put(fastopen_sk);
             goto drop_and_free;
         }
         sk->sk_data_ready(sk);
         bh_unlock_sock(fastopen_sk);
         sock_put(fastopen_sk);     } else {
         tcp_rsk(req)->tfo_listener = false;
         if (!want_cookie)
             inet_csk_reqsk_queue_hash_add(sk, req,
                 tcp_timeout_init((struct sock *)req));
          /* 发送syn+ack */
         af_ops->send_synack(sk, dst, &fl, req, &foc,
                     !want_cookie ? TCP_SYNACK_NORMAL :
                            TCP_SYNACK_COOKIE);
         if (want_cookie) {
             reqsk_free(req);
             return 0;
         }
     }
     ...
 }

最终跟踪到tcp_v4_send_synack函数:

static int tcp_v4_send_synack(struct sock *sk, struct open_request *req,
                    struct dst_entry *dst)
  {
      int err = -1;
      struct sk_buff * skb;
      /* First, grab a route. */
      /* 查找到客户端的路由 */
      if (!dst && (dst = tcp_v4_route_req(sk, req)) == NULL)
          goto out;  
      /* 根据路由、传输控制块、连接请求块中的构建SYN+ACK段 */
      skb = tcp_make_synack(sk, dst, req);  
      if (skb) {/* 生成SYN+ACK段成功 */
          struct tcphdr *th = skb->h.th;
          /* 生成校验码 */
          th->check = tcp_v4_check(th, skb->len,
                       req->af.v4_req.loc_addr,
                       req->af.v4_req.rmt_addr,
                       csum_partial((char *)th, skb->len,
                                skb->csum));
          /* 生成IP数据报并发送出去 */
          err = ip_build_and_send_pkt(skb, sk, req->af.v4_req.loc_addr,
                          req->af.v4_req.rmt_addr,
                          req->af.v4_req.opt);
          if (err == NET_XMIT_CN)
              err = 0;
      }
  out:
      dst_release(dst);
      return err;
  }

其中tcp_make_synack(sk, dst, req)函数作用:根据路由、传输控制块、连接请求块中的构建SYN+ACK段;ip_build_and_send_pkt()函数作用:生成IP数据报并发送出去。

 

客户端在发送SYN/ACK后,还需要将状态变更,仍是tcp_do_v4_rcv:

int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 {
     struct sock *rsk;
 
     if (sk->sk_state == TCP_LISTEN) {
         //返回NULL:出错,丢弃数据包
         //nsk == sk:收到的是第一次握手的SYN
         //NSK != SK: 收到的是第三次握手的ACK
         struct sock *nsk = tcp_v4_hnd_req(sk, skb);
         if (!nsk)
             goto discard;
 
         if (nsk != sk) {
             //收到ACK报文会调用该函数
             if (tcp_child_process(sk, nsk, skb)) {
                 rsk = nsk;
                 goto reset;
             }
             return 0;
         }
     }
 reset:
     tcp_v4_send_reset(rsk, skb);
 }

继续分析其中的核心函数tcp_child_process:

int tcp_child_process(struct sock *parent, struct sock *child,
               struct sk_buff *skb)
 {
     int ret = 0;
     int state = child->sk_state;
 
     /* record NAPI ID of child */
     sk_mark_napi_id(child, skb);
 
     tcp_segs_in(tcp_sk(child), skb);
     if (!sock_owned_by_user(child)) {
         ret = tcp_rcv_state_process(child, skb);
         /* Wakeup parent, send SIGIO */
         if (state == TCP_SYN_RECV && child->sk_state != state)
             parent->sk_data_ready(parent);
     } else {
         /* Alas, it is possible again, because we do lookup
          * in main socket hash table and lock on listening
          * socket does not protect us more.
          */
         __sk_add_backlog(child, skb);
     }
 
     bh_unlock_sock(child);
     sock_put(child);
     return ret;
 }
 EXPORT_SYMBOL(tcp_child_process);

其核心函数并不陌生,仍是tcp_rcv_state_process,此时进入状态为TCP_SYN_SENT时的处理代码,调用tcp_rcv_synsent_state_process函数处理SYN_SENT状态下接收到的TCP段:

static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
                     struct tcphdr *th, unsigned len)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct inet_connection_sock *icsk = inet_csk(sk);
    int saved_clamp = tp->rx_opt.mss_clamp;

    //解析TCP选项
    tcp_parse_options(skb, &tp->rx_opt, 0);

    //报文中携带了ACK标记
    if (th->ack) {
        /* rfc793:
         * "If the state is SYN-SENT then
         *    first check the ACK bit
         *      If the ACK bit is set
         *      If SEG.ACK =< ISS, or SEG.ACK > SND.NXT, send
         *        a reset (unless the RST bit is set, if so drop
         *        the segment and return)"
         *
         *  We do not send data with SYN, so that RFC-correct
         *  test reduces to:
         */
        //输入报文不是对SYN报文的确认,会向对端发送RST报文
        if (TCP_SKB_CB(skb)->ack_seq != tp->snd_nxt)
            goto reset_and_undo;
        //时间戳选项
        if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
            !between(tp->rx_opt.rcv_tsecr, tp->retrans_stamp,
                 tcp_time_stamp)) {
            NET_INC_STATS_BH(LINUX_MIB_PAWSACTIVEREJECTED);
            goto reset_and_undo;
        }

        /* Now ACK is acceptable.
         *
         * "If the RST bit is set
         *    If the ACK was acceptable then signal the user "error:
         *    connection reset", drop the segment, enter CLOSED state,
         *    delete TCB, and return."
         */
        //上面的检查保证了是ACK报文,这里检查是否是RST报文,是则复位TCB
        if (th->rst) {
            tcp_reset(sk);
            goto discard;
        }

        /* rfc793:
         *   "fifth, if neither of the SYN or RST bits is set then
         *    drop the segment and return."
         *
         *    See note below!
         *                                        --ANK(990513)
         */
        //这个状态的收到的应该是SYN+ACK报文
        if (!th->syn)
            goto discard_and_undo;

        //初始化TCB中的一些字段

        /* rfc793:
         *   "If the SYN bit is on ...
         *    are acceptable then ...
         *    (our SYN has been ACKed), change the connection
         *    state to ESTABLISHED..."
         */
        TCP_ECN_rcv_synack(tp, th);
        tp->snd_wl1 = TCP_SKB_CB(skb)->seq;
        tcp_ack(sk, skb, FLAG_SLOWPATH);

        /* Ok.. it's good. Set up sequence numbers and
         * move to established.
         */
        tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
        tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;

        /* RFC1323: The window in SYN & SYN/ACK segments is
         * never scaled.
         */
        tp->snd_wnd = ntohs(th->window);
        tcp_init_wl(tp, TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(skb)->seq);

        if (!tp->rx_opt.wscale_ok) {
            tp->rx_opt.snd_wscale = tp->rx_opt.rcv_wscale = 0;
            tp->window_clamp = min(tp->window_clamp, 65535U);
        }

        if (tp->rx_opt.saw_tstamp) {
            tp->rx_opt.tstamp_ok       = 1;
            tp->tcp_header_len =
                sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
            tp->advmss        -= TCPOLEN_TSTAMP_ALIGNED;
            tcp_store_ts_recent(tp);
        } else {
            tp->tcp_header_len = sizeof(struct tcphdr);
        }

        if (tcp_is_sack(tp) && sysctl_tcp_fack)
            tcp_enable_fack(tp);

        tcp_mtup_init(sk);
        tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
        tcp_initialize_rcv_mss(sk);

        /* Remember, tcp_poll() does not lock socket!
         * Change state from SYN-SENT only after copied_seq
         * is initialized. */
        tp->copied_seq = tp->rcv_nxt;
        smp_mb();
        //对于客户端,收到SYN+ACK后就可以将TCB状态迁移到TCP_ESTABLISHED了
        tcp_set_state(sk, TCP_ESTABLISHED);

        security_inet_conn_established(sk, skb);

        /* Make sure socket is routed, for correct metrics.  */
        icsk->icsk_af_ops->rebuild_header(sk);

        tcp_init_metrics(sk);
        //初始化拥塞控制
        tcp_init_congestion_control(sk);

        /* Prevent spurious tcp_cwnd_restart() on first data
         * packet.
         */
        tp->lsndtime = tcp_time_stamp;
        //初始化收发缓冲区
        tcp_init_buffer_space(sk);

        //如果需要,启动保活定时器
        if (sock_flag(sk, SOCK_KEEPOPEN))
            inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));

        //设置首部预测标记
        if (!tp->rx_opt.snd_wscale)
            __tcp_fast_path_on(tp, tp->snd_wnd);
        else
            tp->pred_flags = 0;

        //唤醒connect()系统调用,因为调用者很有可能在阻塞等待
        if (!sock_flag(sk, SOCK_DEAD)) {
            sk->sk_state_change(sk);
            sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
        }

        //确定是执行快速ACK还是延时ACK
        if (sk->sk_write_pending ||
            icsk->icsk_accept_queue.rskq_defer_accept ||
            icsk->icsk_ack.pingpong) {
            /* Save one ACK. Data will be ready after
             * several ticks, if write_pending is set.
             *
             * It may be deleted, but with this feature tcpdumps
             * look so _wonderfully_ clever, that I was not able
             * to stand against the temptation 8)     --ANK
             */
            inet_csk_schedule_ack(sk);
            icsk->icsk_ack.lrcvtime = tcp_time_stamp;
            icsk->icsk_ack.ato     = TCP_ATO_MIN;
            tcp_incr_quickack(sk);
            tcp_enter_quickack_mode(sk);
            inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
                          TCP_DELACK_MAX, TCP_RTO_MAX);

discard:
            __kfree_skb(skb);
            return 0;
        } else {
            //立即确认
            tcp_send_ack(sk);
        }
        return -1;
    }

    //输入报文中携带了复位标记,返回1,向对端发送RST
    if (th->rst) {
        /* rfc793:
         * "If the RST bit is set
         *
         *      Otherwise (no ACK) drop the segment and return."
         */

        goto discard_and_undo;
    }

    /* PAWS check. */
    if (tp->rx_opt.ts_recent_stamp && tp->rx_opt.saw_tstamp &&
        tcp_paws_check(&tp->rx_opt, 0))
        goto discard_and_undo;

    //收到了SYN请求报文,属于同时打开的场景
    if (th->syn) {
        /* We see SYN without ACK. It is attempt of
         * simultaneous connect with crossed SYNs.
         * Particularly, it can be connect to self.
         */
        //当前状态为TCP_SYN_SENT,收到SYN后状态迁移到TCP_SYN_RECV
        tcp_set_state(sk, TCP_SYN_RECV);

        //重新初始化TCB的一些字段
        if (tp->rx_opt.saw_tstamp) {
            tp->rx_opt.tstamp_ok = 1;
            tcp_store_ts_recent(tp);
            tp->tcp_header_len =
                sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
        } else {
            tp->tcp_header_len = sizeof(struct tcphdr);
        }

        tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
        tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;

        /* RFC1323: The window in SYN & SYN/ACK segments is
         * never scaled.
         */
        tp->snd_wnd    = ntohs(th->window);
        tp->snd_wl1    = TCP_SKB_CB(skb)->seq;
        tp->max_window = tp->snd_wnd;

        TCP_ECN_rcv_syn(tp, th);

        tcp_mtup_init(sk);
        tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
        tcp_initialize_rcv_mss(sk);
        //向服务器端发送SYN+ACK报文,当再次收到服务器端的ACK后,三次握手成功,进入连接态
        tcp_send_synack(sk);
        //丢弃收到的SYN报文,然后返回0,停止后续处理
        goto discard;
    }
    /* "fifth, if neither of the SYN or RST bits is set then
     * drop the segment and return."
     */

discard_and_undo:
    tcp_clear_options(&tp->rx_opt);
    tp->rx_opt.mss_clamp = saved_clamp;
    goto discard;

reset_and_undo:
    //清空选项,会向对端发送RST
    tcp_clear_options(&tp->rx_opt);
    tp->rx_opt.mss_clamp = saved_clamp;
    return 1;
}

其中tcp_ack()函数作用:处理接收到的ack报文;tcp_send_ack()函数作用:在主动连接时,向服务器端发送ACK完成连接,并更新窗口;tcp_urg(sk, skb, th)函数作用:处理完第二次握手后,还需要处理带外数据;tcp_data_snd_check(sk)函数作用:检测是否有数据需要发送 。

 

最后还需要分析服务器端接收ACK时调用的函数,还是分析之前的tcp_do_v4_rcv,这里不再贴出代码,其调用tcp_rcv_state_process()进行处理:

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 {
  ...
     case TCP_SYN_RECV:
         tp->delivered++; /* SYN-ACK delivery isn't tracked in tcp_ack */
         if (!tp->srtt_us)
             tcp_synack_rtt_meas(sk, req);
 
         if (req) {
             tcp_rcv_synrecv_state_fastopen(sk);
         } else {
             tcp_try_undo_spurious_syn(sk);
             tp->retrans_stamp = 0;
             tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
             WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
         }
         smp_mb();
         //转入ESTABLISHED状态
         tcp_set_state(sk, TCP_ESTABLISHED);
         sk->sk_state_change(sk);
 
         /* Note, that this wakeup is only for marginal crossed SYN case.
          * Passively open sockets are not waked up, because
          * sk->sk_sleep == NULL and sk->sk_socket == NULL.
          */
         if (sk->sk_socket)
             sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
 
         tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
         tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
         tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
 
         if (tp->rx_opt.tstamp_ok)
             tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
 
         if (!inet_csk(sk)->icsk_ca_ops->cong_control)
             tcp_update_pacing_rate(sk);
 
         /* Prevent spurious tcp_cwnd_restart() on first data packet */
         tp->lsndtime = tcp_jiffies32;
 
         tcp_initialize_rcv_mss(sk);
         tcp_fast_path_on(tp);
         break;
         ...
}

终于,出现了ESTABLISHED,连接建立。

至此,我们已经对三次握手中的函数调用进行了跟踪分析,再来梳理一下:

第一步:客户端发送SYN片段

由tcp_v4_connect()->tcp_connect()->tcp_transmit_skb()发送,并置为TCP_SYN_SENT。

第二步:服务端接收SYN,并发送SYN/ACK处理

由tcp_v4_do_rcv()->tcp_rcv_state_process()->tcp_v4_conn_request()->tcp_v4_send_synack()处理发送。如图:

深入理解TCP协议及其源代码_第3张图片

第三步:客户端回复确认ACK

由tcp_v4_do_rcv()->tcp_rcv_state_process()处理。当前客户端处于TCP_SYN_SENT状态。

第四步:服务器收到ACK

由tcp_v4_do_rcv()->tcp_rcv_state_process()处理。当前服务端处于TCP_SYN_RECV状态变为TCP_ESTABLISHED状态。

二、运行跟踪

使用gdb在分析出的函数处设置断点

深入理解TCP协议及其源代码_第4张图片

深入理解TCP协议及其源代码_第5张图片

你可能感兴趣的:(深入理解TCP协议及其源代码)