Showing posts with label Network. Show all posts
Showing posts with label Network. Show all posts

Monday, March 2, 2009

FD_xxx API

 

typedef unsigned int u_int;
typedef u_int    SOCKET;

typedef struct fd_set {
    u_int   fd_count;
    SOCKET  fd_array[FD_SETSIZE];
} fd_set;

#define FD_CLR(fd, set) \
    do                                                                   \
    {                                                                    \
        u_int __i;                                                       \
        for (__i = 0; __i < ((fd_set *)(set))->fd_count; __i++)         \
        {                                                                \
            if (((fd_set *)(set))->fd_array[__i] == (fd))               \
            {                                                            \
                while (__i < ((fd_set *)(set))->fd_count - 1)           \
                {                                                        \
                    ((fd_set*)(set))->fd_array[__i] = ((fd_set*)(set))->fd_array[__i + 1]; \
                    __i++;                                               \
                }                                                        \
                ((fd_set*)(set))->fd_count--;                            \
                break;                                                   \
            }                                                            \
        }                                                                \
    } while (0)

#define FD_SET(fd, set)                                                  \
    do                                                                   \
    {                                                                    \
        u_int __i;                                                       \
        for (__i = 0; __i < ((fd_set *)(set))->fd_count; __i++)         \
        {                                                                \
            if (((fd_set *)(set))->fd_array[__i] == (fd))               \
            {                                                            \
                break;                                                   \
            }                                                            \
        }                                                                \
        if (__i == ((fd_set *)(set))->fd_count)                         \
        {                                                                \
            if (((fd_set *)(set))->fd_count < FD_SETSIZE)               \
            {                                                            \
                ((fd_set *)(set))->fd_array[__i] = (fd);                 \
                ((fd_set *)(set))->fd_count++;                           \
            }                                                            \
        }                                                                \
} while(0)

#define FD_ZERO(set) (((fd_set *)(set))->fd_count = 0)

#define FD_ISSET(fd, set) __WSAFDIsSet((SOCKET)(fd), (fd_set *)(set))

Thursday, February 26, 2009

Handling SIGCHLD Signals

The purpose of the zombie state is to maintain information about the child for the parent to fetch at some later time. This information includes the process ID of the child, its termination status, and information on the resource utilization of the child (CPU time, memory, etc.). If a process terminates, and that process has children in the zombie state, the parent process ID of all the zombie children is set to 1 (the init process), which will inherit the children and clean them up (i.e., init will wait for them, which removes the zombie). Some Unix systems show the COMMAND column for a zombie process as <defunct>.

Handling Zombies

Obviously we do not want to leave zombies around. They take up space in the kernel and eventually we can run out of processes. Whenever we fork children, we must wait for them to prevent them from becoming zombies. To do this, we establish a signal handler to catch SIGCHLD, and within the handler, we call wait.

We establish the signal handler by adding the function call

Signal (SIGCHLD, sig_chld);

1 #include     "unp.h"
2 void
3 sig_chld(int signo)
4 {
5     pid_t   pid;
6     int     stat;

7     pid = wait(&stat);
8     printf("child %d terminated\", pid);
9     return;
10 }

Warning: Calling standard I/O functions such as printf in a signal handler is not recommended. We call printf here as a diagnostic tool to see when the child terminates.

Under System V and Unix 98, the child of a process does not become a zombie if the process sets the disposition of SIGCHLD to SIG_IGN. Unfortunately, this works only under System V and Unix 98. POSIX explicitly states that this behavior is unspecified. The portable way to handle zombies is to catch SIGCHLD and call wait or waitpid.

#include    "unp.h"

int
main(int argc, char **argv)
{
    int                    listenfd, connfd;
    pid_t                childpid;
    socklen_t            clilen;
    struct sockaddr_in    cliaddr, servaddr;
    void                sig_chld(int);

    listenfd = Socket(AF_INET, SOCK_STREAM, 0);

    bzero(&servaddr, sizeof(servaddr));
    servaddr.sin_family      = AF_INET;
    servaddr.sin_addr.s_addr = htonl(INADDR_ANY);
    servaddr.sin_port        = htons(SERV_PORT);

    Bind(listenfd, (SA *) &servaddr, sizeof(servaddr));

    Listen(listenfd, LISTENQ);

    Signal(SIGCHLD, sig_chld);

    for ( ; ; ) {
        clilen = sizeof(cliaddr);
        if ( (connfd = accept(listenfd, (SA *) &cliaddr, &clilen)) < 0) {
            if (errno == EINTR)
                continue;        /* back to for() */
            else
                err_sys("accept error");
        }

        if ( (childpid = Fork()) == 0) {    /* child process */
            Close(listenfd);    /* close listening socket */
            str_echo(connfd);    /* process the request */
            exit(0);
        }
        Close(connfd);            /* parent closes connected socket */
    }
}

solaris % tcpserv02 & // start server in background

[2] 16939

solaris % tcpcli01 127.0.0.1 // then start client in foreground

hi there // we type this

hi there // and this is echoed

^D       // we type our EOF character

child 16942 terminated // output by printf in signal handler

accept error: Interrupted system call // main function aborts

The sequence of steps is as follows:




  1. We terminate the client by typing our EOF character. The client TCP sends a FIN to the server and the server responds with an ACK.



  2. The receipt of the FIN delivers an EOF to the child's pending readline. The child terminates.



  3. The parent is blocked in its call to accept when the SIGCHLD signal is delivered. The sig_chld function executes (our signal handler), wait fetches the child's PID and termination status, and printf is called from the signal handler. The signal handler returns.



  4. Since the signal was caught by the parent while the parent was blocked in a slow system call (accept), the kernel causes the accept to return an error of EINTR (interrupted system call). The parent does not handle this error, so it aborts.


The purpose of this example is to show that when writing network programs that catch signals, we must be cognizant of interrupted system calls, and we must handle them. In this specific example, running under Solaris 9, the signal function provided in the standard C library does not cause an interrupted system call to be automatically restarted by the kernel. That is, the SA_RESTART flag that we set is not set by the signal function in the system library. Some other systems automatically restart the interrupted system call. If we run the same example under 4.4BSD, using its library version of the signal function, the kernel restarts the interrupted system call and accept does not return an error. To handle this potential problem between different operating systems is one reason we define our own version of the signal function that we use throughout the text.

As part of the coding conventions used in this text, we always code an explicit return in our signal handlers, even though falling off the end of the function does the same thing for a function returning void. When reading the code, the unnecessary return statement acts as a reminder that the return may interrupt a system call.

Handling Interrupted System Calls


We used the term "slow system call" to describe accept, and we use this term for any system call that can block forever. That is, the system call need never return. Most networking functions fall into this category. For example, there is no guarantee that a server's call to accept will ever return, if there are no clients that will connect to the server. Similarly, our server's call to read will never return if the client never sends a line for the server to echo. Other examples of slow system calls are reads and writes of pipes and terminal devices. A notable exception is disk I/O, which usually returns to the caller (assuming no catastrophic hardware failure).

The basic rule that applies here is that when a process is blocked in a slow system call and the process catches a signal and the signal handler returns, the system call can return an error of EINTR. Some kernels automatically restart some interrupted system calls. For portability, when we write a program that catches signals (most concurrent servers catch SIGCHLD), we must be prepared for slow system calls to return EINTR. Portability problems are caused by the qualifiers "can" and "some," which were used earlier, and the fact that support for the POSIX SA_RESTART flag is optional. Even if an implementation supports the SA_RESTART flag, not all interrupted system calls may automatically be restarted. Most Berkeley-derived implementations, for example, never automatically restart select, and some of these implementations never restart accept or recvfrom.

To handle an interrupted accept, we change the call to accept, the beginning of the for loop, to the following:

     for ( ; ; ) {
clilen = sizeof (cliaddr);
if ( (connfd = accept (listenfd, (SA *) &cliaddr, &clilen)) < 0) {
if (errno == EINTR)
continue; /* back to for () */
else
err_sys ("accept error");
}

Notice that we call accept and not our wrapper function Accept, since we must handle the failure of the function ourselves.

What we are doing in this piece of code is restarting the interrupted system call. This is fine for accept, along with functions such as read, write, select, and open. But there is one function that we cannot restart: connect. If this function returns EINTR, we cannot call it again, as doing so will return an immediate error. When connect is interrupted by a caught signal and is not automatically restarted, we must call select to wait for the connection to complete.


    POSIX Signal Handling

    A signal is a notification to a process that an event has occurred. Signals are sometimes called software interrupts. Signals usually occur asynchronously. By this we mean that a process doesn't know ahead of time exactly when a signal will occur.

    Signals can be sent

    • By one process to another process (or to itself)

    • By the kernel to a process

    The SIGCHLD signal that we described at the end of the previous section is one that is sent by the kernel whenever a process terminates, to the parent of the terminating process.

    Every signal has a disposition, which is also called the action associated with the signal. We set the disposition of a signal by calling the sigaction function (described shortly) and we have three choices for the disposition:

    1. We can provide a function that is called whenever a specific signal occurs. This function is called a signal handler and this action is called catching a signal. The two signals SIGKILL and SIGSTOP cannot be caught. Our function is called with a single integer argument that is the signal number and the function returns nothing. Its function prototype is therefore

      void handler (int signo);

      For most signals, calling sigaction and specifying a function to be called when the signal occurs is all that is required to catch a signal. But we will see later that a few signals, SIGIO, SIGPOLL, and SIGURG, all require additional actions on the part of the process to catch the signal.



    2. We can ignore a signal by setting its disposition to SIG_IGN. The two signals SIGKILL and SIGSTOP cannot be ignored.



    3. We can set the default disposition for a signal by setting its disposition to SIG_DFL. The default is normally to terminate a process on receipt of a signal, with certain signals also generating a core image of the process in its current working directory. There are a few signals whose default disposition is to be ignored: SIGCHLD and SIGURG (sent on the arrival of out-of-band data) are two that we will encounter in this text.


    signal Function


    The POSIX way to establish the disposition of a signal is to call the sigaction function. This gets complicated, however, as one argument to the function is a structure that we must allocate and fill in. An easier way to set the disposition of a signal is to call the signal function. The first argument is the signal name and the second argument is either a pointer to a function or one of the constants SIG_IGN or SIG_DFL. But, signal is an historical function that predates POSIX. Different implementations provide different signal semantics when it is called, providing backward compatibility, whereas POSIX explicitly spells out the semantics when sigaction is called. The solution is to define our own function named signal that just calls the POSIX sigaction function. This provides a simple interface with the desired POSIX semantics. We include this function in our own library, along with our err_XXX functions and our wrapper functions, for example, that we specify when building any of our programs in this text.

    1 #include    "unp.h"

    2 Sigfunc *
    3 signal (int signo, Sigfunc *func)
    4 {
    5     struct sigaction act, oact;

    6     act.sa_handler = func;
    7     sigemptyset (&act.sa_mask);
    8     act.sa_flags = 0;
    9     if (signo == SIGALRM) {
    10 #ifdef  SA_INTERRUPT
    11         act.sa_flags |= SA_INTERRUPT;     /* SunOS 4.x */
    12 #endif
    13     } else {
    14 #ifdef  SA_RESTART
    15         act.sa_flags |= SA_RESTART; /* SVR4, 4.4BSD */
    16 #endif
    17     }
    18     if (sigaction (signo, &act, &oact) < 0)
    19         return (SIG_ERR);
    20     return (oact.sa_handler);
    21 }

    Sigfunc *
    Signal(int signo, Sigfunc *func)    /* for our signal() function */
    {
        Sigfunc    *sigfunc;

        if ( (sigfunc = signal(signo, func)) == SIG_ERR)
            err_sys("signal error");
        return(sigfunc);
    }

    #define SIG_DFL (void (*)(int))0
    #define SIG_IGN (void (*)(int))1
    #define SIG_ERR (void (*)(int))-1

    Simplify function prototype using typedef


    2–3 The normal function prototype for signal is complicated by the level of nested parentheses.

    void (*signal (int signo, void (*func) (int))) (int);

    To simplify this, we define the Sigfunc type in our unp.h header as

    typedef    void    Sigfunc(int);

    stating that signal handlers are functions with an integer argument and the function returns nothing (void). The function prototype then becomes

    Sigfunc *signal (int signo, Sigfunc *func);

    A pointer to a signal handling function is the second argument to the function, as well as the return value from the function.


    Set handler


    6 The sa_handler member of the sigaction structure is set to the func argument.


    Set signal mask for handler


    7 POSIX allows us to specify a set of signals that will be blocked when our signal handler is called. Any signal that is blocked cannot be delivered to a process. We set the sa_mask member to the empty set, which means that no additional signals will be blocked while our signal handler is running. POSIX guarantees that the signal being caught is always blocked while its handler is executing.


    Set SA_RESTART flag


    8–17 SA_RESTART is an optional flag. When the flag is set, a system call interrupted by this signal will be automatically restarted by the kernel. (We will talk more about interrupted system calls in the next section when we continue our example.) If the signal being caught is not SIGALRM, we specify the SA_RESTART flag, if defined. (The reason for making a special case for SIGALRM is that the purpose of generating this signal is normally to place a timeout on an I/O operation, in which case, we want the blocked system call to be interrupted by the signal.) Some older systems, notably SunOS 4.x, automatically restart an interrupted system call by default and then define the complement of this flag as SA_INTERRUPT. If this flag is defined, we set it if the signal being caught is SIGALRM.


    Call sigaction


    18–20 We call sigaction and then return the old action for the signal as the return value of the signal function.


    POSIX Signal Semantics


    We summarize the following points about signal handling on a POSIX-compliant system:



    • Once a signal handler is installed, it remains installed. (Older systems removed the signal handler each time it was executed.)



    • While a signal handler is executing, the signal being delivered is blocked. Furthermore, any additional signals that were specified in the sa_mask signal set passed to sigaction when the handler was installed are also blocked. We set sa_mask to the empty set, meaning no additional signals are blocked other than the signal being caught.



    • If a signal is generated one or more times while it is blocked, it is normally delivered only one time after the signal is unblocked. That is, by default, Unix signals are not queued. We will see an example of this in the next section. The POSIX real-time standard, 1003.1b, defines some reliable signals that are queued, but we do not use them in this text.



    • It is possible to selectively block and unblock a set of signals using the sigprocmask function. This lets us protect a critical region of code by preventing certain signals from being caught while that region of code is executing.

    Monday, February 23, 2009

    Normal Startup

    Although our TCP example is small (about 150 lines of code for the two main functions, str_echo, str_cli, readline, and writen), it is essential that we understand how the client and server start, how they end, and most importantly, what happens when something goes wrong: the client host crashes, the client process crashes, network connectivity is lost, and so on. Only by understanding these boundary conditions, and their interaction with the TCP/IP protocols, can we write robust clients and servers that can handle these conditions.

    We first start the server in the background on the host linux.

    linux % tcpserv01 &
    [1] 17870

    When the server starts, it calls socket, bind, listen, and accept, blocking in the call to accept. (We have not started the client yet.) Before starting the client, we run the netstat program to verify the state of the server's listening socket.

    linux % netstat -a
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address Foreign Address State
    tcp 0 0 *:9877 *:* LISTEN

    Here we show only the first line of output (the heading), plus the line that we are interested in. This command shows the status of all sockets on the system, which can be lots of output. We must specify the -a flag to see listening sockets.

    The output is what we expect. A socket is in the LISTEN state with a wildcard for the local IP address and a local port of 9877. netstat prints an asterisk for an IP address of 0 (INADDR_ANY, the wildcard) or for a port of 0.

    We then start the client on the same host, specifying the server's IP address of 127.0.0.1 (the loopback address). We could have also specified the server's normal (nonloopback) IP address.

    linux % tcpcli01 127.0.0.1

    The client calls socket and connect, the latter causing TCP's three-way handshake to take place. When the three-way handshake completes, connect returns in the client and accept returns in the server. The connection is established. The following steps then take place:




    1. The client calls str_cli, which will block in the call to fgets, because we have not typed a line of input yet.



    2. When accept returns in the server, it calls fork and the child calls str_echo. This function calls readline, which calls read, which blocks while waiting for a line to be sent from the client.



    3. The server parent, on the other hand, calls accept again, and blocks while waiting for the next client connection.


    We have three processes, and all three are asleep (blocked): client, server parent, and server child.


    When the three-way handshake completes, we purposely list the client step first, and then the server steps. The reason : connect returns when the second segment of the handshake is received by the client, but accept does not return until the third segment of the handshake is received by the server, one-half of the RTT after connect returns.


    We purposely run the client and server on the same host because this is the easiest way to experiment with client/server applications. Since we are running the client and server on the same host, netstat now shows two additional lines of output, corresponding to the TCP connection:

    linux % netstat -a
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address Foreign Address State
    tcp 0 0 local host:9877 localhost:42758 ESTABLISHED
    tcp 0 0 local host:42758 localhost:9877 ESTABLISHED
    tcp 0 0 *:9877 *:* LISTEN

    The first of the ESTABLISHED lines corresponds to the server child's socket, since the local port is 9877. The second of the ESTABLISHED lines is the client's socket, since the local port is 42758. If we were running the client and server on different hosts, the client host would display only the client's socket, and the server host would display only the two server sockets.

    We can also use the ps command to check the status and relationship of these processes.

    linux % ps -t pts/6 -o pid,ppid,tty,stat,args,wchan
    PID PPID TT STAT COMMAND WCHAN
    22038 22036 pts/6 S -bash wait4
    17870 22038 pts/6 S ./tcpserv01 wait_for_connect
    19315 17870 pts/6 S ./tcpserv01 tcp_data_wait
    19314 22038 pts/6 S ./tcpcli01 127.0 read_chan

    (We have used very specific arguments to ps to only show us the information that pertains to this discussion.) In this output, we ran the client and server from the same window (pts/6, which stands for pseudo-terminal number 6). The PID and PPID columns show the parent and child relationships. We can tell that the first tcpserv01 line is the parent and the second tcpserv01 line is the child since the PPID of the child is the parent's PID. Also, the PPID of the parent is the shell (bash).

    The STAT column for all three of our network processes is "S," meaning the process is sleeping (waiting for something). When a process is asleep, the WCHAN column specifies the condition. Linux prints wait_for_connect when a process is blocked in either accept or connect, tcp_data_wait when a process is blocked on socket input or output, or read_chan when a process is blocked on terminal I/O. The WCHAN values for our three network processes therefore make sense.

    TCP Echo Client: str_cli Function

    This function, shown in Figure, handles the client processing loop: It reads a line of text from standard input, writes it to the server, reads back the server's echo of the line, and outputs the echoed line to standard output.

    str_cli function: client processing loop.

    1 #include    "unp.h"

    2 void
    3 str_cli(FILE *fp, int sockfd)
    4 {
    5     char    sendline[MAXLINE], recvline[MAXLINE];

    6     while (Fgets(sendline, MAXLINE, fp) != NULL) {

    7         Writen(sockfd, sendline, strlen (sendline));

    8         if (Readline(sockfd, recvline, MAXLINE) == 0)
    9             err_quit("str_cli: server terminated prematurely");

    10         Fputs(recvline, stdout);
    11     }
    12 }

    Read a line, write to server

    6–7 fgets reads a line of text and writen sends the line to the server.

    Read echoed line from server, write to standard output

    8–10 readline reads the line echoed back from the server and fputs writes it to standard output.

    Return to main

    11–12 The loop terminates when fgets returns a null pointer, which occurs when it encounters either an end-of-file (EOF) or an error. Our Fgets wrapper function checks for an error and aborts if one occurs, so Fgets returns a null pointer only when an end-of-file is encountered.

    TCP Echo Server: str_echo Function

    The function str_echo performs the server processing for each client: It reads data from the client and echoes it back to the client.

    1 #include    "unp.h"

    2 void
    3 str_echo(int sockfd)
    4 {
    5     ssize_t n;
    6     char    buf[MAXLINE];

    7   again:
    8     while ( (n = read(sockfd, buf, MAXLINE)) > 0)
    9         Writen(sockfd, buf, n);

    10     if (n < 0 && errno == EINTR)
    11         goto again;
    12     else if (n < 0)
    13         err_sys("str_echo: read error");
    14 }

    Read a buffer and echo the buffer

    8–9 read reads data from the socket and the line is echoed back to the client by Writen. If the client closes the connection (the normal scenario), the receipt of the client's FIN causes the child's read to return 0. This causes the str_echo function to return, which terminates the child.

    TCP Echo Server

    We show the concurrent server program.

    1 #include      "unp.h"

    2 int
    3 main(int argc, char **argv)
    4 {
    5     int     listenfd, connfd;
    6     pid_t   childpid;
    7     socklen_t clilen;
    8     struct sockaddr_in cliaddr, servaddr;

    9     listenfd = Socket (AF_INET, SOCK_STREAM, 0);

    10     bzero(&servaddr, sizeof(servaddr));
    11     servaddr.sin_family = AF_INET;
    12     servaddr.sin_addr.s_addr = htonl (INADDR_ANY);
    13     servaddr.sin_port = htons (SERV_PORT);

    14     Bind(listenfd, (SA *) &servaddr, sizeof(servaddr));

    15     Listen(listenfd, LISTENQ);

    16     for ( ; ; )  {
    17         clilen = sizeof(cliaddr);
    18         connfd = Accept(listenfd, (SA *) &cliaddr, &clilen);

    19         if ( (childpid = Fork()) == 0) { /* child process */
    20             Close(listenfd);    /* close listening socket */
    21             str_echo(connfd);   /* process the request */
    22             exit (0);
    23         }
    24         Close(connfd);          /* parent closes connected socket */
    25     }
    26 }

    pid_t Fork(void)
    {
        pid_t pid;

        if ((pid = fork()) == -1)
            err_sys("fork error");
        return(pid);
    }

    Create socket, bind server's well-known port

    9–15 A TCP socket is created. An Internet socket address structure is filled in with the wildcard address (INADDR_ANY) and the server's well-known port (SERV_PORT, which is defined as 9877 in our unp.h header). Binding the wildcard address tells the system that we will accept a connection destined for any local interface, in case the system is multihomed.  It should be greater than 1023 (we do not need a reserved port), greater than 5000 (to avoid conflict with the ephemeral ports allocated by many Berkeley-derived implementations), less than 49152 (to avoid conflict with the "correct" range of ephemeral ports), and it should not conflict with any registered port. The socket is converted into a listening socket by listen.

    Wait for client connection to complete

    17–18 The server blocks in the call to accept, waiting for a client connection to complete.

    Concurrent server

    19–24 For each client, fork spawns a child, and the child handles the new client. The child closes the listening socket and the parent closes the connected socket. The child then calls str_echo to handle the client.

    Sunday, February 22, 2009

    readn, writen, and readline Functions

    Stream sockets (e.g., TCP sockets) exhibit a behavior with the read and write functions that differs from normal file I/O. A read or write on a stream socket might input or output fewer bytes than requested, but this is not an error condition. The reason is that buffer limits might be reached for the socket in the kernel. All that is required to input or output the remaining bytes is for the caller to invoke the read or write function again. Some versions of Unix also exhibit this behavior when writing more than 4,096 bytes to a pipe. This scenario is always a possibility on a stream socket with read, but is normally seen with write only if the socket is nonblocking. Nevertheless, we always call our writen function instead of write, in case the implementation returns a short count.

    We provide the following three functions that we use whenever we read from or write to a stream socket:

    #include "unp.h"

    ssize_t readn(int filedes, void *buff, size_t nbytes);

    ssize_t writen(int filedes, const void *buff, size_t nbytes);

    ssize_t readline(int filedes, void *buff, size_t maxlen);

    All return: number of bytes read or written, –1 on error

    readn function: Read n bytes from a descriptor.

    lib/readn.c

     1 #include     "unp.h"

    2 ssize_t /* Read "n" bytes from a descriptor. */
    3 readn(int fd, void *vptr, size_t n)
    4 {
    5 size_t nleft;
    6 ssize_t nread;
    7 char *ptr;

    8 ptr = vptr;
    9 nleft = n;
    10 while (nleft > 0) {
    11 if ( (nread = read(fd, ptr, nleft)) < 0) {
    12 if (errno == EINTR)
    13 nread = 0; /* and call read() again */
    14 else
    15 return (-1);
    16 } else if (nread == 0)
    17 break; /* EOF */

    18 nleft -= nread;
    19 ptr += nread;
    20 }
    21 return (n - nleft); /* return >= 0 */
    22 }


    writen function: Write n bytes to a descriptor.



    lib/writen.c



     1 #include    "unp.h"

    2 ssize_t /* Write "n" bytes to a descriptor. */
    3 writen(int fd, const void *vptr, size_t n)
    4 {
    5 size_t nleft;
    6 ssize_t nwritten;
    7 const char *ptr;

    8 ptr = vptr;
    9 nleft = n;
    10 while (nleft > 0) {
    11 if ( (nwritten = write(fd, ptr, nleft)) <= 0) {
    12 if (nwritten < 0 && errno == EINTR)
    13 nwritten = 0; /* and call write() again */
    14 else
    15 return (-1); /* error */
    16 }

    17 nleft -= nwritten;
    18 ptr += nwritten;
    19 }
    20 return (n);
    21 }


    readline function: Read a text line from a descriptor, one byte at a time.



    test/readline1.c



     1 #include     "unp.h"

    2 /* PAINFULLY SLOW VERSION -- example only */
    3 ssize_t
    4 readline(int fd, void *vptr, size_t maxlen)
    5 {
    6 ssize_t n, rc;
    7 char c, *ptr;

    8 ptr = vptr;
    9 for (n = 1; n < maxlen; n++) {
    10 again:
    11 if ( (rc = read(fd, &c, 1)) == 1) {
    12 *ptr++ = c;
    13 if (c == '\n')
    14 break; /* newline is stored, like fgets() */
    15 } else if (rc == 0) {
    16 *ptr = 0;
    17 return (n - 1); /* EOF, n - 1 bytes were read */
    18 } else {
    19 if (errno == EINTR)
    20 goto again;
    21 return (-1); /* error, errno set by read() */
    22 }
    23 }

    24 *ptr = 0; /* null terminate like fgets() */
    25 return (n);
    26 }


    Our three functions look for the error EINTR (the system call was interrupted by a caught signal) and continue reading or writing if the error occurs. We handle the error here, instead of forcing the caller to call readn or writen again, since the purpose of these three functions is to prevent the caller from having to handle a short count.



    The MSG_WAITALL flag can be used with the recv function to replace the need for a separate readn function.



    Note that our readline function calls the system's read function once for every byte of data. This is very inefficient, and why we've commented the code to state it is "PAINFULLY SLOW." When faced with the desire to read lines from a socket, it is quite tempting to turn to the standard I/O library (referred to as "stdio") but it can be a dangerous path. The same stdio buffering that solves this performance problem creates numerous logistical problems that can lead to well-hidden bugs in your application. The reason is that the state of the stdio buffers is not exposed. To explain this further, consider a line-based protocol between a client and a server, where several clients and servers using that protocol may be implemented over time (really quite common; for example, there are many Web browsers and Web servers independently written to the HTTP specification). Good "defensive programming" techniques require these programs to not only expect their counterparts to follow the network protocol, but to check for unexpected network traffic as well. Such protocol violations should be reported as errors so that bugs are noticed and fixed (and malicious attempts are detected as well), and also so that network applications can recover from problem traffic and continue working if possible. Using stdio to buffer data for performance flies in the face of these goals since the application has no way to tell if unexpected data is being held in the stdio buffers at any given time.



    There are many line-based network protocols such as SMTP, HTTP, the FTP control connection protocol, and finger. So, the desire to operate on lines comes up again and again. But our advice is to think in terms of buffers and not lines. Write your code to read buffers of data, and if a line is expected, check the buffer to see if it contains that line.



    Source next shows a faster version of the readline function, which uses its own buffering rather than stdio buffering. Most importantly, the state of readline's internal buffer is exposed, so callers have visibility into exactly what has been received. Even with this feature, readline can be problematic. System functions like select still won't know about readline's internal buffer, so a carelessly written program could easily find itself waiting in select for data already received and stored in readline's buffers. For that matter, mixing readn and readline calls will not work as expected unless readn is modified to check the internal buffer as well.



    Better version of readline function.



    lib/readline.c



     1 #include    "unp.h"

    2 static int read_cnt;
    3 static char *read_ptr;
    4 static char read_buf[MAXLINE];

    5 static ssize_t
    6 my_read(int fd, char *ptr)
    7 {

    8 if (read_cnt <= 0) {
    9 again:
    10 if ( (read_cnt = read(fd, read_buf, sizeof(read_buf))) < 0) {
    11 if (errno == EINTR)
    12 goto again;
    13 return (-1);
    14 } else if (read_cnt == 0)
    15 return (0);
    16 read_ptr = read_buf;
    17 }

    18 read_cnt--;
    19 *ptr = *read_ptr++;
    20 return (1);
    21 }
    22 ssize_t
    23 readline(int fd, void *vptr, size_t maxlen)
    24 {
    25 ssize_t n, rc;
    26 char c, *ptr;

    27 ptr = vptr;
    28 for (n = 1; n < maxlen; n++) {
    29 if ( (rc = my_read(fd, &c)) == 1) {
    30 *ptr++ = c;
    31 if (c == '\n')
    32 break; /* newline is stored, like fgets() */
    33 } else if (rc == 0) {
    34 *ptr = 0;
    35 return (n - 1); /* EOF, n - 1 bytes were read */
    36 } else
    37 return (-1); /* error, errno set by read() */
    38 }

    39 *ptr = 0; /* null terminate like fgets() */
    40 return (n);
    41 }

    42 ssize_t
    43 readlinebuf(void **vptrptr)
    44 {
    45 if (read_cnt)
    46 *vptrptr = read_ptr;
    47 return (read_cnt);
    48 }


    2–21 The internal function my_read reads up to MAXLINE characters at a time and then returns them, one at a time.



    29 The only change to the readline function itself is to call my_read instead of read.



    42–48 A new function, readlinebuf, exposes the internal buffer state so that callers can check and see if more data was received beyond a single line.




    Unfortunately, by using static variables in readline.c to maintain the state information across successive calls, the functions are not re-entrant or thread-safe.




    TCP sockets provide a byte stream to an application: There are no record markers. The return value from a read can be less than what we asked for, but this does not indicate an error. To help read and write a byte stream, we developed three functions, readn, writen, and readline, which we will use throughout the text. However, network programs should be written to act on buffers rather than lines.

    sock_ntop and Related Functions

    A basic problem with inet_ntop is that it requires the caller to pass a pointer to a binary address. This address is normally contained in a socket address structure, requiring the caller to know the format of the structure and the address family. That is, to use it, we must write code of the form

    struct sockaddr_in   addr;

    inet_ntop(AF_INET, &addr.sin_addr, str, sizeof(str));

    for IPv4, or

    struct sockaddr_in6   addr6;

    inet_ntop(AF_INET6, &addr6.sin6_addr, str, sizeof(str));

    for IPv6. This makes our code protocol-dependent.

    To solve this, we will write our own function named sock_ntop that takes a pointer to a socket address structure, looks inside the structure, and calls the appropriate function to return the presentation format of the address.

    #include "unp.h"

    char *sock_ntop(const struct sockaddr *sockaddr, socklen_t addrlen);

    Returns: non-null pointer if OK, NULL on error


    This is the notation we use for functions of our own (nonstandard system functions) that we use throughout the book: the box around the function prototype and return value is dashed. The header is included at the beginning is usually our unp.h header.


    sockaddr points to a socket address structure whose length is addrlen. The function uses its own static buffer to hold the result and a pointer to this buffer is the return value.


    Notice that using static storage for the result prevents the function from being re-entrant or thread-safe. We made this design decision for this function to allow us to easily call it from the simple examples in the book.


    The presentation format is the dotted-decimal form of an IPv4 address or the hex string form of an IPv6 address surrounded by brackets, followed by a terminator (we use a colon, similar to URL syntax), followed by the decimal port number, followed by a null character. Hence, the buffer size must be at least INET_ADDRSTRLEN plus 6 bytes for IPv4 (16 + 6 = 22), or INET6_ADDRSTRLEN plus 8 bytes for IPv6 (46 + 8 = 54).

    5 char *
    6 sock_ntop(const struct sockaddr *sa, socklen_t salen)
    7 {
    8     char    portstr[8];
    9     static char str[128];       /* Unix domain is largest */

    10     switch (sa->sa_family) {
    11     case AF_INET:{
    12             struct sockaddr_in *sin = (struct sockaddr_in *) sa;

    13             if (inet_ntop(AF_INET, &sin->sin_addr, str, sizeof(str)) == NULL)
    14                 return (NULL);
    15             if (ntohs(sin->sin_port) != 0) {
    16                 snprintf(portstr, sizeof(portstr), ":%d",
    17                          ntohs(sin->sin_port));
    18                 strcat(str, portstr);
    19             }
    20             return (str);
    21        }

    There are a few other functions that we define to operate on socket address structures, and these will simplify the portability of our code between IPv4 and IPv6.

    #include "unp.h"

    int sock_bind_wild(int sockfd, int family);

    Returns: 0 if OK, -1 on error

    int sock_cmp_addr(const struct sockaddr *sockaddr1, const struct sockaddr *sockaddr2, socklen_t addrlen);

    Returns: 0 if addresses are of the same family and ports are equal, else nonzero

    int sock_cmp_port(const struct sockaddr *sockaddr1, const struct sockaddr *sockaddr2, socklen_t addrlen);

    Returns: 0 if addresses are of the same family and ports are equal, else nonzero

    int sock_get_port(const struct sockaddr *sockaddr, socklen_t addrlen);

    Returns: non-negative port number for IPv4 or IPv6 address, else -1

    char *sock_ntop_host(const struct sockaddr *sockaddr, socklen_t addrlen);

    Returns: non-null pointer if OK, NULL on error

    void sock_set_addr(const struct sockaddr *sockaddr, socklen_t addrlen, void *ptr);

    void sock_set_port(const struct sockaddr *sockaddr, socklen_t addrlen, int port);

    void sock_set_wild(struct sockaddr *sockaddr, socklen_t addrlen);

    sock_bind_wild binds the wildcard address and an ephemeral port to a socket. sock_cmp_addr compares the address portion of two socket address structures, and sock_cmp_port compares the port number of two socket address structures. sock_get_port returns just the port number, and sock_ntop_host converts just the host portion of a socket address structure to presentation format (not the port number). sock_set_addr sets just the address portion of a socket address structure to the value pointed to by ptr, and sock_set_port sets just the port number of a socket address structure. sock_set_wild sets the address portion of a socket address structure to the wildcard. As with all the functions in the text, we provide a wrapper function whose name begins with "S" for all of these functions that return values other than void and normally call the wrapper function from our programs. We do not show the source code for all these functions, but it is freely available (see the Preface).

    inet_pton and inet_ntop Functions

    These two functions are new with IPv6 and work with both IPv4 and IPv6 addresses. We use these two functions throughout the text. The letters "p" and "n" stand for presentation and numeric. The presentation format for an address is often an ASCII string and the numeric format is the binary value that goes into a socket address structure.

    #include <arpa/inet.h>

    int inet_pton(int family, const char *strptr, void *addrptr);

    Returns: 1 if OK, 0 if input not a valid presentation format, -1 on error

    const char *inet_ntop(int family, const void *addrptr, char *strptr, size_t len);

    Returns: pointer to result if OK, NULL on error

    The family argument for both functions is either AF_INET or AF_INET6. If family is not supported, both functions return an error with errno set to EAFNOSUPPORT.

    The first function tries to convert the string pointed to by strptr, storing the binary result through the pointer addrptr. If successful, the return value is 1. If the input string is not a valid presentation format for the specified family, 0 is returned.

    inet_ntop does the reverse conversion, from numeric (addrptr) to presentation (strptr). The len argument is the size of the destination, to prevent the function from overflowing the caller's buffer. To help specify this size, the following two definitions are defined by including the <netinet/in.h> header:

    #define INET_ADDRSTRLEN       16       /* for IPv4 dotted-decimal */
    #define INET6_ADDRSTRLEN 46 /* for IPv6 hex string */


    If len is too small to hold the resulting presentation format, including the terminating null, a null pointer is returned and errno is set to ENOSPC.

    The strptr argument to inet_ntop cannot be a null pointer. The caller must allocate memory for the destination and specify its size. On success, this pointer is the return value of the function.

    Summary of address conversion functions


    3.1

    Even if your system does not yet include support for IPv6, you can start using these newer functions by replacing calls of the form

    foo.sin_addr.s_addr = inet_addr(cp);

    with

    inet_pton(AF_INET, cp, &foo.sin_addr);

    and replacing calls of the form

    ptr = inet_ntoa(foo.sin_addr);

    with

    char str[INET_ADDRSTRLEN];
    ptr = inet_ntop(AF_INET, &foo.sin_addr, str, sizeof(str));

    Simple version of inet_pton that supports only IPv4


    10 int
    11 inet_pton(int family, const char *strptr, void *addrptr)
    12 {
    13     if (family == AF_INET) {
    14         struct in_addr in_val;

    15         if (inet_aton(strptr, &in_val)) {
    16             memcpy(addrptr, &in_val, sizeof(struct in_addr));
    17             return (1);
    18         }
    19         return (0);
    20     }
    21     errno = EAFNOSUPPORT;
    22     return (-1);
    23 }

    Simple version of inet_ntop that supports only IPv4


    8 const char *
    9 inet_ntop(int family, const void *addrptr, char *strptr, size_t len)
    10 {
    11     const u_char *p = (const u_char *) addrptr;

    12     if (family == AF_INET) {
    13         char    temp[INET_ADDRSTRLEN];

    14         snprintf(temp, sizeof(temp), "%d.%d.%d.%d", p[0], p[1], p[2], p[3]);
    15         if (strlen(temp) >= len) {
    16             errno = ENOSPC;
    17             return (NULL);
    18         }
    19         strcpy(strptr, temp);
    20         return (strptr);
    21     }
    22     errno = EAFNOSUPPORT;
    23     return (NULL);
    24 }

    inet_aton, inet_addr, and inet_ntoa Functions

    We will describe two groups of address conversion functions in this section and the next. They convert Internet addresses between ASCII strings (what humans prefer to use) and network byte ordered binary values (values that are stored in socket address structures).

    1. inet_aton, inet_ntoa, and inet_addr convert an IPv4 address from a dotted-decimal string (e.g., "206.168.112.96") to its 32-bit network byte ordered binary value. You will probably encounter these functions in lots of existing code.

    2. The newer functions, inet_pton and inet_ntop, handle both IPv4 and IPv6 addresses. We describe these two functions in the next section and use them throughout the text.

    #include <arpa/inet.h>

    int inet_aton(const char *strptr, struct in_addr *addrptr);

    Returns: 1 if string was valid, 0 on error

    in_addr_t inet_addr(const char *strptr);

    Returns: 32-bit binary network byte ordered IPv4 address; INADDR_NONE if error

    char *inet_ntoa(struct in_addr inaddr);

    Returns: pointer to dotted-decimal string

    The first of these, inet_aton, converts the C character string pointed to by strptr into its 32-bit binary network byte ordered value, which is stored through the pointer addrptr. If successful, 1 is returned; otherwise, 0 is returned.

    An undocumented feature of inet_aton is that if addrptr is a null pointer, the function still performs its validation of the input string but does not store any result.

    inet_addr does the same conversion, returning the 32-bit binary network byte ordered value as the return value. The problem with this function is that all 232 possible binary values are valid IP addresses (0.0.0.0 through 255.255.255.255), but the function returns the constant INADDR_NONE (typically 32 one-bits) on an error. This means the dotted-decimal string 255.255.255.255 (the IPv4 limited broadcast address) cannot be handled by this function since its binary value appears to indicate failure of the function.

    A potential problem with inet_addr is that some man pages state that it returns –1 on an error, instead of INADDR_NONE. This can lead to problems, depending on the C compiler, when comparing the return value of the function (an unsigned value) to a negative constant.

    Today, inet_addr is deprecated and any new code should use inet_aton instead. Better still is to use the newer functions described in the next section, which handle both IPv4 and IPv6.

    The inet_ntoa function converts a 32-bit binary network byte ordered IPv4 address into its corresponding dotted-decimal string. The string pointed to by the return value of the function resides in static memory. This means the function is not reentrant. Finally, notice that this function takes a structure as its argument, not a pointer to a structure.

    Functions that take actual structures as arguments are rare. It is more common to pass a pointer to the structure.

    Byte Manipulation Functions

    There are two groups of functions that operate on multibyte fields, without interpreting the data, and without assuming that the data is a null-terminated C string. We need these types of functions when dealing with socket address structures because we need to manipulate fields such as IP addresses, which can contain bytes of 0, but are not C character strings. The functions beginning with str (for string), defined by including the <string.h> header, deal with null-terminated C character strings.

    The first group of functions, whose names begin with b (for byte), are from 4.2BSD and are still provided by almost any system that supports the socket functions. The second group of functions, whose names begin with mem (for memory), are from the ANSI C standard and are provided with any system that supports an ANSI C library.

    We first show the Berkeley-derived functions, although the only one we use in this text is bzero. (We use it because it has only two arguments and is easier to remember than the three-argument memset function) You may encounter the other two functions, bcopy and bcmp, in existing applications.

    #include <strings.h>

    void bzero(void *dest, size_t nbytes);

    void bcopy(const void *src, void *dest, size_t nbytes);

    int bcmp(const void *ptr1, const void *ptr2, size_t nbytes);

    Returns: 0 if equal, nonzero if unequal

    This is our first encounter with the ANSI C const qualifier. In the three uses here, it indicates that what is pointed to by the pointer with this qualification, src, ptr1, and ptr2, is not modified by the function. Worded another way, the memory pointed to by the const pointer is read but not modified by the function.

    bzero sets the specified number of bytes to 0 in the destination. We often use this function to initialize a socket address structure to 0. bcopy moves the specified number of bytes from the source to the destination. bcmp compares two arbitrary byte strings. The return value is zero if the two byte strings are identical; otherwise, it is nonzero.

    The following functions are the ANSI C functions:

    #include <string.h>

    void *memset(void *dest, int c, size_t len);

    void *memcpy(void *dest, const void *src, size_t nbytes);

    int memcmp(const void *ptr1, const void *ptr2, size_t nbytes);

    Returns: 0 if equal, <0 or >0 if unequal (see text)

    memset sets the specified number of bytes to the value c in the destination. memcpy is similar to bcopy, but the order of the two pointer arguments is swapped. bcopy correctly handles overlapping fields, while the behavior of memcpy is undefined if the source and destination overlap. The ANSI C memmove function must be used when the fields overlap.

    One way to remember the order of the two pointers for memcpy is to remember that they are written in the same left-to-right order as an assignment statement in C:

    dest = src;

    One way to remember the order of the final two arguments to memset is to realize that all of the ANSI C memXXX functions require a length argument, and it is always the final argument.

    memcmp compares two arbitrary byte strings and returns 0 if they are identical. If not identical, the return value is either greater than 0 or less than 0, depending on whether the first unequal byte pointed to by ptr1 is greater than or less than the corresponding byte pointed to by ptr2. The comparison is done assuming the two unequal bytes are unsigned chars.

    Byte Ordering Functions

    Consider a 16-bit integer that is made up of 2 bytes. There are two ways to store the two bytes in memory: with the low-order byte at the starting address, known as little-endian byte order, or with the high-order byte at the starting address, known as big-endian byte order.

    Little-endian byte order and big-endian byte order for a 16-bit integer

    3.9

    In this figure, we show increasing memory addresses going from right to left in the top, and from left to right in the bottom. We also show the most significant bit (MSB) as the leftmost bit of the 16-bit value and the least significant bit (LSB) as the rightmost bit.

    The terms "little-endian" and "big-endian" indicate which end of the multibyte value, the little end or the big end, is stored at the starting address of the value.

    Unfortunately, there is no standard between these two byte orderings and we encounter systems that use both formats. We refer to the byte ordering used by a given system as the host byte order.

    Program to determine host byte order

    1 #include     "unp.h"

    2 int
    3 main(int argc, char **argv)
    4 {
    5     union {
    6         short   s;
    7         char    c[sizeof(short)];
    8     } un;

    9     un.s = 0x0102;
    10     printf("%s: ", CPU_VENDOR_OS);
    11     if (sizeof(short) == 2) {
    12         if (un.c[0] == 1 && un.c[1] == 2)
    13             printf("big-endian\n");
    14         else if (un.c[0] == 2 && un.c[1] == 1)
    15             printf("little-endian\n");
    16         else
    17             printf("unknown\n");
    18     } else
    19         printf("sizeof(short) = %d\n", sizeof(short));

    20     exit(0);
    21 }

    We store the two-byte value 0x0102 in the short integer and then look at the two consecutive bytes, c[0] and c[1], to determine the byte order.

    string CPU_VENDOR_OS is determined by the GNU autoconf program when the software in this book is configured, and it identifies the CPU type, vendor, and OS release.

    freebsd4 % byteorder
    i386-unknown-freebsd4.8: little-endian

    macosx % byteorder
    powerpc-apple-darwin6.6: big-endian

    freebsd5 % byteorder
    sparc64-unknown-freebsd5.1: big-endian

    aix % byteorder
    powerpc-ibm-aix5.1.0.0: big-endian

    hpux % byteorder
    hppa1.1-hp-hpux11.11: big-endian

    linux % byteorder
    i586-pc-linux-gnu: little-endian

    solaris % byteorder
    sparc-sun-solaris2.9: big-endian

    We have talked about the byte ordering of a 16-bit integer; obviously, the same discussion applies to a 32-bit integer.

    There are currently a variety of systems that can change between little-endian and big-endian byte ordering, sometimes at system reset, sometimes at run-time.

    We must deal with these byte ordering differences as network programmers because networking protocols must specify a network byte order. For example, in a TCP segment, there is a 16-bit port number and a 32-bit IPv4 address. The sending protocol stack and the receiving protocol stack must agree on the order in which the bytes of these multibyte fields will be transmitted. The Internet protocols use big-endian byte ordering(BE) for these multibyte integers.

    In theory, an implementation could store the fields in a socket address structure in host byte order and then convert to and from the network byte order when moving the fields to and from the protocol headers, saving us from having to worry about this detail. But, both history and the POSIX specification say that certain fields in the socket address structures must be maintained in network byte order. Our concern is therefore converting between host byte order and network byte order. We use the following four functions to convert between these two byte orders.

    #include <netinet/in.h>
    uint16_t htons(uint16_t host16bitvalue) ;
    uint32_t htonl(uint32_t host32bitvalue) ;
    Both return: value in network byte order
    uint16_t ntohs(uint16_t net16bitvalue) ;
    uint32_t ntohl(uint32_t net32bitvalue) ;
    Both return: value in host byte order

    In the names of these functions, h stands for host, n stands for network, s stands for short, and l stands for long. The terms "short" and "long" are historical artifacts from the Digital VAX implementation of 4.2BSD. We should instead think of s as a 16-bit value (such as a TCP or UDP port number) and l as a 32-bit value (such as an IPv4 address). Indeed, on the 64-bit Digital Alpha, a long integer occupies 64 bits, yet the htonl and ntohl functions operate on 32-bit values.

    When using these functions, we do not care about the actual values (big-endian or little-endian) for the host byte order and the network byte order. What we must do is call the appropriate function to convert a given value between the host and network byte order. On those systems that have the same byte ordering as the Internet protocols (big-endian), these four functions are usually defined as null macros.

    We have not yet defined the term "byte." We use the term to mean an 8-bit quantity since almost all current computer systems use 8-bit bytes. Most Internet standards use the term octet instead of byte to mean an 8-bit quantity. This started in the early days of TCP/IP because much of the early work was done on systems such as the DEC-10, which did not use 8-bit bytes.

    Another important convention in Internet standards is bit ordering. In many Internet standards, you will see "pictures" of packets that look similar to the following (this is the first 32 bits of the IPv4 header from RFC 791):

    3.10

    This represents four bytes in the order in which they appear on the wire; the leftmost bit is the most significant. However, the numbering starts with zero assigned to the most significant bit. This is a notation that you should become familiar with to make it easier to read protocol definitions in RFCs.

    A common network programming error in the 1980s was to develop code on Sun workstations (big-endian Motorola 68000s) and forget to call any of these four functions. The code worked fine on these workstations, but would not work when ported to little-endian machines (such as VAXes).

    Value-Result Arguments

    We mentioned that when a socket address structure is passed to any socket function, it is always passed by reference. That is, a pointer to the structure is passed. The length of the structure is also passed as an argument. But the way in which the length is passed depends on which direction the structure is being passed: from the process to the kernel, or vice versa.

    1. Three functions, bind, connect, and sendto, pass a socket address structure from the process to the kernel. One argument to these three functions is the pointer to the socket address structure and another argument is the integer size of the structure, as in

    struct sockaddr_in serv;

    /* fill in serv{} */
    connect (sockfd, (SA *) &serv, sizeof(serv));

    Since the kernel is passed both the pointer and the size of what the pointer points to, it knows exactly how much data to copy from the process into the kernel.

    Socket address structure passed from process to kernel.


    3.7

    We will see in the next chapter that the datatype for the size of a socket address structure is actually socklen_t and not int, but the POSIX specification recommends that socklen_t be defined as uint32_t.

    2. Four functions, accept, recvfrom, getsockname, and getpeername, pass a socket address structure from the kernel to the process, the reverse direction from the previous scenario. Two of the arguments to these four functions are the pointer to the socket address structure along with a pointer to an integer containing the size of the structure, as in

    struct sockaddr_un cli;                 /* Unix domain */
    socklen_t len;

    len = sizeof(cli); /* len is a value */
    getpeername(unixfd, (SA *) &cli, &len); /* len may have changed */

    The reason that the size changes from an integer to be a pointer to an integer is because the size is both a value when the function is called (it tells the kernel the size of the structure so that the kernel does not write past the end of the structure when filling it in) and a result when the function returns (it tells the process how much information the kernel actually stored in the structure). This type of argument is called a value-result argument.


    Socket address structure passed from kernel to process.


    3.8


    We have been talking about socket address structures being passed between the process and the kernel. For an implementation such as 4.4BSD, where all the socket functions are system calls within the kernel, this is correct. But in some implementations, notably System V, socket functions are just library functions that execute as part of a normal user process. How these functions interface with the protocol stack in the kernel is an implementation detail that normally does not affect us. Nevertheless, for simplicity, we will continue to talk about these structures as being passed between the process and the kernel by functions such as bind and connect. (System V implementations do indeed pass socket address structures between processes and the kernel, but as part of STREAMS messages.)


    When using value-result arguments for the length of socket address structures, if the socket address structure is fixed-length, the value returned by the kernel will always be that fixed size: 16 for an IPv4 sockaddr_in and 28 for an IPv6 sockaddr_in6, for example. But with a variable-length socket address structure (e.g., a Unix domain sockaddr_un), the value returned can be less than the maximum size of the structure.

    With network programming, the most common example of a value-result argument is the length of a returned socket address structure. But, we will encounter other value-result arguments in this text:



    • The middle three arguments for the select function



    • The length argument for the getsockopt function



    • The msg_namelen and msg_controllen members of the msghdr structure, when used with recvmsg



    • The ifc_len member of the ifconf structure



    • The first of the two length arguments for the sysctl function

    Socket Address Structures

    Most socket functions require a pointer to a socket address structure as an argument. Each supported protocol suite defines its own socket address structure. The names of these structures begin with sockaddr_ and end with a unique suffix for each protocol suite.

    IPv4 Socket Address Structure

    An IPv4 socket address structure, commonly called an "Internet socket address structure," is named sockaddr_in and is defined by including the <netinet/in.h> header.

    struct in_addr {
      in_addr_t   s_addr;           /* 32-bit IPv4 address */
                                             /* network byte ordered */
    };

    struct sockaddr_in {
      uint8_t            sin_len;         /* length of structure (16) */
      sa_family_t     sin_family;     /* AF_INET */
      in_port_t         sin_port;        /* 16-bit TCP or UDP port number */
                                                  /* network byte ordered */
      struct in_addr  sin_addr;        /* 32-bit IPv4 address */
                                                  /* network byte ordered */
      char                sin_zero[8];    /* unused */
    };

    There are several points we need to make about socket address structures in general using this example:

    • The length member, sin_len, was added with 4.3BSD-Reno, when support for the OSI protocols was added. Before this release, the first member was sin_family, which was historically an unsigned short.(sin_len + sin_family = 2 byte) Not all vendors support a length field for socket address structures and the POSIX specification does not require this member. The datatype that we show, uint8_t, is typical, and POSIX-compliant systems provide datatypes of this form.

    3.2

    Having a length field simplifies the handling of variable-length socket address structures.

    • Even if the length field is present, we need never set it and need never examine it, unless we are dealing with routing sockets. It is used within the kernel by the routines that deal with socket address structures from various protocol families (e.g., the routing table code).

      The four socket functions that pass a socket address structure from the process to the kernel, bind, connect, sendto, and sendmsg, all go through the sockargs function in a Berkeley-derived implementation. This function copies the socket address structure from the process and explicitly sets its sin_len member to the size of the structure that was passed as an argument to these four functions. The five socket functions that pass a socket address structure from the kernel to the process, accept, recvfrom, recvmsg, getpeername, and getsockname, all set the sin_len member before returning to the process.

      Unfortunately, there is normally no simple compile-time test to determine whether an implementation defines a length field for its socket address structures. In our code, we test our own HAVE_SOCKADDR_SA_LEN constant, but whether to define this constant or not requires trying to compile a simple test program that uses this optional structure member and seeing if the compilation succeeds or not. We will see that IPv6 implementations are required to define SIN6_LEN if socket address structures have a length field. Some IPv4 implementations provide the length field of the socket address structure to the application based on a compile-time option (e.g., _SOCKADDR_LEN). This feature provides compatibility for older programs.

    • The POSIX specification requires only three members in the structure: sin_family, sin_addr, and sin_port. It is acceptable for a POSIX-compliant implementation to define additional structure members, and this is normal for an Internet socket address structure. Almost all implementations add the sin_zero member so that all socket address structures are at least 16 bytes in size.

    • We show the POSIX datatypes for the s_addr, sin_family, and sin_port members. The in_addr_t datatype must be an unsigned integer type of at least 32 bits, in_port_t must be an unsigned integer type of at least 16 bits, and sa_family_t can be any unsigned integer type. The latter is normally an 8-bit unsigned integer if the implementation supports the length field, or an unsigned 16-bit integer if the length field is not supported.

    • You will also encounter the datatypes u_char, u_short, u_int, and u_long, which are all unsigned. The POSIX specification defines these with a note that they are obsolete. They are provided for backward compatibility.

    • Both the IPv4 address and the TCP or UDP port number are always stored in the structure in network byte order(BE). We must be cognizant of this when using these members.

    • The 32-bit IPv4 address can be accessed in two different ways. For example, if serv is defined as an Internet socket address structure, then serv.sin_addr references the 32-bit IPv4 address as an in_addr structure, while serv.sin_addr.s_addr references the same 32-bit IPv4 address as an in_addr_t (typically an unsigned 32-bit integer). We must be certain that we are referencing the IPv4 address correctly, especially when it is used as an argument to a function, because compilers often pass structures differently from integers.

      The reason the sin_addr member is a structure, and not just an in_addr_t, is historical. Earlier releases (4.2BSD) defined the in_addr structure as a union of various structures, to allow access to each of the 4 bytes and to both of the 16-bit values contained within the 32-bit IPv4 address. This was used with class A, B, and C addresses to fetch the appropriate bytes of the address. But with the advent of subnetting and then the disappearance of the various address classes with classless addressing, the need for the union disappeared. Most systems today have done away with the union and just define in_addr as a structure with a single in_addr_t member.

    • The sin_zero member is unused, but we always set it to 0 when filling in one of these structures. By convention, we always set the entire structure to 0 before filling it in, not just the sin_zero member.

      Although most uses of the structure do not require that this member be 0, when binding a non-wildcard IPv4 address, this member must be 0.

    • Socket address structures are used only on a given host: The structure itself is not communicated between different hosts, although certain fields (e.g., the IP address and port) are used for communication.

    Generic Socket Address Structure

    A socket address structures is always passed by reference when passed as an argument to any socket functions. But any socket function that takes one of these pointers as an argument must deal with socket address structures from any of the supported protocol families.

    A problem arises in how to declare the type of pointer that is passed. With ANSI C, the solution is simple: void * is the generic pointer type. But, the socket functions predate ANSI C and the solution chosen in 1982 was to define a generic socket address structure in the <sys/socket.h> header.

    struct sockaddr {
      uint8_t         sa_len;
      sa_family_t  sa_family;    /* address family: AF_xxx value */
      char            sa_data[14];  /* protocol-specific address */
    };

    he socket functions are then defined as taking a pointer to the generic socket address structure, as shown here in the ANSI C function prototype for the bind function:

    int bind(int, struct sockaddr *, socklen_t);

    This requires that any calls to these functions must cast the pointer to the protocol-specific socket address structure to be a pointer to a generic socket address structure. For example,

    struct sockaddr_in  serv;      /* IPv4 socket address structure */

    /* fill in serv{} */

    bind(sockfd, (struct sockaddr *) &serv, sizeof(serv));
    If we omit the cast "(struct sockaddr *)," the C compiler generates a warning of the form "warning: passing arg 2 of 'bind' from incompatible pointer type," assuming the system's headers have an ANSI C prototype for the bind function.

    From an application programmer's point of view, the only use of these generic socket address structures is to cast pointers to protocol-specific structures.


    From the kernel's perspective, another reason for using pointers to generic socket address structures as arguments is that the kernel must take the caller's pointer, cast it to a struct sockaddr *, and then look at the value of sa_family to determine the type of the structure. But from an application programmer's perspective, it would be simpler if the pointer type was void *, omitting the need for the explicit cast.


    IPv6 Socket Address Structure


    struct in6_addr {
      uint8_t  s6_addr[16];          /* 128-bit IPv6 address */
                                               /* network byte ordered */
    };

    #define SIN6_LEN      /* required for compile-time tests */

    struct sockaddr_in6 {
      uint8_t            sin6_len;           /* length of this struct (28) */
      sa_family_t      sin6_family;      /* AF_INET6 */
      in_port_t         sin6_port;         /* transport layer port# */
                                                    /* network byte ordered */
      uint32_t           sin6_flowinfo;  /* flow information, undefined */
      struct in6_addr sin6_addr;        /* IPv6 address */
                                                    /* network byte ordered */
      uint32_t            sin6_scope_id; /* set of interfaces for a scope */
    };

    The extensions to the sockets API for IPv6 are defined in RFC 3493 [Gilligan et al. 2003].

    Note the following points about sockaddr_in6:



    • The SIN6_LEN constant must be defined if the system supports the length member for socket address structures.



    • The IPv6 family is AF_INET6, whereas the IPv4 family is AF_INET.



    • The members in this structure are ordered so that if the sockaddr_in6 structure is 64-bit aligned, so is the 128-bit sin6_addr member. On some 64-bit processors, data accesses of 64-bit values are optimized if stored on a 64-bit boundary.



    • The sin6_flowinfo member is divided into two fields:



      • The low-order 20 bits are the flow label



      • The high-order 12 bits are reserved


      The use of the flow label field is still a research topic.



    • The sin6_scope_id identifies the scope zone in which a scoped address is meaningful, most commonly an interface index for a link-local address.


    New Generic Socket Address Structure


    A new generic socket address structure was defined as part of the IPv6 sockets API, to overcome some of the shortcomings of the existing struct sockaddr. Unlike the struct sockaddr, the new struct sockaddr_storage is large enough to hold any socket address type supported by the system. The sockaddr_storage structure is defined by including the <netinet/in.h> header.

    struct sockaddr_storage {
      uint8_t        ss_len;          /* length of this struct (implementation dependent) */
      sa_family_t  ss_family;    /* address family: AF_xxx value */
      /* implementation-dependent elements to provide:
       * a) alignment sufficient to fulfill the alignment requirements of
       *    all socket address types that the system supports.
       * b) enough storage to hold any type of socket address that the
       *    system supports.
       */
    };

    The sockaddr_storage type provides a generic socket address structure that is different from struct sockaddr in two ways:



    1. If any socket address structures that the system supports have alignment requirements, the sockaddr_storage provides the strictest alignment requirement.



    2. The sockaddr_storage is large enough to contain any socket address structure that the system supports.


    Note that the fields of the sockaddr_storage structure are opaque to the user, except for ss_family and ss_len (if present). The sockaddr_storage must be cast or copied to the appropriate socket address structure for the address given in ss_family to access any other fields.

    Comparison of Socket Address Structures


    2.10


    Figure shows a comparison of the five socket address structures : IPv4, IPv6, Unix domain, datalink, and storage. In this figure, we assume that the socket address structures all contain a one-byte length field, that the family field also occupies one byte, and that any field that must be at least some number of bits is exactly that number of bits.

    Two of the socket address structures are fixed-length, while the Unix domain structure and the datalink structure are variable-length. To handle variable-length structures, whenever we pass a pointer to a socket address structure as an argument to one of the socket functions, we pass its length as another argument. We show the size in bytes (for the 4.4BSD implementation) of the fixed-length structures beneath each structure.


    The sockaddr_un structure itself is not variable-length, but the amount of information—the pathname within the structure—is variable-length. When passing pointers to these structures, we must be careful how we handle the length field, both the length field in the socket address structure itself (if supported by the implementation) and the length to and from the kernel.

    This figure shows the style that we follow throughout the text: structure names are always shown in a bolder font, followed by braces, as in sockaddr_in{}.

    We noted earlier that the length field was added to all the socket address structures with the 4.3BSD Reno release. Had the length field been present with the original release of sockets, there would be no need for the length argument to all the socket functions: the third argument to bind and connect, for example. Instead, the size of the structure could be contained in the length field of the structure.