by Alessandro Rubini
In Linux  (or Unix) world, most network  interfaces, such as
eth0 and ppp0,  are associated to a physical
device that is  in charge or transmitting and  receiving data packets.
However, there are  exceptions to this rule, and  some logical network
interface doesn't  feature any physical packet  transmission; the most
known  examples  are   the  shaper  and  eql
interfaces. This article shows  how such ``virtual'' interfaces attach
to the kernel and to the packet transmission mechanism.
From the kernel's point of view, a network interface is a software object that can process outgoing packets, and the actual transmission mechanism remains hidden inside the interface driver. Even though most interfaces are associated to physical devices (or, for the loopback interface, to a software-only data loop), it is possible to design network interface drivers that rely on other interfaces to perform actual packet transmission. The idea of a ``virtual'' interface can be useful to implement special-purpose processing on data packets while avoiding to hack with the network subsystem of the kernel. To support this discussion with a real-world example, I wrote an insane (INterface SAmple for Network Errors) driver, available as insane.tar.gz. The interface simulates semi-random packet loss or intermittent network failures. The code fragments shown here are part of the insane driver, and have been tested with Linux-2.3.41.
While, the following description is rather terse, the sample code is well-commented and tries to fill the gaps left open by this quick tour of the topic.
Like  other kinds  of device  drivers, a  network  interface module
connects to  the rest of Linux  by registering its  own data structure
within the  kernel.  The insane driver,  for example, registers
itself by calling ``register_netdev(&insane_dev);''.
The device structure being registered, insane_dev is a
struct  net_device object (but  Linux 2.3.13  and earlier
called it  struct device), and  it must feature  at least
two  valid   fields:  the  interface   name  and  a  pointer   to  its
initialization function:
	static struct net_device insane_dev = {
		name: "insane",
		init: insane_init,
	};
The init callback is meant for internal use by the driver: It usually fills other fields of the data structure with pointers to device methods, the functions that performing real work during the interface life time. When an interface driver is linked into the kernel (instead of being loaded as a module), the first task of the init function is checking whether the interface hardware is there.
As you may imagine, the interface can be removed by calling unregister_netdev(), usually invoked by cleanup_module() (or not invoked at all if the driver is not modularized).
The net_device structure  includes, in addition to all
the standardized fields, a ``private'' pointer (a void *)
that  can  be used  by  the  driver for  its  own  use.  When  virtual
interfaces are concerned, the private  field is the best place to host
configuration information; the  insane sample interface follows
the  good practice  of  allocating its  own  priv structure  at
initialization time:
    /* priv is used to host the statistics, and packet dropping policy */
    dev->priv = kmalloc(sizeof(struct insane_private), GFP_USER);
    if (!dev->priv) return -ENOMEM;
    memset(dev->priv, 0, sizeof(struct insane_private));
The allocation is released at interface shutdown (i.e., when the module is removed from the kernel).
A network interface object, like most kernel objects, exports a
list of methods so the rest of the kernel can use it. These methods
are function pointers located in fields of the object data stricture,
here struct net_device.
An interface can be perfectly functional by exporting just a subset
of  all   the  methods;   the  recommended  minimum   subset  includes
open, stop (i.e., ``close''), do_ioctl and
get_stats. These  methods are directly related  to system calls
invoked  by  a  user  program  (such as  ifconfig).   With  the
exception of ioctl, which needs some detailed discussion, their
implementation is pretty  trivial, and they turn out to  be just a few
lines of code.
      int insane_open(struct net_device *dev)
      {
          dev->start = 1;
          MOD_INC_USE_COUNT;
          return 0;
      }
      int insane_close(struct net_device *dev)
      {
          dev->start = 0;
          MOD_DEC_USE_COUNT;
          return 0;
      }
      struct net_device_stats *insane_get_stats(struct net_device *dev)
      {
          return &((struct insane_private *)dev->priv)->priv_stats;
      }
The  open method  is called  when you  call ``ifconfig
insane  up'',  and  close  deals  with  ``ifconfig
insane down''; get_stats returns  a pointer to the local
statistics structure and is used  by ifconfig as well as by the
/proc  informative files.  The  driver is  responsible of
filling  the statistic information  (although it  may choose  not to),
whose fields are defined in <linux/netdevice.h>).
There are other methods, more related to the low level details of packet transmission, but they fall outside of the scope of this discussion (they are on show in the source package, though). The only interesting low-level method is hard_start_xmit, discussed later.
The  do_ioctl entry  point  is the  most  important one  for
virtual interfaces.   When a user  program configures the  behavior of
the interface, it does its  task by invoking the ioctl() system
call.   This is how  shapecfg defines  network shaping  and how
eql_enslave  attaches  real  interfaces to  the  load-balancing
interface    eql.     Similarly,   the    insanely
application configures the insane   behavior on the insane virtual interface.
Unlikely what happens for ``normal'' device drivers (char and block
drivers), the implementation of  ioctl for interfaces is pretty
well-defined:  the invoking  file  descriptor must  be  a socket,  the
available    commands   are   only    SIOCDEVPRIVATE   to
SIOCDEVPRIVATE+15, and the infamous ``third argument'' of
the  system call  is  always a  struct  ifreq *  pointer,
instead   of   the   generic   void  *   pointer.    This
``restriction'' in  ioctl arguments takes  place because socket
ioctl  commands   span  several  logical   layers  and  several
protocols; the predefined values  are reserved for device private use,
and  are unique  throughout the  protocol  stack (note  that no  other
ioctl  command  will  be  delivered to  the  network  interface
method,  so you  really cannot  choose  your own  values).  Passing  a
predefined   data  structure   to  ioctl   doesn't   limit  the
flexibility  of  interface  configuration, as  the  ifreq
structure  includes a data  field, a  caddr_t
value that can point to arbitrary configuration information
Based on the information  above, the insane interface can be
controlled using these commands (defined in "insane.h":
      #define SIOCINSANESETINFO SIOCDEVPRIVATE
      #define SIOCINSANEGETINFO (SIOCDEVPRIVATE+1)
Actual  use   of  the   command,  within  the   user-space  program
insanely turns out to be pretty simple:
      int sock = socket(AF_INET, SOCK_RAW, IPPROTO_RAW);
      struct insane_userinfo info; /* configuration data in/out */
      struct ifreq req;
      strcpy(req.ifr_name, "insane");
      req.ifr_data = (caddr_t)&info;
      /* fill info structure... */
      if (ioctl(sock, SIOCINSANESETINFO, &req)<0) {
           /* deal with error */
      }
      struct insane_userinfo info;
      struct insane_userinfo *uptr;
      /* only authorized users can control the interface */
      if (cmd == SIOCINSANESETINFO && !capable(CAP_NET_ADMIN))
          return -EPERM;
    
      /* retrieve the data structure from user space */
      uptr = (struct insane_userinfo *)ifr->ifr_data;
      err = copy_from_user(&info, uptr, sizeof(info));
      if (err) return err;
      /* deal with the information */
      return 0;
The most important entry point for a network interface driver is hard_start_xmit, where hard is a shorthand for hardware. The device method gets called whenever a network packet gets routed through the interface. Unlike the methods described above (and like the ones not discussed here), this one is not directly related to any system call or application; rather, it is used by the network subsystem of the Linux kernel according to its own policies.
When   virtual  interfaces  are   concerned,  no   actual  hardware
transmission takes  place in the interface itself.  The interface will
instead   resort    to   another   network    interface   to   perform
transmission.  Packet  passing  is  implemented in  two  steps:  first
(usually  at configuration time,  within ioctl),  the interface
must connect to another interface,  the one that can transmit packets;
then, its  own hard_start_xmit must take proper  action to pass
the packet.
      /* look for the hardware interface */
      slave = __dev_get_by_name(info.name);
      if (!slave) return -ENODEV;
      priv->priv_device = slave;
            /* .... */
      /* update your statistic counters */
      priv->priv_stats.tx_packets++;
      priv->priv_stats.tx_bytes += skb->len;
      /* assign the packet to the hw interface */
      skb->dev = priv->priv_device;
      /* and tell Linux to pass it to its device */
      dev_queue_xmit (skb);
In a perfect world, the virtual interface should also register a notifier callback, so Linux will tell the driver when the physical hardware interface goes away -- if the slave interface is a module, its removal will make insane unhappy. The released insane implementation doesn't register any callback, and making it saner is left as an exercise for the reader.
When network packets hit an interface board, they generate an interrupt so that the Operating System can handle packet arrival (the only exception is the loopback interface, whose reception mechanism is part of packet transmission).
A virtual interface, on the other hand, has no way to receive interrupts, and thus it cannot receive any network packet. This can be perceived as unfortunate, because it would be nice to attach the same software operations to both directions of data flow. But the mechanics of packet reception don't allow virtual interface to enter the game, and whoever need to intercept incoming packets must use other ways to hook into the packets' path. This kind of functionality goes out of the scope of this discussion and leans very much towards the way netfilter works.
All of this talking may look rather pointless, unless we can see it
at work.  The insane  interface relies on an Ethernet interface
for physical transmission,  and it be configured to  operate in one of
three insane  modes. It can relay  every packet (``pass''
mode), or relay only some percent of packets (``percent''
mode, with  an integer parameter),  or turn relaying  on and off  on a
repeated   timely  basis   (``time  mode   --   with  two
parameters,   on-time  and  off-time,   specified  as   jiffy  counts,
architecture-dependent time  quanta that  correspond to 10ms  each for
the PC platform). Here are three examples of use of insanely:
      # insanely eth0 pass         ; # relay everything to eth0
      # insanely eth0 percent 80   ; # drop 20% (pseudo random)
      # insanely eth0 time 50 100  ; # relay for .5 seconds, drop for 1s
In  order to  connect insane  to  the network,  you need  to
assign a ``local''  IP address to the interface  (that IP address will
be used  as ``source  address'' and  be used by  remote hosts  to send
their replies) and route some  packets through it. Current versions of
Linux automatically associate a network route to each device, and this
routing cannot be removed. Therefore, we can't re-route all of the lan
through insane  at once, and  the following example  reroutes a
single  host, called  "morgana",  in  the routing  table  of the  host
"borea".
      borea# insmod insane                 ; # load module
      borea# ifconfig insane borea         ; # give same IP as eth0
      borea# route add morgana dev insane  ; # re-route this host
      borea# ./insanely eth0 percent 60    ; # set dropping rate
Unfortunately, due to a glitch in Linux-2.3.41, you'll also need to
disable  the  packet  filters   on  the  Ethernet  interface  used  by
insane. The  following command worked for me:  ``echo 0 >
/proc/sys/net/ipv4/conf/eth0/rp_filter''
With this  setup, you can connect to  morgana with any
protocol  you  like, and  experience  a 40%  packet  loss  -- only  on
transmitted packets, though,  unless morgana runs another
instance of insane with a similar configuration.
An  interesting  effect  of  this  transmission  path  through  two
interfaces   is   that   you    can   run   tcpdump   on   both
eth0 and  insane and see  different results.
While  ``tcpdump  -i   eth0''  shows  the  packets  being
transmitted, ``tcpdump -i  insane'' displays every packet
sent out by  the protocol layers, before any  dropping is applied.

rubini@gnu.org.
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved