iproute2+tc notes

The iproute2+tc package allows access to the variety of neat new networking features in the 2.2 kernels. Policy routing, NAT, QoS, advanced tunnels, RSVP and Differentiated services, are just a few of the buzzword capabilities unleashed by the ip and tc programs.


Contents

  1. Introduction
  2. Distribution
    1. Building
    2. Kernel Configuration
  3. Documentation
    1. from the Distribution
    2. from the Kernel
    3. from everywhere else
  4. Command Syntax
    1. ip command
      1. ip link
      2. ip address
      3. ip neighbor
      4. ip rule
      5. ip route
      6. ip tunnel
      7. ip maddress
      8. ip mroute
      9. ip monitor
    2. rtmon command
    3. tc command
      1. qdisc
        1. qdisc [p|b]fifo
        2. qdisc tbf
        3. qdisc cbq
        4. qdisc red
        5. qdisc sfq
        6. qdisc prio
        7. qdisc csz
      2. class
        1. class cbq
      3. estimator
      4. filter
        1. filter rsvp
        2. filter u32
        3. filter fw
        4. filter route
      5. police

Introduction

This is the second version of this document; the first began as a simple command reference for my own use and evolved into a slightly more complex document, including some usage notes and a bunch of links to documents. That version became dated rather rapidly, and so I've redone it from fresh sources. Hopefully the pace of development has slowed somewhat (at least on the user interface side of things), and this version will remain useful a bit longer.

Suggestions, additions, sources, comments, etc to dragon@snafu.freedom.org. Specifically more detailed explanations of the commands listed in the Command Syntax section, for inclusion, would be extremely welcome.

All of the information pulled from linux/ files comes from linux-2.2.14-pre9. The iproute2/ files come from the iproute2-2.2.4-now-ss991023.tar.gz distribution.


Distribution

From the Documentation/Configure.help file of the linux distribution; "QoS and/or fair queueing" item:

  To administer these schedulers, you'll need the user-level utilities
  from the package iproute2+tc at ftp://ftp.inr.ac.ru/ip-routing/

That directory is mirrored at:

  • ftp://linux.wauug.org/pub/net
  • ftp://ftp.nc.ras.ru/pub/mirrors/ftp.inr.ac.ru/ip-routing/
  • ftp://ftp.gts.cz/MIRRORS/ftp.inr.ac.ru/
  • ftp://ftp.funet.fi/pub/mirrors/ftp.inr.ac.ru/ip-routing/ (STM1 to USA)
  • ftp://sunsite.icm.edu.pl/pub/Linux/linux-iproute/
  • ftp://ftp.sunet.se/pub/Linux/ip-routing/

    It's probably worth picking up the latest version; if your distribution includes these tools at all they're likely to be a revision or two behind.

    Building

    Unpack the distribution tarball. Edit the Makefile; you need to tell it where to find the kernel include files with the KERNEL_INCLUDES variable. You might also need to tell it about libraries or something; I didn't have to. A simple "make all" compiles the programs ip/ip, ip/rtmon, and tc/tc. I didn't get but a couple of warning messages during the compile. Copy those programs to /sbin, and you're ready to play.

    Keep the unpacked distribution, it's got several examples and utility scripts which may well be useful, especially if you're going to be using the tc program to do QoS stuff. In that case, you may also need to edit a kernel header file. In the tc/ directory of the distribution there's a file named "README.last", that explains the possible values and meanings of PSCHED_CLOCK_SOURCE in the file linux/include/net/pkt_sched.h. The default "PSCHED_JIFFIES" there is probably what you want anyway.

    Kernel Configuration

    Most of the capabilities of the kernel accessed by the ip and tc programs probably aren't included in your distribution's stock kernel configuration. If you can't get a feature to work, make sure that all of the required kernel configuration values are set properly. I've tried to note what kernel configuration options are required or related in the Command Syntax section. I'm often guessing, so don't be suprised if I've missed something (please do tell me about it, though).


    Documentation

    If you can't find the answers to your questions in this document or in the documentation listed below, you might try searching through the linux-net mailing list and linux-kernel mailing list archives and/or asking the habitants of those lists.

    from the Distribution

    In the iproute2/ directory you'll find the README file, which contains build instructions; the RELNOTES file, and the README.iproute2+tc file, which discusses the tc Traffic Control tools. The README.decnet file contains some points relevant to DECnet. There's a README.last file in the iproute2/tc/ directory that discusses the clock source for the QoS code; it's a bit dated but it explains what the options are.

    In the iproute2/doc/ directory are several documents; the IP Command Reference (HTML, TeX, and Postscript); Tunnels over IP in Linux-2.2 (HTML, TeX, and Postscript); and IPv6 Flow Labels in Linux-2.2 (HTML, TeX, and Postscript).

    from the Kernel

    The kernel's Documentation/networking/ has a couple of files; the routing.txt file, and the policy-routing.txt file. These are slim and outdated, but they introduce a couple of important concepts. I've got a partial kernel Configure.help file, mostly networking options, that should be a little easier to read than via make config.

    from everywhere else

    Timur A. Bolokhov has started an "Advanced routing mini-HOWTO" (HTML,TeX and Postscript), which ain't bad at all. The date on those files from the original site (ftp://post.tepkom.ru/pub/vol2/Linux/docs/) was Sep 29 1998; they might be a little dated, but not so much so as the kernel docs.

    Saravanan Radhakrishan has put together a fine collection of links and documentation at the IITC QoS site, dealing with the implementation details of the QoS system and how to attack it from C via netlink. I haven't read it all yet but it should be helpful. Specifically see the Linux-Qos-HOWTO (local copy) and the Netlink HOWTO (local copy).

    The README.iproute2+tc file mentions "Terminology and advices about setting CBQ parameters may be found in Sally Floyd papers." Some of those can be found at her site. She 's also got pages listing resources for CBQ, and RED. Programmers looking to interface with the kernel side of the new networking capabilities may find some joy in Andi Kleen's new network man pages.

    Juergen Jaeschke pointed me to the Differentiated Services on Linux Internet Draft, which contains some good discussion of the iproute tc traffic control tools.

    Werner Almesberger has posted the slides he used for his presentations at the Singapore Linux conference '99 and the Linux Expo '99; these are PostScript files. I've got copies of both, Linux Network Traffic Control Implementation Overview slides from Linux Expo, and the QoS Networking with Linux slides from the Singapore conference. He also has an excellent Linux Traffic Control - Implementation Overview paper on his site (which I've copied).

    I've collected some mail list messages that discuss the usage of these tools. One of the mail list messages responding to a request for information on the traffic control toys said "see Cisco Docs"; the IOS v12 Configuration Guides and References and most specifically the Quality of Service Solutions Configuration Guide are likely to be helpful.

    Open Resource has an article on linux 2.2, which includes a some discussion of the new networking.


    Command Syntax


    ip command

    From file iproute2/ip/ip.c
    Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }
    where  OBJECT := { link | addr | route | rule | neigh | tunnel |
                       maddr | mroute | monitor }
           OPTIONS := { -V[ersion] | -s[tatistics] | -r[esolve] |
                        -f[amily] { inet | inet6 | dnet | link } | -o[neline] }
    

    See the IP Command Reference.

    If it's called as "ipaddr", "iproute", "iprule", "ipneigh", "iplink", "iptunnel", or "ipmonitor" it runs the appropriate subcommand. So you can create a link to "iproute", and use:

    # iproute (whatever)
    
    instead of:
    # ip route (whatever)
    

    ip link

    From file iproute2/ip/iplink.c
    Usage: ip link set DEVICE { up | down | arp { on | off } |
    	                     dynamic { on | off } |
    	                     multicast { on | off } | txqueuelen PACKETS |
    	                     name NEWNAME |
    	                     address LLADDR | broadcast LLADDR |
    	                     mtu MTU }
           ip link show [ DEVICE ]
    

    See the IP Command Reference.


    ip address

    From file iproute2/ip/ipaddress.c
    Usage: ip addr {add|del} IFADDR dev STRING
           ip addr {show|flush} [ dev STRING ] [ scope SCOPE-ID ]
                                [ to PREFIX ] [ FLAG-LIST ] [ label PATTERN ]
    IFADDR := PREFIX | ADDR peer PREFIX
              [ broadcast ADDR ] [ anycast ADDR ]
              [ label STRING ] [ scope SCOPE-ID ]
    SCOPE-ID := [ host | link | global | NUMBER ]
    FLAG-LIST := [ FLAG-LIST ] FLAG
    FLAG  := [ permanent | dynamic | secondary | primary |
               tentative | deprecated ]
    

    See the IP Command Reference.


    ip neighbor

    From file iproute2/ip/ipneigh.c
    Usage: ip neigh { add | del | change | replace } { ADDR [ lladdr LLADDR ]
              [ nud { permanent | noarp | stale | reachable } ]
              | proxy ADDR } [ dev DEV ]
           ip neigh {show|flush} [ to PREFIX ] [ dev DEV ] [ nud STATE ]
    

    See the IP Command Reference.


    ip rule

    Kernel Config: IP: advanced router, IP: policy routing, IP: use TOS value as routing key, IP: use FWMARK value as routing key, IP: fast network address translation

    From file iproute2/ip/iprule.c
    Usage: ip rule [ list | add | del ] SELECTOR ACTION
    SELECTOR := [ from PREFIX ] [ to PREFIX ] [ tos TOS ] [ fwmark FWMARK ]
                [ dev STRING ] [ pref NUMBER ]
    ACTION := [ table TABLE_ID ] [ nat ADDRESS ]
              [ prohibit | reject | unreachable ]
              [ realms [SRCREALM/]DSTREALM ]
    TABLE_ID := [ local | main | default | NUMBER ]
    

    See the IP Command Reference.


    ip route

    Kernel Config: IP: advanced router, IP: policy routing, IP: equal cost multipath, IP: large routing tables, IP: fast network address translation

    From file iproute2/ip/iproute.c
    Usage: ip route { list | flush } SELECTOR
           ip route get ADDRESS [ from ADDRESS iif STRING ]
                                [ oif STRING ]  [ tos TOS ]
           ip route { add | del | replace | change | append | replace | monitor } ROUTE
    SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact PREFIX ]
                [ table TABLE_ID ] [ proto RTPROTO ]
                [ type TYPE ] [ scope SCOPE ]
    ROUTE := NODE_SPEC [ INFO_SPEC ]
    NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]
                 [ table TABLE_ID ] [ proto RTPROTO ]
                 [ scope SCOPE ] [ metric METRIC ]
    INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]...
    NH := [ via ADDRESS ] [ dev STRING ] [ weight NUMBER ] NHFLAGS
    OPTIONS := FLAGS [ mtu NUMBER ] [ advmss NUMBER ]
               [ rtt NUMBER ] [ rttvar NUMBER ]
               [ window NUMBER] [ cwnd NUMBER ] [ ssthresh REALM ]
               [ realms REALM ]
    TYPE := [ unicast | local | broadcast | multicast | throw |
              unreachable | prohibit | blackhole | nat ]
    TABLE_ID := [ local | main | default | all | NUMBER ]
    SCOPE := [ host | link | global | NUMBER ]
    FLAGS := [ equalize ]
    NHFLAGS := [ onlink | pervasive ]
    RTPROTO := [ kernel | boot | static | NUMBER ]
    

    See the IP Command Reference.


    ip tunnel

    Kernel Config: IP: tunneling, IP: GRE tunnels over IP, IP: broadcast GRE over IP

    From file iproute2/ip/iptunnel.c
    Usage: ip tunnel { add | change | del | show } [ NAME ]
              [ mode { ipip | gre | sit } ] [ remote ADDR ] [ local ADDR ]
              [ [i|o]seq ] [ [i|o]key KEY ] [ [i|o]csum ]
              [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ]
    
    Where: NAME := STRING
           ADDR := { IP_ADDRESS | any }
           TOS  := { NUMBER | inherit }
           TTL  := { 1..255 | inherit }
           KEY  := { DOTTED_QUAD | NUMBER }
    

    See the IP Command Reference, also see Tunnels over IP in Linux-2.2 and my (old) tunnel notes.


    ip maddress

    From file iproute2/ip/ipmaddr.c
    Usage: ip maddr [ add | del ] MULTIADDR dev STRING
           ip maddr show [ dev STRING ]
    

    See the IP Command Reference.


    ip mroute

    From file iproute2/ip/ipmroute.c
    Usage: ip mroute show [ PREFIX ] [ from PREFIX ] [ iif DEVICE ]
    

    See the IP Command Reference.


    ip monitor

    Kernel Config: IP: verbose route monitoring

    From file iproute2/ip/ipmonitor.c
    Usage: ip monitor [ all | LISTofOBJECTS ]
    

    See the IP Command Reference.


    rtmon command

    Kernel Config: IP: verbose route monitoring

    From file iproute2/ip/rtmon.c
    Usage: rtmon file FILE [ all | LISTofOBJECTS]
    LISTofOBJECTS := [ link ] [ address ] [ route ]
    


    tc command

    Kernel Config: QoS support, QoS and/or fair queueing

    From file iproute2/tc/tc.c
    Usage: tc [ OPTIONS ] OBJECT { COMMAND | help }
    where  OBJECT := { qdisc | class | filter }
           OPTIONS := { -s[tatistics] | -d[etails] | -r[aw] }
    

    Where it's expecting a number for BPS; it understands some suffixes: kbps (*1024), mbps (*1024*1024), kbit (*1024/8), and mbit (*1024*1024/8). If I'm reading the code correctly; "BPS" means Bytes Per Second; if you give a number without a suffix it assumes you want BITS per second (it divides the number you give it by 8). It also understands bps as a suffix.

    Where it's expecting a time value, it seems it understands suffixes of s, sec, and secs for seconds, ms, msec, and msecs for milliseconds, and us, usec, and usecs for microseconds.

    Where it wants a size parameter, it assumes non-suffixed numbers to be specified in bytes. It also understands suffixes of k and kb to mean kilobytes (*1024), m and mb to mean megabytes (*1024*1024), kbit to mean kilobit (*1024/8), and mbit to mean megabits (*1024*1024/8).

    1Mbit == 128Kbps


    qdisc

    Kernel Config: QoS support, QoS and/or fair queueing

    From file iproute2/tc/tc_qdisc.c
    Usage: tc qdisc [ add | del | replace | change | get ] dev STRING
           [ handle QHANDLE ] [ root | ingress | parent CLASSID ]
           [ estimator INTERVAL TIME_CONSTANT ]
           [ [ QDISC_KIND ] [ help | OPTIONS ] ]
    
           tc qdisc show [ dev STRING ] [ingress]
    Where:
    QDISC_KIND := { [p|b]fifo | tbf | prio | cbq | red | etc. }
    OPTIONS := ... try tc qdisc add <desired QDISC_KIND> help
    

    From file linux/net/sched/sch_api.c
       Short review.
       -------------
    
       This file consists of two interrelated parts:
    
       1. queueing disciplines manager frontend.
       2. traffic classes manager frontend.
    
       Generally, queueing discipline ("qdisc") is a black box,
       which is able to enqueue packets and to dequeue them (when
       device is ready to send something) in order and at times
       determined by algorithm hidden in it.
    
       qdisc's are divided to two categories:
       - "queues", which have no internal structure visible from outside.
       - "schedulers", which split all the packets to "traffic classes",
         using "packet classifiers" (look at cls_api.c)
    
       In turn, classes may have child qdiscs (as rule, queues)
       attached to them etc. etc. etc.
    
       The goal of the routines in this file is to translate
       information supplied by user in the form of handles
       to more intelligible for kernel form, to make some sanity
       checks and part of work, which is common to all qdiscs
       and to provide rtnetlink notifications.
    
       All real intelligent work is done inside qdisc modules.
    
    
    
       Every discipline has two major routines: enqueue and dequeue.
    
       ---dequeue
    
       dequeue usually returns a skb to send. It is allowed to return NULL,
       but it does not mean that queue is empty, it just means that
       discipline does not want to send anything this time.
       Queue is really empty if q->q.qlen == 0.
       For complicated disciplines with multiple queues q->q is not
       real packet queue, but however q->q.qlen must be valid.
    
       ---enqueue
    
       enqueue returns number of enqueued packets i.e. this number is 1,
       if packet was enqueued successfully and <1 if something (not
       necessary THIS packet) was dropped.
    
       Auxiliary routines:
    
       ---requeue
    
       requeues once dequeued packet. It is used for non-standard or
       just buggy devices, which can defer output even if dev->tbusy=0.
    
       ---reset
    
       returns qdisc to initial state: purge all buffers, clear all
       timers, counters (except for statistics) etc.
    
       ---init
    
       initializes newly created qdisc.
    
       ---destroy
    
       destroys resources allocated by init and during lifetime of qdisc.
    
       ---change
    
       changes qdisc parameters.
    
    


    qdisc [p|b]fifo

    From file iproute2/tc/q_fifo.c
    Usage: ... [p|b]fifo [ limit NUMBER ]
    

    From file linux/net/sched/sch_fifo.c
    /* 1 band FIFO pseudo-"scheduler" */
    


    qdisc tbf

    Kernel Config: QoS support, QoS and/or fair queueing, TBF queue, Rate estimator

    From file iproute2/tc/
    Usage: ... tbf limit BYTES burst BYTES[/BYTES] rate KBPS [ mtu BYTES[/BYTES] ]
                   [ peakrate KBPS ] [ latency TIME ]
    

    From file linux/net/sched/sch_tbf.c
    /*	Simple Token Bucket Filter.
    	=======================================
    
    	SOURCE.
    	-------
    
    	None.
    
    	Description.
    	------------
    
    	A data flow obeys TBF with rate R and depth B, if for any
    	time interval t_i...t_f the number of transmitted bits
    	does not exceed B + R*(t_f-t_i).
    
    	Packetized version of this definition:
    	The sequence of packets of sizes s_i served at moments t_i
    	obeys TBF, if for any i<=k:
    
    	s_i+....+s_k <= B + R*(t_k - t_i)
    
    	Algorithm.
    	----------
    	
    	Let N(t_i) be B/R initially and N(t) grow continuously with time as:
    
    	N(t+delta) = min{B/R, N(t) + delta}
    
    	If the first packet in queue has length S, it may be
    	transmited only at the time t_* when S/R <= N(t_*),
    	and in this case N(t) jumps:
    
    	N(t_* + 0) = N(t_* - 0) - S/R.
    
    
    
    	Actually, QoS requires two TBF to be applied to a data stream.
    	One of them controls steady state burst size, another
    	one with rate P (peak rate) and depth M (equal to link MTU)
    	limits bursts at a smaller time scale.
    
    	It is easy to see that P>R, and B>M. If P is infinity, this double
    	TBF is equivalent to a single one.
    
    	When TBF works in reshaping mode, latency is estimated as:
    
    	lat = max ((L-B)/R, (L-M)/P)
    
    
    	NOTES.
    	------
    
    	If TBF throttles, it starts a watchdog timer, which will wake it up
    	when it is ready to transmit.
    	Note that the minimal timer resolution is 1/HZ.
    	If no new packets arrive during this period,
    	or if the device is not awaken by EOI for some previous packet,
    	TBF can stop its activity for 1/HZ.
    
    
    	This means, that with depth B, the maximal rate is
    
    	R_crit = B*HZ
    
    	F.e. for 10Mbit ethernet and HZ=100 the minimal allowed B is ~10Kbytes.
    
    	Note that the peak rate TBF is much more tough: with MTU 1500
    	P_crit = 150Kbytes/sec. So, if you need greater peak
    	rates, use alpha with HZ=1000 :-)
    */
    


    qdisc cbq

    Kernel Config: QoS support, QoS and/or fair queueing, CBQ packet scheduler, Rate estimator

    From file iproute2/tc/q_cbq.c
    Usage: ... cbq bandwidth BPS avpkt BYTES [ mpu BYTES ]
                   [ cell BYTES ] [ ewma LOG ]
    

    From file linux/net/sched/sch_cbq.c
    /*	Class-Based Queueing (CBQ) algorithm.
    	=======================================
    
    	Sources: [1] Sally Floyd and Van Jacobson, "Link-sharing and Resource
    	         Management Models for Packet Networks",
    		 IEEE/ACM Transactions on Networking, Vol.3, No.4, 1995
    
    	         [2] Sally Floyd, "Notes on CBQ and Guaranted Service", 1995
    
    	         [3] Sally Floyd, "Notes on Class-Based Queueing: Setting
    		 Parameters", 1996
    
    		 [4] Sally Floyd and Michael Speer, "Experimental Results
    		 for Class-Based Queueing", 1998, not published.
    
    	-----------------------------------------------------------------------
    
    	Algorithm skeleton was taken from NS simulator cbq.cc.
    	If someone wants to check this code against the LBL version,
    	he should take into account that ONLY the skeleton was borrowed,
    	the implementation is different. Particularly:
    
    	--- The WRR algorithm is different. Our version looks more
            reasonable (I hope) and works when quanta are allowed to be
            less than MTU, which is always the case when real time classes
            have small rates. Note, that the statement of [3] is
            incomplete, delay may actually be estimated even if class
            per-round allotment is less than MTU. Namely, if per-round
            allotment is W*r_i, and r_1+...+r_k = r < 1
    
    	delay_i <= ([MTU/(W*r_i)]*W*r + W*r + k*MTU)/B
    
    	In the worst case we have IntServ estimate with D = W*r+k*MTU
    	and C = MTU*r. The proof (if correct at all) is trivial.
    
    
    	--- It seems that cbq-2.0 is not very accurate. At least, I cannot
    	interpret some places, which look like wrong translations
    	from NS. Anyone is advised to find these differences
    	and explain to me, why I am wrong 8).
    
    	--- Linux has no EOI event, so that we cannot estimate true class
    	idle time. Workaround is to consider the next dequeue event
    	as sign that previous packet is finished. This is wrong because of
    	internal device queueing, but on a permanently loaded link it is true.
    	Moreover, combined with clock integrator, this scheme looks
    	very close to an ideal solution.  */
    
    
    tristate 'CBQ packet scheduler' CONFIG_NET_SCH_CBQ


    qdisc red

    Kernel Config: QoS support, QoS and/or fair queueing, RED queue

    From file iproute2/tc/q_red.c
    Usage: ... red limit BYTES min BYTES max BYTES avpkt BYTES burst PACKETS
                   probability PROBABILITY bandwidth KBPS
    

    From file linux/net/sched/sch_red.c
    /*	Random Early Detection (RED) algorithm.
    	=======================================
    
    	Source: Sally Floyd and Van Jacobson, "Random Early Detection Gateways
    	for Congestion Avoidance", 1993, IEEE/ACM Transactions on Networking.
    
    	This file codes a "divisionless" version of RED algorithm
    	as written down in Fig.17 of the paper.
    
    Short description.
    ------------------
    
    	When a new packet arrives we calculate the average queue length:
    
    	avg = (1-W)*avg + W*current_queue_len,
    
    	W is the filter time constant (choosen as 2^(-Wlog)), it controls
    	the inertia of the algorithm. To allow larger bursts, W should be
    	decreased.
    
    	if (avg > th_max) -> packet marked (dropped).
    	if (avg < th_min) -> packet passes.
    	if (th_min < avg < th_max) we calculate probability:
    
    	Pb = max_P * (avg - th_min)/(th_max-th_min)
    
    	and mark (drop) packet with this probability.
    	Pb changes from 0 (at avg==th_min) to max_P (avg==th_max).
    	max_P should be small (not 1), usually 0.01..0.02 is good value.
    
    	max_P is chosen as a number, so that max_P/(th_max-th_min)
    	is a negative power of two in order arithmetics to contain
    	only shifts.
    
    
    	Parameters, settable by user:
    	-----------------------------
    
    	limit		- bytes (must be > qth_max + burst)
    
    	Hard limit on queue length, should be chosen >qth_max
    	to allow packet bursts. This parameter does not
    	affect the algorithms behaviour and can be chosen
    	arbitrarily high (well, less than ram size)
    	Really, this limit will never be reached
    	if RED works correctly.
    
    	qth_min		- bytes (should be < qth_max/2)
    	qth_max		- bytes (should be at least 2*qth_min and less limit)
    	Wlog	       	- bits (<32) log(1/W).
    	Plog	       	- bits (<32)
    
    	Plog is related to max_P by formula:
    
    	max_P = (qth_max-qth_min)/2^Plog;
    
    	F.e. if qth_max=128K and qth_min=32K, then Plog=22
    	corresponds to max_P=0.02
    
    	Scell_log
    	Stab
    
    	Lookup table for log((1-W)^(t/t_ave).
    
    
    NOTES:
    
    Upper bound on W.
    -----------------
    
    	If you want to allow bursts of L packets of size S,
    	you should choose W:
    
    	L + 1 - th_min/S < (1-(1-W)^L)/W
    
    	th_min/S = 32         th_min/S = 4
    			                       
    	log(W)	L
    	-1	33
    	-2	35
    	-3	39
    	-4	46
    	-5	57
    	-6	75
    	-7	101
    	-8	135
    	-9	190
    	etc.
     */
    
    


    qdisc sfq

    From file iproute2/tc/q_sfq.c
    Usage: ... sfq [ perturb SECS ] [ quantum BYTES ]
    

    Kernel Config: QoS support, QoS and/or fair queueing, SFQ queue

    From file linux/net/sched/
    /*	Stochastic Fairness Queuing algorithm.
    	=======================================
    
    	Source:
    	Paul E. McKenney "Stochastic Fairness Queuing",
    	IEEE INFOCOMM'90 Proceedings, San Francisco, 1990.
    
    	Paul E. McKenney "Stochastic Fairness Queuing",
    	"Interworking: Research and Experience", v.2, 1991, p.113-131.
    
    
    	See also:
    	M. Shreedhar and George Varghese "Efficient Fair
    	Queuing using Deficit Round Robin", Proc. SIGCOMM 95.
    
    
    	This is not the thing that is usually called (W)FQ nowadays. 
    	It does not use any timestamp mechanism, but instead
    	processes queues in round-robin order.
    
    	ADVANTAGE:
    
    	- It is very cheap. Both CPU and memory requirements are minimal.
    
    	DRAWBACKS:
    
    	- "Stochastic" -> It is not 100% fair. 
    	When hash collisions occur, several flows are considered as one.
    
    	- "Round-robin" -> It introduces larger delays than virtual clock
    	based schemes, and should not be used for isolating interactive
    	traffic	from non-interactive. It means, that this scheduler
    	should be used as leaf of CBQ or P3, which put interactive traffic
    	to higher priority band.
    
    	We still need true WFQ for top level CSZ, but using WFQ
    	for the best effort traffic is absolutely pointless:
    	SFQ is superior for this purpose.
    
    	IMPLEMENTATION:
    	This implementation limits maximal queue length to 128;
    	maximal mtu to 2^15-1; number of hash buckets to 1024.
    	The only goal of this restrictions was that all data
    	fit into one 4K page :-). Struct sfq_sched_data is
    	organized in anti-cache manner: all the data for a bucket
    	are scattered over different locations. This is not good,
    	but it allowed me to put it into 4K.
    
    	It is easy to increase these values, but not in flight.  */
    
    


    qdisc prio

    Kernel Config: QoS support, QoS and/or fair queueing, The simplest PRIO pseudo scheduler

    From file iproute2/tc/q_prio.c
    Usage: ... prio bands NUMBER priomap P1 P2...
    

    From file linux/net/sched/sch_prio.c
     * net/sched/sch_prio.c	Simple 3-band priority "scheduler".
    


    qdisc csz

    Kernel Config: QoS support, QoS and/or fair queueing, CSZ packet scheduler

    From file iproute2/tc/q_csz.c
    NOT IMPLEMENTED
    

    From file linux/net/sched/sch_csz.c
    /*	Clark-Shenker-Zhang algorithm.
    	=======================================
    
    	SOURCE.
    
    	David D. Clark, Scott Shenker and Lixia Zhang
    	"Supporting Real-Time Applications in an Integrated Services Packet
    	Network: Architecture and Mechanism".
    
    	CBQ presents a flexible universal algorithm for packet scheduling,
    	but it has pretty poor delay characteristics.
    	Round-robin scheduling and link-sharing goals
    	apparently contradict minimization of network delay and jitter.
    	Moreover, correct handling of predictive flows seems to be
    	impossible in CBQ.
    
    	CSZ presents a more precise but less flexible and less efficient
    	approach. As I understand it, the main idea is to create
    	WFQ flows for each guaranteed service and to allocate
    	the rest of bandwith to dummy flow-0. Flow-0 comprises
    	the predictive services and the best effort traffic;
    	it is handled by a priority scheduler with the highest
    	priority band allocated	for predictive services, and the rest ---
    	to the best effort packets.
    
    	Note that in CSZ flows are NOT limited to their bandwidth.  It
    	is supposed that the flow passed admission control at the edge
    	of the QoS network and it doesn't need further shaping. Any
    	attempt to improve the flow or to shape it to a token bucket
    	at intermediate hops will introduce undesired delays and raise
    	jitter.
    
    	At the moment CSZ is the only scheduler that provides
    	true guaranteed service. Another schemes (including CBQ)
    	do not provide guaranteed delay and randomize jitter.
    	There is a proof (Sally Floyd), that delay
    	can be estimated by a IntServ compliant formula.
    	This result is true formally, but it is wrong in principle.
    	It takes into account only round-robin delays,
    	ignoring delays introduced by link sharing i.e. overlimiting.
    	Note that temporary overlimits are inevitable because
    	real links are not ideal, and the real algorithm must take this
    	into account.
    
            ALGORITHM.
    
    	--- Notations.
    
    	$B$ is link bandwidth (bits/sec).
    
    	$I$ is set of all flows, including flow $0$.
    	Every flow $a \in I$ has associated bandwidth slice $r_a < 1$ and
    	$\sum_{a \in I} r_a = 1$.
    
    	--- Flow model.
    
    	Let $m_a$ is the number of backlogged bits in flow $a$.
    	The flow is {\em active}, if $m_a > 0$.
    	This number is a discontinuous function of time;
    	when a packet $i$ arrives:
    	\[
    	m_a(t_i+0) - m_a(t_i-0) = L^i,
    	\]
    	where $L^i$ is the length of the arrived packet.
    	The flow queue is drained continuously until $m_a == 0$:
    	\[
    	{d m_a \over dt} = - { B r_a \over \sum_{b \in A} r_b}.
    	\]
    	I.e. flow rates are their allocated rates proportionally
    	scaled to take all available link bandwidth. Apparently,
    	it is not the only possible policy. F.e. CBQ classes
    	without borrowing would be modelled by:
    	\[
    	{d m_a \over dt} = - B r_a .
    	\]
    	More complicated hierarchical bandwidth allocation
    	policies are possible, but unfortunately, the basic
    	flow equations have a simple solution only for proportional
    	scaling.
    
    	--- Departure times.
    
    	We calculate the time until the last bit of packet is sent:
    	\[
    	E_a^i(t) = { m_a(t_i) - \delta_a(t) \over r_a },
    	\]
    	where $\delta_a(t)$ is number of bits drained since $t_i$.
    	We have to evaluate $E_a^i$ for all queued packets,
    	then find the packet with minimal $E_a^i$ and send it.
    
    	This sounds good, but direct implementation of the algorithm
    	is absolutely infeasible. Luckily, if flow rates
    	are scaled proportionally, the equations have a simple solution.
    	
    	The differential equation for $E_a^i$ is
    	\[
    	{d E_a^i (t) \over dt } = - { d \delta_a(t) \over dt} { 1 \over r_a} =
    	{ B \over \sum_{b \in A} r_b}
    	\]
    	with initial condition
    	\[
    	E_a^i (t_i) = { m_a(t_i) \over r_a } .
    	\]
    
    	Let's introduce an auxiliary function $R(t)$:
    
    	--- Round number.
    
    	Consider the following model: we rotate over active flows,
    	sending $r_a B$ bits from every flow, so that we send
    	$B \sum_{a \in A} r_a$ bits per round, that takes
    	$\sum_{a \in A} r_a$ seconds.
    	
    	Hence, $R(t)$ (round number) is a monotonically increasing
    	linear function	of time when $A$ is not changed
    	\[
    	{ d R(t) \over dt } = { 1 \over \sum_{a \in A} r_a }
    	\]
    	and it is continuous when $A$ changes.
    
    	The central observation is that the quantity
    	$F_a^i = R(t) + E_a^i(t)/B$ does not depend on time at all!
    	$R(t)$ does not depend on flow, so that $F_a^i$ can be
    	calculated only once on packet arrival, and we need not
    	recalculate $E$ numbers and resorting queues.
    	The number $F_a^i$ is called finish number of the packet.
    	It is just the value of $R(t)$ when the last bit of packet
    	is sent out.
    
    	Maximal finish number on flow is called finish number of flow
    	and minimal one is "start number of flow".
    	Apparently, flow is active if and only if $F_a \leq R$.
    
    	When a packet of length $L_i$ bit arrives to flow $a$ at time $t_i$,
    	we calculate $F_a^i$ as:
    
    	If flow was inactive ($F_a < R$):
    	$F_a^i = R(t) + {L_i \over B r_a}$
    	otherwise
    	$F_a^i = F_a + {L_i \over B r_a}$
    
    	These equations complete the algorithm specification.
    
    	It looks pretty hairy, but there is a simple
    	procedure for solving these equations.
    	See procedure csz_update(), that is a generalization of
    	the algorithm from S. Keshav's thesis Chapter 3
    	"Efficient Implementation of Fair Queeing".
    
    	NOTES.
    
    	* We implement only the simplest variant of CSZ,
    	when flow-0 is a explicit 4band priority fifo.
    	This is bad, but we need a "peek" operation in addition
    	to "dequeue" to implement complete CSZ.
    	I do not want to do that, unless it is absolutely
    	necessary.
    	
    	* A primitive support for token bucket filtering
    	presents itself too. It directly contradicts CSZ, but
    	even though the Internet is on the globe ... :-)
    	"the edges of the network" really exist.
    	
    	BUGS.
    
    	* Fixed point arithmetic is overcomplicated, suboptimal and even
    	wrong. Check it later.  */
    
    
    /* This number is arbitrary */
    
    


    class

    Kernel Config: QoS support, QoS and/or fair queueing, Packet classifier API

    From file iproute2/tc/tc_class.c
    Usage: tc class [ add | del | change | get ] dev STRING
           [ classid CLASSID ] [ root | parent CLASSID ]
           [ [ QDISC_KIND ] [ help | OPTIONS ] ]
    
           tc class show [ dev STRING ] [ root | parent CLASSID ]
    Where:
    QDISC_KIND := { prio | cbq | etc. }
    OPTIONS := ... try tc class add <desired QDISC_KIND> help
    
    


    class cbq

    From file iproute2/tc/q_cbq.c
    Usage: ... cbq bandwidth BPS rate BPS maxburst PKTS [ avpkt BYTES ]
                   [ minburst PKTS ] [ bounded ] [ isolated ]
                   [ allot BYTES ] [ mpu BYTES ] [ weight RATE ]
                   [ prio NUMBER ] [ cell BYTES ] [ ewma LOG ]
                   [ estimator INTERVAL TIME_CONSTANT ]
                   [ split CLASSID ] [ defmap MASK/CHANGE ]
    


    estimator

    Kernel Config: QoS support, QoS and/or fair queueing, Rate estimator

    From file iproute2/tc/m_estimator.c
    Usage: ... estimator INTERVAL TIME-CONST
      INTERVAL is interval between measurements
      TIME-CONST is averaging time constant
    Example: ... est 1sec 8sec\
    

    From file linux/net/sched/estimator.c
       This code is NOT intended to be used for statistics collection,
       its purpose is to provide a base for statistical multiplexing
       for controlled load service.
       If you need only statistics, run a user level daemon which
       periodically reads byte counters.
    
       Unfortunately, rate estimation is not a very easy task.
       F.e. I did not find a simple way to estimate the current peak rate
       and even failed to formulate the problem 8)8)
    
       So I preferred not to built an estimator into the scheduler,
       but run this task separately.
       Ideally, it should be kernel thread(s), but for now it runs
       from timers, which puts apparent top bounds on the number of rated
       flows, has minimal overhead on small, but is enough
       to handle controlled load service, sets of aggregates.
    
       We measure rate over A=(1<<interval) seconds and evaluate EWMA:
    
       avrate = avrate*(1-W) + rate*W
    
       where W is chosen as negative power of 2: W = 2^(-ewma_log)
    
       The resulting time constant is:
    
       T = A/(-ln(1-W))
    
    
       NOTES.
    
       * The stored value for avbps is scaled by 2^5, so that maximal
         rate is ~1Gbit, avpps is scaled by 2^10.
    
       * Minimal interval is HZ/4=250msec (it is the greatest common divisor
         for HZ=100 and HZ=1024 8)), maximal interval
         is (HZ/4)*2^EST_MAX_INTERVAL = 8sec. Shorter intervals
         are too expensive, longer ones can be implemented
         at user level painlessly.
    


    filter

    Kernel Config: QoS support, QoS and/or fair queueing, Packet classifier API

    From file iproute2/tc/tc_filter.c
    Usage: tc filter [ add | del | change | get ] dev STRING
           [ pref PRIO ] [ protocol PROTO ]
           [ estimator INTERVAL TIME_CONSTANT ]
           [ root | classid CLASSID ] [ handle FILTERID ]
           [ [ FILTER_TYPE ] [ help | OPTIONS ] ]
    
           tc filter show [ dev STRING ] [ root | parent CLASSID ]
    Where:
    FILTER_TYPE := { rsvp | u32 | fw | route | etc. }
    FILTERID := ... format depends on classifier, see there
    OPTIONS := ... try tc filter add <desired FILTER_KIND> help
    


    filter rsvp

    From file iproute2/tc/f_rsvp.c
    Usage: ... rsvp ipproto PROTOCOL session DST[/PORT | GPI ]
                    [ sender SRC[/PORT | GPI ]
                    [ classid CLASSID ] [ police POLICE_SPEC ]
                    [ tunnelid ID ] [ tunnel ID skip NUMBER ]
    Where: GPI := { flowlabel NUMBER | spi/ah SPI | spi/esp SPI |
                    u{8|16|32} NUMBER mask MASK at OFFSET}
           POLICE_SPEC := ... look at TBF
           FILTERID := X:Y
    

    From file linux/net/sched/cls_rsvp.h
    /*
       Comparing to general packet classification problem,
       RSVP needs only sevaral relatively simple rules:
    
       * (dst, protocol) are always specified,
         so that we are able to hash them.
       * src may be exact, or may be wildcard, so that
         we can keep a hash table plus one wildcard entry.
       * source port (or flow label) is important only if src is given.
    
       IMPLEMENTATION.
    
       We use a two level hash table: The top level is keyed by
       destination address and protocol ID, every bucket contains a list
       of "rsvp sessions", identified by destination address, protocol and
       DPI(="Destination Port ID"): triple (key, mask, offset).
    
       Every bucket has a smaller hash table keyed by source address
       (cf. RSVP flowspec) and one wildcard entry for wildcard reservations.
       Every bucket is again a list of "RSVP flows", selected by
       source address and SPI(="Source Port ID" here rather than
       "security parameter index"): triple (key, mask, offset).
    
    
       NOTE 1. All the packets with IPv6 extension headers (but AH and ESP)
       and all fragmented packets go to the best-effort traffic class.
    
    
       NOTE 2. Two "port id"'s seems to be redundant, rfc2207 requires
       only one "Generalized Port Identifier". So that for classic
       ah, esp (and udp,tcp) both *pi should coincide or one of them
       should be wildcard.
    
       At first sight, this redundancy is just a waste of CPU
       resources. But DPI and SPI add the possibility to assign different
       priorities to GPIs. Look also at note 4 about tunnels below.
    
    
       NOTE 3. One complication is the case of tunneled packets.
       We implement it as following: if the first lookup
       matches a special session with "tunnelhdr" value not zero,
       flowid doesn't contain the true flow ID, but the tunnel ID (1...255).
       In this case, we pull tunnelhdr bytes and restart lookup
       with tunnel ID added to the list of keys. Simple and stupid 8)8)
       It's enough for PIMREG and IPIP.
    
    
       NOTE 4. Two GPIs make it possible to parse even GRE packets.
       F.e. DPI can select ETH_P_IP (and necessary flags to make
       tunnelhdr correct) in GRE protocol field and SPI matches
       GRE key. Is it not nice? 8)8)
    
    
       Well, as result, despite its simplicity, we get a pretty
       powerful classification engine.  */
    
    


    filter u32

    From file iproute2/tc/f_u32.c
    Usage: ... u32 [ match SELECTOR ... ] [ link HTID ] [ classid CLASSID ]
                   [ police POLICE_SPEC ] [ offset OFFSET_SPEC ]
                   [ ht HTID ] [ hashkey HASHKEY_SPEC ]
                   [ sample SAMPLE ]
    or         u32 divisor DIVISOR
    
    Where: SELECTOR := SAMPLE SAMPLE ...
           SAMPLE := { ip | ip6 | udp | tcp | icmp | u{32|16|8} } SAMPLE_ARGS
           FILTERID := X:Y:Z
    

    From file linux/net/sched/cls_u32.c
     *	The filters are packed to hash tables of key nodes
     *	with a set of 32bit key/mask pairs at every node.
     *	Nodes reference next level hash tables etc.
     *
     *	This scheme is the best universal classifier I managed to
     *	invent; it is not super-fast, but it is not slow (provided you
     *	program it correctly), and general enough.  And its relative
     *	speed grows as the number of rules becomes larger.
     *
     *	It seems that it represents the best middle point between
     *	speed and manageability both by human and by machine.
     *
     *	It is especially useful for link sharing combined with QoS;
     *	pure RSVP doesn't need such a general approach and can use
     *	much simpler (and faster) schemes, sort of cls_rsvp.c.
    

    Faster than ipchains, according to Werner Almesburger (as reported in Rusty's Diary, May 22 1999).


    filter fw

    From file iproute2/tc/f_fw.c
    Usage: ... fw [ classid CLASSID ] [ police POLICE_SPEC ]
           POLICE_SPEC := ... look at TBF
           CLASSID := X:Y
    

    From file linux/net/sched/cls_fw.c
     * net/sched/cls_fw.c	Classifier mapping ipchains' fwmark to traffic class.
    


    filter route

    From file iproute2/tc/f_route.c
    Usage: ... route [ from REALM | fromif TAG ] [ to REALM ]
                    [ flowid CLASSID ] [ police POLICE_SPEC ]
           POLICE_SPEC := ... look at TBF
           CLASSID := X:Y\n
    

    From file linux/net/sched/cls_route.c
       1. For now we assume that route tags < 256.
          It allows to use direct table lookups, instead of hash tables.
       2. For now we assume that "from TAG" and "fromdev DEV" statements
          are mutually  exclusive.
       3. "to TAG from ANY" has higher priority, than "to ANY from XXX"
    


    police

    From file iproute2/tc/m_police.c
    Usage: ... police rate BPS burst BYTES[/BYTES] [ mtu BYTES[/BYTES] ]
                    [ peakrate BPS ] [ avrate BPS ]
                    [ ACTION ]
    Where: ACTION := reclassify | drop | continue 
    


    Copyright 1999 Mark Lamb. Redistribution via any media permitted as long as this notice remains intact. http://snafu.freedom.org/