Intel® Ethernet Controller 700 Series: Hash and Flow Director Filters

What's Inside the Box?

Before we look at the hash function, also known as Receive Side Scaling (RSS), and flow director filters let's locate them within the Intel® Ethernet Controller 700-Series based Network Adapters:

Series flexible pipeline
Figure 1. Intel® Ethernet Controller 700-Series flexible pipeline.

Figure 1 depicts a simplified representation of the Intel Ethernet Controller 700-Series flexible pipeline, showing the main packet processing blocks.

Hash and flow director filters are located at the end of the pipeline, close to the host interface, and define destination queues for packets.

There is a lot of pre-processing that happens before a packet enters the hash/flow director filters:

  1. Parser block parses all headers of a packet, one-by-one, extracting different fields of a packet to its "field vector," detecting the packet's type (PTYPE). Packet type will then be presented on the corresponding part of the packet's RX descriptor.
  2. Field Vector attaches to the packet as a metadata buffer and follows the packet on its voyage from one functional block to another.
  3. Switch block uses some fields of the field vector, for example, MAC DA or VLAN or VXLAN VNI or MPLS label, to decide to which physical (PF) or virtual (VF) function this packet must be delivered. If mirroring is required, the packet can be duplicated and then delivered to multiple destinations.
  4. Finally, the packet arrives at the Filters block, first to Flow Director and then to Hash/RSS filter to set a destination queue within the PF/VF this packet had been assigned to by the Switch block on the previous step. Filters will use fields from the field vector to do their usual magic to select the packet's destination queue. If both filters are disabled, then the packet goes to queue 0.

What Kind of Packet Type is it?

There is another important characteristic of a packet defined at the packet parsing stage - Packet Classifier type or PCTYPE.

What is the difference between PCTYPE and PTYPE? PCTYPE defines the packet generic nature, for example, IPv4 TCP and IPv6 TCP are different PCTYPEs.

Each PCTYPE has its own configuration of filters. For example, IPv4 TCP and IPv6 TCP can be distributed to different queue regions by hash filter if needed.

Intel Ethernet Controller 700-Series supports up to 64 different PCTYPEs with only a few defined by default:

PCTYPEDescription
26GENEVE OAM
27VXLAN-GPE OAM
31Non-Fragmented IPv4, UDP
33Non-Fragmented IPv4, TCP
34Non-Fragmented IPv4, SCTP
35Non-Fragmented IPv4, Other IP protocols
36Fragmented IPv4
41Non-Fragmented IPv6, UDP
43Non-Fragmented IPv6, TCP
44Non-Fragmented IPv6, SCTP
45Non-Fragmented IPv6, Other IP protocols
46Fragmented IPv6
63Unknown Ethertype, L2 packet

PTYPE defines the sequence of headers, detected by the parser, for example, IPv4 TCP packet over IPv6 VXLAN.

The Intel Ethernet Controller 700-Series supports up to 192 different PTYPEs. Packet type is reported at the corresponding 8-bit field of the RX descriptor, so the software can use this field to start processing without parsing all of the packet's headers.

Note: Some protocols can be reported by the PTYPE even if there is no matching PCTYPE. For example, packet {MAC, IPV4, ICMP, PAY4} will be reported as PTYPE 28 even if there is no separate ICMP packet classifier type defined. Or ARP packets will be reported as PTYPE 11: {MAC, ARP}.

New PCTYPEs and PTYPEs can be added by re-programming parser at runtime using Dynamic Device Personalization (DDP) available for X710, XXV710 and X710 Intel Ethernet Controllers (formerly known as Fortville).

Greenfield

What size is the "field" and how many "fields" does the field vector consists of?

Each field is 16-bit in size and there are 64 of these fields in the field vector. These 16-bit units are also known as "words". The parser and filters always work in words boundaries.

When a new packet arrives to the parser, the packet's field vector is empty, filled by zeroes, and the parser starts analyzing the packet's headers and extracting specific fields from the headers to the field vector.

Let's take a simple UDP packet with one VLAN tag:

MAC Destination Address  00:01:02:03:04:05
MAC Source Address       10:20:30:40:50:60
VLAN tag                 TPID x8100, PCP 0 DEI 0, VID 3276 (0xCCC) 
Source IPv4              1.1.1.1
Destination IPv4         2.2.2.2
Iv4P DSCP                0
IPv4 ECN                 0
IPv4 TTL                 64
IP Protocol              17 (UDP)
UDP Source Port          43690 (0xAAAA)
UDP Destination Port     48059 (0xBBBB)
Payload type             incremental
Payload initial value    1

Or in hexadecimal view:

packet size 64
b    0| 00 01 02 03 04 05 10 20 | 30 40 50 60 81 00 0C CC | ....... 0@P`....
b   16| 08 00 45 00 00 2A 00 00 | 40 00 40 11 34 BE 01 01 | ..E..*..@.@.4...
b   32| 01 01 02 02 02 02 AA AA | BB BB 00 16 62 1E 01 02 | ............b...
b   48| 03 04 05 06 07 08 09 0A | 0B 0C 0D 0E EA 15 AC 07 | ................

After the parser processes this packet and extracts all fields, the field vector will look like this:

Field Vector for UDP packet:
w    0| 00 01 02 03 04 05 10 20 | 30 40 50 60 10 02 00 00 | ....... 0@P`....
w    8| 0C CC 45 00 00 2A 00 00 | 40 00 40 11 34 BE 01 01 | ..E..*..@.@.4...
w   16| 01 01 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 | ................
w   24| 00 00 00 00 00 00 02 02 | 02 02 AA AA BB BB 00 16 | ................
w   32| 62 1E 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 | b...............
w   40| 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00 | ................
w   48| 00 00 00 00 01 02 03 04 | 05 06 07 08 09 0A 0B 0C | ................
w   56| 0D 0E EA 15 00 00 00 00 | 00 00 00 00 00 00 00 00 | ................

First 3 fields (words) - Ethernet DA, then 3 words of Ethernet SA.

VLAN tag extracted to word 8.

IPv4 Header - words 9-14, IPv4 Source Address - words 15-16, IPv4 Destination Address - words 27-28.

UDP Source Port - word 29, Destination Port - word 30.

The eight words from 50 to 57 are special - they can be used to store up to 16 bytes of the payload of the last recognized layer. For example, when the parser hits unknown Ethertype and does not know what to do with it, then the L2 payload (bytes right after Ethertype) can be extracted here. If the parser stops at L3, for example, it hits unknown IP protocol, then IP payload (bytes right after the IP header) can be extracted here. For known IP protocols: UDP/TCP/SCTP, bytes after corresponding L4 headers can be extracted. Configuration of the bytes to be extracted is flexible, these 8 words can be extracted from any 3 locations in the first 240 words of the payload. But there is a trick - the same 8 fields are also used for tunneled packets, for example, VXLAN or GTP, to store the outer IP Destination address. So, if flexible payload extraction is enabled, then it overwrites outer IP destination address for tunnels.

By default the Intel Ethernet Controller 700-Series out-of-box configuration has the payload extraction disabled, so outer destination addresses for tunnels can be used. Data Plane Development Kit (DPDK) versions 2.2 to 17.11 force extraction of the first 8 words (16 bytes) for all layers, overwriting outer destination address. This behavior was changed in DPDK 18.02 and the payload extraction is only turned on if needed.

For more details on the fields for different protocols see "Field Vector" chapter in the Intel® Ethernet Controller XL710 datasheet.

Direction of the Vector

Now look at how the field vector is used by hash and flow director filters to decide which queue packet should be directed.

As we mentioned above, each PCTYPE has its own filter configuration space. The combination of the fields used as an input set for a filter, and for existing PCTYPEs is defined as follows:

PCTYPEDescriptionFlow Director Input SetHash Input Set
26GENEVE OAMSource Outer UDP Port, VNISource Outer UDP Port, VNI
27VXLAN-GPE OAMSource Outer UDP Port, VNISource Outer UDP Port, VNI
31Non-Fragmented IPv4, UDPIP4-S, IP4-D, UDP-S, UDP-DIP4-S, IP4-D, UDP-S, UDP-D
33Non-Fragmented IPv4, TCPIP4-S, IP4-D, TCP-S, TCP-DIP4-S, IP4-D, TCP-S, TCP-D
34Non-Fragmented IPv4, SCTPIP4-S, IP4-D, SCTP-S, SCTP-D, SCTP Verification TagIP4-S, IP4-D, SCTP-S, SCTP-D, SCTP Verification Tag
35Non-Fragmented IPv4, Other IP protocolsIP4-S, IP4-DIP4-S, IP4-D
36Fragmented IPv4IP4-S, IP4-DIP4-S, IP4-D
41Non-Fragmented IPv6, UDPIP6-S, IP6-D, UDP-S, UDP-DIP6-S, IP6-D, UDP-S, UDP-D
43Non-Fragmented IPv6, TCPIP6-S, IP6-D, TCP-S, TCP-DIP6-S, IP6-D, TCP-S, TCP-D
44Non-Fragmented IPv6, SCTPIP6-S, IP6-D, SCTP-S, SCTP-D, SCTP Verification TagIP6-S, IP6-D, SCTP-S, SCTP-D, SCTP Verification Tag
45Non-Fragmented IPv6, Other IP protocolsIP6-S, IP6-DIP6-S, IP6-D
46Fragmented IPv6IP6-S, IP6-DIP6-S, IP6-D
63Unknown Ethertype, L2 packetL2 EthertypeL2 Ethertype

The internal input set is just a 64-bit register defining which fields of the field vector should be used by the filter.

In addition to input set registers, each PCTYPE has two registers which can be used to mask out some bits of the field.

For example, IPv4 header has Type of Service (TOS) field which is 8 bit wide and located in the lower 8-bits of the first word of the IPv4 header. This word is extracted to field 9 of the field vector. If the application wants to use TOS for a filter (flow director or hash), it can include word 9 to the corresponding input set and then mask out high 8-bits, so only the TOS field will be used by the filters.

So, the input set register is 64-bits wide, there are 64 words (fields) of the field vector, then for a PCTYPE all 64-bits in the input set register can be set and used as the input set for filters, right? Wrong.

First, the last 6 words of the field vector are reserved and must not be used, limiting the number of valid fields to 58.

Second, flow director and hash filters use Toeplitz hash function to calculate hash signature, and the input set of this function is limited to 48 bytes or 24 words.

When constructing the input set, the application should take care not to select more than 24 words for the set. But, as usual, there is an exception. For the hash filter, if Simple XOR hashing function is selected, the input set can include any number of fields.

The DPDK 18.02 introduces a few low-level functions in rte_pmd_i40e.h file, which can be used to control input sets and masks per PCTYPE:

int rte_pmd_i40e_inset_get(uint16_t port, uint8_t pctype,
  struct rte_pmd_i40e_inset *inset, enum rte_pmd_i40e_inset_type inset_type);
int rte_pmd_i40e_inset_set(uint16_t port, uint8_t pctype,
  struct rte_pmd_i40e_inset *inset, enum rte_pmd_i40e_inset_type inset_type);
int rte_pmd_i40e_inset_field_get(uint64_t *inset, uint8_t field_idx);
int rte_pmd_i40e_inset_field_set(uint64_t *inset, uint8_t field_idx);
int rte_pmd_i40e_inset_field_clear(uint64_t *inset, uint8_t field_idx);

Flow Director Basics

The Flow Director sends different flows to different queues--simple.

Almost...

Assigning a destination queue to a packet is just one of the Flow Director actions, others include:

  • Flow Director can flag on RX descriptor that the flow director rule was matched by the packet
  • Flow Director can post on RX descriptor 4 or 8 sequential bytes from the flexible payload part of the packet's field vector
  • Flow Director can do all the above and then just happily pass the packet to the hash filter to assign a queue
  • Flow Director can drop a packet

In total Intel Ethernet Controller 700-Series support up to 8K flow director exact match rules. When the application creates a rule, it can supply a cookie (tag) with the rule, a 32-bit value, which will be reported on the packet's RX descriptor.

Hash Filter Basics

Hash filter, also called RSS, does what it says - calculates the hash signature of the input set.

The calculated hash signature is then used to select the destination queue and also posted to the RX descriptor to be used by software, if needed.

The diagram below shows how the hash filter works:

Hash filter diagram
Figure 2. Hash filter diagram.

Fields, selected as the hash input set, are used to calculate a 32-bit hash value. Then, depending on indirection table size, 'n' of the least significant bits of the hash value is used as an index to the indirection table. For example, a PFs table can have 512 entries, so 9 LSBs will be used as an index. Each entry of the table contains a destination queue index. By default, the indirection table is initialized by the driver evenly, for example, if 4 queues are used for RSS, then the table entries will be initialized as 0,1,2,3,0,1,2,3...0,1,2,3. But the application can change this distribution at any time. In figure 2 the last three entries point to the same queue 3.

As long as the hash input set values are the same, hash value will be the same, so all packets from one flow will always go to the same destination queue.

Hash Filter Tips and Tricks

The hash filter can be configured per PCTYPE, so some tricks can be played by applying different configurations to different PCTYPEs.

Example 1

Just calculate the hash value.

Initialize a port with the single RX queue or configure the indirection table to point to queue 0. The hash filter will calculate the hash value and post it on the RX descriptor.

Example 2

Separate IPv4 TCP packets from any other IP protocols.

Disable the hash filter for all PCTYPEs except IPv4 TCP and set up the indirection table so it will not have any entries pointing to queue 0.

All non-TCP packets will be directed to queue 0 and TCP packets will be distributed to multiple queues as the hash will be calculated using the usual 4-tuple: IP SA, IP DA, TCP, TCP DP.

Queue     Hash Type    Flags   Packet Description
    0 00000000   24          | IPV4 UDP PAY4
    1 93B01B72   26 RSS_HASH | IPV4 TCP PAY4
    2 7CF903B1   26 RSS_HASH | IPV4 TCP PAY4
    3 A75D4238   26 RSS_HASH | IPV4 TCP PAY4
    4 44256387   26 RSS_HASH | IPV4 TCP PAY4
    5 1B5ACD1E   26 RSS_HASH | IPV4 TCP PAY4
    6 B28CC8AD   26 RSS_HASH | IPV4 TCP PAY4
    7 5B9BCFD4   26 RSS_HASH | IPV4 TCP PAY4

Example 3

Separate different PCTYPEs to different queue regions.

Port receive queues can be grouped to different regions. Up to 8 regions can be defined. Each region can contain 1,2,4,8 or 16 queues and start at any queue index, so regions can overlap. By default, all PCTYPEs are assigned to region 0.

Queue regions can be configured by:

int rte_pmd_i40e_rss_queue_region_conf(uint16_t port_id,
   enum rte_pmd_i40e_queue_region_op op_type, void *arg);

Figure 3 shows how queue regions can be used to separate different PCTYPEs:

Queue regions screen
Figure 3. Queue regions screen.

Here, Onboard Controller Port 2 receives SCTP, GTP-C and GTP-U packets and distributes them to different queue regions:

  • SCTP to queues 2 and 3
  • GTP-U to queues 4 and 5
  • GTP-C to queues 6 and 7

Regions can be changed dynamically, for example, GTP-U region can be set to 4 queues 8-11 initially and then increased to 8 queues 8-15 if needed.

Just watch for flows, which could jump from one queue to another when changing region size.

Add-on Controller Port 1 receives three types of QUIC packets - QUIC with long header, QUIC with short header and QUIC with short header and no CID.

As all the packets defined as separate PCTYPEs they can be directed to different queue regions as well.

Find more details on Dynamic Device Personalization and RSS Queue Regions.

Example 4

Mix hash and other filters

Initialize a port with multiple RX queues, let's say, 8. Use Flow Director, Ethertype filter or Switch filter to direct different control plane flows/Ethertypes to queues 4-7 and configure the hash indirection table to use queues 0-3 for data plane traffic.

Queue     Hash Type    Flags   Packet Description
    0 44256387   26 RSS_HASH | IPV4 TCP PAY4
    1 1B5ACD1E   26 RSS_HASH | IPV4 TCP PAY4
    2 A45E70DA   24 RSS_HASH | IPV4 UDP PAY4
    3 5B9BCFD4   26 RSS_HASH | IPV4 TCP PAY4
    4 00000000               | 
    5 00000000   26          | IPV4 TCP PAY4
    6 00000000   26          | IPV4 TCP PAY4
    7 00000000   24          | IPV4 UDP PAY4

Here the first 4 queues are used for data plane traffic, flows distributed to queues using RSS.

Control plane packets are directed to queues 5 to 7 using the Intel Ethernet Controller 700-Series programmable switch filters. Queue 5 receives all BGP packets with TCP destination port 179, queue 6 receives all BGP packets with TCP source port 179 and queue 7 receives all DHCP packets.

Note: The packets in the queues 5 to 7 do not have hash signatures as switch filter bypasses RSS.

Example 5

Map QoS Class Identifier (QCI) to Differentiated Services Code Point (DSCP).

Let's take, for example, the following mapping of QCI to DSCP field from IPv4 header:

QCIDSCP
156
248
340
432
524
616
78
81
90

We want to direct a QCI to corresponding RX queue: QCI1 to queue 1, QCI2 to queue 2 and so on.

The first 8 words of the IPv4 header are extracted to words 9 to 16 according to the “Field Vector” chapter in the Intel Ethernet Controller XL710 datasheet.

DSCP uses 6 upper bits of the second byte of the IPv4 header, so we need to use the first word of the header as hash filter input set and mask out all unused bits to get DSCP value.

With only DSCP bits present, word 9 of the field vector will have the following values:

QCIDSCPWord 9
156224
248192
340160
432128
52496
61664
7832
814
900

Now we just use a Simple XOR hashing function and configure the RSS indirection table (reta in DPDK terms) entries 0,4,32,64,96,128,160,192,224 to direct to queues 9,8,7,6,5,4,3,2,1 and we have our QCI to DSCP to RX queue mapping:

Queue Hash     VLAN Flag Packet Description
    0 00000000 0000 0000
    1 00E00000 0003 01C3 VLAN | RSS_HASH | IPV4 UDP PAY4
    2 00C00000 0002 01C3 VLAN | RSS_HASH | IPV4 UDP PAY4
    3 00A00000 0001 01C3 VLAN | RSS_HASH | IPV4 UDP PAY4
    4 00800000 0003 01C3 VLAN | RSS_HASH | IPV4 UDP PAY4
    5 00600000 0002 01C3 VLAN | RSS_HASH | IPV4 UDP PAY4
    6 00400000 0001 01C3 VLAN | RSS_HASH | IPV4 UDP PAY4
    7 00200000 0003 01C3 VLAN | RSS_HASH | IPV4 UDP PAY4
    8 00040000 0002 01C3 VLAN | RSS_HASH | IPV4 UDP PAY4
    9 00000000 0001 01C3 VLAN | RSS_HASH | IPV4 UDP PAY4

References

About the Authors

Andrey Chilikin is a software architect working on the development and adoption of new networking technologies and solutions for telecom and enterprise communication industries.

Brian Johnson is a solutions architect focused on defining networking solutions and best practices in data center networking, virtualization, and cloud technologies.

有关编译器优化的更完整信息,请参阅优化通知