Open-vSwitch netdev-dpdk with vhost-user support

** Update 28/08/2015 **

There is an update for this post.

Last week, open-vswitch netdev-dpdk got the long awaited vhost-user support. This feature, uses the dpdk-2.0.0 to offload the servicing of a Virtual Machine’s (VM’s) virtio-net devices to a DPDK-based application in place of the kernel’s vhost-net module. The latest patch, uses also the vhost-user library to pass traffic through user domain sockets, instead of cuse signaling..

My intention is to build, install and configure the accelerated data path between a physical interface to a VM, and between two VMs, using the open-vswiitch main code branch.

Preface

when working with open-vswitch in user space, still there are 2 options to pass traffic between the host and the virtual machine (called back-end).

  • IVSHMEM – host memory is shared between the virtual machines, thus this is relevant for trusted virtual machines. In terms of performance this is the fastest method (zero copy). Another downside of this, is that you should use DPDK in the guest virtual machine.
  • VHOST user – packet is first dma’d to the host memory, and is copied to a private memory pool, which belongs to a specific virtual machine. This preserves vm-to-vm isolation, means a vm can not access other vm’s packets.

VHOST user space 

The vhost library implements the following features:

  • Management of virtio-net device creation/destruction events.
  • Mapping of the VM’s physical memory into the DPDK vhost-net’s address space.
  • Triggering/receiving notifications to/from VMs via eventfds.
  • A virtio-net back-end implementation providing a subset of virtio-net features.

Below is a diagram describing the relationship between QEMU based virtual machine, open-vswitch running with netdev datapath using dpdk. Dpdk vhost virtual queues are shared between the ovs datapath to the virtual machine. ioeventfd and irqfd file descriptors are passed to the QEMU and used to signal on events from/to the virtual machine to ovs.

vhost-user

vhost-user

There are two vhost implementations in vhost library, vhost cuse and vhost user. In vhost cuse, a character device driver is implemented to receive and process vhost requests through ioctl messages. In vhost user, a socket server is created to received vhost requests through socket messages. Most of the messages share the same handler routine. I will use vhost user option, since cuse is going to be deprecated in the next versions of dpdk.

Building open-vswitch netdev-dpdk with vhost user

Prerequisites

I’m doing this over CentOS, first you’ll need to install the required packages:

sudo yum update
sudo yum install git
sudo yum install openssl-devel
sudo yum install rpm-build
sudo yum install redhat-rpm-config
sudo yum install fuse fuse-devel

Download DPDK

cd ${HOME}
wget http://dpdk.org/browse/dpdk/snapshot/dpdk-2.0.0.tar.gz
tar xvzpf dpdk-2.0.0.tar.gz
cd dpdk-2.0.0

Configure DPDK

Edit ‘config/common_linuxapp’ to:

CONFIG_RTE_BUILD_COMBINE_LIBS=y
this will set the DPDK to compile to single library

CONFIG_RTE_LIBRTE_VHOST=y
this will compile DPDK with vhost support

Build DPDK

make config T=x86_64-native-linuxapp-gcc
make install T=x86_64-native-linuxapp-gcc

Build and install the eventlibfd_link driver

cd dpdk-2.0.0/lib/librte_vhost/eventfd_link/
make
sudo insmod eventfd_link.ko

Download open-vswitch

cd ${HOME}
git clone https://github.com/openvswitch/ovs.git
cd ovs
git checkout 1c38055de17b4ef00e12e0573fd433989309dc96

patch open-vswitch

cat << 'EOF' > dpdk-vhost.patch
diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
old mode 100644
new mode 100755
index 462ba0e..cdef6cf
--- a/INSTALL.DPDK.md
+++ b/INSTALL.DPDK.md
@@ -16,7 +16,9 @@ OVS needs a system with 1GB hugepages support.
 Building and Installing:
 ------------------------
 
-Required DPDK 2.0, `fuse`, `fuse-devel` (`libfuse-dev` on Debian/Ubuntu)
+Required: DPDK 2.0
+Optional (if building with vhost-cuse): `fuse`, `fuse-devel` (`libfuse-dev`
+on Debian/Ubuntu)
 
 1. Configure build & install DPDK:
   1. Set `$DPDK_DIR`
@@ -32,12 +34,9 @@ Required DPDK 2.0, `fuse`, `fuse-devel` (`libfuse-dev` on Debian/Ubuntu)
      `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
 
      Update `config/common_linuxapp` so that DPDK is built with vhost
-     libraries; currently, OVS only supports vhost-cuse, so DPDK vhost-user
-     libraries should be explicitly turned off (they are enabled by default
-     in DPDK 2.0).
+     libraries.
 
      `CONFIG_RTE_LIBRTE_VHOST=y`
-     `CONFIG_RTE_LIBRTE_VHOST_USER=n`
 
      Then run `make install` to build and install the library.
      For default install without IVSHMEM:
@@ -316,40 +315,164 @@ the vswitchd.
 DPDK vhost:
 -----------
 
-vhost-cuse is only supported at present i.e. not using the standard QEMU
-vhost-user interface. It is intended that vhost-user support will be added
-in future releases when supported in DPDK and that vhost-cuse will eventually
-be deprecated. See [DPDK Docs] for more info on vhost.
+DPDK 2.0 supports two types of vhost:
 
-Prerequisites:
-1.  Insert the Cuse module:
+1. vhost-user
+2. vhost-cuse
 
-      `modprobe cuse`
+Whatever type of vhost is enabled in the DPDK build specified, is the type
+that will be enabled in OVS. By default, vhost-user is enabled in DPDK.
+Therefore, unless vhost-cuse has been enabled in DPDK, vhost-user ports
+will be enabled in OVS.
+Please note that support for vhost-cuse is intended to be deprecated in OVS
+in a future release.
 
-2.  Build and insert the `eventfd_link` module:
+DPDK vhost-user:
+----------------
 
-     `cd $DPDK_DIR/lib/librte_vhost/eventfd_link/`
-     `make`
-     `insmod $DPDK_DIR/lib/librte_vhost/eventfd_link.ko`
+The following sections describe the use of vhost-user 'dpdkvhostuser' ports
+with OVS.
 
-Following the steps above to create a bridge, you can now add DPDK vhost
-as a port to the vswitch.
+DPDK vhost-user Prerequisites:
+-------------------------
 
-`ovs-vsctl add-port br0 dpdkvhost0 -- set Interface dpdkvhost0 type=dpdkvhost`
+1. DPDK 2.0 with vhost support enabled as documented in the "Building and
+   Installing section"
 
-Unlike DPDK ring ports, DPDK vhost ports can have arbitrary names:
+2. QEMU version v2.1.0+
 
-`ovs-vsctl add-port br0 port123ABC -- set Interface port123ABC type=dpdkvhost`
+   QEMU v2.1.0 will suffice, but it is recommended to use v2.2.0 if providing
+   your VM with memory greater than 1GB due to potential issues with memory
+   mapping larger areas.
 
-However, please note that when attaching userspace devices to QEMU, the
-name provided during the add-port operation must match the ifname parameter
-on the QEMU command line.
+Adding DPDK vhost-user ports to the Switch:
+--------------------------------------
 
+Following the steps above to create a bridge, you can now add DPDK vhost-user
+as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-user ports can
+have arbitrary names.
 
-DPDK vhost VM configuration:
-----------------------------
+  -  For vhost-user, the name of the port type is `dpdkvhostuser`
 
-   vhost ports use a Linux* character device to communicate with QEMU.
+     ```
+     ovs-ofctl add-port br0 vhost-user-1 -- set Interface vhost-user-1
+     type=dpdkvhostuser
+     ```
+
+     This action creates a socket located at
+     `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide
+     to your VM on the QEMU command line. More instructions on this can be
+     found in the next section "DPDK vhost-user VM configuration"
+     Note: If you wish for the vhost-user sockets to be created in a
+     directory other than `/usr/local/var/run/openvswitch`, you may specify
+     another location on the ovs-vswitchd command line like so:
+
+      `./vswitchd/ovs-vswitchd --dpdk -vhost_sock_dir /my-dir -c 0x1 ...`
+
+DPDK vhost-user VM configuration:
+---------------------------------
+Follow the steps below to attach vhost-user port(s) to a VM.
+
+1. Configure sockets.
+   Pass the following parameters to QEMU to attach a vhost-user device:
+
+   ```
+   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1
+   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
+   -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1
+   ```
+
+   ...where vhost-user-1 is the name of the vhost-user port added
+   to the switch.
+   Repeat the above parameters for multiple devices, changing the
+   chardev path and id as necessary. Note that a separate and different
+   chardev path needs to be specified for each vhost-user device. For
+   example you have a second vhost-user port named 'vhost-user-2', you
+   append your QEMU command line with an additional set of parameters:
+
+   ```
+   -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2
+   -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce
+   -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2
+   ```
+
+2. Configure huge pages.
+   QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access
+   a virtio-net device's virtual rings and packet buffers mapping the VM's
+   physical memory on hugetlbfs. To enable vhost-user ports to map the VM's
+   memory into their process address space, pass the following paramters
+   to QEMU:
+
+   ```
+   -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,
+   share=on
+   -numa node,memdev=mem -mem-prealloc
+   ```
+
+DPDK vhost-cuse:
+----------------
+
+The following sections describe the use of vhost-cuse 'dpdkvhostcuse' ports
+with OVS.
+
+DPDK vhost-cuse Prerequisites:
+-------------------------
+
+1. DPDK 2.0 with vhost support enabled as documented in the "Building and
+   Installing section"
+   As an additional step, you must enable vhost-cuse in DPDK by setting the
+   following additional flag in `config/common_linuxapp`:
+
+   `CONFIG_RTE_LIBRTE_VHOST_USER=n`
+
+   Following this, rebuild DPDK as per the instructions in the "Building and
+   Installing" section. Finally, rebuild OVS as per step 3 in the "Building
+   and Installing" section - OVS will detect that DPDK has vhost-cuse libraries
+   compiled and in turn will enable support for it in the switch and disable
+   vhost-user support.
+
+2. Insert the Cuse module:
+
+     `modprobe cuse`
+
+3. Build and insert the `eventfd_link` module:
+
+     ```
+     cd $DPDK_DIR/lib/librte_vhost/eventfd_link/
+     make
+     insmod $DPDK_DIR/lib/librte_vhost/eventfd_link.ko
+     ```
+
+4. QEMU version v2.1.0+
+
+   vhost-cuse will work with QEMU v2.1.0 and above, however it is recommended to
+   use v2.2.0 if providing your VM with memory greater than 1GB due to potential
+   issues with memory mapping larger areas.
+   Note: QEMU v1.6.2 will also work, with slightly different command line parameters,
+   which are specified later in this document.
+
+Adding DPDK vhost-cuse ports to the Switch:
+--------------------------------------
+
+Following the steps above to create a bridge, you can now add DPDK vhost-cuse
+as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-cuse ports can have
+arbitrary names.
+
+  -  For vhost-cuse, the name of the port type is `dpdkvhostcuse`
+
+     ```
+     ovs-ofctl add-port br0 vhost-cuse-1 -- set Interface vhost-cuse-1
+     type=dpdkvhostcuse
+     ```
+
+     When attaching vhost-cuse ports to QEMU, the name provided during the
+     add-port operation must match the ifname parameter on the QEMU command
+     line. More instructions on this can be found in the next section.
+
+DPDK vhost-cuse VM configuration:
+---------------------------------
+
+   vhost-cuse ports use a Linux* character device to communicate with QEMU.
    By default it is set to `/dev/vhost-net`. It is possible to reuse this
    standard device for DPDK vhost, which makes setup a little simpler but it
    is better practice to specify an alternative character device in order to
@@ -415,16 +538,19 @@ DPDK vhost VM configuration:
    QEMU must allocate the VM's memory on hugetlbfs. Vhost ports access a
    virtio-net device's virtual rings and packet buffers mapping the VM's
    physical memory on hugetlbfs. To enable vhost-ports to map the VM's
-   memory into their process address space, pass the following paramters
+   memory into their process address space, pass the following parameters
    to QEMU:
 
      `-object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,
       share=on -numa node,memdev=mem -mem-prealloc`
 
+   Note: For use with an earlier QEMU version such as v1.6.2, use the
+   following to configure hugepages instead:
 
-DPDK vhost VM configuration with QEMU wrapper:
-----------------------------------------------
+     `-mem-path /dev/hugepages -mem-prealloc`
 
+DPDK vhost-cuse VM configuration with QEMU wrapper:
+---------------------------------------------------
 The QEMU wrapper script automatically detects and calls QEMU with the
 necessary parameters. It performs the following actions:
 
@@ -450,8 +576,8 @@ qemu-wrap.py -cpu host -boot c -hda <disk image> -m 4096 -smp 4
   netdev=net1,mac=00:00:00:00:00:01
 ```
 
-DPDK vhost VM configuration with libvirt:
------------------------------------------
+DPDK vhost-cuse VM configuration with libvirt:
+----------------------------------------------
 
 If you are using libvirt, you must enable libvirt to access the character
 device by adding it to controllers cgroup for libvirtd using the following
@@ -525,7 +651,7 @@ Now you may launch your VM using virt-manager, or like so:
 
     `virsh create my_vhost_vm.xml`
 
-DPDK vhost VM configuration with libvirt and QEMU wrapper:
+DPDK vhost-cuse VM configuration with libvirt and QEMU wrapper:
 ----------------------------------------------------------
 
 To use the qemu-wrapper script in conjuntion with libvirt, follow the
@@ -553,7 +679,7 @@ steps in the previous section before proceeding with the following steps:
   the correct emulator location and set any additional options. If you are
   using a alternative character device name, please set "us_vhost_path" to the
   location of that device. The script will automatically detect and insert
-  the correct "vhostfd" value in the QEMU command line arguements.
+  the correct "vhostfd" value in the QEMU command line arguments.
 
   5. Use virt-manager to launch the VM
 
diff --git a/acinclude.m4 b/acinclude.m4
index d09a73f..20391ec 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -220,6 +220,9 @@ AC_DEFUN([OVS_CHECK_DPDK], [
     DPDK_vswitchd_LDFLAGS=-Wl,--whole-archive,$DPDK_LIB,--no-whole-archive
     AC_SUBST([DPDK_vswitchd_LDFLAGS])
     AC_DEFINE([DPDK_NETDEV], [1], [System uses the DPDK module.])
+
+    OVS_GREP_IFELSE([$RTE_SDK/include/rte_config.h], [define RTE_LIBRTE_VHOST_USER 1],
+                    [], [AC_DEFINE([VHOST_CUSE], [1], [DPDK vhost-cuse support enabled, vhost-user disabled.])])
   else
     RTE_SDK=
   fi
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 63243d8..af89e39 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -26,8 +26,10 @@
 #include <sched.h>
 #include <stdlib.h>
 #include <unistd.h>
+#include <sys/stat.h>
 #include <stdio.h>
 
+#include "dirs.h"
 #include "dp-packet.h"
 #include "dpif-netdev.h"
 #include "list.h"
@@ -90,8 +92,8 @@ BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF))
 #define NIC_PORT_RX_Q_SIZE 2048  /* Size of Physical NIC RX Queue, Max (n+32<=4096)*/
 #define NIC_PORT_TX_Q_SIZE 2048  /* Size of Physical NIC TX Queue, Max (n+32<=4096)*/
 
-/* Character device cuse_dev_name. */
-static char *cuse_dev_name = NULL;
+char *cuse_dev_name = NULL;    /* Character device cuse_dev_name. */
+char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
 
 /*
  * Maximum amount of time in micro seconds to try and enqueue to vhost.
@@ -126,7 +128,7 @@ enum { DRAIN_TSC = 200000ULL };
 
 enum dpdk_dev_type {
     DPDK_DEV_ETH = 0,
-    DPDK_DEV_VHOST = 1
+    DPDK_DEV_VHOST = 1,
 };
 
 static int rte_eal_init_ret = ENODEV;
@@ -221,6 +223,9 @@ struct netdev_dpdk {
     /* virtio-net structure for vhost device */
     OVSRCU_TYPE(struct virtio_net *) virtio_dev;
 
+    /* Identifier used to distinguish vhost devices from each other */
+    char vhost_ifname[PATH_MAX];
+
     /* In dpdk_list. */
     struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);
 };
@@ -594,7 +599,7 @@ dpdk_dev_parse_name(const char dev_name[], const char prefix[],
 }
 
 static int
-netdev_dpdk_vhost_construct(struct netdev *netdev_)
+vhost_construct_helper(struct netdev *netdev_)
 {
     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
     int err;
@@ -613,6 +618,37 @@ netdev_dpdk_vhost_construct(struct netdev *netdev_)
 }
 
 static int
+netdev_dpdk_vhost_cuse_construct(struct netdev *netdev_)
+{
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+
+    strncpy(netdev->vhost_ifname, netdev->up.name, sizeof(netdev->vhost_ifname));
+
+    return vhost_construct_helper(netdev_);
+}
+
+static int
+netdev_dpdk_vhost_user_construct(struct netdev *netdev_)
+{
+    int err;
+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
+
+    /* Take the name of the vhost-user port and append it to the location where
+     * the socket is to be created, then register the socket.
+     */
+    snprintf(netdev->vhost_ifname, sizeof(netdev->vhost_ifname), "%s/%s",
+            vhost_sock_dir, netdev_->name);
+    err = rte_vhost_driver_register(netdev->vhost_ifname);
+    if (err) {
+        VLOG_ERR("vhost-user socket device setup failure for socket %s\n",
+                 netdev->vhost_ifname);
+    }
+    VLOG_INFO("Socket %s created for vhost-user port %s\n", netdev->vhost_ifname, netdev_->name);
+
+    return vhost_construct_helper(netdev_);
+}
+
+static int
 netdev_dpdk_construct(struct netdev *netdev)
 {
     unsigned int port_no;
@@ -1607,7 +1643,7 @@ new_device(struct virtio_net *dev)
     ovs_mutex_lock(&dpdk_mutex);
     /* Add device to the vhost port with the same name as that passed down. */
     LIST_FOR_EACH(netdev, list_node, &dpdk_list) {
-        if (strncmp(dev->ifname, netdev->up.name, IFNAMSIZ) == 0) {
+        if (strncmp(dev->ifname, netdev->vhost_ifname, IF_NAME_SZ) == 0) {
             ovs_mutex_lock(&netdev->mutex);
             ovsrcu_set(&netdev->virtio_dev, dev);
             ovs_mutex_unlock(&netdev->mutex);
@@ -1687,7 +1723,7 @@ static const struct virtio_net_device_ops virtio_net_device_ops =
 };
 
 static void *
-start_cuse_session_loop(void *dummy OVS_UNUSED)
+start_vhost_loop(void *dummy OVS_UNUSED)
 {
      pthread_detach(pthread_self());
      /* Put the cuse thread into quiescent state. */
@@ -1697,7 +1733,7 @@ start_cuse_session_loop(void *dummy OVS_UNUSED)
 }
 
 static int
-dpdk_vhost_class_init(void)
+dpdk_vhost_cuse_class_init(void)
 {
     int err = -1;
 
@@ -1714,7 +1750,16 @@ dpdk_vhost_class_init(void)
         return -1;
     }
 
-    ovs_thread_create("cuse_thread", start_cuse_session_loop, NULL);
+    ovs_thread_create("vhost_thread", start_vhost_loop, NULL);
+    return 0;
+}
+
+static int
+dpdk_vhost_user_class_init(void)
+{
+    rte_vhost_driver_callback_register(&virtio_net_device_ops);
+
+    ovs_thread_create("vhost_thread", start_vhost_loop, NULL);
     return 0;
 }
 
@@ -1923,6 +1968,40 @@ unlock_dpdk:
     NULL,                       /* rxq_drain */               \
 }
 
+static int
+process_vhost_flags(char* flag, char* default_val, int size, char** argv, char** new_val)
+{
+    int changed = 0;
+    struct stat s;
+
+    /* Depending on which version of vhost is in use, process the vhost-specific
+     * flag if it is provided on the vswitchd command line, otherwise resort to
+     * a default value.
+     *
+     * For vhost-user: Process "-cuse_dev_name" to set the custom location of
+     * the vhost-user socket(s).
+     * For vhost-cuse: Process "-vhost_sock_dir" to set the custom name of the
+     * vhost-cuse character device.
+     */
+    if (!strcmp(argv[1], flag) && (strlen(argv[2]) <= size)) {
+        changed = 1;
+        /* For vhost-user, check if the speficied directory exists */
+        if (!strcmp(argv[1], "-vhost_sock_dir") && stat((argv[2]), &s)) {
+            VLOG_INFO("Invalid %s provided - defaulting to %s", flag,
+                    default_val);
+            goto set_default_dir;
+        }
+        *new_val = strdup(argv[2]);
+        VLOG_INFO("User-provided %s in use: %s", flag, *new_val);
+    } else {
+        VLOG_INFO("No %s provided - defaulting to %s", flag, default_val);
+set_default_dir:
+        *new_val = default_val;
+    }
+
+    return changed;
+}
+
 int
 dpdk_init(int argc, char **argv)
 {
@@ -1937,27 +2016,20 @@ dpdk_init(int argc, char **argv)
     argc--;
     argv++;
 
-    /* If the cuse_dev_name parameter has been provided, set 'cuse_dev_name' to
-     * this string if it meets the correct criteria. Otherwise, set it to the
-     * default (vhost-net).
-     */
-    if (!strcmp(argv[1], "--cuse_dev_name") &&
-        (strlen(argv[2]) <= NAME_MAX)) {
-
-        cuse_dev_name = strdup(argv[2]);
-
-        /* Remove the cuse_dev_name configuration parameters from the argument
+#ifdef VHOST_CUSE
+    if (process_vhost_flags("-cuse_dev_name", strdup("vhost-net"),
+            PATH_MAX, argv, &cuse_dev_name)) {
+#else
+    if (process_vhost_flags("-vhost_sock_dir", strdup(ovs_rundir()),
+            NAME_MAX, argv, &vhost_sock_dir)) {
+#endif
+        /* Remove the vhost flag configuration parameters from the argument
          * list, so that the correct elements are passed to the DPDK
          * initialization function
          */
         argc -= 2;
-        argv += 2;    /* Increment by two to bypass the cuse_dev_name arguments */
+        argv += 2;    /* Increment by two to bypass the vhost flag arguments */
         base = 2;
-
-        VLOG_ERR("User-provided cuse_dev_name in use: /dev/%s", cuse_dev_name);
-    } else {
-        cuse_dev_name = "vhost-net";
-        VLOG_INFO("No cuse_dev_name provided - defaulting to /dev/vhost-net");
     }
 
     /* Keep the program name argument as this is needed for call to
@@ -2012,11 +2084,25 @@ static const struct netdev_class dpdk_ring_class =
         netdev_dpdk_get_status,
         netdev_dpdk_rxq_recv);
 
-static const struct netdev_class dpdk_vhost_class =
+static const struct netdev_class dpdk_vhost_cuse_class =
     NETDEV_DPDK_CLASS(
-        "dpdkvhost",
-        dpdk_vhost_class_init,
-        netdev_dpdk_vhost_construct,
+        "dpdkvhostcuse",
+        dpdk_vhost_cuse_class_init,
+        netdev_dpdk_vhost_cuse_construct,
+        netdev_dpdk_vhost_destruct,
+        netdev_dpdk_vhost_set_multiq,
+        netdev_dpdk_vhost_send,
+        netdev_dpdk_vhost_get_carrier,
+        netdev_dpdk_vhost_get_stats,
+        NULL,
+        NULL,
+        netdev_dpdk_vhost_rxq_recv);
+
+const struct netdev_class dpdk_vhost_user_class =
+    NETDEV_DPDK_CLASS(
+        "dpdkvhostuser",
+        dpdk_vhost_user_class_init,
+        netdev_dpdk_vhost_user_construct,
         netdev_dpdk_vhost_destruct,
         netdev_dpdk_vhost_set_multiq,
         netdev_dpdk_vhost_send,
@@ -2039,7 +2125,11 @@ netdev_dpdk_register(void)
         dpdk_common_init();
         netdev_register_provider(&dpdk_class);
         netdev_register_provider(&dpdk_ring_class);
-        netdev_register_provider(&dpdk_vhost_class);
+#ifdef VHOST_CUSE
+        netdev_register_provider(&dpdk_vhost_cuse_class);
+#else
+        netdev_register_provider(&dpdk_vhost_user_class);
+#endif
         ovsthread_once_done(&once);
     }
 }
diff --git a/lib/netdev.c b/lib/netdev.c
index 03a7549..186c1e2 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -111,7 +111,8 @@ netdev_is_pmd(const struct netdev *netdev)
 {
     return (!strcmp(netdev->netdev_class->type, "dpdk") ||
             !strcmp(netdev->netdev_class->type, "dpdkr") ||
-            !strcmp(netdev->netdev_class->type, "dpdkvhost"));
+            !strcmp(netdev->netdev_class->type, "dpdkvhostcuse") ||
+            !strcmp(netdev->netdev_class->type, "dpdkvhostuser"));
 }
 
 static void
diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
index a1b33da..082fb12 100644
--- a/vswitchd/ovs-vswitchd.c
+++ b/vswitchd/ovs-vswitchd.c
@@ -253,8 +253,13 @@ usage(void)
     vlog_usage();
     printf("\nDPDK options:\n"
            "  --dpdk options            Initialize DPDK datapath.\n"
-           "  --cuse_dev_name BASENAME  override default character device name\n"
+#ifdef VHOST_CUSE
+           "  -cuse_dev_name BASENAME  override default character device name\n"
            "                            for use with userspace vHost.\n");
+#else
+           "  -vhost_sock_dir DIR      override default directory where\n"
+           "                            vhost-user sockets are created.\n");
+#endif
     printf("\nOther options:\n"
            "  --unixctl=SOCKET          override default control socket name\n"
            "  -h, --help                display this help message\n"
EOF

Original post of the patch is here.

And run the patch:

patch -p1 < vhost-user.patch

Build open-vswitch with DPDK support

./boot.sh
./configure --with-dpdk=../dpdk-2.0.0/x86_64-native-linuxapp-gcc
make CFLAGS=-O3

Setup huge pages, and cpu isolation

Add the following options to the kernel bootline using the /etc/default/grub and append the following to GRUB_CMDLINE_LINUX=

iommu=pt intel_iommu=on default_hugepagesz=1G hugepagesz=1G hugepages=8 isolcpus=2-7

Tell the changes to grub:

grub2-mkconfig --output=/boot/grub2/grub.cfg

Reboot, and ensure VT-D is enabled in the bios. Now mount the hugepages:

mount -t hugetlbfs -o pagesize=1G none /dev/hugepages

Setup DPDK devices

RTE_SDK=”${HOME}/dpdk-2.0.0
RTE_TARGET =”x86_64-native-linuxapp-gcc”
sudo modprobe uio
sudo insmod ${RTE_SDK}/${RTE_TARGET}/kmod/igb_uio.ko
sudo ${RTE_SDK}/tools/dpdk_nic_bind.py --bind=igb_uio eth0

Initialize open-vswitch database with new schema

cd ${HOME}/ovs/utilities
./ovsdb-tool create /etc/openvswitch/conf.db ${HOME}/ovs/vswitchd/vswitch.ovsschema

start open-vswitch daemon

cd ${HOME}/ovs
ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 -- \
unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info \
--mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log \
--pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach –monitor

Configure bridge

I will setup 2 bridges, each one connects a physical port to the virtual machine:

ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
ovs-vsctl add-port br0 dpdkvhost0 -- set Interface dpdkvhost0 type=dpdkvhostuser
ovs-vsctl add-br br1 -- set bridge br1 datapath_type=netdev
ovs-vsctl add-port br1 dpdk1 -- set Interface dpdk1 type=dpdk
ovs-vsctl add-port br1 dpdkvhost1 -- set Interface dpdkvhost1 type=dpdkvhostuser

Setup the guest

We need to update the virtual machine .xml. First we will add the qemu wrapper to libvirt:

<emulator>${HOME}/openvswitch/utilities/qemu-wrap.py</emulator>

I’m using qemu version – 2.1.2, so sharing of hugepages is supported, so the next thing is to add qemu arguments:

 <qemu:commandline>
 <qemu:arg value='-object'/>
 <qemu:arg value='memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on'/>
 <qemu:arg value='-numa'/>
 <qemu:arg value='node,memdev=mem'/>
 <qemu:arg value='-mem-prealloc'/>
 </qemu:commandline>

The last thing to do is to modify the network interfaces (virtio-net-pci) with the character device (instead of using cuse), and configuring the netdev to type vhost-user:

 <qemu:arg value='-chardev'/>
 <qemu:arg value='socket,id=char1,path=/usr/local/var/run/openvswitch/dpdkvhost0'/>
 <qemu:arg value=' -netdev'/>
 <qemu:arg value='type=vhost-user,id=net1,chardev=char1,vhostforce'/>
 <qemu:arg value='-device'/>
 <qemu:arg value='virtio-net-pci,netdev=net1,mac=00:00:00:00:00:01'/>

Do the same for all the virtio-net-pci interfaces.
now we can start the virtual machine, doing

virsh create vm-vhost.xml

Set affinity

In order to set the affinity correctly we need to identify the vswitch daemon process, locate the pmd thread and set to an isolated core:

ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x40

will set the polling thread to work on core 6.

Displaying bridge ports statistics

open-vswitch can dump the port statistics of a bridge by doing:

ovs-ofctl dump-ports br0

Final words

Performance can be now measured (in my setup, it was pretty good). This will be a feature of the next open-vswitch version. Since 2.4.0 is still not available, I’ve used the master branch to check and build the setup.

40 thoughts on “Open-vSwitch netdev-dpdk with vhost-user support

  1. Awesome.
    DO you have similar steps for the latest dpdk and ovs 2.4 (which has the dpdk support). It will be very helpful

    • ovs 2.4 was branched just after my post. With version 2.4 you won’t need the patch, and should be followed exactly with the same steps. Anyway, you can read the install.DPDK manual to overcome any problems. In a month or so, ovs 2.4 should be finalized – I will then modify the steps accordingly (if any).

  2. Also it will be useful if you could add the entire xml file for the vm.
    And any changes to qemu-wrapper.py file.

  3. Hi,

    I was trying run it on RHEL 7.1, but getting some packages error.

    Please tell us on which distribution you tried and what all we need to download [apart from ovs and dpdk code]?

    Thanks in advance

    • I did that on Centos 6.5.
      You can check ovs/rhel for scripts to build RPM packaging of OVS for red hat. You might find there the additional packages and versions needed.

  4. Hello to everyone,
    I am actually working with openvswitch dpdk.
    I managed with your tuto and this one https://software.intel.com/en-us/blogs/2015/06/09/building-vhost-user-for-ovs-today-using-dpdk-200
    The problem is in the other one there is a set of patchs that I could not understand.
    These are the set of patches applied by the script.
    cd ${home}/src/${ovs_path}
    git checkout 7762f7c39a8f5f115427b598d9e768f9336af466
    patch -p1 <../../dpdk-vhost-user-2.patch
    patch -p1 <../../ovs-ctl-add-dpdk.patch

    Thank you for your help

  5. How did you verify the pmd task and the core in which it is running . I see lot of drops in the packets and would like to isolate the core and see if it improves .

    • Check phy to phy switching.
      If that’s ok, check the affinity of the virtual machine (qemu processes):
      move them to dedicated cores, which are isolated via the boot param isolcpus.

  6. Hi,

    It’s a nice tutorial, thanks a lot for sharing this.

    I followed your post and the configuration seems ready. However, I don’t know how to test the VM to VM performance after starting the VMs. What kind of VM configuration should I do? Do I need to configure the network interface to make them work?

    Could you please give me some hints or some other links to follow? Thanks again.

    • Hi Samuel,

      First, you need to prepare a VNF with user space packet processing. The best way to start with, is to run the DPDK example application. Once you’ll get that running, you should achieve good low-size packet rate, as the whole datapath is processed in user space (both host + vm).
      Second, after doing a basic VM loopback traffic test, you can add an additional VNF (can be the same DPDK based VNF, as explained) and do M traffic loop: phy->vm1->vm2->phy to also check inter-vm traffic. Pay attention to the affinity, you want to place each VM on dedicated, isolated cores to achieve best performance.

      Regards,
      Ran.

      • Hi Ran,

        Thanks for quick reply.

        Do you mean to run DPDK example application in the VMs? I was able to run the testpmd on the host, but I don’t know how to make it run in this vhostuser configuration on the VMs.

        On the other hand, I couldn’t ping from one VM to the other or to the bridge. Did I miss something?

        Regards,
        samuelq

        • Yes, just get the binary into the VM, and run testpmd. I’ve attached the VM xml file, you can find there how to configure the virtio interfaces to use vhost backend to the VM.

          • I didn’t use libvirt but rather using qemu command line with the same configuration as yours. I will try to catch up with using libvirt then.

            Just one more question: what’s the network interface configuration for VMs and the bridges? Should they be ping’able from each other with your configuration (without assigning IP addresses)?

            Thanks again. 🙂
            samuelq

          • Since i’m passing arguments directly from libvirt to qemu, you can just copy the interfaces arguments to your qemu command line.
            For the bridges – you just need to create ports with type dpdkvhostuser.
            This is a l2 bridge, the ports doesn’t have any ip address. What you probably refer to is internal connection of the bridge to the operating system (run ovs-vsctl show, and see that each bridge has internal connection). This is usually used for protocols termination such as arp, icmp.
            For the VMs – you can just run an operating system, and is should initialize the virtio interfaces. You should see them via ethtool
            as regular networking interfaces. Assign them an ip (run ip addr), and they will be ping’able.
            If you are running testpmd inside the VM, you should un-assign the interfaces from the operating system, so you won’t have any ip bound, thus no ip address to ping.

  7. Hi Ran,

    I followed your steps but without any success. Sorry if I’m asking stupid questions, I’m quite new to qemu and libvirt.

    The bridges and dpdk ports are successfully set up.

    1. with libvirt, I was able to “virsh define” your xml, but when starting the virtual machine, I got following error message. Do you know how to fix it? (I already configured qemu-wrap.py based on its instructions.)

    error: Failed to start domain vm_ovs_vhost_user-1
    error: internal error: process exited while connecting to monitor: libvirt: error : cannot execute binary /usr/bin/qemu-wrap.py: Permission denied

    2. I also converted your xml config to qemu parameters without any difficulties, but the openvswitchd.log shows that the vhostuser ports are created and immediately removed when the VM is on:

    2015-08-28T14:51:36.291Z|00001|dpdk(vhost_thread2)|INFO|vHost Device ‘/usr/local/var/run/openvswitch/dpdkvhost0’ (0) has been added
    2015-08-28T14:51:36.294Z|00002|dpdk(vhost_thread2)|INFO|vHost Device ‘/usr/local/var/run/openvswitch/dpdkvhost1’ (1) has been added
    2015-08-28T14:51:40.017Z|00003|dpdk(vhost_thread2)|INFO|vHost Device ‘/usr/local/var/run/openvswitch/dpdkvhost0’ (0) has been removed
    2015-08-28T14:51:40.018Z|00004|dpdk(vhost_thread2)|INFO|vHost Device ‘/usr/local/var/run/openvswitch/dpdkvhost1’ (1) has been removed

    and of course, the network between VMs, VMs and Host do not work, I assign them IP addresses of the same subnet, but they are not ping’able.

    3. libvirt has integrated gateway and dhcp server. But if using qemu command line, will it work the same way as using libvirt?

    I think going with libvirt is the most straightforward way, but I’m still trying to fix the first error.

    Thanks a lot for your help, I really appreciate it.

    samuelq

    • 1. Check that the qemu-wrap.py belongs to the same user you are running libvirt with, and have permissions. Anyway, in case of vhost-user you don’t need the wrapper. (look at my updated post). just remove the line in your xml.
      2. What is you exact qemu command ? are you able to login to the VM ? is it loaded and running ? again, verify the permission/user issue (for the file descriptors).
      3. Look at Libvirt as a wrapper to the hypervisor. If you would like to be hypervisor (xen/kvm etc) agnostic, use Libvirt.

      • 1. I’m pretty sure that qemu-wrap.py has the same ownership and in the same user group as I run libvirt with, which is simply the root.

        If I get rid of the qemu-wrap.py, I got following error message:

        error: Failed to start domain vm_ovs_vhost_user-1
        error: internal error: process exited while connecting to monitor: 2015-08-31T11:03:57.417721Z qemu-system-x86_64: -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/dpdkvhost0: Failed to connect socket: Permission denied

        but the permission of the vhostuser0 seems correct:

        srwxr-xr-x 1 root root 0 Aug 31 13:03 /usr/local/var/run/openvswitch/dpdkvhost0=

        2. the qemu command line that I was using looks like this:

        kvm ~/workspace/vm-images/ubuntu14-1.qcow2 -m 1024 -boot c -vga vmware -cpu Nehalem -smp 4 -name “VM1” \
        -object memory-backend-file,id=mem,size=1024M,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc \
        -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/dpdkvhost0 \
        -netdev type=vhost-user,id=net1,chardev=char1,vhostforce \
        -device virtio-net-pci,netdev=net1,mac=00:11:22:33:44:00 &

        and I was able to login to the VM. And I believe you are right, the permission of the file descriptor is the key problem. But I don’t know how to fix it, as the permission looks correct to me. Any suggestions?

        3. I think it’s very helpful to try with both of them (qemu cmd and libvirt) to make everything work correctly. 🙂

        Thanks again for your help.

  8. We bring 2 VMs up with vhost-user type. We can see eth0 interfaces created in 2 VMs with proper mac address we assign. After IP address assignment, 2 VMs could not PING to each other when they are in the same network segment.

    However, we check link up state via ovs-ofctl utility but LINK_DOWN as below. Have any one with experiences and give some helps … Thanks!!!

    $ sudo ./utilities/ovs-ofctl show ovsbr0
    OFPT_FEATURES_REPLY (xid=0x2): dpid:00000670da615e4a
    n_tables:254, n_buffers:256
    capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
    actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
    1(vhost-user1): addr:00:00:00:00:00:00
    config: PORT_DOWN
    state: LINK_DOWN
    speed: 0 Mbps now, 0 Mbps max
    2(vhost-user2): addr:00:00:00:00:00:00
    config: PORT_DOWN
    state: LINK_DOWN
    speed: 0 Mbps now, 0 Mbps max
    LOCAL(ovsbr0): addr:06:70:da:61:5e:4a
    config: PORT_DOWN
    state: LINK_DOWN
    current: 10MB-FD COPPER
    speed: 10 Mbps now, 0 Mbps max
    OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

    • Yes I’m getting the same. Maybe it’s related to the netdev-dpdk (instead of using the kernel netdev, to get the link status, it should poll the dpdk PMD driver).
      I would post this question in openvswitch.org mailing lists.

  9. Hi Ran,
    I am using OVS 2.4. I was facing an issue while creating the dpdkvhostuser interface. I observe these messages in my logs:
    1] could not create netdev dpdkvhost0 of unknown type dpdkvhostuser
    2] could not open network device dpdkvhost0 (Address family not supported by protocol)
    Do you know what I could be doing wrong?

    Thanks and best regards,
    Yash.

    • Verify that you configured OVS with dpdk support: ./configure –with-dpdk=….
      Check config.log to see what was configured previously.

    • I got the same error, even though i followed the steps exactly
      I am able to add dpdkvhostcuse but i get the following error when i insert dpdkvhostuser
      “could not open network device dpdkvhost0 (Address family not supported by protocol)”
      Even though i have configured ovs with dpdk and enabled vhost user in ‘config/common_linuxapp’

    • Dear Ran,

      Even we faced the same issue,
      We were not even able to create int type dpdkvhostcuse.
      We checked with ovs-vsctl list Open-vSwitch command just to verify that OVS is complied and enabled with interface type dpdk & dpdkvhostuser. Both this types were absense in it.

      It was the issue of ovs-vsctl/vwitchd execution from relative path location. After executing it from the exact path, our issue got resolved.

      Thanks,
      Abhijeet

  10. I am able to add dpdkvhostcuse but i get the following error when i insert dpdkvhostuser
    “could not open network device dpdkvhost0 (Address family not supported by protocol)”
    Even though i have configured ovs with dpdk and enabled vhost user in ‘config/common_linuxapp’

    • Hi Haris,

      look carefully at the install.dpdk, you configured the DPDK to build dpdkvhostcuse, instead of dpdkvhostuser.
      They can’t reside together. Re-visit the install.dpdk with a clean common_linuxapp file. Pay attention you are following the vhost-user sections and not vhost-cuse.

      Regards,
      Ran.

      • Hi Ran,
        Thank You for reply.
        Well i thought you might think that but i’ve confirmed many times that i have checked the following option as ‘y’
        CONFIG_RTE_LIBRTE_VHOST=y
        I have compiled and recompiled many times now using this configuration still it does not compile vhost user. I am confused as to why it does that

  11. I’ve read Install.dpdk many times now and pretty sure that im using the right configuration in ‘config/linux_commonapp’ for vhost user. But Surprisingly, it gives error for vhost user and gives out no error for vhost cuse

  12. i do all above that according to your step,then get the result:
    root@ubuntu:~/src/dpdk-2.0.0# ovs-vsctl show
    29eb6df1-46e1-4dfa-8124-bc94c5daa8bd
    Bridge “br0”
    Port “br0”
    Interface “br0”
    type: internal
    Port “dpdk1”
    Interface “dpdk1”
    type: dpdk
    Port “dpdk0”
    Interface “dpdk0”
    type: dpdk
    but ,then i add a new port, eg:
    root@ubuntu:~/src/dpdk-2.0.0# ovs-vsctl add-port br0 dpdk2 — set Interface dpdk2 type=dpdk
    ovs-vsctl: Error detected while setting up ‘dpdk2’. See ovs-vswitchd log for details.
    root@ubuntu:~/src/dpdk-2.0.0# ovs-vsctl show
    29eb6df1-46e1-4dfa-8124-bc94c5daa8bd
    Bridge “br0”
    Port “br0”
    Interface “br0”
    type: internal
    Port “dpdk1”
    Interface “dpdk1”
    type: dpdk
    Port “dpdk0”
    Interface “dpdk0”
    type: dpdk
    Port “dpdk2”
    Interface “dpdk2”
    type: dpdk
    error: “could not open network device dpdk2 (No such device)”
    i get this error,what’s wrong with this problem?

    • You can only add physical dpdk ports as much as the physical ports you have binded with igb_uio driver.
      In your case, it seems that you have binded only two ports and it throws out error if you add another

Leave a Reply

Your email address will not be published. Required fields are marked *