Porting to VIMAGE
August 17 2009 Julian Elischer
VIMAGE: what is it?
VIMAGE is a framework in the BSD kernel which allows a co-operating module to operate on multiple independent instances of its state so that it can participate in a virtual machine / virtual environment scenario. It refers to a part of the Jail infrastructure in FreeBSD. For historical reasons "Virtual network stack enabled jails"(1) are also known as "VIMAGE enabled jails"(2) or "vnet enabled jails"(3). The currently correct term is the latter, which is a contraction of the first. In the future other parts of the system may be virtualized using the same technology and the term to cover all such components would be VIMAGE enhanced modules.
The implementation approach taken by the VIMAGE framework is a redefinition of selected global state variables to evaluate to constructs that allow for the virtualized state to be stored and resolved in appropriate instances of 'jail' specific container storage regions. The code operating on virtualized state has to conform to a set of rules described further below. Among other things in order to allow for all the changes to be conditionally compilable. i.e. permitting the virtualized code to fall back to operation on global state.
The rest of this document will discuss NETWORK virtualization though the concepts may be true in the future for other parts of the system.
The most visible change throughout the existing code is typically replacement of direct references to global variables with macros; foo_bar thus becomes V_foo_bar. V_foo_bar macros will resolve back to the foo_bar global in default kernel builds, and alternatively to the logical equivalent of some_base_pointer->_foo_bar for "options VIMAGE" kernel configs.
Prepending of "V_" prefixes to variable references helps in visual discrimination between global and virtualized state. It is also possible to use an alternative syntax, of VNET(foo_bar) to achieve the same thing. The developers felt that V_foo_bar was less visually distracting while still providing enough clues to the reader that the variable is virtualized. In fact the V_foo_bar macro is locally defined near the definition of foo_bar to be an alias for VNET(foo_bar) so the two are not only equivalent, they are the same.
The framework also extends the sysctl infrastructure to support access to virtualized state through introduction of the SYSCTL_VNET family of macros; those also automatically fall back to their standard SYSCTL counterparts in default kernel builds.
Transparent libkvm(3) lookups are provided to virtualized variables which permits userland binaries such as netstat to operate unmodified on "options VIMAGE" kernels, though this may have some security implications.
Vnets are associated with jails. In 8.0, every process is associated with a jail, usually the default (null) jail, and jails currently hang off of a processes ucred. This relationship defines a process's administrative affinity to a vnet and thus indirectly to all of its state. All network interfaces and sockets hold pointers back to their associated vnets. This relationship is obviously entirely independent from proc->ucred->jail bindings. Hence, when a process opens a socket, the socket will get bound to a vnet instance hanging off of proc->ucred->jail->vnet, but once such a socket->vnet binding gets established, it cannot be changed for the entire socket lifetime.
The mapping of a from a thread to a vnet should always be done via the TD_TO_VNET macro as the path may change in the future as we get more experience with using the system.
Certain classes of network interfaces (Ethernet in particular) can be reassigned from one vnet to another at any time. By definition all vnets are independent and can communicate only if they are explicitly provided with communication paths. Currently mainly netgraph is used to establish inter-vnet datapaths, though other paths are being explored such as the 'epair' back-to-back virtual interface pair, in which the different sides may exist in different jails.
In network traffic processing the vnet affinity is defined either by the inbound interface or by the socket / pcb -> vnet binding. However, there are many functions in the network stack that cannot implicitly fetch the vnet context from their standard arguments. Instead of explicitly extending argument lists of such functions with a struct vnet *, the concept of a "current vnet", a per-thread variable was introduced, which can be fetched efficiently via the curvnet macro. The correct network context has to be set on entry to the network stack (socket operations, packet reception, or timer-driven functions) and cleared on exit. This must be done via provided CURVNET_SET() / CURVNET_RESTORE() family of macros, which allow for "stacking" of curvnet context setting and provide additional debugging info in INVARIANTS kernel configs. In most cases however a developer writing virtualized code will not have to set / restore the curvnet context unless the code would include timer-driven events, given that those are inherently vnet-contextless on entry.
The current rule is that when not in networking code, the result of the 'curvnet' macro will return NULL and evaluating a V_xxx (or VNET(xxx)) macro will result in an kernel page-fault error. While this is not strictly necessary, it aids in debugging and assurance of program correctness. Note this does NOT mean that TD_TO_VNET(curthread) is invalid. A thread is always associated with a vnet, but just the efficient "curvnet" access method is disabled along with the ability to resolve virtualized symbols.
Converting / virtualizing existing code
There are several steps need in virtualisation.
1. Decide whether the module needs to be virtualised.
- If the module is a driver for specific hardware, it makes sense that there be only one instance of the driver as there is only one piece of physical hardware. There are changes in the networking code to allow physical (or virtual) interfaces to be moved between vnets. This generally requires NO changes to the network drivers of the classes covered (e.g. ethernet). Currently if your module is does not have any networking facet, the answer is "no" by default.
2. If the module is to be virtualised, decide which attributes of the
- module should be virtualised. For example, It may make sense that there be a single central pool of "struct foo" and a single uma zone for them to come from, with a single lock guarding it. It might also make sense if the "foo_debug" sysctl controls all the instances at once, while on the other hand, the "foo_mode" sysctl might make better sense if it were controllable on a virtual system by virtual system basis.
3. Work out what global variables and structures are to be virtualised to achieve the behaviour required for part #2.
4. Work out for all the code paths through the module, how the thread entering the module can divine which virtual environment it is on.
- Some examples:
- Since interfaces are all assigned to one vnet or another, an incoming
- packet has a pointer to the receive interface, which in turn has a pointer back to the vnet. Often "curvnet" will already have been set by the time your code is called anyhow.
- Similarly, on any request from outside the kernel, (direct or indirect)
- the current thread has a way to get to the current virtual environment instance via TD_TO_VNET(curthread). For existing sockets the vnet
context must be used via so->so_vnet since the thread's vnet might change after socket creation.
- the current thread has a way to get to the current virtual environment instance via TD_TO_VNET(curthread). For existing sockets the vnet
- Timer initiated actions usually have a (void *) argument which points to
- some private structure for the module. It should be possible to add a pointer to the appropriate module instance into whatever structure that points to.
- Sometimes an action (timer trigerred or trigerred by module load or
- unload simply has to check all the vimage or module instances. There are macro (pairs) for this which will iterate through all the VNET or instances. (see sample code below).
5. Decide which parts of the initialization and teardown are per jail and
- which parts are global, and separate out the code accordingly. Global initialization is done using the SYSINIT facility. Per jail initialization is done using VNET_SYSINIT(). Per jail teardown is doen using VNET_SYSUNINIT(). Global teardown is done using SYSUNIT(). In addition, the modevent handler is called with various event types before any of these are called. The modevent handler may veto load or teardown. On Shutdown, only the modevent handler is called so it may have to simulate the calling of the other handlers if clean shutdown is a requirement of your module. (see sample code below). Don't forget to unregister event handlers, and destroy locks and condition variables.
6. Add the code described below to the files that make up the module.
Details: (VNET implementation details)
Firstly the file <net/vnet.h> must be included. Depending on what code you use you may find you also need one or more of: <sys/proc.h>, <sys/ucred.h> and <sys/jail.h>. These requirements may change slightly as the ABI settles.
Having decided which variables need to be virtualized, the definition of thosvariables needs to be modified to use the VNET_DEFINE() macro. For example:
would become:
Normal rules regarding 'static/extern' apply. The initial values that you give in this way will be stored and used as the initial values for EACH NEW INSTANCE of these variables as new jails/vnets are created.
As mentioned above, accesses to virtualized symbols are achieved via macros, which generally are of the same name as the original symbol but with a "V_" prepended, thus the head of the interface list, called 'ifnet' is replaced whereever used with "V_ifnet". We do this, by adding the following lines after the definitions above:
In SCTP, because the code is shared with other OS's they are replaced with a macro MODULE_GLOBAL(modulename, symbol). (this may simplify in light of recent changes).
In addition, should any of your values need to be changed or viewed via sysctl, the following SYSCTL definitions would be needed:
In the current version of vimage, when VIMAGE is not compiled into the kernel, the macros evaluate to a direct reference to the one and only symbol/variable, so that there is no speed penalty for those not using vnets.
When VIMAGE is compiled in, the macro will evaluate to an access to an offset into a data structure that is accessed on a per-vet basis. The vnet used for this is always curvnet. For this reason an attempt to access such a variable while curvnet is not valid, will result in an exception.
To ensure that curvnet has a valid value when needed one needs to add the following code on all entry code paths into the networking code:
The initial value is usually something like "TD_TO_VNET(curthread) which in turn is a macro that derives the vnet affinity from the current thread. It could also be (m->m_ifp->if_vnet) if we were receiving an mbuf, or so->so_vnet if we had a socket involved.
Usually, when a packet enters the system it is carried through the processing path via a single thread, and that thread will set its virtual environment reference to that indicated by the packet on picking up that new packet. This means that in the normal inbound processing path as well as the outgoing process path the current thread can be used to indicate the current virtual environment and curvet will always be valid once most user supplied code is reached. In timer events, it is sometimes necessary to add an "outer loop" to iterate through all the possible vnets if there is just one timer for all instances.
When a new loadable module is virtualised the module definitions and intializers need to be examined. The following example illustrates what is needed in the case that you are not loading a new protocol, or domain. (for that see later)
1 /* ============= sample skeleton code ========== */
2
3 /* init on boot or module load */
4 static int
5 mymod_init(void)
6 {
7 return (error);
8 }
9
10 /****************
11 * Stuff that must be initialized for every instance
12 * (including the first of course).
13 */
14 static int
15 mymod_vnet_init(const void *unused)
16 {
17 return (0);
18 }
19
20 /**********************
21 * Called for the removal of the last instance only on module unload.
22 */
23 static void
24 mymod_uninit(void)
25 {
26 }
27
28 /***********************
29 * Called for the removal of each instance.
30 */
31 static int
32 mymod_vnet_uninit(const void *unused)
33 {
34 return (0)
35 }
36
37 mymod_modevent(module_t mod, int type, void *unused)
38 {
39 int err = 0;
40
41 switch (type) {
42 case MOD_LOAD:
43 /* check that loading is ok */
44 break;
45
46 case MOD_UNLOAD:
47 /* check that unloading is ok */
48 break;
49
50 case MOD_QUIESCE:
51 /* warning: try stop processing */
52 /* maybe sleep 1 mSec or something to let threads get out */
53 break;
54
55 case MOD_SHUTDOWN:
56 /*
57 * this is called once but you may want to shut down
58 * things in each jail, or something global.
59 * In that case it's up to us to simulate the SYSUNINIT()
60 * or the VNET_SYSUNINIT()
61 */
62 {
63 VNET_ITERATOR_DECL(vnet_iter);
64 VNET_LIST_RLOCK();
65 VNET_FOREACH(vnet_iter) {
66 CURVNET_SET(vnet_iter);
67 mymod_vnet_uninit(NULL);
68 CURVNET_RESTORE();
69 }
70 VNET_LIST_RUNLOCK();
71 }
72 /* you may need to shutdown something global. */
73 mymod_uninit();
74 break;
75
76 default:
77 err = EOPNOTSUPP;
78 break;
79 }
80 return err;
81 }
82
83 static moduledata_t mymodmod = {
84 "mymod",
85 mymod_modevent,
86 0
87 };
88
89 /* define execution order using constants from /sys/sys/kernel.h */
90 #define MYMOD_MAJOR_ORDER SI_SUB_PROTO_BEGIN /* for example */
91 #define MYMOD_MODULE_ORDER (SI_ORDER_ANY + 64) /* not fussy */
92 #define MYMOD_SYSINIT_ORDER (MYMOD_MODULE_ORDER + 1) /* a bit later */
93 #define MYMOD_VNET_ORDER (MYMOD_MODULE_ORDER + 2) /* later still */
94
95 DECLARE_MODULE(mymod, mymodmod, MYMOD_MAJOR_ORDER, MYMOD_MODULE_ORDER);
96 MODULE_DEPEND(mymod, ipfw, 2, 2, 2); /* depend on ipfw version (exactly) 2 */
97 MODULE_VERSION(mymod, 1);
98
99 SYSINIT(mymod_init, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
100 mymod_init, NULL);
101 SYSUNINIT(mymod_uninit, MYMOD_MAJOR_ORDER, MYMOD_SYSINIT_ORDER,
102 mymod_uninit, NULL);
103
104 VNET_SYSINIT(mymod_vnet_init, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
105 mymod_vnet_init, NULL);
106 VNET_SYSUNINIT(mymod_vnet_uninit, MYMOD_MAJOR_ORDER, MYMOD_VNET_ORDER,
107 mymod_vnet_uninit, NULL);
108
109
110 /* ========== end sample code ======= */
On BOOT, the order of evaluation will be:
- In a NON-VIMAGE kernel where the module is compiled:
- MODEVENT, SYSINIT and VNET_SYSINIT both run with order defined by their order declarations. {good foot shooting material if you get it wrong!}
- MODEVNET, SYSINIT and VNET_SYSINIT all run with order defined by their order declarations. AND in addition, the VNET_SYSINIT is repeated once for every existing or new jail/vnet.
On loading a vnet enabled kernel module after boot:
- MODEVENT("event = load"); SYSINIT() VNET_SYSINIT() for every existing jail
- AND in addition, VNET_SYSINIT being called for each new jail created.
On unloading of module:
- MODEVENT("event = MOD_QUIESCE") MODEVENT("event = MOD_UNLOAD") VNET_SYSUNINIT called for every jail/vnet SYSUNINIT
On system shutdown:
- MODEVENT(shutdown)
NOTICE that while the order of the SYSINIT and VNET_SYSINIT is reversed from that of SYSUNINIT and VNET_SYSUNINIT, MODEVENTS do not follow this rule and thus it is dangerous to initialise and uninitialise things which are order dependent using MODEVENTs.
Or, put another way, Since MODEVENT is called first during module load, it would, by the assumption that everything is reversed, be easy to assume that MODEVENT is called AFTER the SYSINITS during unload. This is in fact not the case. (and I have the scars to prove it).
It might be make some sense if the "QUIESCE" was called before the SYSINIT/SYSUNINIT and the UNLOAD called after.. with a millisecond sleep between them, but this is not the case either.
Since initial values are copied into the virtualized variables on each new instantiatin, it is quite possible to have modules for which some of the above methods are not needed, and they may be left out. (but not the modevent).
Sometimes there is a need to iterate through the vnets. See the modevent shutdown handler (above) for an example of how to do this. Don't forget the locks.
In the case where you are loading a new protocol, or domain (protocol family) there are some "shortcuts" that are in place to allow you to maintain a bit more source compatibility with older revisions of FreeBSD. It must be added that the sample code above works just fine for protocols, however protcols also have an aditional initialization vector which is via the prtocol structure, which has a pr_init() entry. When a protocol is registered using pf_proto_register(), the pr_init() for the protocol is called once for every existing vnet. in addition, it will be called for each new vnet. The pr_destroy() method will be called as well on vnet teardown. The pf_proto_register() funcion can be called either from a modevent handler of from the SYSINIT() if you have one, and the pf_proto_unregister() called from the SYSUNINIT or the unload modevent handler.
If you are adding a whole new protocol domain, (protocol family) then you should add the VNET_DOMAIN_SET(domainname) (e,g, inet, inet6) macro. These use VNET_SYSINIT internally to indirectly call the dom_init() and pr_init() functions for each vnet, (and the equivalent for teardown.) In this case one needs to be absolutely sure that both your domain and protocol initializers can be called multiple times, once for each vnet. One can still add SYSINITs for once only initialization, or use the modevent handler. I prefer to do as much explicitly in the SYSINITS and VNET_SYSINITS as then you have no surprises.
finally: The command to make a new jail with a new vnet:
jail -c host.hostname=test path=/ vnet command=/bin/tcsh jail -c host.hostname=test path=/ children.max=4 vnet command=/bin/tcsh
(children.max allows hierarchical jail creation). Note that the command must come last.