Instruction caching for bhyve
Student: Mihai Carabas (<mihai AT SPAMFREE freebsd DOT org>)
Mentor: Neel Natu (<neel AT SPAMFREE freebsd DOT org>)
Project description
Virtualization of every submodule of a CPU in hardware isn't yet accomplished by the hardware designers. One example is related to the guest interrupt controller. The Virtual Machine Monitor (VMM) must virtualize guest's interrupts and interrupt controller. Given the fact that is no feature to do this in hardware, VMM must emulate all guest accesses to APIC control registers which requires VM exits and VM entries (at each access of the registers, an VM exit exception is raised and the VMM must solve this exception. Thus the VMM needs first to decode the instruction that caused the exception and the emulate it.
The instruction fetch step is a very expensive one because you have to manually walk the Guest page tables to find the guest physical address (function gla2gpa [1]), than you have to find the host physical address by manual walking the EPT (on Intel) page tables (function vm_gpa_hold in vmm_fetch_instruction). You cannot access directly the GPA because the EPT pagetables aren't compatible with the normal one. One you obtain the physical host address you can map it and decode the instruction. Instruction decoding is another operation that is very time consuming because of the variable instruction length. So not only we are wasting time with VM exits, the time to solve the VMexit exception is quite important.
Approach to solving the problem
Giving this desiderates Neel Natu came with the idea of instruction caching based on the address of the instruction (RIP). Neel's idea about caching instructions seems to not be used in KVM and would be a good feature for FreeBSD VMM/bhyve (you can find his thoughts here [2]).
Until now we established that instruction fetching and decoding brings a significant overhead. The idea behind this project is simple: keep the decoded instruction in a dictionary (basically you will store struct vie which contains all you need about the instruction [3]). At the next VMexit trap for instruction emulation you will check if you have that instruction already decoded based on the instruction pointer (IP) of the Virtual Machine. But when you use this instruction you must be sure that nobody modified that in-memory instruction. This can be accomplished by marking the page where the instruction is stored as read-only and writing to it will cause a VMexit (because the EPT is used). At that moment one can verify if the IP is in our dictionary (hash-table) and invalidate that cached instruction (eventually remove it from the Hash - the garbage collection Neel was mentioning in his proposal). Playing with page-tables read/write attributes would be the chalange here because it can appear many corner cases). These are the specific actions need to be taken in order to accomplish our objective.
Deliverables/Milestones
Week 1-3
- building Bhyve test infrastructure
- find some suitable tests that generates a lot of interrupts and does not rely on any particular driver of a device (to not influence the tests)
- start implementing the custom hash-table that has as a key the instruction pointer (RIP), address of the "struct *vm" and the CR3 value (this basically identifies the address space for which we made the caching)
Week 4-6
- write unit tests for the custom hash-table, especially for corner cases not to introduce critical bugs in kernel
- cache instructions by adding them to the hash-table. This would be done in vm_handle_inst_emul function [17]
- start looking where I can write-protect the page with the instruction and where I can handle the fault
- make some tests (may be performance) to see if there is any improvement. Of course there is a big chance to crash the guest
MIDTERM (week 6)
- functional code with some dummy instruction caching tests
Week 7-9
- add protection to detect when the page tables modify by write protecting all the page table pages walked when we translate the instruction pointer(RIP) to the guest physical address
- make some stress tests to see if there is any corner case we've missed
Week 10-12
- tests the solution on other Guest OSes
- centralize the results
- clean the code
FINAL (week 12)
- a set of patches implementing instruction caching for FreeBSD/VMM
- a final report with the results shown the improved/not improved results.
Test Plan
The real final goal of this project are the results obtained after implementing the instruction caching. Before start testing we should identify where is the overhead in the worse case scenario: trying to search the instruction in the hashtable and also fetch and decode the instruction (this would be worse than the current case). However as Neel mentioned in the FreeBSD case, the scenario will only take place at boot-time.
The tests that will be run will have to generate lot's of interrupts (causing VM exits which would need instruction decode) in order to see the specific difference with and without instruction decode. Here I guess I could steel two IBM blades to install FreeBSD on them in order to ease-up the testing (I don't know if a runtime config would work in this sensitive case).
Other tests scenario's will be using different GuestOS-es to see the gain we have in there.
Final Results and thoughts
First of all you can find all the steps I've taken through the project in the soc-status list, in the thread [GSOC] bhyve instruction caching .
Until now I managed to finish up all the coding stuff related to instruction caching. As you saw in my soc-status e-mails we obtained a speed-up of 35%-40% in the microbenchmarking tests (accessing LAPIC many times from a kernel module). Further we wanted to see how get this extrapolated to real-world workloads.
I've made two kinds of benchmarking: a CPU intensive process and a make buildworld -j2 command. For each of one I've measured the time spent to execute.
The CPU intensive application
The CPU intensive application is a bash script:
a=0 MAX=10000000 for i in $(seq 1 $MAX); do a=$((a+1)) done
For a VM with 2 vCPUs:
- Cache_instr=1
real 3m45.067s 3m42.628s 3m38.371s 3m36.301s 3m39.929s user 3m10.454s 3m8.785s 3m7.516s 3m8.204s 3m8.822s sys 0m19.085s 0m16.135s 0m13.696s 0m13.016s 0m16.105s
- Cache_instru=0
real 3m50.550s 3m41.517s 3m34.783s user 3m5.350s 3m7.571s 3m1.415s sys 0m25.268s 0m19.200s 0m16.200s
There are multiple measurements. As you can see the results aren't stable and are in the same range. To minimize the range they vary, I repeated the tests with 1vCPU (to eliminate the context switches):
* Cache_instr=1
real 2m58.968s 2m57.009s 3m0.451s 2m55.902s 2m56.422s user 2m46.909s 2m45.241s 2m45.670s 2m45.788s 2m45.503s sys 0m4.890s 0m4.134s 0m3.942s 0m3.764s 0m3.984s
* Cache_instr=0
real 2m56.845s 2m57.051s 3m1.794s 2m57.340s user 2m45.232s 2m44.873s 2m45.482s 2m46.538s sys 0m4.644s 0m4.141s 0m3.906s 0m3.875s
As you can see the results are very appropiate in terms of variation and almost the same.
make buildworld -j2
The results for a make buildworld -j2 with 1 vCPU:
* Cache_instr=1
13900.60 real 12051.54 user 1800.42 sys
* Cache_instr=0
13938.07 real 12122.14 user 1743.61 sys
As you can see the difference between them is not significant and is about the same.
Also for this two different kind of workloads there is no speed-up improvement unfortunatelly.
I've tried other workloads more specific like:
- dd if=/dev/zero of=/dev/zero bs=256 count=10000K (from memory to memory - to not be influenced by the storage system)
A simple getuid program that executes getuid syscall in a loop:
int main(int argc, char *argv[]) { int i; if (argc == 2) { i = atoi(argv[1]); } else { i = 100; } while (i > 0) { getuid(); i--; } return 0; }
The results were the same with and without instruction caching.
It seems that we can't get a real-world benefict with this instruction caching.
The Code