Tuesday, October 29, 2013

How does clock_gettime work

Clock_gettime is a function that, as its name suggests, gives the time. clock_gettime has a VDSO  implementation on x86 architectures. VDSO is a shared memory segment between the kernel and each user application. It allows the kernel to export functions to userland so that userspace processes can use them without the overhead of a system call.
clock_gettime() requires two arguments, first one being the wanted clock id, and the second one being a pointer to a struct timespec variable in which the values will be stored. Struct timespec is simply a structure that contains two fields, tv_sec for seconds, and tv_nsec for nanoseconds:

struct timespec {
    __kernel_time_t tv_sec;     /* seconds */
    long    tv_nsec;    /* nanoseconds */
};
Note: The main focus of this blog post will be around clock ids CLOCK_MONOTONIC and CLOCK_REALTIME  as these are the clocks that the LTTng tracer uses for userspace tracing to put a timestamp on recorded events.
clock_gettime()
 is relative to a certain time reference, ie. some specific event in the past. The main difference on Linux between CLOCK_MONOTONIC and CLOCK_REALTIME is this reference. CLOCK_REALTIME gives the "real time" as in the wall clock time, or the time on your watch. Its time reference is the epoch which is defined to be the first of January 1970. If I call:
clock_gettime(CLOCK_REALTIME, &ts);
at the time I am writing this post, the returned values are the following:
ts.tv_sec = 1383065479, ts.tv_nsec = 750367192.
If we take the number of seconds and convert it to years (dividing it by 3600, then 24, then 365.25), we get 43.82. This means that 43.82 years have elapsed since the epoch up until the moment I called clock_gettime(CLOCK_REALTIME, &ts).  This also means that if I manually change the clock (or the date) of my system, this change will have repercussions on the value returned by clock_gettime(CLOCK_REALTIME, &ts).Note that this is also true for time changes made by NTP. Thus, the time given by the CLOCK_REALTIME clock is not ~monotonic~, as it is not necessarily monotonically increasing in time, and can go backwards and forwards.

This helps us introduce the other clock id, CLOCK_MONOTONIC. This clock is, as you could have guessed, updated in a strictly monotonic fashion. In other words, consecutive reads of this clock unconditionally give ascending values; this clock can not go back in time, even if the clock of my system is changed. The time reference to which it relatively gives the time to is the boot time of the system. Note that this is specific to Linux, and not to all POSIX systems. The time returned by clock_gettime(CLOCK_MONOTONIC, &ts) is the elapsed time since the system boot. If I call:
clock_gettime(CLOCK_MONOTONIC, &ts);
I get the following values:
ts.tv_sec = 103941, ts.tv_nsec = 959414826
Meaning that my (Linux) system has booted 103941/3600 = 28.8 hours ago. We can clearly see why this time reference guarantees monotonicity. The elapsed time since boot is independent from the wall clock time. If I change the clock of my system, the value given by the CLOCK_MONOTONIC clock is still relative to the boot time, which still hasn't changed.

As you can see, CLOCK_MONOTONIC is better for ordering events during the lifetime of a session, whereas CLOCK_REALTIME is better when an absolute time is needed. LTTng  uses the monotonic clock to assign a timestamp to the recorded events in a trace. However, since it is more useful to have an actual wall clock time, LTTng stores the difference between CLOCK_REALTIME and CLOCK_MONOTONIC at the beginning of the tracing in a metadata file. When LTTng is done tracing, a conversion from boot time to absolute time can be made by adding that value to all recorded timestamps.

Now let's take a look at the source code of the VDSO implementation of clock_gettime(), in file
arch/x86/vdso/vclock_gettime.c from the kernel source tree:
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
    int ret = VCLOCK_NONE;

    switch (clock) {
    case CLOCK_REALTIME:
        ret = do_realtime(ts);
        break;
    case CLOCK_MONOTONIC:
        ret = do_monotonic(ts);
        break;
    case CLOCK_REALTIME_COARSE:
        return do_realtime_coarse(ts);
    case CLOCK_MONOTONIC_COARSE:
        return do_monotonic_coarse(ts);
    }

    if (ret == VCLOCK_NONE)
        return vdso_fallback_gettime(clock, ts);
    return 0;
}
This code snippet simply calls the time function corresponding to the requested clock id. Assuming we asked for CLOCK_MONOTONIC, let's take a look at the do_monotonic() function, from the same file:
notrace static int do_monotonic(struct timespec *ts)
{
    unsigned long seq;
    u64 ns;
    int mode;

    ts->tv_nsec = 0;
    do {
        seq = read_seqcount_begin(&gtod->seq);
        mode = gtod->clock.vclock_mode;
        ts->tv_sec = gtod->monotonic_time_sec;
        ns = gtod->monotonic_time_snsec;
        ns += vgetsns(&mode);
        ns >>= gtod->clock.shift;
    } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
    timespec_add_ns(ts, ns);
  
    return mode;
}

As you can see, all this function does is to "fill" the ts structure that was given as a parameter with the current values of tv_sec and tv_nsec. The do-while loop is simply a synchronization scheme and can be ignored for now.
ts->tv_sec is set to gtod->monotonic_time_sec while ts->tv_nsec is set to gtod->monotonic_time_snsec  plus the returned value of vgetsns(), for finer granularity. gtod is simply a structure that acts as a replacement for the actual values kept in the kernel, that userspace processes can't access. Therefore, the values in gtod have to get updated regularly. This update happens in update_vsyscall(struct timekeeper *tk), from file arch/x86/kernel/vsyscall_64.c:
void update_vsyscall(struct timekeeper *tk)
{
    struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;

    write_seqcount_begin(&vdata->seq);

    /* copy vsyscall data */
    [...]
  
    vdata->monotonic_time_sec = tk->xtime_sec      // (1)
          + tk->wall_to_monotonic.tv_sec;
    vdata->monotonic_time_snsec = tk->xtime_nsec   // (2)
          + (tk->wall_to_monotonic.tv_nsec
            << tk->shift);
    while (vdata->monotonic_time_snsec >=
          (((u64)NSEC_PER_SEC) << tk->shift)) {
        vdata->monotonic_time_snsec -=
          ((u64)NSEC_PER_SEC) << tk->shift;
        vdata->monotonic_time_sec++;
    }

    [...]

    write_seqcount_end(&vdata->seq);
}

In (1), monotonic_time_sec is set, and in 2, monotonic_time_snsec is set. These are the values that are "exported" to userland, via the vsyscall_gtod_data structure. By digging a little more in the kernel source, we can have an idea at how and when is this structure is updated.

Depending on the frequency of "ticks" - see CONFIG_HZ
Hardware timer interrupt (generated by the Programmable Interrupt Timer - PIT)
-> tick_periodic();
  -> do_timer(1);
    -> update_wall_time();
      -> timekeeping_update(tk, false);
        -> update_vsyscall(tk);

Or, (on tickless kernels - see CONFIG_NO_HZ):
smp_apic_timer_interrupt()
  -> irq_enter()
    -> tick_check_idle()
      -> tick_check_nohz()
        -> tick_nohz_update_jiffies()
          -> tick_do_update_jiffies64()
            -> do_timer(ticks) // ex: ticks = 1344
              -> update_wall_time();
                -> timekeeping_update(tk, false);
                  -> update_vsyscall(tk);

So, to sum things up: clock_gettime() gives some values that are updated regurarly, plus an interpolation to give better precision for the nanoseconds value. How regurarly are these values updated? Simply upon timer interrupts.

Wednesday, October 2, 2013

Debug the Linux kernel using Qemu

As I am trying to understand in detail how virtualization works, I wanted to debug the Linux kernel in a virtual machine using Qemu to have an idea about the real control flow during the execution of Linux in Qemu/KVM. So I looked on the internet for how to get this done, but most of the posts I found didn't work for me. The main reason is that Qemu was used to directly boot a kernel image from the host, which for some reason I wasn't able to get done. In this post, I will use another way to debug the Linux kernel using Qemu.

The workaround is to give to gdb (on the host) the same kernel used by the virtual machine you want to debug. If you have compiled your own kernel on the host, this can easily be done by downloading the sources of the same kernel version in a VM, and use the same configuration file you used on your host system. Note that you don't need to be running the same version on your host, you only need to provide the same vmlinux file to gdb as the one used in your VM. You could also do the other way around, which is to get the configuration file used to compile the kernel in the VM and use it to compile the same kernel version on the host.

These are the detailed steps from scratch, using Linux kernel version 3.11.3:
On the host, download the tarball from kernel.org and extract it (you can also use git or any other way, the only important thing is to have the same version both on the host an on the guest):
host $> wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.11.3.tar.xz
host $> tar xvf linux-3.11.3.tar.xz
host $> cd linux-3.11.3
host $> make menuconfig
host $> make -jNUMBER_OF_THREADS
Now, it is very important that you add debugging information to your kernel. Make sure you have done this by inspecting the .config file created, and by making sure that CONFIG_DEBUG_INFO is set to "y":
CONFIG_DEBUG_INFO=y 

At this point, you should have a file called vmlinux in the current directory. Now you have two options, you can copy the entire directory linux-3.11.3 into your VM, cd into it and run:
guest $> sudo make modules_install
guest $> sudo make install

Or download the same tarball in your VM, copy only the file linux-3.11.3/.config from your host into linux-3.11.3 in your VM, and run:
guest $> make
guest $> sudo make modules_install
guest $> sudo make install

Both of these methods will get you to the same result: installing the same vmlinux file you have on your host in your VM. You can then reboot your VM, and try to boot using the kernel you just compiled. If you can't achieve this, you have a problem and you should review the steps from the beginning.

Assuming this worked, turn off the VM. Then use Qemu to turn it on again, while adding the options -s -S to qemu. If you're using virt-manager (libvirt), you should do this by modifying the xml file used by libvirt to configure your VM. Otherwise, you can simply use the following command:
host $> qemu-system-x86_64 -smp 4 -m 4096 /var/lib/libvirt/images/name-of-vm.img -s -S

Note: -smp 4 means create 4 virtual CPUs and -m 4096 means allocate 4GB of memory to the VM.

At this point, Qemu will have started and opened a gdbserver instance on port 1234 (-s option) but your VM will be stopped (because of the -S option). Now, from your host, launch gdb by giving it the vmlinux file we compiled earlier:
host $> gdb vmlinux

You will see a message on the output, ending with:
Reading symbols from some_path/linux-3.11.3/vmlinux...done.
If you see instead another message like:
Reading symbols from some_path/linux-3.9.2/vmlinux...(no debugging symbols found)...done.
Go back and make sure you have enabled the debugging information before compiling the kernel.

Then connect to the gdbserver launched by Qemu:
(gdb) target remote :1234
then:
(gdb) c

Now your VM will continue execution and start the boot process. Make sure you boot using the kernel we have just compiled.
From this point, you can use gdb as you would with any regular process. In the gdb shell, you can send the Ctrl-C signal to pause the VM and set a breakpoint on a specific function:
(gdb) b do_timer

Then continue execution:
(gdb) c

Gdb will then break the next time function do_timer is called.

Tip: after sending the Ctrl-C signal to pause your VM, you can use Ctrl-X, Ctrl-A to show the source code of your kernel and the location at which the VM is paused.


Registering a probe to a kernel module using Systemtap

I was trying to register a probe on a function of a kernel module of mine using Systemtap. The .stp file was fairly simple: $> cat mymod...