Wednesday, December 10, 2014

FreeBSD CAM Layer

The CAM Layer

To reduce the complexity of the individual disk drivers, much of the complexity of handling a modern controller has been abstracted out to a separate CAM layer that sits below the GEOM layer and above the device-driver level. The CAM layer handles the device-independent tasks of resource allocation and command routing. Thesetasks include the tracking of requests and notifications between the controller and itsclients . They also include the routing of requests across the many I/O busses to get the request to the correct controller.

The CAM layer leaves to the device driver the device-specific operations such as the setup and teardown of the DMA (Direct Memory Access) maps needed to do the I/O. Some device drivers can still get complex. For example, the Fibre Channel device driver has a lot of code to handle operations, such as asynchronous topology changes as drives are removed and attached. A driver responds to a CAM request by converting the virtual address to store the data to the appropriate physical address. It then marshals the device-independent parameters like IO request, physical address to store the data, and transfer length into a firmware-specific format and executes the command. When the I/O completes, the driver returns the results back to the CAM layer.
In addition to disks, the CAM layer manages any other storage device that might be connected to the system such as tape drives and flash memory dongles . For other character devices such as keyboard and mice, CAM will not be involved. A USB keyboard will be attached directly to the keyboard driver.

SCSI Subsystem

The CAM SCSI subsystem provides a uniform and modular system for the implementation of drivers to control various SCSI devices and to use different SCSI host adapters through host-adapter drivers. The CAM system is made up of threelayers :
  1. The CAM peripheral layer that provides open , close, strategy, attach, and detach operations for the supported SCSI devices. CAM supported devices includes direct access (da)—disk drives, cdrom (cd)—CD-ROM drives, sequential access (sa)—tape drives, and changer (ch)—jukeboxes.
  2. The CAM transport layer that builds, executes, and interprets results of SCSI commands. CAM starts by building a generic I/O command using a CCB (CAM Control Block). The CCB contains a CDB (Command Descriptor Block) composed of a 6-16 byte SCSI command. For example, the command "READ_10, block_offset, count" gets back a status of success or various error codes. If there is an error, the drive may also include sense data to give more information about the cause of the error.
  3. The CAM SIM (Software Interface Module) or HBA (Host Bus Adapter) interface layer provides bus routing to devices. Its job is to allocate a path to the requesteddevice, send a CCB action request to the device, and then collect notification of the I/O completion from the device.
The operation of the CAM layer is most easily understood by tracing an I/O request through it.

The Path of an I/O Request Through the CAM Subsystem

The path of a request through the CAM I/O subsystem is shown in Figure. In the FreeBSD framework, the filesystem sees a single contiguous disk. I/O requests are based on block numbers within this idealized disk. In Figure, the filesystem determines a set of blocks on which it wants to do I/O, and it passes this request down to the GEOM layer by calling the strategy () routine.

Figure The path of an I/O request through the CAM subsystem.




The GEOM layer takes the request and determines the disk to which the request should be sent. In this example the request is on a da SCSI disk. Sometimes a request may span several disks. When this happens, the GEOM layer breaks up the original request into a set of separate I/O requests for each of the disks on which the original request resides. Each of these new requests are passed down to the CAM layer by calling the appropriate strategy ( ) routine for the associated disk (the dastrategy ( ) routine in Figure).
The CAM dastrategy ( ) routine gets the request and calls bioq_disksort ( ), which puts the request on the disk queue of the specified SCSI disk. The dastrategy ( ) routine finishes by calling the xpt_schedule ( ) function.
The xpt_schedule ( ) function allocates and constructs a CCB (CAM Control Block) to describe the operation that needs to be done. If the disk supports tagged queueing, an unused tag is allocated, if it is available. If tagged queueing is not supported or a tag is not available, the request is left on the queue of requests pending for the disk. If the disk is ready to accept a new command, the xpt_schedule ( ) routine calls the drive start routine set up for it ( dastart ( ) in this example).
The dastart ( ) routine takes the first request off the disk's queue and begins to service it using the CCB that was constructed by dastrategy ( ). Because the command is destined for a SCSI disk, dastart ( ) needs to build a SCSI READ_10 command based on the information in the CCB. The resulting SCSI command that includes a READ_10 header, a pointer to the virtual address that references the data to be transferred, and a transfer length is placed in the CCB and given the type XPT_SCSI_IO. The dastart ( ) routine then calls the xpt_action ( ) routine to determine the bus and controller (adaptor) to which the command should be sent.
The xpt_action ( ) routine returns a pointer to a cam_path structure that describes the controller to be used and has a pointer to the controller's action routine. In this example we are using the Adaptec SCSI controller whose action routine is ahc_action(). The xpt_action () routine queues the CCB with its cam_path and schedules it to beprocessed .
The request is processed by calling the controller specific action routine, ahc_action (). The ahc_action () routine gets the CCB and converts its generic SCSI command into a hardware-specific SCB (SCSI Control Block) to handle the command. The SCB is filled out from information in the CCB. It is also filled out with any hardware-specific information and a DMA request descriptor is set up. The SCB is then passed to the driver firmware to be executed. Having completed its task, the CAM layer returns back to the caller of dastrategy ().
The controller fulfills the request and DMAs the data to or from the location given in the SCB. When done, a completion interrupt arrives from the controller. The interrupt causes the ahc_done () routine to be run. The ahc_done () routine updates the CCB associated with the completed SCB from information in the SCB (command completion status or sense information if there was an error). It then frees the previously allocated DMA resources and the completed SCB and passes the completed CCB back to the CAM layer by calling xpt_done ().
The xpt_done () routine inserts the associated CCB into the completion notification queue and posts a software interrupt request for camisr (), the CAM interrupt service routine. When camisr () runs, it removes the CCB from the completion notification queue and calls the specified completion function which maps to dadone () in this example.
The dadone () routine will call the biodone () routine, which notifies the GEOM layer that one of its I/O requests has finished. The GEOM layer aggregates all the separate disk I/O requests together. When the last I/O operation finishes, it updates the original I/O request passed to it by the filesystem to reflect the result (either successful completion or details on any errors that occurred). The filesystem is then notified of the final result by calling the biodone () routine.

Monday, February 24, 2014

When to use which data structure!

A big question is which data structure to use and in which cases.

First, take a look at what you will be doing with the data items and ask yourself some questions:

do you need random access?
do you perform a lot of insertions? how about deletions?
do you allow duplicates?
are you searching for elements frequently?
does your data need to be ordered?
would you need to traverse the elements?
how big is your data?

these and other questions need to be considered before choosing which data structure to use. it is a good idea to start from the simplest structure to see if it satisfies your criteria. it would be a waste of time to slave over implementing a complicated structure when an array can fulfil all your needs.

//THE PROCESS


start with arrays...
arrays are a suitable structure to use when you know the size you'll need and the number of elements is reasonably small. if fast insertion is needed and you do not need to traverse the elements in a specified order, use an unordered array. however, if you need search to be fast use binary search and an ordered array. this however, will make insertion slower, so if you need fast insertion and fast search, choose another structure. deletion is always slow in any kind of an array, so if you are doing a lot of deletions, array is probably not the best structure for you to use. additionally, if you overestimate or underestimate the size of the array, you will either have to expand the array (make a new bigger array and copy all elements from original array into the new one - costly operation), or you will have wasted memory. the biggest detriment of arrays is that size must be known beforehand, as failure to do so results in slow operations or memory waste. also, deletions are always slow, regardless of whether the array is sorted or not.

when arrays are not good enough, move on to linked lists
if you need a more flexible structure that does not require you to know the size ahead of time, a linked list is a good starting point. unordered linked lists offer constant time insertion (at the end or beginning of the list) since only references are being changed and no items need to be shifted. deletion runs in O(N) time since the element we're deleting needs to be found first. this is still faster than the array because, as with insertion, no items are shifted. searching is slow in the linked list because it can only be linear. remember that binary search is not possible to use with an ordered list since we cannot access elements from the middle of the list. also, if you need random access, use arrays or hash tables; linked lists are not the structure to use since they are built on relationships and every element can only be accessed from the first node.

linked lists still not good enough, move on to binary trees
if you have looked at arrays and linked lists and decided that they are not satisfactory, a binary search tree might be what you need. it provides fast O(logN) time for all operations: search, insertion, and deletion. you can easily find the minimum and maximum value of the data, and traversal in order is possible with the use of recursion. however, trees degrade to O(N) speed when they become unbalanced. if you are sure that data will be inserted in random order, a regular binary tree might be a sufficient solution. otherwise, a red-black tree or 2-3-4 tree that retains balance would be your best choice.

the end of the line: hash tables
as we've seen in the last post, hash tables offer close to O(1) search and insertion. deletion also runs in O(1) time assuming the deleted item is simply replaced with a special flag object that the search and insertion algorithms treat as an empty cell. hash tables are very fast if the load factor is suitable: .5 or .66 for open addressing, and around 1 for separate chaining. beware though, that any sort of traversal of the elements inside the hash table is not possible. we are only able to search, insert, and delete (in the special way described earlier). hash tables are much faster than trees, but can degrade catastrophically when the load factor gets too big. since hash tables are based on arrays, it is important to have a rough idea of how many elements you would be expecting. if you cannot accurately predict the size of your elements beforehand, using the separate chaining method would be a better choice over open addressing in implementing your hash table.

//EXTERNAL STORAGE
i mentioned external storage in the post on b-trees. recall that accessing data on in external storage is much slower than access in main memory, so to increase efficiency while working with external storage we need to minimize the number of disk accesses. this happens if we increase the number of data per node, which can be done with a multi-way tree. this way, we can read in a whole block into main memory, and work from there to search for our key (supposing we are doing insertion). if the block contains 1000 data items, by fitting all these items into a single block we have reduced the number of disk accesses from 1000 to 1. this is the direction of thinking you need to be aware of while working with external storage and deciding which data structure to use.

//ABSTRACT DATA TYPES

stack:                       O(1) insertion, O(1) deletion
queue:                     O(1) insertion, O(1) deletion
priority queue:        O(N) insertion, O(1) deletion

to review, there are three types of ADTs: the stack, the queue, and the priority queue. these are interfaces and can be implemented with either an array or linked list (in the case of a priority queue, a heap can be used). the stack is a last-in-first-out data structure, and offers constant time insertion and deletion. the queue has the same efficiency, except that it is a first-in-first-out structure. priority queue is a sorted queue by priority (from greatest to lowest key) - meaning it is sorted. insertion in a priority queue runs in O(N) time, while removal is still in constant time.

//A WORD ON SORTS

                     average               worst
bubble:         O(N^2)                same
selection:      O(N^2)                same
insertion:      O(N^2)                same
shellsort:       O(N^3/2)             same
mergesort:    O(NlogN)            same                [note: requires extra memory]
quicksort:     O(NlogN)           O(N^2)

if you need to sort your data, first start with the simple sorts. insertion is the best of the O(N^2) sorts, so if you have a relatively small amount of data this sort will work fine for your needs, and is also easy to implement. if you have roughly 1,000-5,000 items (figures are estimated), insertion sort is probably not good enough and shellsort can be examined next. if you have a large data set, you can finally turn to the more complex sorting algorithms: mergesort and quicksort, which run in fastest O(NlogN) time. mergesort requires twice the amount of space as the original array, so if you are limited on memory this would not be the best choice. quicksort can then be used. however, beware of quicksort's catastrophic degradation to O(N^2) time if the items are not random. the table above summarizes these points.

Compilation stages for a C program

There are four main stages through which a source code passes in order to finally become an executable.

The four stages for a C program to become an executable are the following:
  1. Pre-processing
  2. Compilation
  3. Assembly
  4. Linking
In Part-I of this article series, we will discuss the steps that the gcc compiler goes through when a C program source code is compiled into an executable.
Before going any further, lets take a quick look on how to compile and run a ‘C’ code using gcc, using a simple hello world example.

$ vi print.c
#include <stdio.h>
#define STRING "Hello World"
int main(void)
{
/* Using a macro to print 'Hello World'*/
printf(STRING);
return 0;
}
 
Now, lets run gcc compiler over this source code to create the executable.
$ gcc -Wall print.c -o print
In the above command:
  • gcc – Invokes the GNU C compiler
  • -Wall – gcc flag that enables all warnings. -W stands for warning, and we are passing “all” to -W.
  • print.c – Input C program
  • -o print – Instruct C compiler to create the C executable as print. If you don’t specify -o, by default C compiler will create the executable with name a.out
Finally, execute print which will execute the C program and display hello world.
$ ./print
Hello World
Note: When you are working on a big project that contains several C program, use make utility to manage your C program compilation as we discussed earlier.
Now that we have a basic idea about how gcc is used to convert a source code into binary, we’ll review the 4 stages a C program has to go through to become an executable.

1. PRE-PROCESSING

This is the very first stage through which a source code passes. In this stage the following tasks are done:
  1. Macro substitution
  2. Comments are stripped off
  3. Expansion of the included files
To understand preprocessing better, you can compile the above ‘print.c’ program using flag -E, which will print the preprocessed output to stdout.
$ gcc -Wall -E print.c
Even better, you can use flag ‘-save-temps’ as shown below. ‘-save-temps’ flag instructs compiler to store the temporary intermediate files used by the gcc compiler in the current directory.
$ gcc -Wall -save-temps print.c -o print
So when we compile the program print.c with -save-temps flag we get the following intermediate files in the current directory (along with the print executable)
$ ls
print.i
print.s
print.o
The preprocessed output is stored in the temporary file that has the extension .i (i.e ‘print.i’ in this example)
Now, lets open print.i file and view the content.
$ vi print.i
......
......
......
......
# 846 "/usr/include/stdio.h" 3 4
extern FILE *popen (__const char *__command, __const char *__modes) ;
extern int pclose (FILE *__stream);
extern char *ctermid (char *__s) __attribute__ ((__nothrow__));

# 886 "/usr/include/stdio.h" 3 4
extern void flockfile (FILE *__stream) __attribute__ ((__nothrow__));
extern int ftrylockfile (FILE *__stream) __attribute__ ((__nothrow__)) ;
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__));

# 916 "/usr/include/stdio.h" 3 4
# 2 "print.c" 2

int main(void)
{
printf("Hello World");
return 0;
}
In the above output, you can see that the source file is now filled with lots and lots of information, but still at the end of it we can see the lines of code written by us. Lets analyze on these lines of code first.
  1. The first observation is that the argument to printf() now contains directly the string “Hello World” rather than the macro. In fact the macro definition and usage has completely disappeared. This proves the first task that all the macros are expanded in the preprocessing stage.
  2. The second observation is that the comment that we wrote in our original code is not there. This proves that all the comments are stripped off.
  3. The third observation is that beside the line ‘#include’ is missing and instead of that we see whole lot of code in its place. So its safe to conclude that stdio.h has been expanded and literally included in our source file. Hence we understand how the compiler is able to see the declaration of printf() function.
When I searched print.i file, I found, The function printf is declared as:
extern int printf (__const char *__restrict __format, ...);
The keyword ‘extern’ tells that the function printf() is not defined here. It is external to this file. We will later see how gcc gets to the definition of printf().
You can use gdb to debug your c programs. Now that we have a decent understanding on what happens during the preprocessing stage. let us move on to the next stage.

2. COMPILING

After the compiler is done with the pre-processor stage. The next step is to take print.i as input, compile it and produce an intermediate compiled output. The output file for this stage is ‘print.s’. The output present in print.s is assembly level instructions.
Open the print.s file in an editor and view the content.
$ vi print.s
.file "print.c"
.section .rodata
.LC0:
.string "Hello World"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl $.LC0, %eax
movq %rax, %rdi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",@progbits
Though I am not much into assembly level programming but a quick look concludes that this assembly level output is in some form of instructions which the assembler can understand and convert it into machine level language.

3. ASSEMBLY

At this stage the print.s file is taken as an input and an intermediate file print.o is produced. This file is also known as the object file.
This file is produced by the assembler that understands and converts a ‘.s’ file with assembly instructions into a ‘.o’ object file which contains machine level instructions. At this stage only the existing code is converted into machine language, the function calls like printf() are not resolved.
Since the output of this stage is a machine level file (print.o). So we cannot view the content of it. If you still try to open the print.o and view it, you’ll see something that is totally not readable.
$ vi print.o
^?ELF^B^A^A^@^@^@^@^@^@^@^@^@^A^@>^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@0^
^@UH<89>å¸^@^@^@^@H<89>ǸHello World^@^@GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3^@^
T^@^@^@^@^@^@^@^AzR^@^Ax^P^A^[^L^G^H<90>^A^@^@^\^@^@]^@^@^@^@A^N^PC<86>^B^M^F
^@^@^@^@^@^@^@^@.symtab^@.strtab^@.shstrtab^@.rela.text^@.data^@.bss^@.rodata
^@.comment^@.note.GNU-stack^@.rela.eh_frame^@^@^@^@^@^@^@^@^@^@^@^
...
...
…
The only thing we can explain by looking at the print.o file is about the string ELF.
ELF stands for executable and linkable format.
This is a relatively new format for machine level object files and executable that are produced by gcc. Prior to this, a format known as a.out was used. ELF is said to be more sophisticated format than a.out (We might dig deeper into the ELF format in some other future article).
Note: If you compile your code without specifying the name of the output file, the output file produced has name ‘a.out’ but the format now have changed to ELF. It is just that the default executable file name remains the same.

4. LINKING

This is the final stage at which all the linking of function calls with their definitions are done. As discussed earlier, till this stage gcc doesn’t know about the definition of functions like printf(). Until the compiler knows exactly where all of these functions are implemented, it simply uses a place-holder for the function call. It is at this stage, the definition of printf() is resolved and the actual address of the function printf() is plugged in.
The linker comes into action at this stage and does this task.
The linker also does some extra work; it combines some extra code to our program that is required when the program starts and when the program ends. For example, there is code which is standard for setting up the running environment like passing command line arguments, passing environment variables to every program. Similarly some standard code that is required to return the return value of the program to the system.
The above tasks of the compiler can be verified by a small experiment. Since now we already know that the linker converts .o file (print.o) to an executable file (print).
So if we compare the file sizes of both the print.o and print file, we’ll see the difference.
$ size print.o
   text    data     bss     dec     hex filename
     97       0       0      97      61 print.o 

$ size print
   text    data     bss     dec     hex filename
   1181     520      16    1717     6b5 print
Through the size command we get a rough idea about how the size of the output file increases from an object file to an executable file. This is all because of that extra standard code that linker combines with our program.












C Program Compilation Steps

You compile c program and get executables. Have you ever wondered what happens during compilation process and how c program gets converted to executable?
In this module we will learn what are the stages involved in c program compilation using gcc on Linux.
Normally C program building process involves four stages to get executable (.exe)
  1. Preprocessing 
  2. Compilation
  3. Assembly 
  4. Linking  
The following Figure shows the steps involved in the process of building the C program starting from the preprocessing until the loading of the executable image into the memory for program running.
C program compilation steps

Compilation with gcc with different options

-E            Preprocess only; do not compile, assemble or link
-S            Compile only; do not assemble or link
-c            Compile and assemble, but do not link
-o  <file>  Place the output into <file>

 We will use below hello.c program to expain all the 4 phases
#include<stdio.h>        //Line 1
#define MAX_AGE  21   //Line 2
int main()
{
 printf( "Maximum age : %d ",MAX_AGE); //Line 5
}

1. Preprocessing

This is the very first stage through which a source code passes. In this stage the following tasks are done:
  1. Macro substitution
  2. Comments are stripped off
  3. Expansion of the included files
To understand preprocessing better, you can compile the above ‘hello.c’ program using flag –E with gcc. This will generate the preprocessed hello.i
Example:
>gcc  -E hello.c  -o hello.i
//hello.i file content
# 1 "hello.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "hello.c"
# 1 "/usr/include/stdio.h" 1 3 4
# 28 "/usr/include/stdio.h" 3 4
…………
…………
Truncated some text…
………
………
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__));
# 918 "/usr/include/stdio.h" 3 4
 
# 2 "hello.c" 2
 
int main()
{
 printf( "Maximum age : %d ",21);
}
In above code (hello.i) you can see macros are substituted with its value (MA_AGE with 21 in printf statement), comments are stripped off (//Line 1, //Line 2 and //Line 5)and libraries are expanded(<stdio.h>)

2. Compilation

Compilation is the second pass. It takes the output of the preprocessor (hello.i) and generates assembler source code (hello.s)
> gcc -S hello.i  -o hello.s
//hello.s file content
.file   "hello.c"
        .section        .rodata
.LC0:
        .string "Maximum age : %d "
        .text
.globl main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        movq    %rsp, %rbp
        .cfi_offset 6, -16
        .cfi_def_cfa_register 6
        movl    $.LC0, %eax
        movl    $21, %esi
        movq    %rax, %rdi
        movl    $0, %eax
        call    printf
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (GNU) 4.4.2 20091027 (Red Hat 4.4.2-7)"
        .section        .note.GNU-stack,"",@progbits
Above code is assembly code which assembler can understand and generate machine code.

3. Assembly

Assembly is the third stage of compilation. It takes the assembly source code (hello.s) and produces an assembly listing with offsets. The assembler output is stored in an object file (hello.o)
>gcc -c hello.s -o hello.o
Since the output of this stage is a machine level file (hello.o). So we cannot view the content of it. If you still try to open the hello.o and view it, you’ll see something that is totally not readable
//hello.o file content
^?ELF^B^A^A^@^@^@^@^@^@^@^@^@^A^@>^@^A^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@@^A^@^@^@^@^@^@^@^@^@^@@^@^@^@^@^@@^@^M^@^@UH<89>å¸
^@^@^@^@¾^U^@^@^@H<89>ç¸^@^@^@^@è^@^@^@^@éã^@^@^@Maximum age :%d
 ^@^@GCC:GNU)4.4.220091027(RedHat4.4.2-7)^@^@^T^@^@^@^@^@^@^@^AzR^
@^Ax^P^A^[^L^G^H<90>^A^@^@^\^@^@^@^\^@^@^@^@^@^@^@^]^@^@^@^@A^N^PC
<86>^B^M^FX^L^G^H^@^@^@^@.symtab^@.strtab^@.shstrtab^@.rela.text^@
.data^@.bss^@.rodata^@.comment^@.note.GNU-stack^@.rela.eh_frame^@
^@^@^@^@^@^@^@^@^@^@^
By looking at above code only thing we can explain is ELF (executable and linkable format). This is a relatively new format for machine level object files and executable that are produced by gcc.

4. Linking  

Linking is the final stage of compilation. It takes one or more object files or libraries as input and combines them to produce a single executable file (hello.exe). In doing so, it resolves references to external symbols, assigns final addresses to procedures/functions and variables, and revises code and data to reflect new addresses (a process called relocation).
> gcc hello.o -o hello
./hello
Maximum age : 21

Now you know c program compilation steps (Preprocessing, Compiling, Assembly, and Linking). There is lot more things to explain in liking phase.

Wednesday, January 29, 2014

ZFS Source Tour





This page is designed to take users through a brief overview of the source code associated with the ZFS filesystem. It is not intended as an introduction to ZFS – it is assumed that you already have some familiarity with common terms and definitions, as well as a general sense of filesystem architecture.
Traditionally, we describe ZFS as being made of up of three components: ZPL (ZFS POSIX Layer), DMU (Data Management Unit), and SPA (Storage Pool Allocator). While this serves as a useful example for slideware, the real story is a little more complex. The following image gives a more detailed overview; clicking on one of the areas will take you to a more detailed description and links to source code.
In this picture, you can still see the three basic layers, though there are quite a few more elements in each. In addition, we show zvol consumers, as well as the management path, namely zfs(1M) and zpool(1M). You'll find a brief description of all these subsystems below. This is not intended to be an exhaustive overview of exactly how everything works. In the end, the source code is the final documentation. We hope that it is easy to follow and well commented.If not, feel free to post to the ZFS discussion forum.

Filesystem Consumers

These are your basic applications that interact with ZFS solely through the POSIX filesystem APIs. Virtually every application falls into this category. The system calls are passed through the generic OpenSolaris VFS layer to the ZPL.

Device Consumers

ZFS provides a means to created 'emulated volumes'. These volumes are backed by storage from a storage pool, but appear as a normal device under /dev. This is not a typical use case, but there are a small set of cases where this capability is useful. There are a small number of applications that interact directly with these devices, but the most common consumer is a kernel filesystem or target driver layered on top of the device.

Management GUI

Solaris will ship with a web-based ZFS GUI in build 28. While not part of OpenSolaris (yet), it is an example of a Java-based GUI layered on top of the JNI.

Management Consumers

These applications are those which manipulate ZFS filesystem or Storage pools, including examining properties and dataset hierarchy. While there are some scattered exceptions (zoneadm, zoneadmd, fstyp), the two main applications are zpool(1M) and zfs(1M).

zpool(1M)

This command is responsible for creating and managing ZFS storage pools. Its primary purpose is to parse command line input and translate them into libzfs calls, handling any errors along the way. The source for this command can be found in usr/src/cmd/zpool. It covers the following files:
zpool_main.cThe bulk of the command, responsible for processing all arguments and subcommands
zpool_vdev.cCode responsible for converting a series of vdevs into an nvlist representation for libzfs
zpool_iter.cCode to iterate over some or all the pools in the system easily
zpool_util.cMiscellaneous utility functions

zfs(1M)

This command is responsible for creating and managing ZFS filesystems. Similar to zpool(1M), its purpose is really just to parse command line arguments and pass the handling off to libzfs. The source for this command can be found in usr/src/cmd/zfs. It covers the following ilfes:
zfs_main.cThe bulk of the command, responsible for processing all arguments and subcommands
zfs_iter.cCode to iterate over some or all of the datasets in the system

JNI

This library provides a Java interface to libzfs. It is currently a private interface, and is tailored specifically for the GUI. As such, it is geared primarily toward observability, as the GUI performs most actions through the CLI. The source for this library can be found in usr/src/lib/libzfs_jni

libzfs

This is the primary interface for management apps to interact with the ZFS kernel module. The library presents a unified, object-based mechanism for accessing and manipulating both storage pools and filesystems. The underlying mechanism used to communicate with the kernel is ioctl(2) calls through /dev/zfs. The source for this library can be found in usr/src/lib/libzfs. It covers the following files:
libzfs_dataset.cPrimary interfaces for manipulating datasets
libzfs_pool.cPrimary interfaces for manipulating pool
libzfs_changelist.cUtility routines for propagating property changes across children
libzfs_config.cRead and manipulate pool configuration information
libzfs_graph.cConstruct dependent lists for datasets
libzfs_import.cDiscover and import pools
libzfs_mount.cMount, unmount, and share datasets.
libzfs_status.cLink to FMA knowledge articles based on pool state
libzfs_util.cMiscellaneous routines

ZPL (ZFS POSIX Layer)

The ZPL is the primary interface for interacting with ZFS as a filesystem. It is a (relatively) thin layer that sits atop the DMU and presents a filesystem abstraction of files and directories. It is responsible for bridging the gap between the OpenSolaris VFS interfaces and the underlying DMU interfaces. It is also responsible for enforcing ACL (Access Control List) rules as well as synchronous (O_DSYNC) semantics.
zfs_vfsops.cOpenSolaris VFS interfaces
zfs_vnops.cOpenSolaris vnode interfaces
zfs_znode.cEach vnode corresponds to an underlying znode
zfs_dir.cDirectory operations
zfs_acl.cACL implementation
zfs_ctldir.cPseudo filesystem implementation .zfs/snapshot
zfs_log.cZPL interface to record intent log entries
zfs_replay.cZPL interface to replay intent log entries
zfs_byteswap.cByteswap routines for ZPL data structures

ZVOL (ZFS Emulated Volume)

ZFS includes the ability to present raw devices backed by space from a storage pool. These are known as 'zvols' within the source code, and is implemented by a single file in the ZFS source.
zvol.cVolume emulation interface

/dev/zfs

This device is the primary point of control for libzfs. While consumers could consume the ioctl(2) interface directly, it is closely entwined with libzfs, and not a public interface (not that libzfs is, either). It consists of a single file, which does some validation on the ioctl() parameters and then vectors the request to the appropriate place within ZFS.
zfs_ioctl.c/dev/zfs ioctl() interface
zfs.hInterfaces shared between kernel and userland

DMU (Data Management Unit)

The DMU is responsible for presenting a transactional object model, built atop the flat address space presented by the SPA. Consumers interact with the DMU via objsets, objects, and transactions. An objset is a collection of objects, where each object is an arbitrary piece of storage from the SPA. Each transaction is a series of operations that must be committed to disk as a group; it is central to the on-disk consistency for ZFS.
dmu.cPrimary external DMU interfaces
dmu_objset.cObjset open/close/manipulate external interfaces
dmu_object.cObject allocate/free interfaces
txg.cTransaction model control threads
dmu_tx.cTransaction creation/manipulation interfaces
dnode.cOpen context object-level manipulation functions
dnode_sync.cSyncing context object-level manipulation functions
dbuf.cBuffer management functions
dmu_zfetch.cData stream prefetch logic
refcount.cGeneric reference counting interfaces
dmu_send.cSend and receive functions for snapshots

DSL (Dataset and Snapshot Layer)

The DSL aggregates DMU objsets into a hierarchical namespace, with inherited properties, as well as quota and reservation enforcement. It is also responsible for managing snapshots and clones of objsets.
For more information on how snapshots are implemented, see Matt's blog entry.
dsl_dir.cNamespace management functions
dsl_dataset.cSnapshot/Rollback/Clone support interfaces
dsl_pool.cPool-level support interfaces
dsl_prop.cProperty manipulation functions
unique.cSupport functions for unique objset IDs

ZAP (ZFS Attribute Processor)

The ZAP is built atop the DMU, and uses scalable hash algorithms to create arbitrary (name, object) associations within an objset. It is most commonly used to implement directories within the ZPL, but is also used extensively throughout the DSL, as well as a method of storing pool-wide properties. There are two very different ZAP algorithms, designed for different type of directories. The "micro zap" is used when the number of entries is relatively small and each entry is reasonably short. The "fat zap" is used for larger directories, or those with extremely long names.
zap.cFat ZAP interfaces
zap_leaf.cLow-level support functions
zap_micro.cMicro ZAP interfaces

ZIL (ZFS Intent Log)

While ZFS provides always-consistent data on disk, it follows traditional filesystem semantics where the majority of data is not written to disk immediately; otherwise performance would be pathologically slow. But there are applications that require more stringent semantics where the data is guaranteed to be on disk by the time the read(2) or write(2) call returns. For those applications requiring this behavior (specified with O_DSYNC), the ZIL provides the necessary semantics using an efficient per-dataset transaction log that can be replayed in event of a crash.
For a more detailed look at the ZIL implementation, see Neil's blog entry.
zil.cIntent Log

Traversal

Traversal provides a safe, efficient, restartable method of walking all data within a live pool. It forms the basis of resilvering and scrubbing. It walks all metadata looking for blocks modified within a certain period of time. Thanks to the copy-on-write nature of ZFS, this has the benefit of quickly excluding large subtrees that have not been touched during an outage period. It is fundamentally a SPA facility, but has intimate knowledge of some DMU structures in order to handle snapshots, clones, and certain other characteristics of the on-disk format.
dmu_traverse.cTraversal code

ARC (Adaptive Replacement Cache)

ZFS uses a modified version of an Adaptive Replacement Cache to provide its primary cacheing needs. This cache is layered between the DMU and the SPA and so acts at the virtual block-level. This allows filesystems to share their cached data with their snapshots and clones.
arc.cAdaptive Replacement Cache implementation

Pool Configuration (SPA)

While the entire pool layer is often referred to as the SPA (Storage Pool Allocator), the configuration portion is really the public interface. It is responsible for gluing together the ZIO and vdev layers into a consistent pool object. It includes routines to create and destroy pools from their configuration information, as well as sync the data out to the vdevs on regular intervals.
spa.cRoutines for opening, importing, exporting, and destroying pools
spa_misc.cMiscellaneous SPA routines, including locking
spa_config.cParse and update pool configuration data
spa_errlog.cLog of persistent pool-wide data errors.
spa_history.cPersistent ring buffer of successful commands that modified the pool's state.
zfs_fm.cRoutines to post ereports for FMA consumption.

ZIO (ZFS I/O Pipeline)

The ZIO pipeline is where all data must pass when going to or from the disk. It is responsible for translation DVAs (Device Virtual Addresses) into logical locations on a vdev, as well as checksumming and compressing data as necessary. It is implemented as a multi-stage pipeline, with a bit mask to control which stage gets executed for each I/O. The pipeline itself is quite complex, but can be summed up by the following digram:
I/O typesZIO stateCompressionChecksumGang Blocks DVA managementVdev I/O
RWFCIopen




RWFCIwait for
children ready





-W-
write compress



-W-

checksum generate


-WFC-


gang pipeline

-WFC-


get gang header

-W-


rewrite gang
header


--F--


free gang
members


-C-


claim gang
members


-W-



DVA allocate
--F--



DVA free
-C-



DVA claim
-W-

gang checksum
generate



RWFCIready




RW--I




I/O start
RW--I




I/O done
RW--I




I/O assess
RWFCIwait for
children done





R--

checksum verify


R--


read gang
members


R--
read decompress



RWFCIdone




I/O types
Each phase of the I/O pipeline applies to a certain type of I/O. The letters stand for (R)ead, (W)rite, (F)ree, (C)laim, and (I)octl.
ZIO state
These are internal states used to synchronize I/Os. For example, an I/O which child I/Os must wait for all children to be ready before allocating a BP, and must wait for all children to be done before returning.
Compression
This phase applies any compression algorithms, if applicable.
Checksum
This phase applies any checksum algorithms, if applicable.
Gang blocks
When there is not enough contiguous space to write a complete block, the ZIO pipeline will break the I/O up into smaller 'gang blocks' which can later be assembled transparently to appear as complete blocks.
DVA management
Each I/O must be allocated a DVA (Device Virtual Address) to correspon to a particular portion of a vdev in the pool. These phases perform the necessary allocation of these addresses, using metaslabs and the space map.
Vdev I/O
These phases are the once which actually issue I/O to the vdevs contained within the pool.
zio.cThe main ZIO stages, including gang block translation
zio_checksum.cGeneric checksum interface
fletcher.cFletcher2 and fletcher4 checksum algorithms
sha256.cSHA256 checksum algorithm
zio_compress.cGeneric compression interface
lzjb.cLZJB compression algorithm
uberblock.cBasic uberblock (root block) routines
bplist.cKeeps track of lists of block pointers
metaslab.cThe bulk of DVA translation
space_map.cKeeps track of free space, used during DVA translation
szio_inject.cFramework for injecting persistent errors for data and devices

VDEV (Virtual Devices)

The virtual device subsystem provides a unified method of arranging and accessing devices. Virtual devices form a tree, with a single root vdev and multiple interior (mirror and RAID-Z) and leaf (disk and file) vdevs. Each vdev is responsible for representing the available space, as well as laying out blocks on the physical disk.
vdev.cGeneric vdev routines
vdev_disk.cDisk virtual device
vdev_file.cFile virtual device
vdev_mirror.cN-Way mirrroring
vdev_raidz.cRAID-Z grouping (see Jeff's blog).
vdev_root.cToplevel pseudo vdev
vdev_missing.cSpecial device for import
vdev_label.cRoutines for reading and writing identifying labels
vdev_cache.cSimple device-level caching for reads
vdev_queue.cI/O Scheduling algorithm for vdevs

LDI (Layered Driver Interface)

At the bottom of the stack, ZFS interacts with the underlying physical devices through LDI, the Layered Driver Interface, as well as the VFS interfaces (when dealing with files).