A big question is which data structure to use and in which cases.
First, take a look at what you will be doing with the data items and ask yourself some questions:
do you need random access?
do you perform a lot of insertions? how about deletions?
do you allow duplicates?
are you searching for elements frequently?
does your data need to be ordered?
would you need to traverse the elements?
how big is your data?
these and other questions need to be considered before choosing which
data structure to use. it is a good idea to start from the simplest
structure to see if it satisfies your criteria. it would be a waste of
time to slave over implementing a complicated structure when an array
can fulfil all your needs.
//THE PROCESS
start with arrays...
arrays are a suitable structure to use when you know the size you'll
need and the number of elements is reasonably small. if fast insertion
is needed and you do not need to traverse the elements in a specified
order, use an unordered array. however, if you need search to be fast
use binary search and an ordered array. this however, will make
insertion slower, so if you need fast insertion and fast search, choose
another structure. deletion is always slow in any kind of an array, so
if you are doing a lot of deletions, array is probably not the best
structure for you to use. additionally, if you overestimate or
underestimate the size of the array, you will either have to expand the
array (make a new bigger array and copy all elements from original array
into the new one - costly operation), or you will have wasted memory.
the biggest detriment of arrays is that size must be known beforehand,
as failure to do so results in slow operations or memory waste. also,
deletions are always slow, regardless of whether the array is sorted or
not.
when arrays are not good enough, move on to linked lists
if you need a more flexible structure that does not require you to know
the size ahead of time, a linked list is a good starting point.
unordered linked lists offer constant time insertion (at the end or
beginning of the list) since only references are being changed and no
items need to be shifted. deletion runs in O(N) time since the element
we're deleting needs to be found first. this is still faster than the
array because, as with insertion, no items are shifted. searching is
slow in the linked list because it can only be linear. remember that
binary search is not possible to use with an ordered list since we
cannot access elements from the middle of the list. also, if you need
random access, use arrays or hash tables; linked lists are not the
structure to use since they are built on relationships and every element
can only be accessed from the first node.
linked lists still not good enough, move on to binary trees
if you have looked at arrays and linked lists and decided that they are
not satisfactory, a binary search tree might be what you need. it
provides fast O(logN) time for all operations: search, insertion, and
deletion. you can easily find the minimum and maximum value of the data,
and traversal in order is possible with the use of recursion. however,
trees degrade to O(N) speed when they become unbalanced. if you are sure
that data will be inserted in random order, a regular binary tree might
be a sufficient solution. otherwise, a red-black tree or 2-3-4 tree
that retains balance would be your best choice.
the end of the line: hash tables
as we've seen in the last post, hash tables offer close to O(1) search
and insertion. deletion also runs in O(1) time assuming the deleted item
is simply replaced with a special flag object that the search and
insertion algorithms treat as an empty cell. hash tables are very fast
if the load factor is suitable: .5 or .66 for open addressing, and
around 1 for separate chaining. beware though, that any sort of
traversal of the elements inside the hash table is not possible. we are
only able to search, insert, and delete (in the special way described
earlier). hash tables are much faster than trees, but can degrade
catastrophically when the load factor gets too big. since hash tables
are based on arrays, it is important to have a rough idea of how many
elements you would be expecting. if you cannot accurately predict the
size of your elements beforehand, using the separate chaining method
would be a better choice over open addressing in implementing your hash
table.
//EXTERNAL STORAGE
i mentioned external storage in the post on b-trees. recall that
accessing data on in external storage is much slower than access in main
memory, so to increase efficiency while working with external storage
we need to minimize the number of disk accesses. this happens if we
increase the number of data per node, which can be done with a multi-way
tree. this way, we can read in a whole block into main memory, and work
from there to search for our key (supposing we are doing insertion). if
the block contains 1000 data items, by fitting all these items into a
single block we have reduced the number of disk accesses from 1000 to 1.
this is the direction of thinking you need to be aware of while working
with external storage and deciding which data structure to use.
//ABSTRACT DATA TYPES
stack: O(1) insertion, O(1) deletion
queue: O(1) insertion, O(1) deletion
priority queue: O(N) insertion, O(1) deletion
to review, there are three types of ADTs: the stack, the queue, and the
priority queue. these are interfaces and can be implemented with either
an array or linked list (in the case of a priority queue, a heap can be
used). the stack is a last-in-first-out data structure, and offers
constant time insertion and deletion. the queue has the same efficiency,
except that it is a first-in-first-out structure. priority queue is a
sorted queue by priority (from greatest to lowest key) - meaning it is
sorted. insertion in a priority queue runs in O(N) time, while removal
is still in constant time.
//A WORD ON SORTS
average worst
bubble: O(N^2) same
selection: O(N^2) same
insertion: O(N^2) same
shellsort: O(N^3/2) same
mergesort: O(NlogN) same [note: requires extra memory]
quicksort: O(NlogN) O(N^2)
if you need to sort your data, first start with the simple sorts.
insertion is the best of the O(N^2) sorts, so if you have a relatively
small amount of data this sort will work fine for your needs, and is
also easy to implement. if you have roughly 1,000-5,000 items (figures
are estimated), insertion sort is probably not good enough and shellsort
can be examined next. if you have a large data set, you can finally
turn to the more complex sorting algorithms: mergesort and quicksort,
which run in fastest O(NlogN) time. mergesort requires twice the amount
of space as the original array, so if you are limited on memory this
would not be the best choice. quicksort can then be used. however,
beware of quicksort's catastrophic degradation to O(N^2) time if the
items are not random. the table above summarizes these points.
Monday, February 24, 2014
Compilation stages for a C program
There are four main stages through which a source code passes in order to finally become an executable.
The four stages for a C program to become an executable are the following:
Before going any further, lets take a quick look on how to compile and run a ‘C’ code using gcc, using a simple hello world example.
Now that we have a basic idea about how gcc is used to convert a source code into binary, we’ll review the 4 stages a C program has to go through to become an executable.
Now, lets open print.i file and view the content.
You can use gdb to debug your c programs. Now that we have a decent understanding on what happens during the preprocessing stage. let us move on to the next stage.
Open the print.s file in an editor and view the content.
This file is produced by the assembler that understands and converts a ‘.s’ file with assembly instructions into a ‘.o’ object file which contains machine level instructions. At this stage only the existing code is converted into machine language, the function calls like printf() are not resolved.
Since the output of this stage is a machine level file (print.o). So we cannot view the content of it. If you still try to open the print.o and view it, you’ll see something that is totally not readable.
ELF stands for executable and linkable format.
This is a relatively new format for machine level object files and executable that are produced by gcc. Prior to this, a format known as a.out was used. ELF is said to be more sophisticated format than a.out (We might dig deeper into the ELF format in some other future article).
Note: If you compile your code without specifying the name of the output file, the output file produced has name ‘a.out’ but the format now have changed to ELF. It is just that the default executable file name remains the same.
The linker comes into action at this stage and does this task.
The linker also does some extra work; it combines some extra code to our program that is required when the program starts and when the program ends. For example, there is code which is standard for setting up the running environment like passing command line arguments, passing environment variables to every program. Similarly some standard code that is required to return the return value of the program to the system.
The above tasks of the compiler can be verified by a small experiment. Since now we already know that the linker converts .o file (print.o) to an executable file (print).
So if we compare the file sizes of both the print.o and print file, we’ll see the difference.
In this module we will learn what are the stages involved in c program compilation using gcc on Linux.
Normally C program building process involves four stages to get executable (.exe)
-S Compile only; do not assemble or link
-c Compile and assemble, but do not link
-o <file> Place the output into <file>
We will use below hello.c program to expain all the 4 phases
Example:
>gcc -E hello.c -o hello.i
//hello.i file content
In above code (hello.i) you can see macros are substituted with its
value (MA_AGE with 21 in printf statement), comments are stripped off
(//Line 1, //Line 2 and //Line 5)and libraries are
expanded(<stdio.h>)
> gcc -S hello.i -o hello.s
//hello.s file content
Above code is assembly code which assembler can understand and generate machine code.
>gcc -c hello.s -o hello.o
Since the output of this stage is a machine level file (hello.o). So we cannot view the content of it. If you still try to open the hello.o and view it, you’ll see something that is totally not readable
//hello.o file content
By looking at above code only thing we can explain is ELF (executable
and linkable format). This is a relatively new format for machine level
object files and executable that are produced by gcc.
> gcc hello.o -o hello
./hello
Maximum age : 21
Now you know c program compilation steps (Preprocessing, Compiling, Assembly, and Linking). There is lot more things to explain in liking phase.
The four stages for a C program to become an executable are the following:
- Pre-processing
- Compilation
- Assembly
- Linking
Before going any further, lets take a quick look on how to compile and run a ‘C’ code using gcc, using a simple hello world example.
$ vi print.c #include <stdio.h> #define STRING "Hello World" int main(void) { /* Using a macro to print 'Hello World'*/ printf(STRING); return 0; }
Now, lets run gcc compiler over this source code to create the executable.
$ gcc -Wall print.c -o printIn the above command:
- gcc – Invokes the GNU C compiler
- -Wall – gcc flag that enables all warnings. -W stands for warning, and we are passing “all” to -W.
- print.c – Input C program
- -o print – Instruct C compiler to create the C executable as print. If you don’t specify -o, by default C compiler will create the executable with name a.out
$ ./print Hello WorldNote: When you are working on a big project that contains several C program, use make utility to manage your C program compilation as we discussed earlier.
Now that we have a basic idea about how gcc is used to convert a source code into binary, we’ll review the 4 stages a C program has to go through to become an executable.
1. PRE-PROCESSING
This is the very first stage through which a source code passes. In this stage the following tasks are done:- Macro substitution
- Comments are stripped off
- Expansion of the included files
$ gcc -Wall -E print.cEven better, you can use flag ‘-save-temps’ as shown below. ‘-save-temps’ flag instructs compiler to store the temporary intermediate files used by the gcc compiler in the current directory.
$ gcc -Wall -save-temps print.c -o printSo when we compile the program print.c with -save-temps flag we get the following intermediate files in the current directory (along with the print executable)
$ ls print.i print.s print.oThe preprocessed output is stored in the temporary file that has the extension .i (i.e ‘print.i’ in this example)
Now, lets open print.i file and view the content.
$ vi print.i ...... ...... ...... ...... # 846 "/usr/include/stdio.h" 3 4 extern FILE *popen (__const char *__command, __const char *__modes) ; extern int pclose (FILE *__stream); extern char *ctermid (char *__s) __attribute__ ((__nothrow__)); # 886 "/usr/include/stdio.h" 3 4 extern void flockfile (FILE *__stream) __attribute__ ((__nothrow__)); extern int ftrylockfile (FILE *__stream) __attribute__ ((__nothrow__)) ; extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__)); # 916 "/usr/include/stdio.h" 3 4 # 2 "print.c" 2 int main(void) { printf("Hello World"); return 0; }In the above output, you can see that the source file is now filled with lots and lots of information, but still at the end of it we can see the lines of code written by us. Lets analyze on these lines of code first.
- The first observation is that the argument to printf() now contains directly the string “Hello World” rather than the macro. In fact the macro definition and usage has completely disappeared. This proves the first task that all the macros are expanded in the preprocessing stage.
- The second observation is that the comment that we wrote in our original code is not there. This proves that all the comments are stripped off.
- The third observation is that beside the line ‘#include’ is missing and instead of that we see whole lot of code in its place. So its safe to conclude that stdio.h has been expanded and literally included in our source file. Hence we understand how the compiler is able to see the declaration of printf() function.
extern int printf (__const char *__restrict __format, ...);The keyword ‘extern’ tells that the function printf() is not defined here. It is external to this file. We will later see how gcc gets to the definition of printf().
You can use gdb to debug your c programs. Now that we have a decent understanding on what happens during the preprocessing stage. let us move on to the next stage.
2. COMPILING
After the compiler is done with the pre-processor stage. The next step is to take print.i as input, compile it and produce an intermediate compiled output. The output file for this stage is ‘print.s’. The output present in print.s is assembly level instructions.Open the print.s file in an editor and view the content.
$ vi print.s .file "print.c" .section .rodata .LC0: .string "Hello World" .text .globl main .type main, @function main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 movq %rsp, %rbp .cfi_offset 6, -16 .cfi_def_cfa_register 6 movl $.LC0, %eax movq %rax, %rdi movl $0, %eax call printf movl $0, %eax leave ret .cfi_endproc .LFE0: .size main, .-main .ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3" .section .note.GNU-stack,"",@progbitsThough I am not much into assembly level programming but a quick look concludes that this assembly level output is in some form of instructions which the assembler can understand and convert it into machine level language.
3. ASSEMBLY
At this stage the print.s file is taken as an input and an intermediate file print.o is produced. This file is also known as the object file.This file is produced by the assembler that understands and converts a ‘.s’ file with assembly instructions into a ‘.o’ object file which contains machine level instructions. At this stage only the existing code is converted into machine language, the function calls like printf() are not resolved.
Since the output of this stage is a machine level file (print.o). So we cannot view the content of it. If you still try to open the print.o and view it, you’ll see something that is totally not readable.
$ vi print.o ^?ELF^B^A^A^@^@^@^@^@^@^@^@^@^A^@>^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@0^ ^@UH<89>å¸^@^@^@^@H<89>ǸHello World^@^@GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3^@^ T^@^@^@^@^@^@^@^AzR^@^Ax^P^A^[^L^G^H<90>^A^@^@^\^@^@]^@^@^@^@A^N^PC<86>^B^M^F ^@^@^@^@^@^@^@^@.symtab^@.strtab^@.shstrtab^@.rela.text^@.data^@.bss^@.rodata ^@.comment^@.note.GNU-stack^@.rela.eh_frame^@^@^@^@^@^@^@^@^@^@^@^ ... ... …The only thing we can explain by looking at the print.o file is about the string ELF.
ELF stands for executable and linkable format.
This is a relatively new format for machine level object files and executable that are produced by gcc. Prior to this, a format known as a.out was used. ELF is said to be more sophisticated format than a.out (We might dig deeper into the ELF format in some other future article).
Note: If you compile your code without specifying the name of the output file, the output file produced has name ‘a.out’ but the format now have changed to ELF. It is just that the default executable file name remains the same.
4. LINKING
This is the final stage at which all the linking of function calls with their definitions are done. As discussed earlier, till this stage gcc doesn’t know about the definition of functions like printf(). Until the compiler knows exactly where all of these functions are implemented, it simply uses a place-holder for the function call. It is at this stage, the definition of printf() is resolved and the actual address of the function printf() is plugged in.The linker comes into action at this stage and does this task.
The linker also does some extra work; it combines some extra code to our program that is required when the program starts and when the program ends. For example, there is code which is standard for setting up the running environment like passing command line arguments, passing environment variables to every program. Similarly some standard code that is required to return the return value of the program to the system.
The above tasks of the compiler can be verified by a small experiment. Since now we already know that the linker converts .o file (print.o) to an executable file (print).
So if we compare the file sizes of both the print.o and print file, we’ll see the difference.
$ size print.o text data bss dec hex filename 97 0 0 97 61 print.o $ size print text data bss dec hex filename 1181 520 16 1717 6b5 printThrough the size command we get a rough idea about how the size of the output file increases from an object file to an executable file. This is all because of that extra standard code that linker combines with our program.
C Program Compilation Steps
You compile c program and get executables. Have you ever wondered what happens during compilation process and how c program gets converted to executable?In this module we will learn what are the stages involved in c program compilation using gcc on Linux.
Normally C program building process involves four stages to get executable (.exe)
- Preprocessing
- Compilation
- Assembly
- Linking
Compilation with gcc with different options
-E Preprocess only; do not compile, assemble or link-S Compile only; do not assemble or link
-c Compile and assemble, but do not link
-o <file> Place the output into <file>
We will use below hello.c program to expain all the 4 phases
#include<stdio.h> //Line 1 #define MAX_AGE 21 //Line 2 int main() { printf ( "Maximum age : %d " ,MAX_AGE); //Line 5 } |
1. Preprocessing
This is the very first stage through which a source code passes. In this stage the following tasks are done:- Macro substitution
- Comments are stripped off
- Expansion of the included files
Example:
>gcc -E hello.c -o hello.i
//hello.i file content
# 1 "hello.c" # 1 "<built-in>" # 1 "<command-line>" # 1 "hello.c" # 1 "/usr/include/stdio.h" 1 3 4 # 28 "/usr/include/stdio.h" 3 4 ………… ………… Truncated some text… ……… ……… extern void funlockfile ( FILE *__stream) __attribute__ ((__nothrow__)); # 918 "/usr/include/stdio.h" 3 4 # 2 "hello.c" 2 int main() { printf ( "Maximum age : %d " ,21); } |
2. Compilation
Compilation is the second pass. It takes the output of the preprocessor (hello.i) and generates assembler source code (hello.s)> gcc -S hello.i -o hello.s
//hello.s file content
.file "hello.c" .section .rodata .LC0: .string "Maximum age : %d " .text .globl main .type main, @function main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 movq %rsp, %rbp .cfi_offset 6, -16 .cfi_def_cfa_register 6 movl $.LC0, %eax movl $21, %esi movq %rax, %rdi movl $0, %eax call printf leave .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size main, .-main .ident "GCC: (GNU) 4.4.2 20091027 (Red Hat 4.4.2-7)" .section .note.GNU-stack, "" ,@progbits |
3. Assembly
Assembly is the third stage of compilation. It takes the assembly source code (hello.s) and produces an assembly listing with offsets. The assembler output is stored in an object file (hello.o)>gcc -c hello.s -o hello.o
Since the output of this stage is a machine level file (hello.o). So we cannot view the content of it. If you still try to open the hello.o and view it, you’ll see something that is totally not readable
//hello.o file content
^?ELF^B^A^A^@^@^@^@^@^@^@^@^@^A^@>^@^A^@^@^@^@^@^@^@^@^@^@^@^@^ @^@^@^@^@^@^@@^A^@^@^@^@^@^@^@^@^@^@@^@^@^@^@^@@^@^M^@^@UH<89>å¸ ^@^@^@^@¾^U^@^@^@H<89>ç¸^@^@^@^@è^@^@^@^@éã^@^@^@Maximum age :%d ^@^@GCC:GNU)4.4.220091027(RedHat4.4.2-7)^@^@^T^@^@^@^@^@^@^@^AzR^ @^Ax^P^A^[^L^G^H<90>^A^@^@^\^@^@^@^\^@^@^@^@^@^@^@^]^@^@^@^@A^N^PC <86>^B^M^FX^L^G^H^@^@^@^@.symtab^@.strtab^@.shstrtab^@.rela.text^@ .data^@.bss^@.rodata^@.comment^@.note.GNU-stack^@.rela.eh_frame^@ ^@^@^@^@^@^@^@^@^@^@^ |
4. Linking
Linking is the final stage of compilation. It takes one or more object files or libraries as input and combines them to produce a single executable file (hello.exe). In doing so, it resolves references to external symbols, assigns final addresses to procedures/functions and variables, and revises code and data to reflect new addresses (a process called relocation).> gcc hello.o -o hello
./hello
Maximum age : 21
Now you know c program compilation steps (Preprocessing, Compiling, Assembly, and Linking). There is lot more things to explain in liking phase.
Subscribe to:
Posts (Atom)