More Information on C Language

This document includes more in-depth information on C language, but completely ignoring library functions, that are not actually part of the language. This is an appendix to A-C-X-en.html and is meant to be useful for people already programming in some other language.

Pointers and Arrays

We must always keep in mind that C has been designed to be near the processor and its structures. It shouldn't thus be a surprising choice, made by language authors, to represent a array with the address of the first element. The concepts of array and pointer are thus interchangeable, and you can apply an index to a pointer, using square brackets like it was an array, without any error or warning. The name of an array is however different from a pointer in that you can't assign a new value to it: an array name represents a constant address, which has been chosen at compilation time.

You can sum and subtract pointers and integers: the result of summing a pointer and an integer n is a pointer to the nth element of the array. In other words, the number is not the number of bytes to add to the original address, but the number of elements. The scale factor is applied by the compiler according to the type of pointer you make arithmetics on. So you can increment or decrement a pointer, as well as taking the difference (but not the sum) of two pointers of the same type; the result is an integer number. If you make arithmetics on generic pointers (pointers to void), the gcc compiler uses 1 byte as scale factor, but the operation is not allowed according the standard.

The "null pointer" is zero and is not a valid pointer; usually functions that return a pointer use the null pointer as an error marker. The macro NULL is usually "0", or sometimes "(void *)0", and zero can be assigned to any type of pointer.

The most important operators to use pointers are "*" (read as "pointed by") and "&" ("the address of").

Examples:

int i, v[10], *p; /* an integer, an array, a pointer:
                    "it's integer the element of v, which is long 10 items"
                    "it's integer what's pointed by p" */
p = v;         /* p takes the value of v, i.e. the address of its first item */
p = &v[0];     /* same as above */
p = &v[4];     /* p takes the address of the fifth element of v */
p = v+4;       /* same as above */
p++;           /* p is incremented */
i = p-v;       /* i takes 5, as a result of the previous two lines */

In rhe following loop, the integer i takes the sum of all elements of the array v that are different from 0. Note that there must be a zero-valued array element, or the loop will overflow the array length. The language doesn't offer any check on pointers and array indices.

i=0;
for (p = v; *p; p++)
    i += *p;

Assignment between pointers of different types is an error, and it's also an error any arithmetic between pointers of different types. You can nonetheless convert a pointer of a type into another pointer, as well as pointers into integers and back. These conversions don't generate any machine code, as the values are all integers within the CPU; but they are sometimes needed to ensure type consistency of the source code.

A void pointer and always be assigned to any type of pointer, and any type of pointer can be assigned toa void pointer. This is allowed because the void pointer is used to handle generic memory addresses, a very common task in operating systems and system libraries.

The sizeof, operator, applied to a type or a variable name or an expression, returns the size in bytes of the named object. This operation is performed at compile time, according to the type being passed to sizeof. If a pointer p is incremented, it's numeric value (memory address, in bytes) is incremented by sizeof(*p).

Example:

int i, v[10], *p;   	  /* the same variables used earlier */
i = sizeof(int);    	  /* usually 4, but may be 8 or 2 */
i = sizeof(i);      	  /* same as above */
i = sizeof(v[0]);      	  /* same as above */
i = sizeof(*p);      	  /* same as above */
i = sizeof(p);            /* usually 4 (pointer size), or 8 */ 
i = sizeof(v);      	  /* 40, or 80, or 20 */
i = sizeof(v)/sizeof(*v); /* 10: the number of elements in the array */
#define ARRAY_SIZE(x) (sizeof(x)/sizeof(*x)) /* a useful macro */

Strings

C strings are arrays of characters, each with a zero byte as terminator. They double-quote representation is just a simplified notation to to represent an array. Each time a string in double-quotes appear in the program text, the compiler saves the string in the data segment of the program and represents it with the address of the first character. A character within single quotes is an integer number, i.e. the ASCII code of the character itself.

Examples of declarations of strings and pointers:

char s[] = "test";  /* an array of 5 bytes, including the terminator */
char s[] = {'t', 'e', 's', 't', 0}; /* the same, more baroquely */
char c, *t; /* a character and a pointer to character */
c = *s;     /* c takes 't' */
t = s+2;    /* t represents the string "st" */
s[0] = 'p'; /* the string s is now "pest" */

char *name = "arthur"; /* a pointer to an initialized area, 7 bytes long */
char surname[] = "smith"; /* a 6 byte area, whose address is "surname" */
name++;                /*  now name is "rthur" */
surname++;             /*  error: surname is a constant address */
The following is a possible implementation of strlen, the function that returns the length of a string:
int strlen(char *s)
{
    char *t = s;
    for (; *t; t++)
        ;
    return t - s;
}

Function pointers

Like arrays, a function is represented by an address, the address of its code. Every time you use a function name in a program, you are actually using the pointer to the function. The most common use of function pointers is applying the parentheses operator, even f you don't usually think about this as using a function pointer. A function pointer can also be assigned to other pointers, for example within data structures that define the method that act on objects, or can be passed to argument as other functions, like the library function qsort, which implements the "quick sort" algorithm on an array. The compiler checks at compile time that pointer types in assignments are compatible, i.e. that the functions receive the same arguments. Example:

#include <string.h> /* get strcmp declaration*/
#include <stdlib.h> /* get  qsort declaration*/

char *strings[100]; /* array of 100 char pointers */

strcmp(strings[0], strings[1]); /* comparison of two strings */

/* let's call qsort telling it to use strcmp as compare function */
qsort(strings, 100, sizeof(char *), strcmp);

strncmp(strings[0], strings[1], 5); /* only compare the first 5 chars */

/* the following is an error, as strncmp receives three arguments */
qsort(strings, 100, sizeof(char *), strncmp);

Memory allocation

The language doesn't offer primitives of memory management ( line new, creators or destructors), and has no garbage collection.

Programs use three types of memory: static, dynamic, or automatic memory. A variable or data structure is static when it is declared at compilation time; the linker assigns an immutable address to it. A dynamic structure is allocated at program run-time, for example by calling malloc, and access to data happen through a pointer. A so called "automatic" variable is allocated on the stack and disappears when the program leaves the code block that declared it.

A static variable is initialize to zero, unless the program declares a constant value to load into the variable. Initialized variables are saved to disk and leave in the "data segment" of the program and of the executable ELF file. Uninitialized variables live in the "bss segment" of the program, which is a memory area that is allocated and zeroed before the program starts running. The executable file on disk doesn't include a copy of the BSS area but only a declaration.

A dynamic variable is stored in memory that is being asked to the system at run-time. After allocation, you can't make any assumption on the content of such memory space, it may be zero-filled but it may also contain information from data that was previously allocated and then freed. In any case, each malloc requires a corresponding free, missing which the program will experience memory leakage, and the size of the process will slowly and continuously increase during execution. While all the memory allocated by a user-space program is released at program termination, kernel memory that is allocated and not freed will lead to loss of memory, and such areas can only be recovered rebooting the system.

An automatic variable is a local variable in a function or in a code block. It lives on the stack and is not initialized, unless the programmer does it explicitly. If it initialized, the compiler will output the code that is needed to do the initialization. Memory associated to automatic variables, being on the stack, can't be used after the program leaves the block where it is defined.

Examples:

int i;              /* initialized to zero, lives in bss segment */
int v[4] = {2,1,};  /* initialized to {2,1,0,0}, data segment */
int j = f(3, i);    /* error: the value is not known at compile time */

int *f(int x, int y)
{
    int z;                            /* automatic, unknown value */
    int a=0, b=1, c=2;                /* run-time iniitialization */
    int *p = malloc(4 * sizeof(int)); /* another run-time operation */
    int *q, *r = &z;              /* two pointers: one points to z */

    *q = y;          /* error: q points to undefined place */
    *r = y;          /* correct: r points to z so this assigns z */
    if (x) return p; /* correct: p has been allocated and remains valid */
    else return &z;  /* error: z can't be used out of the function */
}

All the operators

A table of operators is found in operator.tbl, which lists priority and associativity. The file can be printed in an A4 or A5 page.

The operands of every operator are always other expressions, with two exceptions. This section is explaining how to use each operator, in the same order as operator.tbl.

The switch control construct

The control construct switch is used to choose between several different behaviours according to an integer expression, keeping in mind that a character between primes is an integer number. The syntax is different from that of other constructs, as the braces are mandatory. Moreover it uses as many as three keywords: switch, case and default.

The complete syntax is as follows:

switch ( integer-expression ) {
    case constant-expression :
        [ instruction ... ]
        [ break ; ]
    case constant-expression :
        [ instruction ... ]
        [ break ; ]
    [ default: ]
        [ instruction ... ]
        [ break ; ]
}

The expressions in each case must be constant expressions, i.e. they must be integer and their value must be known at compile time. After each case instructions are optional, to allow grouping the same code under several cases.

Putting break at the end of each case is optional, to allow instructions associated to a case to continue with the instructions of the next case; when you willingly avoid break you should always add a comment about it, or it will look like an error to people reading your code.

The default branch is optional; if it exists, it is used when no case expression matches the integer expression. Default is usually the last branch, but can appear in any position.

Example: extremely inefficient conversion from hex to decimal, one char at a time. Note how c is being modified after being used to select the correct calse; this shouldn't surprise you as the expression used to select the case is evaluated once only, at the beginning.

int value;
int nextchar(int c)
{
    switch(c) {
        case 'a': case 'b': case 'c': case 'd: case 'e': case 'f':
            c = c - 'a' + 10 + '0';
            /* fall through */
        case '0': case '1': case '2': case '3': case '4':
        case '5': case '6': case '7': case '8': case '9':
            value = value * 10 + c - '0';
            break;
        case 'p':
            printf("%i\n", value); value = 0;
            break;
        default:
            return -1; /* error */
    }
    return 0;
}

Usually switch is used to select between different commands, for example in the implementation of the ioctl() system call, or in the parsing of command line arguments. Using two or more case clauses for the same code block is uncommon, and likely uncommon is the need to fall through a case clause while evaluating the previous case.

Data stuctures

A data structure (struct) can include other structures or pointers to other structures. While pointers can cyclically refer a structure from another, structure inclusion can't be recursive as the included structure is contained in the including one in its entirety.

If a structure includes a pointer to another structure, such other structure must have been declared in advance (even without defining the field list), because the compiler reads source code only once. After declaration, without a definition, you can't instantiate a data structure because the compiler ignores its size, but you can instantiate a pointer to it, as all pointers have the same size.

Example:

struct father;

struct child {
    struct father *father;
    /* ... */
};

struct father {
    struct child *child;
    /* ... */
};

A structure declaration without a field list also allows to have opaque structures in a library. the technique is used for data which is private to the library. If a structure includes a pointer to another structure of the same type you not need to declare it in advance, because when the compiler reads the field list has already seen the structure name.

struct dpriv;
struct datum {
    struct dpriv *priv; /* users of "datum" ignore contents of "dpriv" */
    /* ... */
    struct datum *next; /* pointer to another struct, to build a list */
};

Data and code scope

The language has a single flat name space for variables and functions. A variable can't have the same name as a function, because in the linker a name is associated to a single address, be it code or data.

Unlike global variables, local ("automatic") variables are only visible in the block where they are declared. Such block can be a function or a composite instruction enclosed in braces, either the body of a control statement (if, for, and so on) or a standalone composite instruction. Variables which are local to a block are allocated on the stack, while you can't define local functions, whose scope is limited. If a variable defined within a block has the same name as another one, global or local (to an outer block), within the inner block the name refers to the inner variable. As said, function arguments can be used like variables that are local to the function itself.

The static keyword is a qualifier for code and data: it is used to change the default scope rules. It a global symbol (function or variable) is declared static, it isn't visible outside of the source file where it is defined, because its name is not exported to the linker. A local variable, if defined static, is allocated in the global data space, but without exporting its name; in this way you can have a persistent data space within the block where it is defined. Example:

int i; /* global */
static int j; /* global, but only visible in this file */
static int invert(int i) /* the function can only be called in this file */
{
      int j; /* allocated on the stack */
      j = -i; /* two local variables, where i is the function argument */
      return j;
}
int count(void) /* count is globally defined in the program */
{
      static int i;  /* local but persistent across calls, initially 0 */
      return ++i; /* increment the counter and return its value */
}

Important gcc options

The gcc compiler, like every implementation of cc, is passed command line options. Input files are processed according to their name: it the end in .c they are compiled, if the end in .S they are passed to the assembler and if the end in .o they are just passed to the linker. Its most important options are the following ones. Below, file refers to a generic filename, and not the same file in all examples:

Example:
gcc -DDEBUG jpegdemo.c -I/usr/local/include -L/usr/local/lib -ljpeg -o jpegdemo

Programming style

Please be consistent in you program layout: always indent blocks in the same way, whatever you preferred indentation style is. The most common style is the Kernighan and Ritchie one (open brace at end of line, closed brace alone in a line). Your personal preference is not very important, but consistency in your files is.

The TAB character is 8 spaces, whatever your indenting level is (2, 4, 8 spaces). Please check your editor's configuration, whose default may be wrong.

Functions should be short and understandable. If a function gets too complex you should split blocks that are conceptually separate into separate functions.

Use data structures as much as possible, for better readability and maintainability. Define creators and destructors for your objects, using dynamic allocation instead of global variables.

Always check errors: every function you call may fail, the calling code should check return values and behave in a reasonable way -- which often means passing the error back to the caller.

Don't call exit from within a function if an error happens, leaving that decision to the main program.

Add good comments to your code; avoid exceedingly "smart" constructs, but if you do that please explain why you made the specific implementation choice.

Always make clear your license terms in the source file; without any such terms the "all rights reserved" applies by default. Even when this is your intention, you should make that clear to avoid possible doubts about it.

Avoid user interaction if not really needed. If needed, please read stdin with fgets and then sscanf, never use scanf directly as it may bite you; write to stdout by complete lines, with a trailing '\n'. Avoid unneded output ("silence is golden") and unneeded empty lines.

What's missing

Constructs that have not been covered in these two documents, as they are rarely used, are: