Files, Pipes, I/O plumbing, Filters and Permissions

The file system and pathnames

Within a Unix system, files are organised into a structure conventionally visualised as an inverted tree, like the pattern of roots which may exist underground beneath a single trunk. This structure consists of a number of directory files, usually referred to as directories or folders, and ordinary files sometimes referred to as "plain files".

Every file and directory within a Unix filesystem has a pathname which may be used to refer to that object when carrying out various operations such as displaying the contents of a file, setting the current or working directory or copying files. At the top of the filesystem is the root directory whose pathname is the forward slash character / . Directories may contain other directories and plain files.

                     /  (root directory)
                     |
     ------------------------------------------------------
     |           |           |           |                |
    dir1       file1       file2        dir2            home
     |                                   |                |
  ----------------        -------------------        ----------
  |              |        |                 |        |        |
 file3          dir3      file4             file5   fred     bill
                 |                                   |
        -------------------------------          ----------
        |              |              |          |        |
      file6          file7          file8   cupboard   wardrobe
                                                           |
                                         ----------------------
                                         |          |         |
                                       socks      vests    shirts

In the above example the pathname of file2 would be /file2 the pathname for file6 would be /dir1/dir3/file6 and the pathname for dir3 would be /dir1/dir3 . In Unix (and the pathname parts of web URLs) forward slashes (/) are used to delimit components of pathnames. In Windows back slashes (\) are used.

The Unix filesystem supports working or current directories and relative pathnames; a relative pathname is relative to the working directory and does not start with / . For example to change working directory from dir1 to dir3 you could either enter cd dir3 (pathname relative to dir1 ) or cd /dir1/dir3 (absolute pathname from root).

Here are some more examples of pathnames and (comments):

pathname comment
wardrobe this is a name of a directory within /home/fred, the current working directory)
Wardrobe this name is different from wardrobe
fred/wardrobe/socks file or directory called socks within a directory called wardrobe which is within a directory called fred within /home, the current working directory
/home/fred/wardrobe/vests This pathname to the object vests could be used from any directory on the system, including the current working directory.
. a period on its own used as a pathname indicates the current working directory
.. two periods together indicate the parent of the current working directory. e.g. the parent of dir3 is dir1

Unix does not support the concept of drive letters or use physical device names in the same way. Unix file trees are made up of one or more physical or network devices mounted at various points within the same tree, so if on Unix the floppy disk drive is mounted at the directory /mnt/floppy , then the Windows pathname a:\report.txt will become /mnt/floppy/report.txt when the floppy disk is used on Unix.

Links

UNIX also supports multiple paths to the same file, by the use of links. This enables a file or directory to be logically present wherever it may be needed, without having multiple copies wasting disk-space and not kept up-to date. Hard links are simply additional names for the same object while soft links contain the usual pathname of the linked object and can be used in some situations where hard links can not be used, e.g. filesystems residing on more than one disk partition. E.G.

ln /sales/personnel/simon /bonuses/recipients/simon

gives file /sales/personnel/simon the additional name using a hard link.

ln -s /usr/wizard/wands/magic/software/binaries /wiz

creates a soft link called /wiz so the relevant binaries can be executed and accessed more simply.

Wildcards

Wildcards or constructs using them are usually referred to in UNIX documentation as "regular expressions". Some UNIX commands use their own regular expressions which usually work in a similar manner to those interpreted by the shell. Note that when you use shell commands with regular expressions, e.g. to use the command on a number of files, the expression is expanded by the shell. The program called by the shell to perform the command is passed the matched filenames as separate arguments. Therefore this program does not know that you used a regular expression and does not need to know how to interpret it. This makes the shell both a powerful tool and dangerous in the wrong hands. Putting a superfluous space between a prefix and a * wildcard used with the rm (remove or delete) command has meant the difference between clearing out 2 unwanted files with the same prefix and accidently destroying a main project directory. Enough of the theory, lets look at the practice.
wildcard comment
* matches any filename or part of filename except those starting with a period (.)
? matches any single character
[123] matches the single character 1, 2 or 3
[1-5] matches any single character in the range 1-5

Technically the string abc is an expression which matches any occurrence of itself. So expressions can be combined to give greater flexibility. E.G.

fred* matches anything prefixed by fred
fred*.? matches fred.1 or frederick.Z but not freddy.bak
fred[A-Z]* matches fredB or fredXor but not fred or freddy
fred * will try to match fred (if it exists) and then match EVERYTHING matched by * . Note that the space will cause the shell to treat fred and * as separate arguments.

Note that . at the beginning of a filename must be matched explicitly, and / must always be matched explicitly. Special characters such as *,?$^()[]\/<>|"'! and spaces should not normally be included in actual filenames for reasons which should by now be becoming increasingly obvious), but not everyone who creates files for use on a UNIX system is aware of this and a user may sometimes have reasons for exceptions which outweigh the awkwardness involved, so you may sometimes have to find ways of telling the shell that a special character is actually part of a filename. In this event the backslash special character \ will usually escape the special character following it so that rm \$fred will delete the file called $fred .

Other characters worth mentioning now include period (.) which on its own means the current working directory and tilde (~) which is shorthand for your home directory. Hyphen (-) period (.) and underscore (_) characters can be used safely within filenames, but period (.) will hide files from ordinary directory listings and won't match wildcard searches if used at the start of filenames. For this reason, files created by and for the use of system software are often preceded by period (.) and these files should not be deleted unless you you know what effect this will have.

Pipes and redirection

By default the standard input comes from the keyboard unless it is redirected, which means being told to come from something else; it uses file descriptor 1 and is referred to as stdin. The standard output of a command or program is what would normally be output to the controlling console, window or terminal unless redirected elsewhere; it uses file descriptor 0 and is often called stdout. Standard error is where programs and commands send their error messages; by convention file descriptor 2 is used and it is called stderr. This is also usually the controlling terminal, but you can redirect it so it goes to a different file from the standard output.

Redirection enables the use of programs to be automated without you having to rewrite them; for example a program which normally accepts command line input can be run from a shell script which redirects standard input to come from a file containing commands prepared automatically, while its standard output is sent to another program or file.

Unix also allows you to create pipelines, or sequences of commands which connect the stdout of commands or programs directly to the stdin of the next in the pipeline. This allows many operations to be carried out without creating temporary disk files, which saves time and space, can help reduce confusion and enables you to create shorter shell scripts which are easier to understand and maintain. Pipelines can also be used to transfer data between 2 commands running on different machines across a network.

Here is the syntax used:

syntax purpose example
 command > filename
redirects the stdout from command to filename
 echo hello there > hello_file
 command >> filename
appends the stdout from command to filename. If filename already exists the stdout is added at the end of existing contents, otherwise filename is created.
 echo new record >> log_file 
 command < filename 
redirects the stdin to command from filename
 mail fred < letter_file 
 command 2> error.report 
enables you to capture any error messages in a file
 job_with_irrelevant_errors 2> /dev/null 
 command1 | command2 
pipes stdout from command1 to the stdin of command 2
 ls | wc -l
(counts the number of files)

These terms can of course be combined with more than one redirection on the same command line. So long as what you ask the shell to do makes sense, you will be allowed to do it. e.g.

grep xxx < f1 | awk '{ print $2 }' 2> f2 | tee f3 | wc -l > f4

This line extracts all lines containing xxx from file f1, selects the second column from these lines, outputs any error messages from the awk column selection program to file f2, writes the selected columns into file f3 and counts the number of lines, this count being output to file f4. A few simple commands can be combined in this manner to simply and quickly carry out complex tasks and enquiries which you would otherwise not have time for.

The tee command is useful in pipelines as it enables filing of intermediate data within a multi-stage pipeline. cat and echo are useful for providing input to a pipeline; cat sends the contents of one or more files to its stdout while echo can send characters, words or the values of variables to stdout. wc is useful for counting characters (wc -c), words(wc -w) or lines (wc -l) output from pipelines. more and less are both useful for viewing the data output from a pipe paged a screenfull at a time.

Some useful filters

A filter is a program which reads from stdin and writes to stdout. Unix provides comes with many useful, flexible and powerful programs which do this as standard. Most of these programs have been ported to Windows, but you will need to install them yourself. These facilities can be used to help us cope with the otherwise overwhelming volume of information available, by extracting exactly what we want when we want it so we can use it more selectively and waste less time searching through data we don't need. We can also filter raw input information to provide the correct input to existing programs without having to rewrite these to enable them to process data derived from a new source.

Below are a few of the most commonly used filters. I will only give brief comments about likely uses and simple examples in these notes. More information about them is contained in the manual pages, but even this is not always complete.

awk

This can be used to conditionally process input data. While awk is a programming language in its own right, for complex purposes it has been superceded by the Perl and Python languages. For simple column extraction and reordering awk is still useful in shell commands and scripts where the awk program is a one-liner . One of its easiest uses is to act as a simple column filter.

Example 1:

cat file1 | awk '{ print $5, $2 }'

will output fields 5 and 2 from file1 in that order (assuming fields are separated or delimited by spaces or tabs) . Note that omitting the comma after the $5 will result in fields 5 and 2 being joined

.

Example 2: If you want to use an awk program of any complexity, it's a good idea to put it in a separate file, using the -f flag to precede the program filename on the command line. In this example the awkprog file contains the following 3 lines:

  BEGIN { total=0 }
  {if ( $3 == "usr19999" ) total=total+$4 }
  END { print total/1024 }
awk itself is invoked using the following command line:
  ls -l /scratch | awk -f awkprog

This outputs the number of 1K blocks used by files belonging to usr19999 in the /scratch directory, obtaining the information by totalling the size field from the ls -l directory listing. This example was used to investigate whether a program was crashing due to insufficient space for scratch files.

grep

This is used to extract lines containing a particular pattern from a file. You can use regular expressions in the pattern which you ask it to match; these are similar but not identical to shell regular expressions. Other versions are fgrep and egrep; having the 3 versions gives greater flexibility. fgrep does not interpret regular expressions, allowing for faster searches and fewer problems if you are looking for special characters (metacharacters) while egrep will interpret a greater variety of regular expressions; grep is a compromise solution.

Examples:
grep xxxx outputs (or matches) all lines containing xxxx
grep -i 'xxxx' outputs lines containing xxxx , XXXX, Xxxx, xxXX etc.
grep -c 'xxxx' outputs the number of matching lines
grep -v 'xxxx' matches lines not containing xxxx
grep 'x.z' matches lines containing xyz, xHz etc. but not xz or xyyz
grep '^xyz' matches lines starting with xyz
grep 'xyz$' matches lines ending with xyz
grep '^[A-Z]' matches lines starting with a capital letter
grep '[aeiou]a' matches lines containing aa, ea, ia, oa and ua
grep 'ab*c' matches lines containing ac, abc, abbc, abbbc, etc.
fgrep '[abc]^.*$' matches lines containing [abc]^.*$
egrep '(red|blue|green|yellow)' matches lines containing red or blue or green or yellow

sed

If you want automated, as opposed to interactive editing, you can do this either by writing a special program, or by scripting commands to a very simple editor. Those who used computers more than a decade or so ago are more likely to have experienced very simple editors (these were line editors, because you typically edited a single line at a time).

sed stands for stream editor. It is useful for performing the kind of simple editing operations which you might use a line editor for, but in automated processing within a pipeline. It can easily be used to extract a single line from a file based on the line number, and can also be used to substitute or remove unwanted characters in a file e.g. () brackets and : colons for other characters e.g. spaces. sed will also interpret and match regular expressions. sed can be used to selectively remove lines from a file based on their contents, e.g. blank lines or comment lines.

Examples:

sed -n '5p' f1 just outputs line 5 from file f1. ( the -n flag suppresses default output)
sed '5,7d' outputs entire stdin file except for lines 5,6 and 7
sed -n '3,6p' outputs lines 3 to 6 only from stdin.
sed '/hello/d' outputs all stdin lines not containing hello
cat f1 | sed '1,$sxoldxnewxg' > f2
copies all lines from f1 to f2, changing all instances of the word "old" to "new".

The last example needs some explanation. Note that 1,$ applies "s" (the substitution command) to every line in the file because "$" refers to the last line. "x" ( as the character following "s" ) is the substitution delimiter, while the "g" character after the final "x" denotes that all occurrences in the line are to be changed. Without the "g" only the first "new" on each line would be changed to "old". Here is another example:

 cat f1 | sed '1,$s/(//g' | sed '1,$s/)//g' > f2 
copies all lines from f1 to f2, removing all ( and ) brackets. Note that sed is used twice in the same pipeline and that "/" has been chosen as the delimiter.

sort

This filter is used to provide sorted output of input data. You can choose which fields to sort on, and there are also options which specify whether to sort in ascending or descending order, whether to sort numerically or on the ascii values etc. See sort(1) for further details.

Example:

 ypcat passwd | sort -t: +0n -1 | more

This gives a paged output of the NIS (Network Information Service) passwd database, sorted by userid (use cat passwd instead of ypcat passwd if the passwd database is local).

head and tail

Copies first or last part of stdin file, see head(1) and tail(1). e.g. head -5 index displays first 5 lines of index

File security and ownership

Up to 216 (which is 65,534) userids can be created for a Unix system. Each user has a unique UID number in the range 0 - 65,533 and a unique name or userid in the passwd(5) database file, which is used to control login security. The same one to one correspondence exists between group names and GIDs, see group(5). Files and processes also belong to a specific userid and group; this determines who is allowed to do what to a file (or directory which is a type of file), and which files a particular process is allowed to read, write or execute.

The system uses the UID and GID as 16 bit numbers internally, and resolves references to the userid and group names using the /etc/passwd and the /etc/group files (or network wide passwd and group maps which have the same format). These names are used when displaying this information to the user, for example when you use the ls -l or the ls -lg commands. Normally UID 0 is reserved for root, who has system administrator privileges.

When directories are displayed using the ls -l command, the first 10 character field describes the permissions attached to the file. The first character describes what kind of file it is, a - (hyphen) indicates a plain or ordinary file, d indicates a directory and l indicates a soft link. The remaining 9 characters are split into 3 components of 3 characters each, to describe the user's, group's and others' privileges respectively. Within each 3 character triplet, the first indicates presence or denial of read permission, the second is for write permission and the third is for execute permission. If the r, w or x character is present, the permission is allowed, if the position is occupied by a - (hyphen) it is denied.

On some directory listings you may see s, S or t characters instead of the x - indicating use of setuid, setgid or sticky bit. Setuid and setgid applies to programs such as passwd, or login, which run with the privilege of the program owner, or group owner, not the program user which would otherwise be the case. The sticky bit is used on directories such as /tmp where all users have write access, but not to each others' files.

Examples:

-rwxr-xr-- indicates user read, write and execute rights, group read and execute rights, and read rights only for others.
drwxr-xr-- indicates a directory which can be displayed and searched by user and group. User can also delete, create and rename files in the directory. Others may list the directory but not search through it or access files in it.

Warning: it is dangerous to allow anyone else write access to any of your directories, and you should never give write access to your home directory to anyone, because this gives them the ability to do anything (accidently or deliberately) to any of the contents of the directory. If you want to share your files, allocate read and execute permission to the directory and read permission to your files. Others can then read them or copy them into their own directories. If others want to let you use copies of their files they can do likewise. If you want to protect confidential information from prying eyes, you should not regard something intended as an open system in an academic environment as being very secure, but some protection can be given by putting files into a directory to which only you have access. This does not of course stop anyone with access to the root userid from looking at them.

File permissions are assigned using the chmod command. This can be used in one of two ways. Some users prefer to give the octal 3 digit mode, others prefer to use the character equivalents.

Examples using octal modes:

chmod 421 f1 assigns permissions r---w---x to f1
chmod 750 f1 f2 assigns permissions rwxr-x--- to f1 and f2

Each octal digit adds 4 for read, 2 for write and 1 for execute permission. The first digit is for user, second for group and the third for others.

Examples using character equivalents:

chmod u=rwx f1
chmod g=rx f1
chmod o= f1
these three commands assign permissions rwxr-x--- to f1
chmod -R go-w dir1
removes write permission from group and others from directory dir1 including all objects contained in it and all its subdirectories etc.
chmod u+x *.exe
adds user execute rights to all files in current directory not prefixed by period (.) ending in .exe

Security is maintained by processes normally inheriting the UID and GID ownership fields from their parents, so every process you create while logged in normally has your user and group privileges. The login(1) program is an obvious exception to this rule. Some programs also use the setuid(2) or setgid(2) privileges so that they run with different real and effective UIDs or GIDs. This allows system administrators to create privileged programs to enable other users to access files etc. in the manner controlled by the privileged program.