This repository contains the starter code and tests for the "C system calls" lab. The primary goal of this lab is to gain some experience with (Unix) file system calls in C, and compare these to using the shell.
For this lab, you will need to have (or develop) some familiarity with several C programming concepts, including:
- Using command line arguments
- Using 2-Dimensional arrays
- Basics of file-system calls, including standard input (
stdin
) and standard output (stdout
) - Using
stat()
- Using
struct
For comparison, we will also use shell commands to traverse directories.
To accomplish these goals, you will work on solving two distinct problems, both of which illustrate how we interact with the file system from (C) programs. The first is revisiting an old friend from a previous C Lab, but this time we disemvowel entire files instead of single lines. The second is three different solutions to the problem of summarizing files; two solutions will be in C and one uses shell commands.
There's a lot here, and only some of it is described below. You will definitely need to do some reading on at least the following using your preferred resource(s):
- C file I/O, including stdin, stdout,
fopen()
,fwrite()
,fclose()
- Command line arguments in C
- C structs
- C tools for processing directories, including
stat()
,readdir()
,chdir()
- Function pointers in C (that's a head trip)
- The C
ftw()
(file tree walk) system call
There are two directories in the repository, one for each of the two major parts of this lab. We would recommend doing them in the order listed below; there's no overwhelming reason that you need to do them in any particular order, however, and it would be far better to move on to the next one rather than get buried in one and not make any progress.
This is an extension of the disemvoweling exercise from a previous C Lab. Here you should write a program that disemvowels an entire file, reading from standard input or a file specified on the command line, and writing to standard output or to a file specified on the command line.
It's common for command line utilities to be very flexible in how they
handle both input and output. This allows them to be easily used as both
"stand alone" utilities, and as combined with other programs using
pipes. To illustrate what that looks like from a programmer's standpoint,
you'll write a program file_disemvowel
that can be called in several
different ways, each of which is listed below.
The difference in all of these is where the input comes from and the output
goes to. If the user doesn't provide any command line arguments, for example,
the the code should read from the "file" stdin
and write to the "file"
stdout
, where both stdin
and stdout
are pre-defined constants provided
in <stdio.h>
. Your code should probably set up your file I/O using stdin
and stdio
as the defaults, and then replace them with files specified by
the user if/when the user provides them. You'll use fopen()
to open files
specified by the user, and you need to remember to use fclose()
to close
the files when you finish. (Failing to close can lead to some
of the last data you wrote out not actually making it into the file, which
can cause tests to fail.)
$ ./file_disemvowel
Some text
consisting of 2 lines.
^D
Sm txt
cnsstng f 2 lns.
Note that we can also use I/O redirection and pipes to modify standard input and standard output:
$ cat small_input
Vulcan once stood proud here,
until he spewed his parts from his dirty mouth,
shattering his body and the land alike.
-- Thomas McPhee, 20 Jul 2010
$ ./file_disemvowel < small_input > /tmp/output
$ cat /tmp/output
Vlcn nc std prd hr,
ntl h spwd hs prts frm hs drty mth,
shttrng hs bdy nd th lnd lk.
-- Thms McPh, 20 Jl 2010
Your program should be able to take a single command line argument that is interpreted as the name of the file to be disemvoweled. The output will then go to standard output.
$ cat small_input
Vulcan once stood proud here,
until he spewed his parts from his dirty mouth,
shattering his body and the land alike.
-- Thomas McPhee, 20 Jul 2010
$ ./file_disemvowel small_input
Vlcn nc std prd hr,
ntl h spwd hs prts frm hs drty mth,
shttrng hs bdy nd th lnd lk.
-- Thms McPh, 20 Jl 2010
If your program is provided with two command line arguments, it should interpret the first as the input file and the second as the file where the output should be written.
$ cat small_input
Vulcan once stood proud here,
until he spewed his parts from his dirty mouth,
shattering his body and the land alike.
-- Thomas McPhee, 20 Jul 2010
$ ./file_disemvowel small_input /tmp/output
$ cat /tmp/output
Vlcn nc std prd hr,
ntl h spwd hs prts frm hs drty mth,
shttrng hs bdy nd th lnd lk.
-- Thms McPh, 20 Jl 2010
The directory file_disemvowel/tests
has a bats
test
script (file_disemvowel_test.bats
) and some test data. You want to start by trying to get
those tests to pass at a minimum but, as always, we make no guarantees
that those tests are in any way complete enough to ensure correctness. To
run the tests enter
bats tests/file_disemvowel_test.bats
in the file_disemvowel
directory.
The basic structure of our solution is shown below. You should certainly feel free to copy code from your earlier lab to help as appropriate, but you may need to tweak things some to fit the new circumstances.
#include <stdio.h>
#include <stdbool.h>
#define BUF_SIZE 1024
bool is_vowel(char c) {
/*
* Returns true if c is a vowel (upper or lower case), and
* false otherwise.
*/
}
int copy_non_vowels(int num_chars, char* in_buf, char* out_buf) {
/*
* Copy all the non-vowels from in_buf to out_buf.
* num_chars indicates how many characters are in in_buf,
* and this function should return the number of non-vowels that
* that were copied over.
*/
}
void disemvowel(FILE* inputFile, FILE* outputFile) {
/*
* Copy all the non-vowels from inputFile to outputFile.
* Create input and output buffers, and use fread() to repeatedly read
* in a buffer of data, copy the non-vowels to the output buffer, and
* use fwrite to write that out.
*/
}
int main(int argc, char *argv[]) {
// This sets these to `stdin` and `stdout` by default.
// You then need to set them to user specified files when the user
// provides files names as command line arguments.
FILE *inputFile = stdin;
FILE *outputFile = stdout;
// Code that processes the command line arguments
// and sets up inputFile and outputFile.
disemvowel(inputFile, outputFile);
return 0;
}
A few comments:
- We've used
#define
to create a named constant which we used for the size of the two buffers in the functiondisemvowel()
. The choice of 1024 for the buffer size is largely arbitrary. You often see these in multiples of a kilobyte. If you have a lot of RAM (which modern machines typically do) you might slightly improve performance by increasing this buffer size, but you'd be unlikely to be able to tell the difference without careful measurements. - Rather than using the very low-level I/O tools
open()
,read()
, andwrite()
, we recommend using the slightly higher level toolsfopen()
,fread()
, andfwrite()
, as they're more like what you'd actually use in a text-processing program like this. You'll want to look at the man pages forfopen()
,fread()
, andfwrite()
for the details, and definitely ask questions if you're not sure what's going on there. - This uses command line arguments in C. You probably want to start by just reading standard input and writing to standard input, as that allows you to initially ignore the command line arguments and focus on the file operations; once you get that right, you should pass the first of the three tests. When that's done that you can start worrying about the command line arguments. There are lots of on-line resources for handling command line arguments in C, e.g., in the wikibook A Little C Primer.
skip
to prevent them
from hanging when you first start the project. You need to make sure
to remove (or comment out) the skip
lines when think you have command
line argument processing working. Otherwise your handling of command
line arguments won't actually be tested.
Here we'll generate several different solutions to the same problem, two in C (using different system tools), and one using shell commands. The problem we're solving in each case is to write a program that takes a single command line argument and looks at every file and directory from there down (in sub-directories, sub-sub-directories, etc.) and summarizes how many directories it finds and how many regular files (non-directories) it finds. The output should look like:
Processed all the files from <test_data/loads_o_files/>.
There were 1112 directories.
There were 10001 regular files.
There is a test script in
summarize_tree_test.bats
that will
only pass when all three programs exist and work using these names:
summarize_tree
summarize_tree_ftw
summarize_tree.sh
The two files test_data/small_dir_sizes
and test_data/large_dir_sizes
in the repo are necessary for the tests to pass; you shouldn't remove
or change them.
Remember to also use valgrind
to look for memory leaks in both your
C versions.
test_data.tgz
tar
archive. This contains large collections of test directories and
files (over 10,000 files) that we keep compressed in the repository so
we don't burden Github (and you) with pushing and pulling these large
collections of files. So execute
cd test_data
tar -zxf test_data.tgz
That should give you two directories: loads_o_files
and fewer_files
.
The former is a large directory structure with thousands of files and
directories. The latter is a smaller subset. The tests assume those
directories exist and won't pass unless they do.
We'll start with a "traditional" C solution using the stat()
and
readdir()
library functions. We've given you a substantial stub for
this in summarize_tree/summarize_tree.c
which
is also included below:
#include <stdio.h>
#include <sys/stat.h>
#include <stdbool.h>
#include <stdlib.h>
#include <dirent.h>
#include <unistd.h>
#include <string.h>
static int num_dirs, num_regular;
bool is_dir(const char* path) {
/*
* Use the stat() function (try "man 2 stat") to determine if the file
* referenced by path is a directory or not. Call stat, and then use
* S_ISDIR to see if the file is a directory. Make sure you check the
* return value from stat() in case there is a problem, e.g., maybe the
* the file doesn't actually exist.
*
* You'll need to allocate memory for a buffer you pass to stat(); make
* sure you free up that space before you return from this function or
* you'll have a substantial memory leak (think of how many times this
* will be called!).
*/
}
/*
* This is necessary this because the mutual recursion means there's no way to
* order them so that all the functions are defined before they're called.
*/
void process_path(const char*);
void process_directory(const char* path) {
/*
* Update the number of directories seen, use opendir() to open the
* directory, and then use readdir() to loop through the entries
* and process them. You have to be careful not to process the
* "." and ".." directory entries, or you'll end up spinning in
* (infinite) loops. Also make sure you closedir() when you're done.
*
* You'll also want to use chdir() to move into this new directory,
* with a matching call to chdir() to move back out of it when you're
* done.
*/
}
void process_file(const char* path) {
/*
* Update the number of regular files.
* This is as simple as it seems. :-)
*/
}
void process_path(const char* path) {
if (is_dir(path)) {
process_directory(path);
} else {
process_file(path);
}
}
int main (int argc, char *argv[]) {
// Ensure an argument was provided.
if (argc != 2) {
printf ("Usage: %s <path>\n", argv[0]);
printf (" where <path> is the file or root of the tree you want to summarize.\n");
return 1;
}
num_dirs = 0;
num_regular = 0;
process_path(argv[1]);
printf("There were %d directories.\n", num_dirs);
printf("There were %d regular files.\n", num_regular);
return 0;
}
The static int
declaration on line 9 gives us two semi-global
variables that are visible to all the functions in this file, but
invisible outside of this file. This gives us something sort of like
fields in a class in a language like Java,
except that you can't make "instances" of this file
so there are never more than these two integer variables. (This means if
you tried to use this approach for something like a stack, then you
couldn't ever have more than one stack.)
In is_dir()
, you'll want to use the stat()
system call
(try man 2 stat
) to get the "status" of the file identified by path
.
The signature for stat()
is
int stat(const char *path, struct stat *buf);
This is typical of many library
functions where the "return value" is actually being written into a
pointer argument (buf
in this case), and what's returned is an error
code. If the error code is 0, then stat
worked fine, and its results
are stored in
buf
; if the error code is non-zero then there was a problem
(e.g., the file didn't actually exist) and you can't count on buf
containing any meaningful information.
When it succeeds, what you get in buf
is a whole
bunch of information including the inode number, access and change
times, etc. (see the man
page for more). What we care about is the
st_mode
field; to access that field you need to dereference the
pointer (i.e., *buf
) and then use the dot notation to access the
field: (*buf).st_mode
. (The parens
are necessary because dot .
binds more tightly than star *
.)
This syntax is sufficiently awkward that the
designers of C provided an alternative syntax for this common case
that allows us to use x->y
instead of (*x).y
. In our case that
means we can use buf->st_mode
instead of (*buf).st_mode
.
This st_mode
field contains all the file permission information,
which includes whether it's a directory or not. We could try to extract
that information from st_mode
directly, but they've given us a
S_ISDIR()
macro that does that work for us. Then the key expression is
something like
S_ISDIR(buf->st_mode)
An important point about stat()
is that the function doesn't allocate space for
buf
; you're supposed to have done that before you call stat()
. You
can do this with malloc()
, or you can
actually declare a struct stat
locally (struct stat buf
), and pass
it's address to stat()
with &buf
. If you do that, buf
isn't a
pointer, so you don't need the arrow notation above and can just use
S_ISDIR(buf.st_mode)
In process_directory
you'll need to have a loop that reads
all the entries in the directory named by the given path, making
recursive calls to process those sub-paths. The basic approach is to use
opendir()
to open the directory, repeatedly call readdir()
to read
the next entry from the directory, and then call closedir()
at the
end. There are a few bits you need to be careful of:
- You need use
chdir()
at the start ofprocess_directory()
to actually move into the directory you're processing. You'll then to usechdir("..")
to move back up to the parent directory at the end ofprocess_directory()
. - You need to check whether
readdir()
gave you either "." or ".." as thed_name
. It will eventually do that since every directory contains both those entires. You don't want to make recursive calls on those, because you'll end up in an infinite loop.
Since C tends to just crash horribly when something goes wrong, you want
to get in the habit of checking for errors so you can try to give your
user some useful feedback. I've provided an example in main()
where I
check that the program is called with the right number of command line
arguments. You also want to check the result code returned by stat()
in is_dir()
. You should also probably check for errors from things
like opendir()
, readdir()
, closedir()
, and chdir()
, although
errors there are less likely.
There is a nice C function perror()
that will print out a somewhat
human readable error when something goes wrong. You might find that useful.
Having gotten the version with stat()
to work, we're now going to do
this again, using a nice library tool called ftw()
(for "file tree
walk"). Here you just need to call ftw()
(which you can do straight
from main()
after checking the command line arguments), and it does
all the recursive traversal of the file system for you. What you need to
do is define a function
static int callback(const char *fpath, const struct stat *sb, int typeflag)
and pass it in as the second
argument of ftw()
, something like this:
static int callback(const char *fpath, const struct stat *sb, int typeflag) {
// Define stuff here
}
#define MAX_FTW_DEPTH 16
int main(int argc, char** argv) {
// Check arguments and set things up
ftw(argv[1], callback, MAX_FTW_DEPTH);
// Print out the results
}
ftw()
will call your callback()
function once for every file it
finds, giving you the opportunity to do whatever you wish with that
file. In our case we just determine whether it's a directory or not (the
typeflag
parameter is very useful here) and update the appropriate
counter. In more general applications, however, you could check for file
types (image files, for example), copy things, compute their sizes, or
whatever.
An alternative to writing this in C is to use shell commands.
Here we'd recommend using find
to do the traversal for you (its
-type
flag is probably useful), and let wc
do the counting.
If you find that wc
gives you white space you don't want, you
can use xargs
to strip that off.
The canvas rubric provides detailed information on how you will be graded. The main topics revolve around
- Create the four executable files:
-
disemvowel
-
summarize_tree
-
summarize_tree_ftw
-
summarize_tree.sh
-
- Ensure that the programs pass their tests
- Ensure that the C programs pass valgrind tests
- Code and commits should be understandable and useful