Bash practices - Part 1: Input validation and local variables

Posted on by Matthias Noback

Forgive me for my bad pun. As I mentioned in my previous Bash post I'm going to show you some ways in which you can improve the design of Bash scripts. Again, it's a weird language, and a lot of what's below probably won't feel natural to you. Anyway, there we go.

I started out with a piece of code that looked like this:

BUILD_DIR="build"

function clean_up() {
    rm -r "$BUILD_DIR"
}

clean_up

Function arguments

Inside a function you can use all global and environment variables, which easily leads to smelly code like this: clean_up will behave differently based on what's in the global variable BUILD_DIR. This makes the function itself quite unpredictable, but also error-prone, as the value of BUILD_DIR may at one point not contain the name of a directory, or even be an empty string. Usually we would fix this by providing the path to the directory we'd like to remove as an argument of the function call, like this:

function clean_up() {
    rm -r "$1"
}

clean_up "$BUILD_DIR"

You may recognize this $1 syntax from previous Bash encounters: variables $1...n are the arguments that you provide when you run the script at the command-line. Likewise, when calling a Bash function, $1...n represent the arguments that the caller provided (by the way, I really like this symmetry between calling functions and running programs).

Passing the directory as an argument (although not a named, nor a typed argument) is good practice. It makes the function reusable. And equally important: predictable. Its behavior won't be influenced by changes in global variables.

Input validation

The only problem so far is that the clean_up function doesn't perform any input validation at all. You can even call this function without any argument and you won't even receive a warning for that...

In order to function correctly, the following pre-conditions need to be met:

  • Argument 1 needs to be provided.
  • It should represent the path to an existing directory.

We can easily accomplish this by adding a -d test. However, we can't really throw an exception if the directory doesn't exist. The best thing we can do is follow Unix conventions:

  1. print an error message to stderr.
  2. exit with a non-zero exit code.

That way, the process running our script knows that it encountered a problem. We print the error message to stderr to prevent other processes from automatically processing the output in case the script was part of longer chain of commands (e.g. command-a | command-b > output.txt).

function clean_up() {
    if [[ ! -d "$1" ]]; then
        echo "Argument 1 should be the path of an existing directory" 1>&2
        exit 1
    fi

    rm -r "$1"
}

Note that we echo and exit where we would normally like to throw an exception. echo prints to stdout (file descriptor 1), but we'd like to print to stderr (file descriptor 2). We accomplish this by redirecting the output from 1 to 2: 1>&2

This starts to look like a reasonable function. However, $1 is still a bad variable name. It doesn't explain what it represents. We'd rather call it directory. We can easily do so of course:

directory="$1"

if [[ ! -d "$directory" ]]; then
    #...
fi

rm -r "$directory"

Local, named variables

That's much better already! However, by default, variables have no scope. This means that once we set a variable inside a function, it will be available outside that function:

function clean_up() {
    directory="$1"

    # ...
}

clean_up "build"

echo "$directory"

Bash has a way to mark variables as "local to the current scope", by adding the local keyword in front of the variable name, like this: local directory="$1". However, I recommend using declare, as it has many more options, even allowing some rudimentary typing. Let's use the -r option in this case to mark the variable as read-only (PHP could also benefit from such an option by the way).

function clean_up() {
    declare -r directory="$1"

    #...
}

clean_up "build"

# This will show an empty string:
echo "$directory"

A nice debugging suggestion is to use declare -p to print whatever variables (including environment variables) have been declared at that point in the script.

For completeness sake, this is the full code of the final solution:

#!/usr/bin/env bash

function clean_up() {
    declare -r directory="$1"

    if [[ ! -d "$directory" ]]; then
        echo "Argument 1 should be the path of an existing directory" 1>&2
        exit 1
    fi

    rm -r "$directory"
}

clean_up "build"

Conclusion

In this article we've improved the design of the clean_up method. This is what you'd call a "command" method: it does something, it has side effects, and it may either succeed or fail, providing no particular return value. In another article I'll show you a query function that needs fixing.

Bash Bash Comments

Adventures with Bash

Posted on by Matthias Noback

A bit of reminiscing

When I was a kid, MS Windows was still a program called WIN.COM which you needed to start from the MS-DOS command prompt. You could also add it to AUTOEXEC.BAT which was a so-called batch file. You could write these .BAT files yourself. They were basically just command-line scripts. You could make them execute commands, print things, collect input, and make simple decisions. It wasn't much, and I remember that you often needed some helper .COM or .EXE programs to accomplish anything useful. The most advanced thing I ever wrote was a nice little ASCII-art menu program, spread across multiple .BAT files (with GOTOs and all), which allowed me to easily start my favorite games, like Super Tetris, known to me as SUPERTET.EXE, or Prince of Persia.

From .BAT to .php to .sh

Several years later I learned a bit of PHP and immediately felt at home. PHP was a scripting language back then. Of course it still is, but it doesn't feel like one anymore. It shared (and still shares) some basic characteristics with scripting languages like the MS-DOS batch programming "language".

Several more years later, working on a Mac, I encountered something called "shell scripts"—files with an .sh extension that you can run, if you have the right permissions. These scripts often start with #!/usr/bin/env bash, known as shebang, telling the shell which interpreter should be used to run this script. By the way, if you run this env program without an argument, you get a list of all the environment variables that are available to you. This can be quite useful.

Learning Bash

Until recently I haven't felt the need to learn more about the Bash programming language. When I started researching Docker though, I encountered a lot of examples written in Bash. Most of those examples I didn't understand completely. Bash has a lot of crazy syntax, and people don't often put very informative comments in their scripts. I hate it when I don't understand what's going on in a piece of code that I use in a project, so one day I decided to dive into Bash and learn enough about it to let myself get away with what I don't know yet. I started reading the Bash Academy guide, but unfortunately it's an unfinished project. Next up was "Pro Bash Programming : Scripting the GNU/Linux Shell, Second Edition", by Jayant Varma and Chris F. A. Johnson. A very interesting book, which I never finished, but keep open as a reference. In terms of reference material, I sometimes find Google a useful source (which often leads me to Stack Overflow). This to me demonstrates that I don't really know what I'm doing, as I simply try out several of the answers I find. A better reference book is "Bash Pocket Reference, 2nd Edition" by Arnold Robbins.

About Bash

Bash is everywhere. It's pre-installed on Linux and Mac OS X, and since last year it's even possible to use Bash on Windows. Without a need to compile your code, this means that you can run your Bash script on many machines already. Two potential problems though:

  • There are differences between Bash versions (this isn't any different from code written in any other programming language by the way).
  • The power of a Bash script usually lies in the programs it runs. Not every runtime environment comes with the same programs installed (like git, mktemp, read, etc.).

Both of these potentially problematic situations could make your script fail, or—maybe worse—behave in subtly different ways. With simple scripts it's certainly possible to navigate safely around these problems, but in most cases I recommend running a Bash script inside a known-stable environment, with pinned dependency versions, for example inside a carefully prepared Docker container.

Bash script characteristics

As a programming language, Bash isn't a strictly typed language. With regard to the types of values, this most often results in pretty sloppy programming. Apparently, that's how it's supposed to be, but it might make you feel a bit insecure from time to time.

Another reason to feel insecure is the fact that functions have no predefined parameters. In fact, running a program, running a built-in command, or calling a function all have the same syntax, allowing a variable number of arguments (and options, if applicable). It's up to the function or program to verify that any required argument has been provided.

Just like programs produce output and exit codes, functions can have a numeric return value and optionally print something to stdout or stderr. This is a very different approach to functions than most programmers might expect, but it actually makes a lot of sense in the environment in which these scripts run.

Many Bash functions will have side-effects, like changing the current working directory, creating directories, copying files, exit-ing the process, etc. When designing these functions, I often feel like I'm doing something unnatural, sometimes even dirty.

A big reason for "feeling dirty" is that besides function arguments, functions in-the-wild often use environment variables to base their decisions on. "Environment variables" is another word for "global variables", which I've vowed to never use again in my programs. Still, I find myself writing Bash code like this:

export GIT_CLONE_URL="$(git remote get-url origin)"
export COMMIT_HASH="$(git rev-parse --short --verify HEAD)"

#...

PROJECT_DIR=$(pwd)

function fresh_checkout() {
    cd "$PROJECT_DIR"
    mkdir -p "$PROJECT_DIR/build"
    BUILD_DIR=$(mktemp -d "$PROJECT_DIR/build/$COMMIT_HASH-XXXXXXX")
    git clone "$GIT_CLONE_URL" "$BUILD_DIR"
    cd "$BUILD_DIR"
    git checkout "$COMMIT_HASH"
}

function clean_up() {
    rm -rf "$BUILD_DIR" || true
}

fresh_checkout()

This does the trick, but it isn't particularly well-designed code.

Determined to improve this awful situation, I set out to refactor the clean_up() function. Along the way I learned quite a lot of interesting things about Bash programming, which I'll explain to you in my next post.

Bash Bash Comments

Duck-typing in PHP

Posted on by Matthias Noback

For quite some time now the PHP community has becoming more and more professional. "More professional" in part means that we use more types in our PHP code. Though it took years to introduce more or less decent types in the programming language itself, it took some more time to really appreciate the fact that by adding parameter and return types to our code, we can verify its correctness in better ways than we could before. And although all the type checks still happen at runtime, it feels as if those type checks already happen at compile time, because our editor validates most of our code before actually running it.

To make it perfectly clear: this is all very awesome. In fact, I hope that PHP will change to become more of a static language than a dynamic one. I can very well remember the times when we actually relied on PHP doing the type juggling for us, but I'm happy we've left that phase behind. I think that nowadays many PHP developers agree that silent type conversions is not something which is very useful, nor safe.

But sometimes it's good to remember what's possible with PHP, due to it being a dynamic scripting language. I recently encountered a situation where I wanted to build a generic repository, which would be able to keep track of entities, allowing the user to store and retrieve them by their ID.

class SomeEntity
{
    public function id()
    {
        return $this->id;
    }
}

class GenericRepository
{
    public function store($object)
    {
        $id = $object->id();
        ...
    }

    public function getById($id)
    {
        return ...;
    }
}

So, what are the types we should introduce in this scenario? $id might be a simple string, although these days identifier strings will often get wrapped in their own dedicated value object. Maybe we could enforce an interface for Id type of objects? But then people won't be able to use a simple string anymore. Do I want to force that upon them? The same goes for the objects that our repository is going to store. $object might be typed as an Entity interface (since an object with identity is basically what we call "entity"), which has a method id(), which returns an identifier:

interface Id
{
    public function __toString() : string;
}

interface Entity
{
    public function id() : Id
}

Do we want to force the term Entity onto the user's code? Do we want to force users to implement the Id interface? What if there is no user we can force? What if the "entity" we want to store in our repository is defined in a third-party library?

It doesn't have to be that way. Hey, it's PHP! We only want the user to provide an object which we can use in the following way:

public function store($object) { 
    $id = $object->id();

    /*
     * $id should be a string, or usable as a string (i.e. it has a __toString() method)
     *
     * In fact, we might as well just cast it to a string to be sure:
     */

    $id = (string) $id;

    ...
}

The funny thing is, whatever value the user provides, we can already do this. As long as the method id() exists on the object and PHP can successfully cast its return value to a string, we're fine. As long as we don't define any type at all for the $object parameter, PHP will do no type checking and will just try to do whatever you ask it to do, and throw warnings/errors/exceptions whenever it fails.

The only problem, one that many of us including myself will find a very big problem: our IDE isn't able to help us anymore. It won't be able to verify that methods exist or that passed function argument types are correct. It won't let us click to class definitions, etc. In other words, we loose our ability to do a little bit of the type-checking before runtime.

How to fix this? By helping your IDE to figure it out. PhpStorm for example allows you to define @var or @param annotations to make intended types explicit.

public function store($object) {
    /** @var Entity $object */
    ...
}

// or (this might show some IDE warnings in the user's code):

/**
 * @param Entity $object
 */
public function store($object) {
    ...
}

So, even when $object doesn't actually implement, it will still be treated by the IDE as if it does.

This, by the way, is known as duck typing. Type checks happen at runtime:

With normal typing, suitability is assumed to be determined by an object's type only. In duck typing, an object's suitability is determined by the presence of certain methods and properties (with appropriate meaning), rather than the actual type of the object.

Introducing the php-duck-typing library

The only problem of simply adding a type hint to a value like this is that PHP will simply crash at some point if the passed value doesn't meet our expectations. When we call store() with an object that doesn't really match with the Entity interface, we would like to give the user some insight into what might be wrong. We'd like to know what was wrong about the object we passed to store(), e.g.:

  • The object doesn't implement the Entity interface.
  • It does offer the method id().
  • id() doesn't return an object with a __toString() method though.

In other words: we need some proper validation!

Let me introduce you to my new, highly experimental open source library: php-duck-typing. It allows you to run checks like this:

public function store($object) {
    // this will throw an exception if the object is not usable as Entity:
    Object($object)->shouldBeUsableAs(Entity::class);

    ...
}

Just wanted to let you know that this exists. I had some fun exploring the options. Some open issues:

  • Could an object with a __toString() method be used as an actual string value?
  • What about defining other types which we can use as pseudo-types, e.g. arrays as traversables, arrays as maps, etc.?

I'd be interested to hear your thoughts about this.

For now, this library at least supports the use case I described in this article. I'm not sure if it has a real future, to be honest. Consider it an experiment.

PHP duck-typing Comments