Build Sanity in Bioinformatics Projects

package_name
    README.md
    Makefile AND/OR configure AND/OR install.sh
    |___bin
    |___lib
    |___include
    |___src
    |___doc
    |___test

I've been wrangling a new-to-me C++ project over the past few weeks, and I've realized that it is by no means the first bioinformatics package to use a non-canonical build format. I don't mean to pick on the field. Science will always come first, but there are good reasons for following solid design practices even if it's just a small piece of code for in-house use. You never know if a piece of code will become important enough to publish at a later date.

Let me back up and first explain what is a canonical build format. Most software packages use the same layout in their basic design, with a folder hierarchy to separate code, libraries, binaries, etc. This usually looks like the diagram this post opens with. bin contains any binaries onces they are built; lib contains third-party libraries as well as any generated during the build; include contains all header files in the package; and src contains source code. That's it. The docs folder, while nice, is not strictly necessary. You may also want a test folder and additional top-level files if you're using a test harness. While this isn't a flat structure, it's about as easy as a hierarchical system could be. A good README contains instructions for building as well as a list of external dependencies to be installed, so that there should be no guesswork in building your code. Hiltmon has a great post on the topic of project structure that I've drawn from extensively in creating this post.

This is the organization of choice for most C++ projects and almost all modern Java projects. In fact many Java build systems, such as Maven, strongly enforce such structure. In Maven you have to modify the configuration of the build system to even create a different type of structure. For emphasis: you can't modify your directory structure without modifying the build system itself, usually a more difficult task than simply obliging Maven's demand for order. While that's a pretty intense requirement, after many years of dealing with both styles I'm beginning to understand why.

The vast majority of users will do the same ./configure; make; make install sequence the moment they get your code. If they see a configure script, they'll know to run that first. I'm guilty of regularly typing ./configure at the command line before realizing there isn't even a configure script inside the folder. In Java, there's a similar gravitation towards mvn build and other basic commands. Since users are accustomed to this, the setup is easy on the user. It behooves the developers who want users to utilize their code to employ such a user-friendly structure. More users means more citations and (hopefully) good press.

Additionally, the directory structure above mimics other packages. It even mirrors Unix, with it's /usr/local/* hierarchy for system-wide header files and libraries. This format works, and it provides a sort of common interface between packages. It makes linking against packages easy, and if you add a make install target you can even install to the system-level directories with just a handful of copy commands in your makefile. Plus, your package becomes nearly plug-and-play compatible with build systems like Gradle and Maven if it's in Java. For C/C++, you may have to toy with a NAR plugin, but it should still be possible to move to an advanced build system if your package is relatively simple and organized like so.

There are I guess some downsides to this structure of course. Perhaps it's conceptually hard to grasp for certain packages. There's also the pain of migration from an old organization scheme to a new one. Another major argument against this structure is that is isn't as easy to include header files in your code, but this is easily remedied using a mixture of environment variables (LD_LIBRARY_PATH, LD_INCLUDE_PATH, CPATH, C_INCLUDE_PATH, and LIBRARY_PATH), good old GNU Make, and the -L, -I, and -l flags to GCC/ICC. You can even set these in a script that gets sourced from your Makefile to make sure they're set when the build kicks off. In all, I think the benefits of canon outway any negatives. Stick to a simple design like this, and your users (and fellow developers) will thank you.