| Author: | Martin Blais |
|---|---|
| Contact: | blais@furius.ca |
| LastUpdate: | $Date$ |
Abstract
Description of a very simple but efficient system for source files and website organization for working with many projects at the same time, with a workflow that includes multiple sites, files under version control and data files.
Table of Contents
This document is a description of the system that I have been using for dealing with work on many many projects at the same time, managing all the source files, easy publish on a web server, without going crazy. This system is composed of very simple ideas, but the decisions involved in coming up with the choices that I made were many and carefully considered (and unfortunately not all documented in this document--they are too many and over time I forgot some of the original reasons) and it has proved over time to make my varied work manageable and has helped maintain my sanity in trying to keep all files and details under control when working at multiple sites.
In extreme programming, it has been said that simple things are often more difficult to accomplish than complicated tasks. This is an attempt at finding the simplest design for efficiently working on lots of projects at the same time by someone who does.
The scope and needs of the system here considered are the following: the need to...
After much experimentation [1], we have decided that all project directories will lie in a single directory, typically $HOME/p, which will contain only directories of the types mentioned below. That location is the heart of the system. We use the following environment variable to specify it:
PROJECTS=$HOME/p export PROJECTS
The scripts are made so that that variable can be specified, the user can thus maintain many of these (typically one for the source code at each site, and one for the web server).
This implies that the name of each project directory must be unique amongst the set of all source code, data directory and generated data projects (see below).
| [1] | We used to have a separate root directory for source-files directories and data-files directories (i.e. projects-src would be in $HOME/p and projects-data would be in $HOME/data). This makes it more complicated because if we need to refer to the data from the source code we need to use the basename of the source directory-- not a big deal, but it's that much nicer without having to do that. |
Each project directory can consist in one of a few types of files:
SOURCE FILES, written by the user/developer. These files need be under version control.
We detect that a directory contains source files by the presence of a <project>/CVS subdirectory [2];
DATA FILES, used by the source code or web publishing system. These files should not be under version control, but it should be easy to share them between sites and to make sure that new additions are backed up automatically on a centralized server. The reason for this is that often the data files do not change often and are often too big to be managed efficiently by version control systems.
We detect that a directory contains data files to be backed up to a remote server by the presence of a DATADIR file in the root of the project directory. The contents of the DATADIR file determine the backup policy: the first line of the file is interpreted as the location for the remote rsync'ed directory to back it up to, e.g. for project <pro>, using DATADIR contents:
user@host.domain.com:/u/blais/dataroot It will be mirrored at:: user@host.domain.com:/u/blais/dataroot/<pro>/
Note: a DATADIR file is not allowed to be empty.
GENERATED FILES, files that are generated either with the source code or through some other process. Those files can be lost or deleted without much worry. (However, since the configuration of the web server is unified with the projects source for simplicity, it has been decided that the generated files lie next to the source code for simplicity's sake.)
Generated files are detected by the absence of a CVS or DATADIR file in the root of the project directory;
| [2] | Right now we use CVS, but nothing prevents us from eventually moving to another revision control system such as tla or subversion or perforce. |
The separation of source and data files is important, because stray files lying in the source directories that are not meant to be under version control make it a royal pain in the neck to manage code. If the file is necessary but does not require revision control, we don't hack, we acknowledge it and place it in an appropriately named directory (e.g. <project>-data). It makes it a lot easier to know what's going on when you get lost. The basic message is, always place the files immediately where they should be if you were to leave for a 3 month holiday, don't leave stuff lying around.
Each project name must be short, and unique amongst the set of all source code, data directory and generated data projects. For consistency (and ease of typing), the names may not contain underscores, but dashes to separate words (whatever you choose, be consistent, it makes it easier on the fingers).
When a data directory relates to a project, we name it <project>-data.
When a source code project's purpose is generating a "generated project", we name it <project>-src and generated project simply <project>. This is so that when we put the generated project under a web server (typical use), it can simply be referred to as URL/.../<project>. This is very convenient. So we have the following arrangement:
$PROJECTS/<project1>-src : source code directory
$PROJECTS/<project1>-data : data directory associated to project1
$PROJECTS/<project1> : generated directory
$PROJECTS/<project2> : (could be a) source code directory
$PROJECTS/<project3> : (could be a) source code directory
$PROJECTS/<project3>-data : data directory associated to project3
$PROJECTS/<project4> : (could be a) pure data directory not
associated to any project
For example, at the time of writing this, my projects directory looked like this:
elbow:~/p$ ls -l total 136 drwxr-xr-x 10 blais users 4096 Dec 29 02:50 adventures drwxr-xr-x 4 blais users 4096 Dec 9 14:07 backup-index drwxr-xr-x 33 blais users 4096 Jan 1 19:08 camera drwxr-xr-x 9 blais users 4096 Dec 29 02:51 camera-tools drwxr-xr-x 7 blais users 4096 Dec 26 21:17 company drwxr-xr-x 6 blais users 4096 Jan 2 17:02 conf drwxr-xr-x 7 blais users 4096 Dec 29 02:50 contract-tools drwxr-xr-x 8 blais users 4096 Jan 1 17:24 curator drwxr-xr-x 6 blais users 4096 Jan 1 15:26 drumming-notes drwxr-xr-x 10 blais users 4096 Dec 24 17:43 extra drwxr-xr-x 8 blais users 4096 Dec 24 17:43 hiertemp drwxr-xr-x 3 blais users 4096 Jan 3 14:14 impdoc drwxr-xr-x 11 blais users 4096 Dec 29 02:56 jukebox drwxr-xr-x 4 blais users 4096 Dec 26 21:22 jukebox-data drwxr-xr-x 5 blais users 4096 Jan 1 15:29 languages drwxr-xr-x 9 blais users 4096 Sep 29 22:07 latindance drwxr-xr-x 5 blais users 4096 Dec 29 02:51 mailsoup drwxr-xr-x 5 blais users 4096 Dec 26 23:08 memcard drwxr-xr-x 3 blais users 4096 Dec 24 17:43 optparse-completion drwxr-xr-x 6 blais users 4096 Dec 29 02:50 pydeps drwxr-xr-x 5 blais users 4096 Dec 24 17:43 reviews drwxr-xr-x 4 blais users 4096 Jan 1 17:24 rhythm-latin drwxr-xr-x 6 blais users 4096 Dec 29 02:51 salsa drwxr-xr-x 4 blais users 4096 Dec 26 23:27 santeriadb-data drwxr-xr-x 7 blais users 4096 Jan 1 18:26 santeriadb-src drwxr-xr-x 3 blais users 4096 Oct 28 00:18 solus-config drwxr-xr-x 4 blais users 4096 Jan 1 18:23 techdoc drwxr-xr-x 9 blais users 4096 Jan 1 15:40 tengis drwxr-xr-x 4 blais users 4096 Jan 2 17:21 web-diro drwxr-xr-x 5 blais users 4096 Jan 3 14:23 web-furius drwxr-xr-x 4 blais users 4096 Dec 29 02:51 x2vnc drwxr-xr-x 11 blais users 4096 Dec 31 19:02 xxdiff
We have a remote computer where all work changes are saved to at the end of every work period (i.e. every day or half-day typically). The source code projects are managed using cvs. The data directories are duplicated using rsync.
The location of the repositories is specified in the following configuration options (in a .projectsrc configuration file, read by a custom script):
[options] source_root = "user@cvs.server.com:/path/to/cvsroot" data_root = "rsync.server.com:/path/to/dataroot"
There are a few more options supported by my script, but they are not relevant to this document.
I wrote a script to manage the update, change, commit process over this structure. It finds the projects and their types and can apply commits and updates globally, and more.
The following example command output lists the checked out projects: project name, type (source: source files under CVS, data: data files managed with rsync, or other: just an unmanaged directory, e.g. HTML generated by a project), if it's backed up to central repository, and full path of directory:
elbow:~$ projects lsco adventures [ source | backup ]: /home/blais/p/adventures backup-index [ other | ]: /home/blais/p/backup-index books [ other | ]: /home/blais/p/books camera [ other | ]: /home/blais/p/camera camera-tools [ source | backup ]: /home/blais/p/camera-tools company [ data | backup ]: /home/blais/p/company conf [ source | backup ]: /home/blais/p/conf contract-tools [ source | backup ]: /home/blais/p/contract-tools curator [ source | backup ]: /home/blais/p/curator drumming-notes [ source | backup ]: /home/blais/p/drumming-notes extra [ source | backup ]: /home/blais/p/extra hiertemp [ source | backup ]: /home/blais/p/hiertemp impdoc [ source | backup ]: /home/blais/p/impdoc jukebox [ source | backup ]: /home/blais/p/jukebox jukebox-data [ data | backup ]: /home/blais/p/jukebox-data languages [ source | backup ]: /home/blais/p/languages latindance [ source | backup ]: /home/blais/p/latindance mailsoup [ source | backup ]: /home/blais/p/mailsoup memcard [ source | backup ]: /home/blais/p/memcard optparse-completion [ source | backup ]: /home/blais/p/optparse-completion projects [ source | backup ]: /home/blais/p/projects pydeps [ source | backup ]: /home/blais/p/pydeps reviews [ source | backup ]: /home/blais/p/reviews rhythm-latin [ source | backup ]: /home/blais/p/rhythm-latin salsa [ source | backup ]: /home/blais/p/salsa santeriadb-data [ data | backup ]: /home/blais/p/santeriadb-data santeriadb-src [ source | backup ]: /home/blais/p/santeriadb-src techdoc [ source | backup ]: /home/blais/p/techdoc tengis [ source | backup ]: /home/blais/p/tengis web-diro [ source | backup ]: /home/blais/p/web-diro web-furius [ source | backup ]: /home/blais/p/web-furius x2vnc-modifs [ source | backup ]: /home/blais/p/x2vnc-modifs xxdiff [ source | backup ]: /home/blais/p/xxdiff
Permissions are not set automatically for CVS. Since we add files regularly, it becomes annoying to have to set permission by hand on the server. In fact, this often only becomes a visible problem at the moment of publishing the files on a web server. We would like to process to be automated. To that effect, we wrote a script that sets the permissions from within the cvsroot on the server. The script is called cvsroot-perms and is invoked from this configuration in CVSROOT/loginfo:
^CVSROOT (echo ""; id; echo %s; date; cat) >> $CVSROOT/CVSROOT/commitlog
DEFAULT (echo "================================================="; date; cat) \
>> $CVSROOT/CVSROOT/commitlog 2>&1 ; \
( /u/blais/p/conf-local/plat/i686-pc-linux/bin/python \
/u/blais/p/conf/common/bin/cvsroot-perms --logbox \
$CVSROOT/%{} 2>&1 ) | tee -a $CVSROOT/CVSROOT/commitlog
One important consideration for this system's design was to be able to easily share specific projects over a web server, as easily as possible, with as least maintenance as possible.
We simply create a directory to contain all of the web server's contents and use that as a projects root. The web server is configured so that requests for simple documents in the root (i.e. without slashes) are redirected to /home/<doc>, so that a request for /index.html will be redirected to /home/index.html.
Also, requests to /home/... are aliased to one of the projects (typically one of the web-<something> projects). This allows me to have meaningful names for the various web projects without having to have many "home" projects.
All links on the websites and projects then refer to /<project>/<path>/<to>/<file>. It's very simple. Projects names are short and do not change often, i.e. they are not a long public "title" that represents the project, but rather a short identifier/directory name that I use internally to identify that project (and in the URL). I can decide to checkout (or copy) any subset of projects on any web server that I manage.
Also, the script I wrote to manage the projects in my personal work space has an option to specify the root of the projects directory, which makes it trivial to update websites.
For example, here is my configuration for my company's website:
#
# Apache config for Furius website layout.
#
# projects directory. everything lies here, it's simpler to manage.
DocumentRoot /home/httpd/public
# serve "ROOT/web-furius" as "DOMAIN/home".
Alias /home /home/httpd/public/web-furius
# make sure that ROOT/web-furius cannot be accessed as "DOMAIN/web-furius".
Redirect /web-furius http://www.furius.ca/home
# make "DOMAIN/error" appear as available from root.
Alias /error /home/httpd/public/web-furius/error
# setup specific error URLs.
ErrorDocument 404 /error/error404.html
# redirect requests for domain ("/") to "DOMAIN/home".
RedirectMatch ^/$ http://www.furius.ca/home
# redirect requests for "DOMAIN/file.{css,html,png}" to "DOMAIN/home" files.
# .txt is important for robots.txt redirection.
RedirectMatch ^/([^/]*)\.(css|html|png|txt)$ http://www.furius.ca/home/$1.$2
With this setup, my configuration files are simply one of the projects that I checkout in my projects directory root. An advantage of a regular organization of projects is that I can easily distribute the initialization of the environment (shell variables) and shell declarations (functions, aliases, etc.) into each of the projects. My configuration files initialize using each project's .../etc/env and .../etc/bashrc automatically if the project is checked out to the projects directory root. This significantly reduces pollution of my shell's environment (something that quickly goes out of control).
While this section is out of the scope of the UNIX configuration issue, I document it here because with the longer fingers that the Python language provides me, many of the projects that I fiddle with daily are simply outgrowths of the scripts that used to be in my conf/share/bin configuration scripts directory.
For simplicity's sake, I want to have a certain consistency in the directory organization of my projects. In the following we're assuming that the web page for the project is also the project itself (although for some projects the web page is a separate CVS module/project by itself). These are checked out under my $PROJECTS directory:
- .../
- Root of the project.
- .../README
- General description file for the project, sometimes used as the source for generating index.html file where it makes sense (see .../.docutils file below).
- .../CHANGES
- Change log, contains change history of a project, latest at the top.
- .../TODO
- List of tasks to be done, and ideas for future development. The important or urgent section of this file ends at the .. end marker.
- .../VERSION
Version number of package, if there is one. It is useful to have a single location for the version number because it makes releasing software often much easier. Also, to have a well-defined location of the version file allows us to write a release script that works with all projects.
(Note: I only use version numbers/tags for software that is publicly released, e.g. via sourceforge or freshmeat, or that has a downloadable "package" cut for it, otherwise having to choose/update version numbers just gets in the way.)
- .../PKGINFO
A list of one-line resources for the package, where appropriate. All software projects, at least all software projects that are published should have a RESOURCES file, so that we can identify which external dependencies are being used by the project.
This is a simple file in RFC822 format the following kinds of fields:
Official-Homepage: Mirror-Homepage: Announcement: Mailing-List: Source-Code-Repository: Bugs-Reporting: Download-Releases: Download-Snapshots: ChangeLog: License: Author: Author-Email: Short-Description:Each of the fields can be repeated where is relevant.
- .../etc/docutilsrc
- I use the docutils text format almost everywhere now, it is very useful to write simple HTML documents with a text editor. Since it is pervasive in my system, I wrote a script (similar to docutils-html) that uses my own settings and that looks for a style.css file up the path of the output file. The .docutils file in a project's root indicates the needed .txt to .html conversions (I check the generate HTML files in the repository, but sometimes I might need to regenerate all the docutils for a reason.)
- .../COPYING
- GNU GPL license file, if the project is to be distributed under that license.
- .../index.html (.../index.txt)
- The Home Page, for web publication, generated in HTML with docutils. Sometimes it is generated automatically from the README file, sometimes by a different index.txt file.
- .../Makefile
- A makefile, if there are oft-used and convenient operations to be run for that project (e.g. building a web-site).
- .../doc
- User and Design Documentation for the project. Sometimes the web page from the root links to documents in this space.
- .../bin
- Scripts (and executables) to run.
- .../lib
Libraries (source and compiled) to run.
- .../lib/python/<libname>/...
- When I have Python libraries they go under a python subdir.
- .../lib/emacs/...
- When I have emacs-lisp files they go under an emacs subdir.
- .../etc (.../etc/env, .../etc/bashrc)
- Configuration files and sample configuration for the project. The main configuration files (in my conf project) looks at all project directories that are checked out and if those files exist, they are automatically run (for each project). Thus my environment and bash setup varies slightly depending on which of the projects I have setup. This also allows me to put project-related environment settings close to the project files, rather than in a large single .bashrc file.
- .../src
- Source code for code to be compiled.
- .../adm
- Development scripts to be run for distribution and maintenance.
- .../share
- Data that pertains to the project. This can contain any data. This is where DTD and XSL files should go as well.
- .../test
- Whatever test code and data.
- .../tools
- Related tools, scripts that are of interest to users but that should not be installed as a default.
- .../misc
- Miscellaneous other stuff.
- .../old
- Old and obsolete files.
Typically, the .../etc/env file will contain something akin to this:
USERPATH=$USERPATH:$PROJDIR/bin PYTHONPATH=$PYTHONPATH:$PROJDIR/lib/python
to provide the scripts to the PATH of the user and the Python modules. Other environment variables get set here. This is just an example.
Our code projects contain multiple libraries, e.g. Python or Perl modules. We need to strike a balance between
The gist of this is that while we really want to minimize dependencies between our project, we would also like to have a single master source copy for any particular library (to avoid having to merge), but want to allow fallback copies of the master so that a project still works in those cases where the master is not present.
Consequently, we have chosen to take the following approach for lightweight libraries/modules (this solves requirement (1) above):
We will thus separate the library path and directories between a master part, and a fallback part. For the programs that require this, the configuration will use the variables <libpath> and <libpath>_FALLBACK to set the final <libpath> for the program and we adopt the convention that fallback libraries that would go under .../lib/<program> in the master copy will go under .../lib/<program>-fallback in the fallback location.
For example, for Python we would have:
PYTHONPATH=$PYTHONPATH:$PROJDIR/lib/python PYTHONPATH_FALLBACK=$PYTHONPATH_FALLBACK:$PROJDIR/lib/python-fallback
...and later on, at the end of the environment code, the final value of the PYTHONPATH is set, to ensure proper ordering (i.e. to make sure that the master copy is caught first in the path):
PYTHONPATH=${PYTHONPATH}:${PYTHONPATH_FALLBACK}