Some thoughts on coding practices and an outline of a work-flow that i find useful when starting any project. Future posts will focus on concrete examples.
The problem
Everyone has the pleasure of receiving code that is broken, not just that it doesn't translate to a specific need sans some edits here and there, but that it is fundamentally flawed and lazy in design. Sometimes the giver will remark, "i got lazy and just hard-coded some stuff in; it should be easy to change." That can be true if they put all their magic numbers at the beginning as easily modifiable constants and have a clear logical plan that is followed throughout. However, there is a tendency to build page-long for-loops; a dislike for small, self-contained functions; and a desire to build complex data-structures into low-level functions. For example, low-level functions are no place for storing analyzed matrices in complex, multi-level structures—send the results up to the calling function and have the data-abstraction take place there. This allows for re-use of code later on and allows easier re-factoring when the function I/O isn't super complex.
Refactoring mercilessly is a style of programming that has become very useful as the complexity and amount of data i need to handle grows. In lab we routinely produce terabytes of imaging data that needs to be backed-up, compressed, processed, and analyzed. Initially i had a hodge-podge of code that spanned ImageJ macros/plugins, Matlab GUIs, Batch programs, and R scripts. The overarching challenge was integrating all these disparate languages and design philosophies into a single, unified pipeline. Some of the ImageJ plugins that had been written by previous lab members relied on for loops with hundreds of lines of inter-dependent, function-less code that inefficiently switched between automatic analysis and manual intervention. For example, if you needed to select a region of an image to analyze, this was done at the start of each loop instead of batching all this at the beginning then having the script automatically retrieve the entered target regions later on. Thus, i set out to improve the work-flow.
The fix
Initially, i tried to refactor much of the code given to me so that it would be more easily maintainable. This worked, insofar as i was able to slightly abstract bits and pieces to allow batch processing in a more streamlined fashion. This turned out to be the case for both the compression of the movies and the pre-processing before analysis. However, the disparate design philosophies and monstrously unmaintainable code base led me to finally abandon the ImageJ plugins and batch scripts altogether and write the entire process up in Matlab.
This was partially inspired by the fact that one needed to constantly be writing temporary text files to transfer information between the different scripts and the need to use TIFF files because ImageJ's support for HDF5 files is flaky—our compressed movies often are tens of GB for a single imaging session, so the inherent 4GB limit of TIFF files, that ImageJ gets around through some sketchy modification of how it writes headers to the files, was slowly becoming a crippling bottleneck that needed to be overcome. It became apparent, when i tried to show other labmates the new workflow, how disastrous it would be to maintain and improve upon this quicksand of a pipeline in the months and years to come. Making this type of realization early, and acting on it quickly, can save oodles of time down the road.
There is always an initial resistance to porting over a pipeline to a different language, especially one that uses proprietary/paid software to run. This is the reason why i tend to do most analyses in R: it is both powerful and free. However, when it comes to dealing with matrices, e.g. what every movie and image is, Matlab is without a doubt the first choice due to its combination of speed, portability, and ease-of-use. Thus, i did away with the temporary text files, designed a workflow that allowed easy integration/removal of components, and started coding.
The workflow
Below is the basic workflow that i follow when writing code:
- Write down what the intended overall function of the set of scripts will be. This often involves diagramming out what the general flow of data should look like, how it gets modified at each step, and the intended output(s). It might also be helpful to talk with others working on similar data to see what their process is like; this can point toward the best ways to design the work-flow such that it is amenable to multiple types of data organization or flexible enough to be easily converted. Writing code assuming others will use it is one of the best ways to avoid specificity traps and bad coding practices.
- Setup a root directory with several sub-directories where functions will reside, e.g. /io, /model, /views, /pre_processing (don't use spaces in folder or file-names...ever). This helps organize the logic of the project and makes it easy for others to find functions of interest.
- Check and see if someone has already written functions that are generally similar to what will be needed for various steps. If such functions exist, use them directly or if they are function-less messes, abstract them.
- Often when i say 'abstract them', it means find the underlying point of the function and refactor (or completely change) it so that it will accept a general input, perform some transformation on it, and provide a few outputs.
- For example, if you are analyzing transients in a neuronal calcium signal (as i do on a daily basis in lab), realize that the function is basically a signal processing function (not a calcium transient function) that is attempting to classify peaks in an analog waveform. Thus, write the function as such and avoid coding in details specific to the experiment (e.g. looping over days, animals, etc.).
- This applies to naming functions. e.g. readHDF5Subset is much better than loadAnimalDaysData—one is specific to the type of data that is being analyzed while the other could mean multiple things. Normally the underlying type of data or manipulation of said data guides my naming schemes rather than the specific scientific use or interpretation of said data. However, laziness or lack of creativity sometimes wins in this case.
- Start writing wrapper functions where the specifics of the implementation reside. By doing this, you resist the temptation to implement highly-specific routines in low-level functions and reduce the possibility of overloading functions that should have one, and only one, task. Lastly, it provides an easy place to sketch out a workflow and see which functions still need to be written.
- Test (or unit test depending on the environment) each new functions capabilities by running it through real or fake data that meets the specs for that function. Liberally add comments to each function that outline the acceptable inputs and a list of changes to the function (since not everyone uses git). Further, if there are any options or 'magic' variables, place them at the top. For example, i consistently use the below code to deal with the ability to have an arbitrary number of input arguments via varargin in Matlab. getOptions is a small function that processes inputs from varargin and adds them to the portable options structure without the use of a crazy-long switch statement. This is clear and allows setting of defaults that are easily changed and obvious to the user at the beginning of each function.
- % biafra ahanonu
- % example of dealing with options in matlab
- % updated 2013.12.10
- options.fileFilterRegexp = 'concatenated_.*.h5';
- % get options
- options
- % unpack options into current workspace
- end
- % biafra ahanonu
- % updated: 2013.12.10
- % gets default options for a function, restrict by pre-defined names in options structure
- % use ability to reference
- else
- continue;
- end
- end
- Briefly skim over the code and check to see whether the same type of routine is being repeated, e.g. concatenating together strings from the same underlying variables, and find a way to abstract the process into a small, discrete function.
- If there is a giant for-loop, look for breaks in the logic and turn those into sub-functions. This both makes the main function easier to read and, in the case of Matlab, reduces memory usage as unused variables are freed when the sub-routine finishes (though, for large variables, this can actually temporarily increase memory usage).
- And as always, look for ways to make short functions that take the role of a single, concrete tasks. While this may lead to having a plethora of functions, it is easier to maintain than having several hundred (or thousand) lines of code in a script that is highly dependent and obtuse (you should not have to use Find to determine where a variable was defined). Further, it is likely that these small functions will come in handy, it's like building several small Lego sets that you can call upon when the time comes.
- Once it appears the code is stable and suitably functionalized, see if any of the loops can be parallelized (e.g. parfor in Matlab or parLapply in R). If the code was designed correctly, this should be easier than if there are sprawling dependencies—small functions allow obvious entry/exit points that are easier to debug, remove, or modify if it is found they conflict with the ability to parallelize the overarching pipeline.
- And for Matlab, can the loop be vectorized? e.g. using bsxfun.
- Also, avoid global variables at all costs, it destroys the ability for someone to follow the logic of the code or tell at a glance what its I/O is.
Having a pre-defined method of building up even simple scripts has greatly improved my ability to maintain code and build on it later (and share it). For our Pavlovian conditioning paradigm, we get data out from a Med Associates' box (via MED-PC) that is nightmarishly organized. However, during my rotation in the Schnitzer lab, i wrote a small R function that re-organizes the underlying data into data-frames and saves it as a csv file. Originally it was a mess; however, after trying to parallelize the script to allow faster analysis the myriad of experiments we were doing, i re-factored the underlying code into several smaller sub-functions and made a wrapper that allows parallelization via the parLapply in the parallel package. This little bit of abstracting has saved a great deal of time as the experiments changed but the underlying type of data stayed constant. Further, as mentioned before, discrete units of logic are quite amenable to parallelization.
In future coding practices posts i will go over specific examples of before-after code to make concrete the work-flow i outlined above and the problems i try to avoid.