Like many other people who wrangle data for a living, I've been in on-and-off
pursuit of a make-like approach to processing and transforming data.
Specifically, one that allows me to idempotently bootstrap and process data
without starting from zero each time. Also, one that makes it easy enough to
script experiments pro-actively rather than putting replicability off.
Common Idioms
We use a number of sh / bash idioms that are helpful to recognize as such.
Use of Exit Codes
When a process exits, it does so with a numeric code that indicates success or
failure (in the latter case, often with a value that can be used to determine
why). 0 is considered success; anything else is a failure of some sort.
These codes aren't directly visible, but make uses them to determine whether
a target was successful (and whether execution should continue). $? can be
used to surface the code in a shell context:
$ echo hi;echo$?
hi
0
There are a few cases where we want to shield make from failures (to prevent
execution from stopping). For example:
carto -l $<>$@||(rm -f$@;false)
This runs carto; if it fails, it deletes the output ($@). However, if
there was no output, rm will return successfully, but the task actually
failed. To work-around that, we execute rm in a subshell (using parentheses)
and explicitly return false (false) so that make ceases execution.
|| and &&
|| and && are used in conjunction with exit codes to conditionally execute
subsequent parts of a composite command. || is used when you want to execute
a command only if the first returns an error (non-zero exit code); && when
you want to execute a command only if the first was successful.
> /dev/null 2>&1
POSIX processes are provided 3 file descriptions (handles to files or file-like
things) by default. stdin (content from a file or other source like
a keyboard) is file descriptor 0, stdout is 1, and stderr is 2. Thus,
echo hi > /dev/null 2>&1 says to redirect stdout (the default output) to
/dev/null (into the abyss) and then stderr (2) to the same place as
stdout (as a reference: &1). In this context, we use this to squelch
output, since we generally just care about the exit code (either directly, or
after greping through it or counting lines (wc -l)).
psql Checks
At various points, we want to short-circuit tasks if a resource already exists.
In a traditional Makefile, these resources would be files and that
short-circuiting is built-in. However, since we're since working with database
tables (etc.) and there are no file equivalents, we need to implement
equivalent functionality.
We use psql -c for practically all existence checks, as it allows us to
construct a SQL command and use it to query PostgreSQL. Unfortunately,
different checks result in varying output and exit codes.
Looking for Relations
\d <relation> will check for the existence of a "relation" (which could be
a table, view, etc.). It displays the name, type, and owner of the relation and
exits with 0 only if it matched something.
Looking for Extensions
\dx <extension> will check for extensions matching the provided name. If
present, it will display various information about it. Unfortunately, it exits
with 0 regardless of whether anything has been found or not, so we need to
grep the output for the presence of the name.
Looking for Functions
\df <name> will check for functions matching the provided name. When
a function has multiple signatures, it will be displayed multiple times. Like
\dx, it exits with 0 no matter what, so we need to grep
(case-insensitively) for the function we're looking for.
Implicit make Variables
make includes many implicit
variables,
most of which are intended for use in an environment where source files are
being compiled and linked into binaries. However, there are still a few we make
use of:
$@ - the name of the target (the thing before the :).
$< - the name of the first prerequisite. Convenient.
$^ - all prerequisites, space-delimited. Can be combined with
$(word <n>,$^) to select the _n_th one.
make Functions
Since many of the targets we're working with are synthetic, we need to extract
relevant components of their names. Database-related functionality is grouped
under the db/ "path", so we primarily use
subst
to remove irrelevant components. We also use word to refer to components
within space-delimited values.
make Patterns
There are many resources that follow patterns when we work with data. Natural
Earth's filenames and source URLs are a good example of this. In keeping with
the DRY principle ("don't repeat yourself"), we fold these into a smaller
number of targets using make patterns. These are strings that include %
anywhere text may vary.
The convenient bit is that prerequisites can also use the % syntax and the
value of the pattern in the target name will be substituted. Thus, %: %.mml
will convert toner to toner: toner.mml and gives us the behavior we're
looking for.
Annotated Makefile
Without further ado, here's an annotated snapshot of the Makefile-driven
approach we've been using. Suggestions are most welcome, especially if they
simplify or clarify.
SHELL := /bin/bash
Use bash for sub-shells, allowing use of bash-specific functionality.
Define a macro that expands (splits on =) and exports (makes available to
sub-shells) key-value arguments, e.g. DATABASE\_URL=postgres:///db.
# wrap Makefile body with a check for pgexplode
ifeq ($(shell test -f node_modules/.bin/pgexplode; echo $$?), 0)
Check whether pgexplode has been installed (it's effectively a prerequisite
for the entire Makefile); test doesn't output anything, so the return
code ($?) needs to be echo'd for comparison with 0 (and escaped with an
extra $ to be passed through to the shell call).
Read .env (squelching error messages if one doesn't exist) and pass each
environment pair to EXPAND\_EXPORTS to make it available to commands in
targets.
# expand PG* environment vars
$(foreach a,$(shell set -a && source .env 2> /dev/null; node_modules/.bin/pgexplode),$(eval $(call EXPAND_EXPORTS,$(a))))
Use pgexplode to expand DATABASE\_URL into libpq-compatible
environment
variables. This
will read from the environment ($DATABASE\_URL) if one isn't present in
.env (or .env doesn't exist).
Macro definitions that will strip db/ from targets' $@ (target name) and
use it as the name of a SQL file or PostgreSQL extension ($(subst db/,,$@)). The first command (psql -c "\d $(subst db/,,$@)" > /dev/null 2>&1) checks for the presence of a relation in the database specified by
DATABASE\_URL and only evaluates SQL commands if it fails. \d <name>
checks for the presence of a relation of any kind, \dx for loaded
extensions, etc.
ON\_ERROR\_STOP=1 is used to abort as soon as an error occurs, -q is
"quiet", -X ignores any psqlrc files, -1 runs the command in a single
transaction, -f provides a file containing commands, and -c tells it to
use the provided command.
Commands are prefixed with @ to prevent make from printing them (since
they're unnecessarily complicated due to the need to check for the existence
of non-file resources).
Macro definition for importing OSM extracts. When called, it generates
targets like db/<place> (which depends on both the import and db/shared
(see below)) and db/osm-<place>, which does the actual import. osm\_roads
is assumed to be created by imposm3 in this case. $${DATABASE\_URL} is
escaped because this is a macro (so it's evaluated at runtime) and uses
braces to use the environment variable.
Now begins what looks like a more conventional Makefile.
# default target
default: toner
# symlink into TileMill's project folder (if TM is installed)
link:
@test -e ${HOME}/Documents/MapBox/project && \
test -e ${HOME}/Documents/MapBox/project/toner || \
ln -sf "`pwd`" ${HOME}/Documents/MapBox/project/toner
# clean up derivative files
clean:
@rm -f *.mml *.xml
# create a default .env file with a sensible(?) default
.env:
@echo DATABASE_URL=postgres:///toner > $@
%: %.mml
@cp $< project.mml
A pattern, which obviates the need to declare multiple redundant targets. If
make toner is run, it will depend on toner.mml (see below) and will
silently copy the output ($< is the expanded name of the first dependency)
to project.mml for TileMill to read.
Builds *.mml by interpolating environment variables into
a Mustache-templated YAML file
(interp) and converting to JSON
(js-yaml). tmp.mml is used so that
mv can atomically move the file into place (without doing this, TileMill
periodically chokes when reading partially-written files).
This depends on <style>.yml, map.mss, labels.mss, and <style.mss> so
that it will be considered stale when any of those are modified. interp and
js-yaml are explicitly called out as dependencies so that they can be
installed if necessary (see below).
This is marked as .PRECIOUS so that artifacts won't be deleted when called
as an intermediate target (i.e. from an XML target).
Builds *.xml from <style>.mml (declared above). || (rm -f $@; false) is
included because carto may leave behind invalid XML files when it fails
(and because we want to pass the failure through and terminate the current
make invocation.
Artificial (.PHONY) targets for required commands along with file-based
dependencies and checks for Node. PKG is defined as a make variable in
preparation for future refactoring that turns this boilerplate into a macro.
package.json declares dependencies on these commands, but it also includes
everything else to run a rendering node, so this is a lower-impact way of
ensuring that they're installed.
Target to ensure that a database exists. If psql returns fall, createdb
will be run (using libpq environment variables extracted from
DATABASE\_URL) to create one.
.PHONY: db/postgis
db/postgis: db
$(call create_extension)
.PHONY: db/hstore
db/hstore: db
$(call create_extension)
Targets that create extensions (and require that a database exists) using the
macros defined above.
Define a list of places along with the local reference to their corresponding
extract (data/extract/% and data/metro/% are patterns defined below) and
generate import targets (e.g. db/BC for British Columbia).
OSM extract patterns; i.e. anything under data/extract/ will be downloaded
from Geofabrik, anything under data/metro/ from
Mapzen's Metro
Extracts).
These are marked as .SECONDARY to prevent them from being deleted (as
they're "expensive" to create). Note: this isn't quite right; my
intention is to keep them around but also to delete them if the target
failed.
mkdir -p is used to ensure that the target directory exists. $$(dirname $@) is used to pass the literal $(dirname data/metro/<whatever>) to
mkdir.
$(@:data/metro/%=%) substitutes <whatever> for data/metro/<whatever> in
the target name.
curl's -f option is provided so that it will return with a non-zero exit
code on failure and cause the target to fail.
.SECONDARY: data/osmdata/land_polygons.zip
# so the zip matches the shapefile name
data/osmdata/land_polygons.zip:
@mkdir -p $$(dirname $@)
curl -Lf https://data.openstreetmapdata.com/land-polygons-complete-3857.zip -o $@
shp/osmdata/%.shp \
shp/osmdata/%.dbf \
shp/osmdata/%.prj \
shp/osmdata/%.shx: data/osmdata/%.zip
@mkdir -p $$(dirname $@)
unzip -ju $< -d $$(dirname $@)
shp/osmdata/land_polygons.index: shp/osmdata/land_polygons.shp
shapeindex $<
.SECONDARY: data/osmdata/land-polygons-complete-3857.zip
shp/osmdata/land-polygons-complete-3857.zip: shp/osmdata/land_polygons.shp \
shp/osmdata/land_polygons.dbf \
shp/osmdata/land_polygons.prj \
shp/osmdata/land_polygons.shx \
shp/osmdata/land_polygons.index
zip -j $@ $^
Similar to the OSM extracts, but for a specific file with a non-matching
source name. Some of this remains as a relic of when we were using different
versions of the land polygons.
Macro definition that creates db/ne\_<whatever> and shp/natural\_earth/*
targets for Natural Earth sources. It
assumes that it's called with arguments in the form <name>:<source file>:[shapefile]. (If someone can help simplify the repeated $(strip $(word 1, $(subst :, ,$(1)))) declarations (used to extract the first
component), I'd appreciate it!)
The db/<whatever> target runs ogr2ogr with a set of options that we've
found to work well with the Natural Earth Shapefiles over the years.
shp/natural\_earth/*-merc.{shp,dbf,prj,shx} states that there will be
4 artifacts for each invocation of ogr2ogr (used to reproject here). Again,
we use options gathered over the years along with /vsizip to avoid needing
to unzip. If a single "file" (base name, really) exists in the zip, just the
name of the zip file is necessary, otherwise a path to the Shapefile within
the zip is necessary (the list of layers below takes advantage of that by
allowing [shapefile] to be optional. (This is necessary due to errors in
the packaging of some of the Natural Earth layers.)
Aggressive escaping is necessary in order for literal $s to be passed
through, e.g. $$$$(dirname $$@).
The shp/natural\_earth/*-merc.zip target uses $^ for the list of files to
compress, which contains the expanded names of all of the dependencies.
zip's -j option is used to "junk paths" and put everything in the root of
the zip file.
Define a list of Natural Earth layers along with their local file references
and (optional) Shapefile names and call natural_earth to generate targets.
As with the OSM extracts, patterns are used to match the conventions used by
the Natural Earth site.
Provide a default target (truly default in that it will execute for any
requested target) explaining what needs to be done for things to work
correctly.