PostgreSQL, R, US Census geography and encoding

We use PostgreSQL/PostGIS to manage a lot of our tabular and geographic data from the US Census. In terms of workflow we will either download a shapefile manually from ftp://ftp.census.gov/geo/tiger/ or, if we’re dealing with more than one file (block groups or blocks for example), we will do this from within R (using the download.file() and unzip() functions in a loop). In order to import the data to Windows-based PostgreSQL we can generally rely on GDAL’s org2ogr program and this works out well (ogr2ogr can be used to convert between a ton of different formats). For example we might use the code:

ogr2ogr -overwrite -lco GEOMETRY_NAME=geom  -a_srs "EPSG:4269" -f  "PostgreSQL"  PG:"host=localhost user=postgres dbname=census password=xxx"  -nlt MULTIPOLYGON  existing_shapefile.shp -nln target_table

to write from a shapefile to a PostgreSQL/PostGIS database and we might use R’s system() function to execute this on the command line.

But we have found that the default character encoding for PostgreSQL (UTF-8) does not handle the occasional names with accents or other non-English words that you find in county or Core Based Statistical Area files. Running the ogr2ogr program on county shapefiles, for example, brings up the error:

ERROR 1: Terminating translation prematurely after failed translation

In order to address this we used R’s cat() function to create a *.bat file that first sets the encoding to LATIN1 and then run the same ogr2ogr command. So the bat file looks like:

SET PGCLIENTENCODING=LATIN1

ogr2ogr -overwrite -lco GEOMETRY_NAME=geom  -a_srs "EPSG:4269" -f  "PostgreSQL"  PG:"host=localhost user=postgres dbname=census password=spatial"  -nlt MULTIPOLYGON  existing_shapefile.shp -nln target_table

You can then run the bat file with the system() function in R and the shapefile will import without errors.

Leave a Reply

Your email address will not be published. Required fields are marked *