landsummary.com

One of the things I like the least about my real job, and much of the contract work that I do is that i'm usually the only programmer working on each task. So, it's been very fun to work on a project with Josh Livni (His writeup). We got together one afternoon, and by the time we left, we had a reasonable start of what we call landsummary, we've since put in a fair bit of work sprinkled here and there. Josh set up an AWS server--it's nice for me to have fewer sys-admin duties too.
What's actually on display is fairly modest. What it does is takes a user-drawn square, circle, or arbitrary polygon, and uses that to summarize the NLCD dataset along with some census data. The things that make it more than lame are that it's very fast, it can easily be extended to summarize any raster dataset, and we have a sorta cool API (not documented) which allows us or anyone to query the data with a WKT Polygon and request a particular service -- currently nlcd, population, weather.precip, and a couple of services with environmental engineering application. All of them use the same libraries--postGIS for doing the census related stuff, and GDAL, gdal_array for doing the raster (currently just NLCD) queries. Josh handled all the census data, I know very little about that, except that whenever I've tried previously, it's been a pain to work with, and now, Josh has a nice set up for it. I took the lead on the raster summaries. For that, I wrote a little library that wraps gdal_array, so you can take a GDAL datasource:

>>> g = AGoodle("something.tif")
>>> a = g.read_array_bbox([xmin, ymin, xmax, ymax])

And then 'a' is a numpy array with all the niceties that entails. So, if we want to get just the food cells, which have values of 81 and 82 in the NLCD dataset, it's just:

>>> a[(a == 81) | (a == 82)]

For arbitrary polygons, we use a surprisingly fast function from matplotlib to mask anything that's not inside a list of verticies (polygon). Then do any summary stats on the masked array.

Thanks to Josh, we have a fairly nice django project structure, with separate apps for each little analysis we've added. In my previous django projects, I've dumped everything into a single app and hacked away, the structure we have now makes it easier to keep what's needed in my brain. Also, when hacking with someone, I'm less likely to put in total crap code. Josh has already had a good laugh at some code where I found 25 closest weather stations using:

SELECT * from stations ORDER BY ABS(lat - ?) + ABS(lon - ?) LIMIT 25

then sorted those 25 using geopy.distance to make sure it was the real distance. In my defence, I really wanted to use vanilla sqlite and so didn't have postGIS at my disposal--also, it was quite fast for only 6,000 stations. We've since dumped it all into postGIS. There's probably a couple of other gems in there.

So, back to the modest functionality part... Actually, it turns out, this is a fairly difficult thing to do in McClick software--time consuming in user and processor time. So, having a way to click a point and see land-use stats and population data appear in about a second is pretty cool--and it's on the web. We've already found a couple folks with interesting applications, and we're interested in finding more--the original motivation was 'foodmiles'-- from this post. And there's a couple things we'll probably add in from that, people I happened to hear talking in a cafe today were talking about foodmiles and seemed interested in incorporating the carbon foot-print of exporting / importing food vs. growing locally. My friend, Megan also has lots other ideas for things that firms commonly do McClick style with the NLCD data.
There's more info on the about page, but suffice to say we make full use of all the usual open-source GIS, science tools.

Comments

Megan D said…
you did good work i can attest

Popular posts from this blog

filtering paired end reads (high throughput sequencing)

python interval tree

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign