Category Archives: Tutorials

Fixing Pillow segmentation fault

I am currently working a lot with large GeoTIFFs in Python and use Pillow to read them in, then convert them to NumPy arrays for processing. Every now and then, Pillow throws the following error, that I’ve seen on several computers running OS X now:

python TIFFReadDirectory: Warning, Unknown field with tag 42113 (0xa481) encountered. Segmentation fault: 11

Since it always takes me a while to figure out how to fix this, here’s a short note to self, maybe also useful to someone else out there:

  1. Uninstall Pillow:
  2. $ pip uninstall pillow
  3. Install dependencies for building Pillow from source:
  4. $ brew install libtiff libjpeg webp little-cms2
  5. Download Pillow source from PyPI
  6. Unpack and change into the folder with the source code, then build an install via:
    python setup.py install

This has always fixed the problem for me so far. I don’t know whether building from source rather than simply running

pip install pillow

will also fix this problem on other operating systems, but it’s worth a shot if you hit that error.

PostgreSQL DISTINCT ON

Getting the highest (or lowest) value from a database column is a bit tricky if you cannot use GROUP BY, because it requires you to aggregate across all columns that you want in the result. Say you have a table with employees, having the columns name, salary, and department, and you want to know the highest-paid employee per department. Then GROUP BY is not an option because you would need to also aggregate by name to have the name in the output, which doesn’t make sense.

I used to tackle this kind of query with unwieldy subqueries, but I just came across PostgreSQL’s DISTINCT ON clause, which makes them a lot more compact and readable. Using that, you can simply do

SELECT DISTINCT ON (department)
       name, salary, department
FROM   salaries_table
ORDER BY department, salary DESC;

So we’ll only get one entry per department, and ORDER BY salary DESC makes sure it is the one with the highest salary. The only bummer is that it is a PostgreSQL-specific function, so it won’t work on other DBMSs.

Manipulating and mapping US Census data in R →

censusR

The US Census provides an incredible wealth of data but it’s not always easy to work with it. In the past, working with the tabular and spatial census data generally meant downloading a table from FactFinder and a shapefile from the boundary files site and joining the two, perhaps in a GIS system. These files could also be handled in R but getting the data, reading it into R and, in particular, merging tabular and spatial data can be a chore. Working with slots and all the different classes of data and functions can be challenging.

recent interesting post on stackoverflow by Claire Salloum prompted me to revisit this issue in R and I’ve definitely found some valuable new packages for capturing and manipulating Census data in R.

Great post explaining how to wrangle and map census data in R.

Strabon on Mac OS X

Strabon is a triple store that has been developed with a focus on spatio-temporal query functions. I’ve meant to play around with Strabon for a while, and while at ISWC earlier this week, I decided to finally give it a shot. There are no instructions for setting up Strabon on Mac OS in the user guide, so here’s what I did to get it running:

If you don’t have Homebrew installed yet, open a terminal window and paste the following command to install it:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Then use Homebrew to get Maven, Mercurial, PostGres, and PostGIS:

brew install maven mercurial postgres postgis

Don’t worry if you have any of those already installed, Homebrew is smart enough to figure that out and will just skip over those.

Next, we’ll initialize a new PostGIS database folder and start PostGIS:

initdb strabon
postgres -D strabon

The terminal window will now be “occupied” by PostGIS (press Control+C to shut down PostGIS when you are done), so open a new terminal window (⌘N) to continue there. Next, we’ll create the actual database and enable the PostGIS extensions on it:

createdb -E UTF8 -T template0 template_postgis
psql -d template_postgis -c "CREATE EXTENSION postgis;"

Next, we’ll adjust the permissions on that database so that Strabon can make changes later, do some housekeeping, and create another database for the endpoint:

psql -d template_postgis -c "GRANT ALL ON geometry_columns TO PUBLIC;"
psql -d template_postgis -c "GRANT ALL ON geography_columns TO PUBLIC;"
psql -d template_postgis -c "GRANT ALL ON spatial_ref_sys TO PUBLIC;"
psql -d template_postgis -c "VACUUM FULL;"
psql -d template_postgis -c "VACUUM FREEZE;"
psql -d postgres -c "UPDATE pg_database SET datistemplate='true' WHERE datname='template_postgis';"
psql -d postgres -c "UPDATE pg_database SET datallowconn='false' WHERE datname='template_postgis';"
createdb endpoint -T template_postgis

That’s all for a PostGres database with spatial extension. Next, we’ll install the temporal extension (you can skip over this part if you are only interested in the spatial functions). We’ll download, build, and install the temporal extension like so:

git clone https://github.com/jeff-davis/PostgreSQL-Temporal.git
cd PostgreSQL-Temporal
make
make install
make installcheck
psql -d endpoint -c "CREATE EXTENSION temporal;"

Next, we’ll install Tomcat, a servlet container that we’ll use to host Strabon, and modify the Tomcat users. Note that if you already have Tomcat and other servlets are running in it, you should just add a user to the Tomcat configuration and skip this step:

cd ~
curl http://mirror.symnds.com/software/Apache/tomcat/tomcat-8/v8.0.28/bin/apache-tomcat-8.0.28.zip > apache-tomcat-8.0.28.zip
unzip apache-tomcat-8.0.28.zip
rm apache-tomcat-8.0.28.zip
cd apache-tomcat-8.0.28/conf
mv tomcat-users.xml tomcat-users-backup.xml
curl https://gist.githubusercontent.com/crstn/2daac7483501c8b8beda/raw/26d66cbfaec03cb8b36bedd895c6c4e4955ffe7e/tomcat-users.xml > tomcat-users.xml

And start Tomcat:

cd ../bin
sh catalina.sh start

Now we’re done setting up the PostGIS backend and Tomcat, let’s move on to download and compile Strabon:

cd ~
hg clone http://hg.strabon.di.uoa.gr/Strabon/
cd Strabon
hg update temporals
mvn clean package

Don’t worry if you get an error message here for Strabon: Executable endpoint. This part tries to automatically deploy the compiled code in Tomcat, we’ll do that manually:

cp endpoint/target/strabon-endpoint-*.war ~/apache-tomcat-8.0.28/webapps/strabonendpoint.war

That’s all. At this point, you should be able to visit http://localhost:8080/strabonendpoint/ in your browser. Before you can load data into Strabon and write stSPARQL queries, there is one last bit of configuration to do. On the web interface, go to Explore/Modify Operations > Configuration and enter the following information (assuming you used the information exactly as provided above):

  • Database Name: endpoint
  • Username: test
  • Password: test
  • Port: 5432
  • Hostname: localhost
  • Database Backend: postgis

Since we didn’t set up the database with a password, you can put in any username and password you want, but the fields can’t be empty (this seems to be a JDBC issue). If Strabon can connect to PostGIS with the configuration parameters provided, it will take you to the query page after you click connect. If there is something wrong with your parameters, nothing happens – you will just stay on the configuration page, but there is no error message (it took me a while to figure this out…). If that happens, take a look at the terminal window running PostGIS and at the Tomcat logs in ~/apache-tomcat-8.0.28/logs/, which should give you an idea of what’s going wrong.

SublimeText and LaTeXTools on El Capitan

I find myself using Atom for most of my text editing these days, but for LaTeX editing I still rely on SublimeText. This is mostly because of the excellent LaTeXTools plugin that makes LaTeX editing really convenient in Sublime. Unfortunately, updating OS X to El Capitan broke this setup. It took me a while to figure out how to fix this, so here’s the short story:

  1. Download and install MacTex 2015.
  2. In Sublime, go to Tools > Command Palette, look for LateXTools: Reconfigure and migrate settings, and hit enter to execute the command.

That’s it. This will update LaTeXTools with the new path of the LaTeX executables, which had to be moved around due to new restrictions on the /usr directory introduced in El Capitan. Just be aware that this will overwrite any customizations that you may have added to your LaTeXTools setup.

Developing and testing SPARQL queries with cURL

There are tons of online SPARQL editors out there, but they often lack some specific functionality. The most common one is that they cannot query an endpoint you may have running on your own machine during development. The SPARQL forms that come with most triple stores, however, are very bare bones, to say the least. Plus I don’t really like going back and forth in a web browser when I’m working on a piece of code.

What I do instead is writing the query in a text editor with syntax highlighting and then shoot them over to the endpoint via cURL on the command line:

curl -i -H "Accept: text/csv" --data-urlencode query@query.sparql http://example.com/sparql

This will take the file query.sparql, send its contents to http://example.com/sparql (with the query parameter name being query), and show the results as comma-separated values. Obviously, this is no magic, I just keep forgetting the exact parameters so I thought I might as well document this here.

If you are using Sublime Text as your text editor, there is also the Sublime SPARQL Runner package. Does exactly the same thing and opens the results in a new text file right in sublime. I’ve only tested the package briefly, but it seems to do what it says on the tin.

Building your own watch-it-later video podcast with Dropbox

Update January 2017: This approach will stop working soon, since Dropbox will be turning off public folders in March (and acts like that’s a good thing, because they have “built even better ways for you to share securely and work together with your team” … yeah, right).

iPhonePodcast
Playing a video from the podcast we’ll create.

My feed reader and social media streams swamp me with videos every day, and some of them actually seem to be interesting. The thing is: During the day, I don’t have time to watch them. And the one time I can actually watch them – on my way home on the subway – I have no (or very slow) Internet access. Which basically means no Youtube, no Vimeo – simply no video streaming in general. So I thought it would be great if I could just save these videos to watch them later, and have them transferred automagically to my phone for offline viewing. Turns out this is actually not so hard if we use Dropbox and a fairly simple PHP script. Yes, I know this could also be done in [insert your favorite programming language here]. PHP is preinstalled on Mac OS, though, and I know PHP well enough to quickly implement this, so that’s what I did.

I’ll describe the process on Mac OS, but it should be easily transferrable to any other operating system. As the title suggests, you’ll need a Dropbox account to make this work. What we’ll do is use your Dropbox’s Public folder to publish your own podcast, generated from your saved videos, and then subscribe to this podcast on the phone.

1. Put the PHP script in place.

Download this GitHub repository, unzip it, and copy the folder Watchlater-master to your Dropbox’s Public folder. If you’d rather have it in a different subfolder, you can change the path in the configuration at the top of the feed.php file (just make sure it’s somewhere inside the Public folder):

$path = "Watchlater-master"; // change this if you place the script in a different subfolder.

2. Configure the script

The one thing you do need to change in the script is your Dropbox ID:

$dropboxID = "123";

The easiest way to find out your Dropbox ID is by going to your Public folder, right-clicking on any file (such as the feed.php file we have just pasted there), and select Copy Public Link. Paste that into a text editor, and it should look something like this:

https://dl.dropboxusercontent.com/u/123/path/to/my/file.txt

The number after /u/ (123 in this example) is your Dropbox ID.

The other two things you can (but don’t have to) tweak are setting your local time zone:

date_default_timezone_set('America/New_York');

and the number of videos you want to keep in the folder:

$keepVids = 10;

The script does some housekeeping for you, so if you have more than, say, 10 (which is the default) videos in your folder, the script will automatically remove the oldest X videos until only the 10 newest videos are left. This keeps your Dropbox from overflowing. Note that the script has no way of knowing which videos you have already watched, so it may also remove videos you have not watched yet.

3. Add some videos

Before we run the script, we’ll need some videos to feed it. Download some videos into the Watchlater-master folder (or wherever you put the feed.php script). Since most services like YouTube don’t really offer an easy way to download their stuff, I’m using youtube-dl for that (which also works great for a wide range of other streaming sites). On a Mac, you can install it via Homebrew (if you don’t have Homebrew, go and install it right now – it’s awesome!). While we’re at it, we will also install ffmpeg and use that later to automatically convert videos to web-friendly formats. In a terminal window, type

$ brew install youtube-dl ffmpeg

Once Homebrew is done installing youtube-dl, change into our feed directory and download some videos:

$ cd ~/Dropbox/Public/Watchlater-master
$ youtube-dl https://www.youtube.com/watch?v=dQw4w9WgXcQ

3. Run the script

To run the script, type

$ php feed.php

into the Terminal. If everything goes right, you should see some messages, and a new file called later.xml in the folder.

4. Subscribe to your podcast

This later.xml is the actual podcast that we want to subscribe to, and because it is in your Dropbox’s Public folder, it actually has a stable URL and is accessible online even when your computer is not running (if your Dropbox is completely synced, that is; if not, wait until it’s done and the spinning arrow on the Dropbox icon in your menu bar has disappeared). Right-click on the later.xml, and select Copy Public Link again. Since we need this URL on the phone to subscribe to the podcast, I simply emailed it to myself (there should really be an iCloud-based clipboard, I know).

On your phone, open your Podcast client and add that URL as a new podcast. I use Pocket Casts 4, but it should work with any other app the same way:

casts

Voilà – we now get videos saved to watch later delivered to our phone! Make sure you pick a podcast client that automatically downloads new episodes.

5. Automatically update your podcast

The one piece that is missing is some magic that automatically updates the later.xml on a regular basis to check whether new videos have been added to the folder. We’ll use a Cron job for that. In a terminal, type:

env EDITOR=nano crontab -e

This will open the nano editor inside your terminal window, showing an empty file (unless you have already set up any other Cron jobs on your system). In nano, paste this line:

59 * * * * cd ~/Dropbox/Public/Watchlater-master/; for f in *.mkv;do ffmpeg -i "$f" -c:v copy -c:a aac -b:a 256k -strict -2 "${f%mkv}mp4"; rm "$f"; done; php feed.php

on the first line. This sets up a Cron job to automatically convert any .mkv files in the folder to .mp4 format, remove the .mkv files, and then run the feed.php script once every hour to update the podcast feed. Press CTRL-X to close nano, confirm to save the file by pressing ‘Y’, then enter.

That’s it. Now every time you place a new video in that folder, your podcast will be updated hourly, and the videos will end up in your podcast app. I’ve been using this script for a few months now and find it really useful – it’s a bit tedious to set up, but once you have done that, gone are the times of awkwardly staring at ads on the subway because there’s nothing else to do.

6. (Optional) Make an alias for youtube-dl

If you don’t want to change into the Watchlater directory every time before downloading the video, you also make an alias for youtube-dl that will always download videos to the folder. While we’re at it, we will also make the video titles a bit nicer and tell youtube-dl to fetch MP4 videos straight away (so the conversion from MKV we have added in the previous step is actually not really required any more.) To do that, we’ll just add an alias to our bash profile like so: Change into your home directory:

cd ~

Then open your bash profile in nano:

sudo nano .bash_profile

And add the following line to the end of the file:

alias ytdl='youtube-dl -f mp4 -o "~/Dropbox/Public/Watchlater-master/%(title)s.%(ext)s"'

Then hot control + O to save your change (hit enter to confirm) and control + X to exit nano. That’s all. Now you can use ytdl https://www.youtube.com/watch?v=dQw4w9WgXcQ from any terminal session and the video will always go into your Watchlater folder in MP4 format.

Update October 22, 2015: Added instructions for automatic conversion of .mkv videos to .mp4 format.

Update November 30, 2016: Added instructions for alias.

Virtuoso Open Source on Mac OS

I’ve played around with different triple stores over the past years, but somehow never got to try Virtuoso. So I thought I’d give it a shot and while I’m at it, document the installation steps. The commercial edition of Virtuoso comes with a simple app that does all the dirty work for you, the open source edition does not – but this is not a huge issue unless a terminal window makes you want to hide under the table and cry.

Here we go.

Installation

  1. The Virtuoso Open Source GitHub page has a tutorial for building the application from scratch, but we are going to take the easy path. If you don’t have Homebrew installed, go ahead and do that. Open a terminal window and run
    ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  2. Homebrew is a package manager for Mac OS and provides builds for many popular (and not so popular) open source software packages. Homebrew makes installing Virtuoso as easy as
    brew install virtuoso

Starting Virtuoso

  1. After Homebrew has installed all dependencies and Virtuoso itself, we are ready to go. Virtuoso needs a virtuoso.ini file that contains all settings for the server. Our installation comes with a sample .ini file, located in /usr/local/Cellar/virtuoso/7.1.0/var/lib/virtuoso/db/ (this works for Virtuoso 7.1.0, but the location is most likely different for other versions. You can find that directory by running locate virtuoso.ini). Change into that directory and start virtuoso by running the following two commands:
    cd /usr/local/Cellar/virtuoso/7.1.0/var/lib/virtuoso/db/
    virtuoso-t -f
  2. Voilà. You should now be able to access the Virtuoso frontend at http://localhost:8890.
  3. You’ll find a link to the Conductor, the Virtuoso admin interface, at the top left of the page. The standard installation has a bunch of preset usernames and passwords, so that we can simply login with
    • User: dba
    • Password: dba

Loading data into Virtuoso

An empty triple store is not much fun, so let’s load some data into Virtuoso. There are several ways to do that, I’ll just describe one way here that works well for large files like the Geonames dump in N-Triples format I’m loading here.

  1. In order to load a local RDF file into Virtuoso, it needs access to the directory holding that file. The clean solution is to add that directory to the DirsAllowed in virtuoso.ini and restart Virtuoso. I’ll take the quick and dirty approach here and simply move the file from my Downloads folder to the folder holding our virtuoso.ini by running
    mv ~/Downloads/geonames.nt /usr/local/Cellar/virtuoso/7.1.0/var/lib/virtuoso/db/geonames.nt
  2. Next, we’ll start an SQL prompt by running
    isql

    You should see a new prompt now: SQL>

  3. On this SQL prompt, insert the following command to load the data into Virtuoso:
    DB.DBA.TTLP_MT (file_to_string_output ('./geonames.nt'), '', 'http://mytest.com');

    The second parameter is the URI of the graph we load the data into. It does not really matter what you put there, as long as it’s a valid URI.

If you’re trying to load a big file, the bulk loader functions are much faster. You should also change the settings for the NumberOfBuffers and MaxDirtyBuffers in virtuoso.ini to allow Virtuoso to use more RAM.

Querying Virtuoso

While we are waiting for Virtuoso to finish loading our dataset (which may take a while if you are loading a big dump like GeoNames), we can already run our first queries on the triples that have already been loaded. Go to http://localhost:8890/sparql and run a test query against the graph you are loading data into (http://mytest.com in my example above:

SELECT * WHERE {
 ?a ?b ?c
} LIMIT 10

This should give you ten random triples from the dataset you are loading.

Update 12/03/2014: You can also execute SPARQL commands on the isql prompt we have been using above to load data into Virtuoso. There are two ways to do that: If you are already on the isql prompt, you can just type SPARQL, followed by your query and a semicolon to finish the command (see below). This will list the results on the prompt. If you want to take the data out, e.g. to process them in other software, you can just put this command in a file, e.g. save

SPARQL SELECT * WHERE {
 ?a ?b ?c
} LIMIT 10;

in a file called query.sparql, and then on the command line (i.e., without logging into the isql prompt), run:

isql 127.0.0.1:1111 dba dba query.sparql > results.txt

Assuming you have not changed the Virtuoso standard port and username/password, this will save the results in a file called results.txt.

Shutting down Virtuoso

Shutting down Virtuoso correctly (rather than just killing the process) is important because otherwise you may have trouble starting it next time. For a clean shut down, open an SQL prompt (see above) and run

SHUTDOWN;