Recently a friend asked me to find out a certain piece of data.

Which bus route in Poznań is the longest?

As I’m a massive ferroequinologist and public transport enthusiast, this question is right up my alley. However, this information is rarely accessible directly. Public transport planning is a complex matter, and routes can change in response to road works or other events. The average passenger is only interested in the timetable for their specific itinerary, so it’s no wonder that transport authorities rarely publish this kind of statistics.

Additionally, Poznań is a city I have little expertise in. While I’ve visited it many times and rode its trams and buses, my familiarity with it is limited. I can find my way in the area between the International Fair and the Old Town and not get lost in the railway station (no small feat, if you ask a random passenger), but that’s pretty much it – I don’t have the knowledge of a native city-dweller to back me up here.

# So what data is there?

Fortunately, Poznań’s transport authority ZTM publishes its schedule data, and does it in formats that are easy to work with. Many popular transit apps show their schedules thanks to that. The files are free to use, with an attribution clause, and available on their website. Commercial use requires a contract, but for personal use, we can just download the zipfiles. Let’s see how, using that data, we can answer the question posed.

# The big reveal

However, for the remainder of this post, the only tool we’ll be using (except shell and maybe some of coreutils) will be SQLite, as in the command sqlite3. Not some-programming-language and its SQLite bindings. No. Just SQLite itself.

## What’s inside? (or: SQLite is an unzip)

The exact filename varies with date, as the schedules can be updated even multiple times per week. The examples show what I had on hand at the time of writing.

Column sz holds the uncompressed file size.

Alternatively, we can use its archive mode explicitly. SQLite has a concept of an archive, which is not the same as a zipfile. It’s just a database that contains a table named sqlar. Rows of that table are the archive entries, and can be individually extracted. However, that mode will also work with plain zipfiles.

Let’s inspect the first file.

If you don’t speak Polish, these are names of bus operator companies. Most of them are County/municipal transport of Sometown Ltd., and others are Transport Company Somename Ltd. The archive mode command equivalent would be -Ax filename, but it always creates a file, while we wanted to see it on stdout.

# These are CSV files (or: SQLite is a schema-less db)

Well, obviously. The next step would be to extract the files, create tables in some SQLite database and load them up with that data, right? For the agencies, something like this:

The .import command loads a specified file into the given table name. With --csv it will use comma as record separator, and newline as record separator. Otherwise it will assume --ascii mode, which uses ASCII US and RS respectively. We add --skip here to eat the header line.

However, we can be even more lazy: just run the .import command, but without creating the table or skipping the header. SQLite will then create the table for us, setting column names from the header. All column types are TEXT in this case, however this won’t be an issue.

# Criteria

So what makes a route longer than other ones? The length of the polyline it makes on a map. And we do have this data, in the shapes.txt file.

The substr() call is here to show only the beginning of file data. Or like any reasonable person, we can just pipe that through head1.

A shape contains arbitrarily many points, one per row, in order of the sequence column. Shapes are mapped to routes in the trips table:

These specific trips are for route 1, which is a tram line, definitely not a bus.

## Measuring the route

Each line segment in the shapes table is defined by two points, stored in two successive rows. The total length is obtained by summing distances between these pairs of points. We can pair them up with nothing but a self-join:

For convenience, we’ll package this query into a view with CREATE VIEW shape_segments SELECT... and refer to the view from now on.

## Flat Earth (or: SQLite is an advanced calculator)

Now that we have each segment in a row, we can calculate its length with the usual Euclidean distance formula. Fortunately, SQLite comes with all the math functions we need for this.

This formula is… not very good. It is useful in our case, because we only need to compare shape lengths, not determine them exactly. But the number it returns isn’t useful in any way. What is its unit? Degrees-of-longitude-and-latitude, probably. It wouldn’t be so bad if the Earth was a flat plane, with both meridians and lines of latitude arranged in a rectangular grid.

Meanwhile, we can now pack that as a subquery and find the longest route (decimals truncated again):

So is line 560 the king of long routes in Poznań? Maybe. Another conclusion here is that lines have many routes of varying lengths, and we should check some statistics.

## Locally flat earth

Let’s improve the formula we’re using. While the planet is round, we can assume that Poznań and its immediate surroundings are indeed flat. The longitude-latitude grid is however no longer square. We’ll still be using the simple distance formula, but with some improvements.

• Earth radius in km: $$R=6371.01$$
• Euclidean distance: $$d=\sqrt{\Delta x^2 + \Delta y^2}$$
• $$\lambda, \psi$$ longitude and latitude, in radians
• $$\Delta x = R \Delta \lambda \cos(\psi)$$
• $$\Delta y = R \Delta \psi$$

Now we join this with the trips table again. This tells us that line 560 travels 30.84 km in its longest variant (and 8.37 km in its shortest). This feels about right, after checking the route on a map. Also, the relative order of lines didn’t change from last time, as expected — meaning our initial flat Earth approximation wasn’t that bad.

# Different scoring

Another criteria for a long route is the number of stops it has to make. We can find this in the stop_times table. The query is vastly simpler now.

Which gives us a different ranking altogether. Now we learn that night buses 215 and 222 may be the longest lines instead.

## Ride time (or: SQLite is an adequate2 date calculator)

Yet another definition could be the time between departing the initial station and arrival at the final one. We can also find this in the stop_times table.

Note that this should be wrong for night routes: the time picked for earliest departure should be whichever is just after midnight (and correspondingly, latest arrival is just before midnight). But it’s actually correct! This is because times in this table use a special convention: time since midnight of business day. Late night buses which depart just after Saturday midnight are still counted towards Friday as business day. Bus 215 above arrives at 24:45:00, and SQLite converts it back to a readable hour for us.

Also, we can skip most conversions in this query. Taking min and max directly from the HH:MM:SS text works as well (again, because of the notation). The only time we actually need a timestamp value is in the argument to ORDER BY.

# Final bus ranking, top three

1. 560: Longest ride, longest route
2. 215: Most stops, second longest ride
3. 222: Third longest ride, second most stops

# Conclusions

SQLite is quite the powerful tool, a real open-source treasure. It’s rather under-rated in popular opinion, mostly because of comparisons to big databases like PostgreSQL, and its non-ability to do networking. The documentation lists many successful use cases, some of which won’t be a fit for big databases. Here, we used SQLite as mostly an advanced calculator, and for its unzip capabilities. Not once did we have to use external tools.

# Footnotes

1 Which is part of coreutils, and legal to use in this task.

2 Not excellent, but competent enough for the task.