Recently a friend asked me to find out a certain piece of data.

Which bus route in Poznań is the longest?

As I’m a massive ferroequinologist and public transport enthusiast, this question is right up my alley. However, this information is rarely accessible directly. Public transport planning is a complex matter, and routes can change in response to road works or other events. The average passenger is only interested in the timetable for their specific itinerary, so it’s no wonder that transport authorities rarely publish this kind of statistics.

Additionally, Poznań is a city I have little expertise in. While I’ve visited it many times and rode its trams and buses, my familiarity with it is limited. I can find my way in the area between the International Fair and the Old Town and not get lost in the railway station (no small feat, if you ask a random passenger), but that’s pretty much it – I don’t have the knowledge of a native city-dweller to back me up here.

So what data is there?

Fortunately, Poznań’s transport authority ZTM publishes its schedule data, and does it in formats that are easy to work with. Many popular transit apps show their schedules thanks to that. The files are free to use, with an attribution clause, and available on their website. Commercial use requires a contract, but for personal use, we can just download the zipfiles. Let’s see how, using that data, we can answer the question posed.

The big reveal

However, for the remainder of this post, the only tool we’ll be using (except shell and maybe some of coreutils) will be SQLite, as in the command sqlite3. Not some-programming-language and its SQLite bindings. No. Just SQLite itself.

What’s inside? (or: SQLite is an unzip)

The exact filename varies with date, as the schedules can be updated even multiple times per week. The examples show what I had on hand at the time of writing.

$ sqlite3 -zip 20210426_20210430.zip 'SELECT name,sz FROM zip'
name|sz
agency.txt|1371
calendar.txt|124
calendar_dates.txt|32
feed_info.txt|149
routes.txt|108551
shapes.txt|6543206
stops.txt|175029
stop_times.txt|13006823
trips.txt|460604

Column sz holds the uncompressed file size.

Alternatively, we can use its archive mode explicitly. SQLite has a concept of an archive, which is not the same as a zipfile. It’s just a database that contains a table named sqlar. Rows of that table are the archive entries, and can be individually extracted. However, that mode will also work with plain zipfiles.

$ sqlite3 -zip 20210426_20210430.zip -Atv
?---------       1371  2021-04-23 10:59:24  agency.txt
?---------        124  2021-04-23 10:51:14  calendar.txt
?---------         32  2021-04-23 10:51:14  calendar_dates.txt
?---------        149  2021-04-23 10:51:14  feed_info.txt
?---------     108551  2021-04-23 11:00:18  routes.txt
?---------    6543206  2021-04-23 10:51:24  shapes.txt
?---------     175029  2021-04-23 10:51:22  stops.txt
?---------   13006823  2021-04-23 10:57:00  stop_times.txt
?---------     460604  2021-04-23 10:55:48  trips.txt

Let’s inspect the first file.

$ sqlite3 -zip 20210426_20210430.zip \
  "SELECT data FROM zip WHERE name='agency.txt'"
agency_id,agency_name,agency_url,agency_timezone,agency_phone,agency_lang
2,"Miejskie Przedsiębiorstwo Komunikacyjne Sp. z o.o. w Poznaniu","http://www.mpk.poznan.pl","Europe/Warsaw","61 646 33 44","pl"
4,"Kórnickie Przedsiębiorstwo Autobusowe KOMBUS Sp. z o.o.","http://www.kombus.com.pl","Europe/Warsaw","61 898 06 66","pl"
5,"Przedsiębiorstwo Wielobranżowe TRANSKOM Sp. z o.o.","http://www.transkom.com.pl","Europe/Warsaw","61 651 47 14","pl"
6,"Przedsiębiorstwo Transportowe Translub Sp. z o.o.","http://www.translub.pl","Europe/Warsaw","61 813 01 45","pl"
7,"Przedsiębiorstwo Usług Komunalnych Komorniki sp. z o.o.","http://www.pukkomorniki.pl","Europe/Warsaw","61 810 81 71","pl"
8,"Zakład Usług Komunikacyjnych ROKBUS Sp. z o.o.","http://www.rokbus.com.pl","Europe/Warsaw","61 102 50 90","pl"
9,"Zakład Komunikacji Publicznej Suchy Las Sp. z o.o.","http://www.zkp.com.pl","Europe/Warsaw","61 811 65 57","pl"
10,"Komunikacja Gminy Tarnowo Podgórne TPBUS Sp. z o.o.","http://www.tpbus.com.pl","Europe/Warsaw","61 814 67 92","pl"
11,"Swarzędzkie Przedsiębiorstwo Komunalne Sp. z o.o.","http://www.spk.swarzedz.pl","Europe/Warsaw","61 817 34 21","pl"
12,"Zakład Komunalny w Kleszczewie Sp. z o.o.","http://www.zk.kleszczewo.pl","Europe/Warsaw","61 817 60 62","pl"
14,"Urząd Gminy Dopiewo","http://www.dopiewo.pl","Europe/Warsaw","","pl"

If you don’t speak Polish, these are names of bus operator companies. Most of them are County/municipal transport of Sometown Ltd., and others are Transport Company Somename Ltd. The archive mode command equivalent would be -Ax filename, but it always creates a file, while we wanted to see it on stdout.

These are CSV files (or: SQLite is a schema-less db)

Well, obviously. The next step would be to extract the files, create tables in some SQLite database and load them up with that data, right? For the agencies, something like this:

CREATE TABLE agency(
    id INT PRIMARY KEY, 
    name TEXT,
    url TEXT,
    timezone TEXT,
    phone TEXT,
    lang TEXT
);
.import --csv --skip 1 agency.txt agency

The .import command loads a specified file into the given table name. With --csv it will use comma as record separator, and newline as record separator. Otherwise it will assume --ascii mode, which uses ASCII US and RS respectively. We add --skip here to eat the header line.

However, we can be even more lazy: just run the .import command, but without creating the table or skipping the header. SQLite will then create the table for us, setting column names from the header. All column types are TEXT in this case, however this won’t be an issue.

Criteria

So what makes a route longer than other ones? The length of the polyline it makes on a map. And we do have this data, in the shapes.txt file.

$ sqlite3 -zip 20210426_20210430.zip \
  "SELECT substr(data, 0, 390) FROM zip WHERE name='shapes.txt'"
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence
783121,52.368628172902,16.934143162577,0
783121,52.367843818268,16.938585189158,1
783121,52.367762677338,16.939013158496,2
783121,52.367690551942,16.939426370271,3
783121,52.367023386448,16.942879640105,4
783121,52.366951259846,16.943292851880,5
783121,52.366879133125,16.943706063655,6
783121,52.366825038007,16.943956942233,7
783121,52.366770942823,16.944207820810,8

The substr() call is here to show only the beginning of file data. Or like any reasonable person, we can just pipe that through head1.

A shape contains arbitrarily many points, one per row, in order of the sequence column. Shapes are mapped to routes in the trips table:

$ sqlite3 -zip 20210426_20210430.zip \
  "select data from zip where name='trips.txt'" | head -10
route_id,service_id,trip_id,trip_headsign,direction_id,shape_id,wheelchair_accessible
1,3,"3_7752232^N+","Franowo",0,784574,1
1,3,"3_7752234^N+","Franowo",0,784574,1
1,3,"3_7752236^N+","Franowo",0,784574,1
1,3,"3_7752238^N+","Franowo",0,784574,1
1,3,"3_7752240^N+","Franowo",0,784574,1
1,3,"3_7752242^N+","Franowo",0,784574,1
1,3,"3_7752244^N+","Franowo",0,784574,1
1,3,"3_7752251^N+","Franowo",0,784574,1
1,3,"3_7752253^N+","Franowo",0,784574,1

These specific trips are for route 1, which is a tram line, definitely not a bus.

Measuring the route

Each line segment in the shapes table is defined by two points, stored in two successive rows. The total length is obtained by summing distances between these pairs of points. We can pair them up with nothing but a self-join:

SELECT
  ss1.shape_id,
  CAST(ss1.shape_pt_sequence AS INT) num,
  ss1.shape_pt_lat lat0, ss1.shape_pt_lon lon0,
  ss2.shape_pt_lat lat1, ss2.shape_pt_lon lon1
FROM
  shapes ss1
  JOIN shapes ss2
  ON ss2.shape_id = ss1.shape_id
WHERE
  ss2.shape_pt_sequence = ss1.shape_pt_sequence + 1
GROUP BY ss1.shape_id, num

-- Decimal places truncated for readability
shape_id|num|lat0|lon0|lat1|lon1
758584|0|52.464295|16.664138|52.464250|16.664551
758584|1|52.464250|16.664551|52.464187|16.664979
758584|2|52.464187|16.664979|52.464053|16.665850
758584|3|52.464053|16.665850|52.463864|16.666396
758584|4|52.463864|16.666396|52.463567|16.667237
758584|5|52.463567|16.667237|52.463432|16.667960
758584|6|52.463432|16.667960|52.463315|16.668934
758584|7|52.463315|16.668934|52.463180|16.669672
758584|8|52.463180|16.669672|52.463018|16.670351
758584|9|52.463018|16.670351|52.462676|16.671709

For convenience, we’ll package this query into a view with CREATE VIEW shape_segments SELECT... and refer to the view from now on.

Flat Earth (or: SQLite is an advanced calculator)

Now that we have each segment in a row, we can calculate its length with the usual Euclidean distance formula. Fortunately, SQLite comes with all the math functions we need for this.

SELECT
  shape_id,
  sum(
    sqrt(pow(lat1 - lat0, 2) + pow(lon1 - lon0, 2))
  ) sumhypot
FROM shape_segments
GROUP BY shape_id

This formula is… not very good. It is useful in our case, because we only need to compare shape lengths, not determine them exactly. But the number it returns isn’t useful in any way. What is its unit? Degrees-of-longitude-and-latitude, probably. It wouldn’t be so bad if the Earth was a flat plane, with both meridians and lines of latitude arranged in a rectangular grid.

Meanwhile, we can now pack that as a subquery and find the longest route (decimals truncated again):

SELECT
  route_id, q1.shape_id, q1.sumhypot
FROM (
  -- flat earth formula, omitted for brevity
  ) q1
JOIN trips ON trips.shape_id = q1.shape_id
ORDER BY sumhypot DESC
route_id|shape_id|sumhypot
560|760669|0.277378
560|760669|0.277378
560|760672|0.276453
560|760672|0.276453
560|760672|0.276453
560|760672|0.276453
560|760668|0.274670
560|760668|0.274670
560|760668|0.274670
560|760667|0.260732

So is line 560 the king of long routes in Poznań? Maybe. Another conclusion here is that lines have many routes of varying lengths, and we should check some statistics.

SELECT
  route_id,
  min(q1.sumhypot) minlen,
  avg(q1.sumhypot) avglen,
  max(q1.sumhypot) maxlen
FROM (
  -- flat earth formula again
  ) q1
JOIN trips ON trips.shape_id = q1.shape_id
GROUP BY route_id
ORDER BY maxlen DESC

route_id|minlen|avglen|maxlen
560|0.075285|0.237787|0.277378
561|0.246936|0.246936|0.246936
502|0.191143|0.205534|0.216080
342|0.194560|0.195850|0.214357
501|0.173946|0.179029|0.197477
220|0.184467|0.184467|0.184467

Locally flat earth

Let’s improve the formula we’re using. While the planet is round, we can assume that Poznań and its immediate surroundings are indeed flat. The longitude-latitude grid is however no longer square. We’ll still be using the simple distance formula, but with some improvements.

  • Earth radius in km: \(R=6371.01\)
  • Euclidean distance: \(d=\sqrt{\Delta x^2 + \Delta y^2}\)
  • \(\lambda, \psi\) longitude and latitude, in radians
  • \(\Delta x = R \Delta \lambda \cos(\psi)\)
  • \(\Delta y = R \Delta \psi\)
SELECT
  shape_id,
  sum(
    sqrt(
      pow(6371.01 * radians(lon1 - lon0) * cos(lat0), 2) -- dx
      + pow(6371.01 * radians(lat1 - lat0), 2) -- dy
    )
  ) sumdist
FROM shape_segments
GROUP BY shape_id

Now we join this with the trips table again. This tells us that line 560 travels 30.84 km in its longest variant (and 8.37 km in its shortest). This feels about right, after checking the route on a map. Also, the relative order of lines didn’t change from last time, as expected — meaning our initial flat Earth approximation wasn’t that bad.

Different scoring

Another criteria for a long route is the number of stops it has to make. We can find this in the stop_times table. The query is vastly simpler now.

SELECT route_id,
       min(numstops) minstops,
       avg(numstops) avgstops,
       max(numstops) maxstops
FROM (
  SELECT tt.route_id, st.trip_id, count(st.stop_id) numstops
  FROM stop_times st
  JOIN trips tt ON tt.trip_id = st.trip_id
  GROUP BY st.trip_id) q1
GROUP BY route_id ORDER BY maxstops DESC;

Which gives us a different ranking altogether. Now we learn that night buses 215 and 222 may be the longest lines instead.

Ride time (or: SQLite is an adequate2 date calculator)

Yet another definition could be the time between departing the initial station and arrival at the final one. We can also find this in the stop_times table.

SELECT route_id,
        -- Convert timestamps back to HH:MM:SS
        time(q1.starts, 'unixepoch') depart,
        time(q1.ends, 'unixepoch') arrive
FROM (
  SELECT trip_id,
          -- Convert HH:MM:SS to Unix timestamps.
          -- Their date component will be 2000-01-01
          min(strftime('%s', time(departure_time))) starts,
          max(strftime('%s', time(arrival_time))) ends
  FROM stop_times
  GROUP BY trip_id) q1
JOIN trips on q1.trip_id = trips.trip_id
ORDER BY ends - starts DESC

route_id|depart|arrive
560|14:00:00|15:12:00
215|23:34:00|00:45:00
560|13:32:00|14:41:00
560|06:15:00|07:24:00
222|23:02:00|00:10:00
560|05:15:00|06:23:00
811|15:30:00|16:38:00
811|14:45:00|15:53:00
560|14:40:00|15:47:00
811|13:25:00|14:31:00

Note that this should be wrong for night routes: the time picked for earliest departure should be whichever is just after midnight (and correspondingly, latest arrival is just before midnight). But it’s actually correct! This is because times in this table use a special convention: time since midnight of business day. Late night buses which depart just after Saturday midnight are still counted towards Friday as business day. Bus 215 above arrives at 24:45:00, and SQLite converts it back to a readable hour for us.

Also, we can skip most conversions in this query. Taking min and max directly from the HH:MM:SS text works as well (again, because of the notation). The only time we actually need a timestamp value is in the argument to ORDER BY.

Final bus ranking, top three

  1. 560: Longest ride, longest route
  2. 215: Most stops, second longest ride
  3. 222: Third longest ride, second most stops

Conclusions

SQLite is quite the powerful tool, a real open-source treasure. It’s rather under-rated in popular opinion, mostly because of comparisons to big databases like PostgreSQL, and its non-ability to do networking. The documentation lists many successful use cases, some of which won’t be a fit for big databases. Here, we used SQLite as mostly an advanced calculator, and for its unzip capabilities. Not once did we have to use external tools.

Footnotes

1 Which is part of coreutils, and legal to use in this task.

2 Not excellent, but competent enough for the task.