What is this bitfield thing?
An array of true/false values. Usually a fixed-size array, due to platform constraints. Also a very compact array – taking exactly one bit to store each element. In other words: a boolean vector or set.
So does anyone use it?
Plenty of usage around, many of it inherited from history. Probably the most known bitfield is the POSIX access mode field, used by every filesystem on your UNIX-like OS:
$ LC_ALL=C ls -l /etc/passwd
-rw-r--r-- 1 root root 2473 Sep 1 2016 /etc/passwd
What?
I’m talking about this field where dashes mix with letters. It shows me that this file is readable and writable by its owner (first triplet after the initial dash: rw-
), readable by its group (second triplet: r--
) and also readable by other users (last triplet: r--
). The first blank is used for displaying any special bits. Underneath this representation lies a bitfield:
Flag | Octal value | Purpose |
---|---|---|
S_ISUID | 04000 | Set user ID on execution |
S_ISGID | 02000 | Set group ID on execution |
S_ISVTX | 01000 | Sticky bit |
S_IRUSR, S_IREAD | 00400 | Read by owner |
S_IWUSR, S_IWRITE | 00200 | Write by owner |
S_IXUSR, S_IEXEC | 00100 | Execute/search by owner |
S_IRGRP | 00040 | Read by group |
S_IWGRP | 00020 | Write by group |
S_IXGRP | 00010 | Execute/search by group |
S_IROTH | 00004 | Read by others |
S_IWOTH | 00002 | Write by others |
S_IXOTH | 00001 | Execute/search by others |
How do your strange numbers work? Why do you use only 0, 1, 2 and 4?
Let’s look at bits grouped into threes (BTW: conveniently 2³ = 8 and octo is Latin for ‘eight’ – hence octal, also the number of legs on GitHub’s octocat).
000 = 0
001 = 1
010 = 2
011 = 3
100 = 4
101 = 5
110 = 6
111 = 7
We’re using these digits from 0 to 7 to represent such groups. If we expand 644 to ones and zeros by substituting from the above table, we get 110 100 100
(does it look familiar already?). Also, notice that for 1, 2 and 4 exactly one bit is set, each time in a different location.
Adding together numbers which contain only zeros and exactly one of these three digits, at any position, will never carry (increment other columns). This is because 4 + 2 + 1 = 7, the maximum digit at any position. Now, let’s calculate the bitfield again:
field | value |
---|---|
S_IRUSR | 00400 |
S_IWUSR | 00200 |
S_IRGRP | 00040 |
S_IROTH | 00004 |
Σ | 00644 |
There, a quick primer on octal numbers.
And how does this translate to -rw-r--r--
again?
Start with the S_IRUSR
bit position. Output a dash if bit is unset, otherwise output a letter matching the bit’s intent (r
for read – an octal 4, w
for write = 2, x
for execute = 1). The leading dash is for some additional bits that we’re not going to cover here. Doing that for all eight possible combinations results in this table:
000 = 0 = ---
001 = 1 = --x
010 = 2 = -w-
011 = 3 = -wx
100 = 4 = r--
101 = 5 = r-x
110 = 6 = rw-
111 = 7 = rwx
Other popular values used here are 600, 400 and 777. See if you can figure out what they mean.
You said it’s an array
It’s a compact representation of one. Let’s convert an array to bits, holding on to our filesystem permissions example.
irb> perms = [false, false, false, # special bits
true, true, false, # owner
true, false, false, # group
true, false, false] # others
irb> perms.inject(0) { |acc, bit| acc << 1 | (bit ? 1 : 0) }
=> 420
That’s not 644
IRb gave us the answer in decimal notation. Let’s ask it for octal again.
irb> 420.to_s(8)
644
Or binary even
irb> 420.to_s(2)
=> "110100100" # Looks familiar?
Okay. And how do I decode the number?
irb> 0o644.to_s(2).rjust(12, "0").each_char.map { |b| b.to_i == 1 }
=> [false, false, false, true, true, false, true, false, false, true, false, false]
Ruby even gives us these bits (in MSB-first-order – starting from the rightmost one) when indexing an integer. Less magic involved.
irb> (0...12).reverse_each.map { |i| 0o644[i] == 1 }
=> [false, false, false, true, true, false, true, false, false, true, false, false]
Prefixing numbers with 0o
is a common notation for octals. I’m left padding the string to 12 characters with zeros here to output the unset special bits flags.
And why would I want to use that? Seems complicated
Whenever you have a large number of boolean values, and don’t want to waste storage. This was obviously more relevant in the olden days of tight memory constraints, but still is when programming microcontroller and embedded systems. Not so much on modern hardware, though. If you can use Ruby, you probably don’t have constraints forcing you to use bitfields.
Another use is when you need to compactly represent a set, where allowed values come from a small list. By assigning a single bit to each permitted value, and using its status to signal presence, you can fit up to 64 set members in a single register. (How many members does a set represented by 0644
have?)
It’s also very efficient in terms of memory access – the processor only needs to fetch one location to work on all of these flags, and compilers are very good at optimizing such access. Therefore you will see it used often in kernel code, memory management, network protocols and games.
I have worked on a large legacy application, where the godobject contains a state machine utilizing many booleans. Instead of having a TINYINT(1)
field for each of these, resulting in a table with over 50 columns, all of them are stashed into two 32-bit wide bitfields. Given a million records in the table, the overhead of storing each of them separately was negligible when it was designed, and is definitely negligible today. But it does have the advantage of being less cluttered and easier to understand.
Do other languages use these?
Certainly. Rust has a nice crate for that:
#[macro_use]
extern crate bitflags;
bitflags! {
struct Flags: u32 {
const A = 0b00000001;
const B = 0b00000010;
const C = 0b00000100;
const ABC = Self::A.bits | Self::B.bits | Self::C.bits;
}
}
fn main() {
let e1 = Flags::A | Flags::C;
let e2 = Flags::B | Flags::C;
assert_eq!((e1 | e2), Flags::ABC); // union
assert_eq!((e1 & e2), Flags::C); // intersection
assert_eq!((e1 - e2), Flags::A); // set difference
assert_eq!(!e2, Flags::A); // set complement
}
Historically, C programmers usually defined constants plus accessor macros. An excerpt from Linux code linked above, mind the octal notation again, this time represented with a leading zero:
#define S_IFMT 0170000
#define S_IFLNK 0120000
#define S_IFREG 0100000
#define S_IFDIR 0040000
#define S_IFCHR 0020000
#define S_IFIFO 0010000
#define S_ISUID 0004000
#define S_ISGID 0002000
#define S_ISVTX 0001000
#define S_ISLNK(m) (((m) & S_IFMT) == S_IFLNK)
#define S_ISREG(m) (((m) & S_IFMT) == S_IFREG)
#define S_ISDIR(m) (((m) & S_IFMT) == S_IFDIR)
The first one is a mask for extracting only the relevant bits (bits 13-16 counting from LSB), others are bitfield values.
Modern C has storage specifiers for struct fields. You’d declare a struct type for the permissions field, and for each single permission declare it with a bit length of one. Your compiler will then squish them together as efficiently as possible, preserving order. Then just access it like any other struct member.
struct posix_perm {
unsigned irusr:1;
unsigned iwusr:1;
unsigned ixusr:1;
unsigned irgrp:1;
unsigned iwgrp:1;
unsigned ixgrp:1;
unsigned iroth:1;
unsigned iwoth:1;
unsigned ixoth:1;
}
Rob Pike, designer of Go hates these C bitfields and there is no built-in support, though people have created packages for that.
Erlang and Elixir have a similar concept, and a package to work with them.
Okay, show me how to work with these in a Ruby project. I’m feeling adventurous
Let’s start with Sequel. There’s a battle-tested plugin for it.
class MyModel < Sequel::Model
plugin :bit_fields, :status_bits, [ :started, :finished, :reviewed ]
end
In migrations you’d use a Bignum
(64 bit) column by default, though it will tolerate longer and shorter columns.
Sequel.migration do
change do
alter_table :my_models do
add_column :status_bits, Bignum, :default => 0, :null => false
end
end
end
Then you just use the declared bitfield names as if they were regular boolean fields, with some extra features such as scopes.
model = MyModel.create
model.started? # => false
model.started = true
model.started? # => true
model.status_bits # => 1
model.finished? # => false
model.finished = true
model.finished? # => true
model.status_bits # => 3
# let's find all the finished instances
MyModel.finished(true).all
# let's find all unfinished instances
MyModel.finished(false).all
# let's find all the started and the finished instances
MyModel.started.finished.all
These field lookups are translated to bitwise operations. The last line may translate to a query like:
SELECT * from `my_models` WHERE (`status_bits` & 1 == 1) AND (`status_bits` & 2 == 2)
Aha, I see how it’s using these specially bit-positioned numbers. But why the &
?
That is bitwise AND
. It works by treating the right argument as a mask, and returning only bits from left argument for which 1 was set in the mask. All other bits will be unset. The only two options from x & 2
are either 0
or 2
, and by comparing we know whether it’s set or unset.
Clever. How do I set a single bit?
model.finished = true; model.save
UPDATE `my_models` SET `status_bits` = `status_bits` | 2 WHERE id = 424242
This is a bitwise OR
, which modifies its left argument by setting bits specified by the right argument; regardless if they were set or not.
And unset?
model.finished = false; model.save
UPDATE `my_models` SET `status_bits` = `status_bits` & ~2 WHERE id = 424242
Actually, this plugin would just calculate the new value and set it directly. The above does a mask again, but this time with a one’s complement of value 2. This is just the value with all bits flipped. This will keep values at all other bits unchanged, but set a zero at this one.
What is this Sequel thing, give me the ActiveRecord version
The most popular analog for ActiveRecord is bitfields
. It’s really similar:
class User < ActiveRecord::Base
include Bitfields
bitfield :my_bits, 1 => :seller, 2 => :insane, 4 => :sensible
end
And for migrations:
t.integer :my_bits, default: 0, null: false
# OR
add_column :users, :my_bits, :integer, default: 0, null: false
New attributes and scopes work almost like the Sequel version. One difference is that scope methods don’t accept a boolean param to search for that flag set or unset. Rather, the unset version is in a separate scope, prefixed with not_
.
So how many of these attributes can I fit in a single bitfield?
Typically up to 64. The Sequel example above does not explicitly assign numeric values to specific attributes (it certainly supports that), while the ActiveRecord one does. Let’s consider how the required numbers work:
bitfield :my_bits,
1 => :seller,
2 => :insane,
4 => :sensible,
# ...
128 => :manipulative
# ...
1048576 => :closed_on_wednesdays # That's the 21st flag
2097152 => :closed_on_thursdays
# ...
4294967296 => :likes_muffins # 31st
# ...
What’s the last number needed for all 64 flags? And for 500 flags? Generally this is only feasible for small, easily comprehended numbers of flags.
Is there native database support for this?
PostgreSQL has bit strings, since basically forever. These match what we are doing here very well, with some additional semantics.
Binary strings are different, designed to be more like regular strings but with arbitrary binary data. But you can use get_bit
and set_bit
functions on them to get more or less the same behavior.
MySQL has BIT as well, matching the PostgreSQL counterpart. It also has SET, which describes a set containing predefined values, but is a bitfield underneath. Presentation is done with a comma-separated string, while any math and aggregation is numeric.
But none of the two plugins above work with any of these. That’s because they target the lowest common denominator, which is bitwise operations on integers – available everywhere.