Store your taxes in git

When I have an important PDF to store, like my tax returns or a contract, I prefer to store it in git rather than something like Dropbox, Google Drive or iCloud.

There’s the nice feeling I get when enacting the add / commit / push ritual. The feeling is one of tranquility, permanance and order. This document is hermetically sealed, and here to stay. Banish the thought, the sheer horror, of a server-side bug deleting this document and its remote copy!

I also get this nice oppurtunity to add a commit message. If a document is worth holding onto, it’s worth adding a word or two before the context slips away.

On the other side of this equation, it’s nice to have git-style snapshots of the set of all documents at any point in time. There might be important signal in groups of documents being added or removed in a single commit.

When working on behalf of a company, the same logic applies, only more so. Git allows sharing patterns we all know well.

In general, git is an expressive and deliberate way to hold onto documents, and it suits me. I realize this won’t be true for everyone, but it’s worth a try, if you haven’t tried it already. Git: it’s good for more than just code!

E2EE server-side git storage

Problem: which server to push to?

I don’t like the default idea, GitHub, since I don’t trust cloud providers and their many layers of infrastructure to keep my documents secret. We all know the hazards.

What about self-hosting? You could run your own git server, but then you need to worry about availability, backups, NAT-punching, and credentialling your team so they can get onto your LAN.

For many people, managed hosting is right play. The key unlock for “storing important files in cloud-managed git” is end-to-end encrypted git storage. The client encrypts the commit, with its many constituent objects, and the server sees encrypted blobs. File names, branch names and commit hashes remain secret just like file contents.

Inspired by Keybase git, I wrote a new system called FOKS. FOKS git has similar encryption properties, and works well across multiple devices and teams. But FOKS has new features, like federation, PQ-security, teams of teams, YubiKey support, and more. FOKS is fully open-source, server and all. Lastly, it takes a different implementation approach to git. It doesn’t build upon a file system abstraction but instead implements git as a layer on top of an encrypted key-value store.

How does FOKS’s encrypted git work?

First, for a quick recap of how git normally works. When you run git add and git commit, git turns your deltas into git objects, and it writes these objects to a simple database under the .git directory. A data structure encapsulates the preexisting state of the repository, the new deltas, and the commit message and metadata. The hash of this data structure is the new “commit hash”. Git updates a human readable pointer in .git/refs (like main or my-branch-3) to point to this new commit hash. In other words, these operations change .git to reflect your checkout. Operations like git checkout and git reset do the opposite.

Other git commands like git push and git fetch sync the local .git directory with a remote server. They do this via git’s remote helper protocol:

On git fetch, the git client calls into the remote helper protocol to start at a branch name, and sync all objects into .git so that the client can reconstruct the commit graph and the checkout for that branch. git push does the opposite, pushing objects to the server that it doesn’t already have, so that other clients pulling from the server can reconstruct the commit graph that the current client has.

A design goal of FOKS git is to maintain compatibility with existing git clients. Hence, the remote helper protocol provides the logical integration point. The FOKS flow works as follows:

Consider an example where the remote helper asks to push a branch issue-42. This call gets routed to the FOKS agent, which walks the commit graph starting from the root of issue-42, and stopping when it reaches objects the server already has. Each object is encoded as a key-value pair: the key is an HMAC of the object hash, and the value is the encryption of the object itself. The secret keys used in the HMAC and encryption are shared among the devices of the users who have access to this repository. These keys rotate whenever a user revokes a device or a team evicts a user who had access to the repository.

Crucially, all encryption is authenticated. A malicious server cannot inject bogus commits without knowing the shared secret keys.

The agent pushes this key-value pair to the FOKS KV server, after confirming the server doesn’t already have it. Isn’t this a server-round trip on every node in the commit graph and therefore insanely slow? Yes, we will get to a key optimization in a bit.

After all the objects are in place, the agent pushes the reference issue-42 to the server as a key-value. The key is the HMAC of issue-42; the value is the encryption of issue-42’s commit hash.

git fetch does the opposite. The remote helper asks to fetch a branch name, like issue-42. Using the user (or group’s) shared key, the agent HMACs issue-42 to get the key for the reference, and requests this key from the server, which returns the encrypted commit hash. The agent decrypts to get the commit hash, then requests the corresponding commit object. It HMACs the commit hash as the key and requests the corresponding value from the remote KV store. The agent decrypts this value to get the commit object, and then recursively walks the tree, populating the local .git directory with objects needed to checkout issue-42.

Isn’t this Insanely slow?

Yes. Modern git servers like GitHub have the benefit of fully understanding the structure of git repositories, and can save clients all but one round trip. The typical flow is for a remote client to request a commit hash, along with a list of commits it already has. The server can walk the commit graph too, and respond with a single archive (or “packfile” in git-speak) containing all the objects the client lacks. The client writes this packfile into .git and violà, finito. Checkout can extract the objects from the packfile when needed.

Our encrypted KV-store is by contrast, in the dark, and therefore unable to guess which objects the client needs. In addition, the end-to-end flow lives within the confines of the existing remote helper protocol, from which we cannot deviate. Come hell or high water, the client must start with a remote branch name, and leave the local .git directory in a state where the checkout of that branch works.

Our optimization is as follows. We mentioned packfiles above, which can be quite large, as they contain many objects. We could naively program clients to sync down all packfiles, oblivious to what they contain. This would reduce round trips but would waste bandwidth, since those packfiles might correspond to branches that the client doesn’t care about. However, every packfile has an associated index file, which is a list of all the object hashes contained in the packfile. Index files are typically quite small, so there is little downside to eagerly slurping these down from the server.

Thus, when a client is searching for objects on the server, it: first downloads all index files; then scans the indices for the object it needs; and only downloads the packfile that contains the object it needs, under the supposition that the packfile contains the other objects in the desired commit graph. Under the hood, git tries to optimize packfiles to ensure this proximity assumption holds.

On the push side, FOKS git clients eagerly repack the local .git directory and push new packfiles to the server. There is a small privacy hitch here: if you push a commit on branch A, some commits on branch B might go along for the ride. In a team context, work on a branch might become visible to other team members before it’s explicitly pushed. FOKS users get a warning to this effect and can choose to disable this optimization if they want.

The packfile index optimization also speeds up push operations. We said above that clients should not push objects the server already has. So doing would waste bandwidth. The indices provide a local manifest of the server’s object inventory and can be checked without a round trip.

Try it today

Getting started with FOKS is easy:

brew install foks
foks signup
foks git create my-repo
git clone foks://foks.app/my-username/my-repo

curl -fsSL https://pkgs.foks.pub/install.sh | sh
foks signup
foks git create my-repo
git clone foks://foks.app/my-username/my-repo

winget install foks
foks signup
foks git create my-repo
git clone foks://foks.app/my-username/my-repo

Then use your usual git client with your usual git commands. Try it for your IRS Form 1040 this tax season.

✌️️ Max 🔑

FAQ for FOKS

What about indexing?

Just use a client-side indexing scheme.

What about mobile?

That would be great, mobile support is hopefully coming soon.

What about LFS?

In current use cases, it’s desirable to have a full checkout of the repository, so we haven’t needed LFS. We’re not opposed to support it in the future if there’s a demand.

What about signing commits?

It actually does make sense to sign commits in addition to authenticated encryption, as this would prevent members of a team from impersonating each other. We haven’t implemented this yet, but it’s possible. Each user would get an SSH signing key that all their devices could access. Other team members could verify signatures correspond to these advertised keys. One major snag here is this only makes sense if the repository is using SHA256 hashes rather than the horrible SHA1 default. Note, however, that FOKS’s current authentication encryption, layered underneath the git layer, does not rely on the security of SHA1.