Dance, Computer, Dance

by Ray Grasso

Designing Data-Intensive Applications 📚

29 January, 2019

Designing Data-Intensive Applications by Martin Kleppmann

This book surveys data storage and distributed systems and is a fantastic primer for all software developers.

It starts with naive approaches to storing data, quickly builds up to how transactions work, and works up to the complexities of building distributed systems.

I particularly enjoyed the chapter on stream processing and event sourcing. It contrasts stream processing to batch processing and highlights many of the challenges of these approaches and explores options for addressing them.

Using Netlify for Hosting

18 January, 2019

I recently moved the hosting of my various blogs and websites off my own server to Netlify.

I was originally going to set up an S3 bucket and Cloudfront distribution for each of my sites but Netlify provides me the CDN and hosting features I need all bundled up already. You can upload files directly for serving or hook your site up to run a static site generator when you push to a branch of a Github repository.

In short, I’m not longer paying hosting costs and they handle all of the SSL certificate renewal from Let’s Encrypt for me.

Next up I plan to clean up the tooling I use for some of my sites and tweak things on here so I have more variety in my posts.

2019 is the year of the blog baby.

Forgetting Data in Event Sourced Systems

13 September, 2018

GDPR’s right to be forgotten means we have to be able to erase a person’s data from our systems. Event sourced systems work from an immutable log of events which makes erasure difficult. You probably want to think hard about storing data you need to delete in an immutable event log but sometimes that choice is already made and you need to make it work, so let’s dig in.

Erasing user data from current state projections

This is relatively straightforward. A RightToBeForgottenInvoked event is added to the event store for the person. All projectors that depend on personal data listen for this event and prune or scrub the appropriate data for the person from their projections.

Erasing data from the event stream itself

This case is trickier. We need to rewrite history in a way that doesn’t break things. Let’s look at an option for erasing data without rebuilding the event stream. This approach is also applicable for projections that are immutable change logs.

We can store personal data outside of events themselves in a separate storage layer. Each event instead stores a key for retrieving the data from this layer and any event consumers request the data when they need it. Given this data is personal the storage layer should probably encrypt the data at rest.

Once a RightToBeForgottenInvoked event is added to the event store all data for that person can be erased from the storage layer. All subsequent requests for data from the secure storage layer for that person’s data will return null objects rather than the actual data. This should make life easier for all consumers and avoid you null checking yourself to death all over the place.

Let’s see what this secure storage layer might look like.

Sketch of a secure storage layer

Our secure storage layer stores data that is scoped to a person and has a type (so we can return null objects). The store allows all data for a specific person to be erased.

Let’s start with two main models: a Person1 and a Data model.

      Data                 Person
  ┌──────────┐        ┌───────────────┐
  │    id    │   ┌───>│      id       │
  ├──────────┤   │    ├───────────────┤
  │person_id │───┘    │encryption_key │
  ├──────────┤        ├───────────────┤
  │   type   │        │   is_erased   │
  ├──────────┤        └───────────────┘

The interface to the secure storage layer is outlined below.

class SecureStorage
  def add(person_id, data_id, type, data)
    # Find the Person model for person_id (lazily create one if needed).
    # Encrypt the data using the person's encryption_key and store the
    # ciphertext in the data table using the client supplied data_id and type.
    # Clients will store this data_id in an event body and use it to retrieve
    # the data later.

  def erase_data_for_person(person_id)
    # Mark the corresponding record in the person table as erased
    # and delete the encryption key.

  def get(data_id)
    person = Person.find_non_erased(person_id)
    if person
      # Look up the row from the data table, decrypt ciphertext using the
      # key on the person model, and return the data.
      # Look up the row from the data table and return a null object for
      # that data type.

Where does that leave us?

After a person has invoked their right to be forgotten all current state projections will be updated to erase that person’s data. The event store will return null objects for any events that contain data for the person which means that any event processors won’t see that data as they build their projections. It will also contain the RightToBeForgottenInvoked event for the person so consumers can handle that explicitly if required.

  1. This could be expanded to be more general but we’ll stick with person for the purpose of this post. 

Replacing Google Analytics with GoAccess

3 September, 2018

I removed Google Analytics from my sites1 but still wanted access to some simple request statistics on them. Turns out GoAccess gives me most of what I need by analysing my Apache access logs.

The main challenges I ran into were, working out the correct flags for GoAccess, feeding compressed and uncompressed access logs at the same time, and ignoring junk requests from internet pests.

I pulled together a bash script that handles this, the essence of which is below.

# Analyse all log files
{ cat /var/www/mysite/logs/access.log; zcat /var/www/mysite/logs/access.log.*.gz; } | \
  # Strip out junk requests
  grep -v -E '\.php|jmx-console|\.cgi|phpmyadmin|dbadmin' | \
  # Fire up goaccess using the correct log file format for consolidated apache logs
  goaccess --log-format='%h %^[%d:%t %^] \"%r\" %s %b \"%R\" \"%u\"' --date-format='%d/%b/%Y' --time-format='%H:%M:%S' --ignore-crawlers
  1. A) It does way more than I need. B) I still don’t understand how to use it properly. C) It needlessly collects and sends your information off to the Big G.