Initial replica setup

  1. The customer team will log a ticket to Heroku requesting access to base backups and WAL files for their database. Specifics are helpful, and we’ve found this verbiage to work well:

    "Please grant me access to the S3 bucket containing base backups and current WAL files for the postgresql-curly-1234 database on the <YOUR_APP_HERE> app to use for migrating our data out of Heroku."

  2. Heroku will create config variables on the app, something like this (though also created by "color", ie HEROKU_POSTGRESQL_MAROON_*:

  3. The customer team will share those credentials to Crunchy Bridge CSE folks. The simplest way is to create a new, empty application and add Crunchy Bridge CSE as a collaborator on the app. Other options are sharing via 1Password, Age Encryption, KeyBase.

  4. The customer team will provision a Crunchy Bridge instance with the same major Postgres version with sufficient storage to hold the data from the restore. It is recommended to provision at minimum 1.4x the "Data Size" reported by heroku pg:info — this will ensure that the data size on the Crunchy instance will not exceed 75% of the available disk, avoiding warning emails and/or automatic disk resizing (cf documentation on automatic disk resizes).

  5. The Crunchy Bridge team will restore the base backup onto the new instance and begin applying WAL from Heroku. The total time to apply WAL depends on the amount of WAL being generated by the source database. Typically, the restore of the base backup takes about 90 minutes per TB.

Once the database is applying WAL it can essentially be promoted at any time. However, it is recommended, especially for production clusters, to do an initial test promotion so that amcheck can be run against all indexes. This is essential because Heroku runs an older version of glibc that can sometimes result in different sort orders within indexes. Any affected indexes will need to be recreated (for more information, see blog post on the topic - Postgres Migration Pitstop: Collations).

A new WAL file is created either when the file grows to wal_segment_size (16MB in Heroku Postgres), or when it hits the archive_timeout (1 minute in Heroku Postgres), whichever comes first. Synchronization status can be seen by comparing LSN's between the leader (on Heroku) and the follower (on Crunchy Bridge): Leader:

select pg_current_wal_lsn();

Follower:

select pg_last_wal_replay_lsn();

Alternately, you can compare now() with the timestamp of the last transaction that was been applied. However if there is very little activity in the database then the timestamp of the last transaction will not increment, giving the false impression that there is lag when there is not:

select now()- pg_last_xact_replay_timestamp();

Application Cutover

The live application cutover will normally take place on a conference call between the customer and Crunchy Bridge Customer Success Engineering since there are multiple moving parts that must be coordinated. An example migration event might look like this:

~ Before Migration Event ~

  1. Schedule migration
  2. Update status page
  3. (Crunchy) Verify that replication is still working and lag is no more than a minute or two

~ Migration Event ~