Pure Postgres Cosine Similarity Search

I recently released pg-costop, which is a small experiment into performing simple vector arithmetic and weighted, variably randomized cosine similarity search in pure Postgres. I created this for a system that could run Postgres procedures and nothing else. This shouldn’t be used by anyone in production, but can serve as a useful learning exercise.

pg-costop allows you to perform basic arithmetic on vectors, such as:

SELECT ct_add(
  a := ARRAY[1,2,3],
  b := ARRAY[4,5,6]
);

-- => {5.0,7.0,9.0}

However its main purpose is facilitating advanced cosine proximity/similarity ranking.

Say you’re building a Spotify competitor, and you want to offer users a front page with recommendations based on tracks they have listened to or liked. If a user disliked a track, similar tracks should rank lower. Finally you want to show a fresh batch of recommendations every time. That’s easy with costop!

Let’s build on this example. Each track in our platform’s database is represented by a vector generated previously by our magic AI. The vector represents the “feel” or identity of the track, and tracks with a smaller cosine distance between them have a similar feel. To optimize the recommendation query, next to our vector (which has many dimensions) we store the pre-calculated magnitude. While pg-costop can calculate the magnitude if it is not supplied, that’s going to be around 3x slower.

Let’s retrieve some sample data:

SELECT vec, magnitude
FROM track_properties
LIMIT 10
[
  {
    "vec": [-4.43744993210000000, 22.10746002200000000, 34.09545898440000000, ...],
    "magnitude": 567.9233
  },
  {
    "vec": [1.48314607140000000, 2.67624783520000000, 8.49940872190000000, ...],
    "magnitude": 48.442005
  },
  ...
]

All vectors in pg-costop are treated as double precision[]. Here’s an example of how we could rank tracks:

WITH
  liked AS (
    SELECT track_id, vec, magnitude
    FROM track_properties
    WHERE track_id = 'foo'
  ),
  dont_like AS (
    SELECT vec, magnitude
    FROM track_properties
    WHERE track_id = 'bar'
  ),
  weighted AS (
    SELECT *
    FROM ct_weigh(
      positive := ARRAY(SELECT vec FROM liked)
     ,negative := ARRAY(SELECT vec FROM dont_like)
     ,neg_scale := 0.1
     ,noise := 0.05
    )
  )
SELECT
  tp.track_id,
  ct_similarity(
    a := (select w.vec from weighted w)
   ,b := tp.vec
   ,norm_a := (select w.norm from weighted w)
   ,norm_b := tp.magnitude
  )
FROM track_properties tp
WHERE track_id NOT IN (SELECT track_id FROM liked)
ORDER BY 2
LIMIT 10;

That’s a lot! Let’s break this down.

First, liked queries the vectors from the tracks we’d like to see more of. For this example, there’s one track in there. Next, dont_like queries vectors for tracks the user doesn’t like.

Without respecting likes and dislikes, and without the random component, we could use the vector from the liked track directly in the ct_similarity function we’ll get to in a moment. But that only really works for a single source vector. If we want to recommend tracks based on say the last 5 tracks a user has liked, we need to use ct_weigh.

ct_weigh builds a weighted vector taking into account user preferences and a bit of chance. positive receives a list of vectors used to bias the weighted vector towards, whereas negative is a list of vectors to bias against. We’ve added the optional neg_scale and pos_scale parameters. In the example above, all positive vectors are counted as they are, but negative vectors are scaled to have 10% of their original impact.

Finally, we add chance (5%) using the optional noise parameter.

Now, we use the resulting vector as the origin and try to find similar tracks by using ct_similarity. As mentioned previously we make use of cached magnitude values. norm_a comes from the ct_weigh output, and magnitude from our tracks table. ct_similarity returns a value that when used with the default ORDER BY (the 2 here references the second SELECTed value) will sort by most similar first.

We exclude all input tracks with WHERE track_id NOT IN (SELECT track_id FROM liked), because the most similar tracks are the same tracks we put in.

You can find the complete API docs for pg-costop in the GitHub repo. There are several performance optimization one could consider like fundamentally different algorithms (locality sensitive hash grouping, ball-graphs, …) but those are primarily useful for datasets whose size exceed the practical limits of this experimental Postgres procedure.

Spotify has open-sourced their nearest neighbour library Annoy in the meantime, which has lots of bindings for all kinds of languages.