{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "In my domain, the energy world, we work a many with timeseries. Generally, these arrive in batches where one day (in localtime) always arrives at once and there can be multiple versions of the same data. A normal portfolio has many thousands of assets. Often one wants to aggregate these timeseries over a subset of assets, for some time range. This is not a trivial thing to do in a row-oriented database like Postgres. Below we will explore some methods to store this data, and benchmark different queries to get the same result.\n", "\n", "\n", "\n", "See [this post](/posts/postgres-notify/) for more tips on how to work with psql and psycopg2 from a notebook." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import os, random, json, pytz\n", "from datetime import datetime, timedelta, time\n", "from time import time as t\n", "from functools import lru_cache\n", "from itertools import zip_longest\n", "\n", "from IPython.display import Javascript\n", "import plotly.express as px\n", "import tqdm" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting package metadata (current_repodata.json): ...working... done\n", "Solving environment: ...working... done\n", "\n", "# All requested packages already installed.\n", "\n", "psql (PostgreSQL) 12.4 (Ubuntu 12.4-1.pgdg20.04+1)\n" ] } ], "source": [ "# Install python dependencies\n", "!conda install -y --quiet psycopg2\n", "# Install postgres (if not already installed)\n", "!sudo apt-get install postgresql -yqq\n", "# It needs to run in the background\n", "!sudo service postgresql start\n", "!psql --version" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CREATE ROLE\n", "CREATE DATABASE\n", "ALTER ROLE\n" ] } ], "source": [ "# Optionally drop the user and/or database if starting anew\n", "# !sudo su - postgres -c \"psql -c \\\"DROP DATABASE timeseries_benchmark\\\"\"\n", "# !sudo su - postgres -c \"psql -c \\\"DROP USER tse\\\"\"\n", "# Create user, db, and give access\n", "!sudo su - postgres -c \"psql -c \\\"CREATE USER tse WITH PASSWORD 'abc'\\\" && psql -c \\\"CREATE DATABASE timeseries_benchmark OWNER tse\\\"\"\n", "# Needed to be able to COPY FROM a file\n", "!sudo su - postgres -c \"psql -c \\\"ALTER USER tse WITH SUPERUSER\\\"\"" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "conn = psycopg2.connect(user=\"tse\", password=\"abc\", database=\"timeseries_benchmark\", host=\"localhost\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create tables\n", "\n", "First we make the three tables to be benchmarked, they differ:\n", "\n", "- actuals_array_double: One array per asset/day/version, with DOUBLE PRECISION (8 bytes) datatype\n", "- actuals_array: Same but with REAL (4 bytes)\n", "- actuals_cols: Also with REAL but split into 100 physical columns instead of ARRAYs" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Drop tables if running anew\n", "# !sudo su - postgres -c \"psql -d timeseries_benchmark -c \\\"DROP TABLE IF EXISTS actuals_array_double; DROP TABLE IF EXISTS actuals_array; DROP TABLE IF EXISTS actuals_cols;\\\"\"\n", "\n", "# FLOAT means DOUBLE PRECISION (8B)\n", "with conn as conn:\n", " with conn.cursor() as curs:\n", " curs.execute(\"\"\"\n", " CREATE TABLE actuals_array_double (\n", " id serial PRIMARY KEY,\n", " asset_id INTEGER NOT NULL,\n", " day DATE NOT NULL,\n", " created TIMESTAMP NOT NULL,\n", " values FLOAT[] NOT NULL,\n", " CONSTRAINT actuals_array_double_asset_day UNIQUE (asset_id, day, created)\n", " );\n", " \"\"\")\n", "\n", "# REAL is single precision floating point\n", "with conn as conn:\n", " with conn.cursor() as curs:\n", " curs.execute(\"\"\"\n", " CREATE TABLE actuals_array (\n", " id serial PRIMARY KEY,\n", " asset_id INTEGER NOT NULL,\n", " day DATE NOT NULL,\n", " created TIMESTAMP NOT NULL,\n", " values REAL[] NOT NULL,\n", " CONSTRAINT actuals_array_asset_day UNIQUE (asset_id, day, created)\n", " );\n", " \"\"\")\n", "\n", "with conn as conn:\n", " with conn.cursor() as curs:\n", " pte_str = \",\\n \".join(f\"pte{i:02d} REAL\" for i in range(100))\n", " curs.execute(f\"\"\"\n", " CREATE TABLE actuals_cols (\n", " id serial PRIMARY KEY,\n", " asset_id INTEGER NOT NULL,\n", " day DATE NOT NULL,\n", " created TIMESTAMP NOT NULL,\n", " {pte_str},\n", " CONSTRAINT actuals_cols_asset_day UNIQUE (asset_id, day, created)\n", " );\n", " \"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating and importing testdata\n", "\n", "I'm creating a reasonable amount of data, about 16M rows, as that is still okay to load and query on my laptop. To reach that amount of data I use 5000 assets, 3 years, and 3 versions of daily actuals." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Use a cache to speed up the function call, the randomness is not critical\n", "@lru_cache(maxsize=10000)\n", "def randvals(d: datetime, tz = pytz.timezone(\"Europe/Amsterdam\"), values_arr=True):\n", " d1 = tz.localize(datetime.combine(d, time()))\n", " d2 = tz.localize(datetime.combine(d + timedelta(days=1), time()))\n", " periods = int((d2 - d1).total_seconds() / 3600 * 4)\n", " values = [random.random() for i in range(periods)]\n", " if values_arr:\n", " return '\"{' + \",\".join(f\"{v:.3f}\" for v in values) + '}\"'\n", " else:\n", " # Always make 100 columns, but fill with unquoted empty string\n", " # See parameters -> NULL here: https://www.postgresql.org/docs/9.2/sql-copy.html\n", " return \",\".join(f\"{v:.3f}\" if v else \"\" for i, v in zip_longest(range(100), values))\n", "\n", "def generate_fake_data(filename, n_assets=5E3, n_years=3, n_versions=3, stop_dt=datetime.now(), values_arr=True):\n", " with open(filename, \"w\") as f:\n", " i = 0\n", " for days_ago in tqdm.tqdm(range(n_years * 365)):\n", " for asset_id in range(int(n_assets)):\n", " day = (stop_dt - timedelta(days=days_ago))\n", " for i_v in range(n_versions):\n", " i += 1\n", " created = day - i_v * timedelta(days=3)\n", " values = randvals(day, values_arr=values_arr)\n", " line = f\"{i:d}, {asset_id:d}, '{day:%Y-%m-%d}', '{created.isoformat()}', {values}\"\n", " f.write(line + \"\\n\")\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 1095/1095 [01:50<00:00, 9.88it/s]\n" ] } ], "source": [ "generate_fake_data(\"actuals_array.csv\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 1095/1095 [01:47<00:00, 10.19it/s]\n" ] } ], "source": [ "generate_fake_data(\"actuals_cols.csv\", values_arr=False)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1, 0, '2020-09-04', '2020-09-04T20:16:44.631097', \"{0.172,0.477,0.225,0.707,0.876,0.952,0.913,0.252,0.598,0.292,0.326,0.854,0.821,0.203,0.706,0.531,0.560,0.150,0.843,0.727,0.669,0.039,0.144,0.843,0.019,0.708,0.179,0.308,0.967,0.307,0.031,0.120,0.095,0.824,0.623,0.377,0.916,0.699,0.586,0.953,0.704,0.372,0.156,0.620,0.408,0.229,0.400,0.496,0.502,0.983,0.394,0.683,0.974,0.253,0.398,0.241,0.317,0.899,0.322,0.312,0.659,0.567,0.404,0.210,0.374,0.307,0.126,0.483,0.074,0.536,0.022,0.230,0.881,0.246,0.742,0.247,0.807,0.694,0.570,0.447,0.329,0.570,0.221,0.954,0.693,0.355,0.506,0.111,0.826,0.973,0.822,0.707,0.918,0.734,0.171,0.597}\"\n", "1, 0, '2020-09-04', '2020-09-04T20:16:44.631097', 0.401,0.674,0.385,0.774,0.085,0.304,0.894,0.695,0.591,0.050,0.157,0.392,0.334,0.009,0.095,0.452,0.083,0.289,0.336,0.174,0.140,0.918,0.005,0.389,0.815,0.001,0.323,0.000,0.759,0.697,0.318,0.948,0.985,0.327,0.414,0.605,0.260,0.701,0.156,0.568,0.503,0.455,0.042,0.582,0.310,0.860,0.013,0.190,0.313,0.168,0.506,0.163,0.792,0.828,0.161,0.160,0.356,0.920,0.774,0.493,0.224,0.194,0.326,0.107,0.645,0.954,0.382,0.151,0.593,0.914,0.739,0.332,0.469,0.053,0.472,0.831,0.062,0.538,0.767,0.467,0.909,0.363,0.865,0.744,0.167,0.171,0.442,0.305,0.563,0.144,0.228,0.157,0.473,0.210,0.846,0.958,,,,\n", "16425000 actuals_array.csv\n" ] } ], "source": [ "!head -1 actuals_array.csv\n", "!head -1 actuals_cols.csv\n", "!wc -l actuals_array.csv" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def copy_from(conn, filename, table):\n", " path = os.path.abspath(filename)\n", " print(f\"Importing '{filename}' to '{table}'\")\n", " with conn as conn:\n", " with conn.cursor() as curs:\n", " curs.execute(f\"DELETE FROM {table};\")\n", " curs.execute(f\"COPY {table} FROM '{path}' (FORMAT CSV);\")\n", " curs.execute(f\"SELECT COUNT(*) FROM {table};\")\n", " n_rows = curs.fetchone()[0]\n", " curs.execute(f\"SELECT pg_size_pretty(pg_relation_size('{table}'));\")\n", " print(f\"Table '{table}' rows and size:\", n_rows, \",\", curs.fetchone()[0])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Importing 'actuals_array.csv' to 'actuals_array_double'\n", "Table 'actuals_array_double' rows and size: 16425000 , 14 GB\n", "CPU times: user 17.1 ms, sys: 851 µs, total: 18 ms\n", "Wall time: 7min 43s\n", "Importing 'actuals_array.csv' to 'actuals_array'\n", "Table 'actuals_array' rows and size: 16425000 , 7547 MB\n", "CPU times: user 13 ms, sys: 1 ms, total: 14 ms\n", "Wall time: 6min 53s\n", "Importing 'actuals_cols.csv' to 'actuals_cols'\n", "Table 'actuals_cols' rows and size: 16425000 , 7129 MB\n", "CPU times: user 11.4 ms, sys: 0 ns, total: 11.4 ms\n", "Wall time: 5min 40s\n" ] } ], "source": [ "%time copy_from(conn, \"actuals_array.csv\", \"actuals_array_double\")\n", "%time copy_from(conn, \"actuals_array.csv\", \"actuals_array\")\n", "%time copy_from(conn, \"actuals_cols.csv\", \"actuals_cols\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tuning Postgres\n", "\n", "I [tuned postgres](https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server) a bit to increase performance, as my machine is much larger than the default settings assume. After checking out TimescaleDB, I used their tune script which seemed to set even better values.\n", "\n", "Find the config file with `SHOW config_file;`, edit it, and then restart with `systemctl restart postgresql`.\n", "\n", "`timescaledb-tune` gave me these settings based on my system (only listed the values that were changed from their defaults):\n", "```\n", "effective_cache_size = 23976MB\n", "maintenance_work_mem = 2047MB\n", "work_mem = 10229kB\n", "\n", "max_worker_processes = 19\n", "max_parallel_workers_per_gather = 4\n", "max_parallel_workers = 8\n", "\n", "wal_buffers = 16MB\n", "min_wal_size = 512MB\n", "\n", "default_statistics_target = 500\n", "random_page_cost = 1.1\n", "checkpoint_completion_target = 0.9\n", "max_locks_per_transaction = 256\n", "autovacuum_max_workers = 10\n", "autovacuum_naptime = 10\n", "effective_io_concurrency = 200\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use pev2 to visualize Postgres query plans\n", "\n", "I came across a very nice visualization tool called PEV2. It's a Vue.js module, so naturally I wanted to go the extra mile and make it possible to render this in a jupyter lab notebook. Some hours later I finally got it to work, the steps were:\n", "- Cloned pev2 and followed the [pev2 getting started guide](https://github.com/dalibo/pev2/blob/master/CONTRIBUTING.md)\n", "- Install vue-cli-service script using `npm i @vue/cli-service`\n", "- Compile only the Plan top-level module as UMD (universal, so will register as `window.pev2` when loaded with no specific loader) `./node_modules/.bin/vue-cli-service build --target lib src/components/Plan.vue`\n", "- Insert all required css/javascript in head/body, respectively\n", "- Create function that instantiates the pev2 Vue Component using a new Vue instance\n", "- (When rendering notebook for blog put javascript in EXTRA_HEAD_DATA in conf.py, I could not get it to reliably load when output in a cell like below)\n", "\n", "This works surprisingly well! One could definitely create a nice jupyter lab extension this way.\n", "\n", "I should also mention that I tried [pgMustard](https://www.pgmustard.com/getting-a-query-plan), which gave me some pretty good tips on index use. Unfortunately it's not embeddable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%javascript\n", " function addScript(url) {\n", " var script = document.createElement(\"script\");\n", " script.type = \"text/javascript\";\n", " script.src = url;\n", " //document.head.appendChild(script);\n", " // For use on website, as anything added to
is not available on jupyter reload\n", " element.append(script);\n", " }\n", " function addCSS(url) {\n", " var link = document.createElement(\"link\");\n", "\n", " link.type = \"text/css\";\n", " link.rel = \"stylesheet\";\n", " link.href = url;\n", "\n", " //document.head.appendChild(link);\n", " element.append(link);\n", " }\n", " addCSS(\"https://unpkg.com/bootstrap@4.5.0/dist/css/bootstrap.min.css\");\n", " addCSS(\"https://unpkg.com/@fortawesome/fontawesome-free@5.13.0/css/all.css\");\n", " addScript(\"https://cdn.jsdelivr.net/npm/vue@2.6.12/dist/vue.min.js\");\n", " // Only works if you're running jupyter lab with this notebook as root repo and the compiled file in this folder with name pev2.umd.js\n", " addScript(\"/files/pev2.umd.js\");" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def pev2(query, plan):\n", " \"\"\"\n", " Visualize a Postgres query plan with pev2. Try to pass a text query plan, not JSON,\n", " as text has less issues with escaping double quotes.\n", " \"\"\"\n", " s = \"\"\"\n", " var el = document.createElement('div');\n", " element.append(el);\n", "\n", " var v = new Vue({\n", " el: el,\n", " components: {\n", " 'pev2': window.pev2\n", " },\n", " data: function() {\n", " return {\n", " plan: `\"\"\" + plan + \"\"\"`,\n", " query: `\"\"\" + query + \"\"\"`,\n", " }\n", " },\n", " template: \"