Changelog

Every entry on this page is shipped, verified work — no aspirational entries, no roadmap dressed up as progress. This page is generated directly from the repository's CHANGELOG.md by site/build_changelog.py, so the public log and the real build log are the same document.

Public-format build log, week 1 onward. Every entry is shipped, verified work — no aspirational entries (see spec/05-team-charter.md non-negotiables).

Week 5

2026-06-27 — A cleaner, at-a-glance Overview. The dashboard Overview is now a one-screen digest. The Customers, Products and Geography breakdowns fold into tidy collapsible sections, each showing a quick summary right on its header — how many customers and their average lifetime value, your product count and top seller, how many countries and the leading one — so the highlights are visible without expanding. A new Products section brings the top products and the category revenue-mix onto the Overview, and a Revenue over time chart now sits directly under the headline numbers. Every summary figure is governed (never a fabricated 0), and the trend shows an honest note rather than a faked line until a store has more than one month of data. Reads well on phones too. Zero-JS, $0, dependency-free.
2026-06-27 — New brand typeface and an app-style website. Tilltrend now uses its brand typeface (DM Sans) across the dashboard and the website, and the marketing site has been redesigned to match the product — a dark, app-style look with the emerald accent that reads cleanly on desktop and phones. The font is self-hosted (no third-party font CDN — nothing is fetched from elsewhere when the page loads). No number, and no way any number is computed, changed.
2026-06-27 — Plum is now Tilltrend. New name, new logo, and an emerald palette across the merchant dashboard and the public site. Same product, same governed-metrics-with-receipts — only the name changed; existing stores keep all their data and history. Your dashboard now also refreshes automatically every night, so new orders appear without a manual sync.
2026-06-21 — Connect Tableau to your governed metrics. Prefer Tableau? There's now a starter for it too: five text-only .tds data sources that connect live to the gold views (through the read-only plum_bi role) — a sales star plus the pre-aggregated views — with the same 8 governed calculations as the Power BI kit, re-expressed in Tableau syntax (Average Order Value as the ratio of sums, never an average of monthly AOVs; Median LTV, the honest figure, not the mean; Active Customers as a non-additive distinct count), each carrying its caveat in the field description. Your Tableau workbooks inherit the same honest numbers as Ask, the dashboard, and Power BI. Text-only, $0, no new dependencies. (Live database connection needs Tableau Desktop — Tableau Public has no PostgreSQL connector.)
2026-06-21 — One-click Power BI starter. The Power BI starter is now a double-click project (plum-starter.pbip): open it and it loads the gold tables and all 8 governed measures, ready to drop visuals onto — no manual model setup. (The connection-only file is still there if you prefer a blank report.)
2026-06-21 — Connect Power BI to your governed metrics. A starter Power BI kit: a one-click connection to the gold views (through the read-only plum_bi role) plus a governed semantic model whose DAX measures are 1:1 with Plum's metric definitions — Total Revenue, Orders, Units, Active Customers, Average Order Value (ratio of sums, never an average of monthly AOVs), Median LTV (the honest figure, not the mean), and Repeat-revenue share — each carrying its caveat. Your Power BI reports inherit the same honest numbers as Ask and the dashboard. Text-only, $0, no new dependencies.
2026-06-21 — Bigger Trends charts, with the edge dates fully clear. The time-series charts on the Trends page are now taller and have more breathing room at the sides, so the first and last dates — and the value tag on the latest point — sit comfortably inside the chart instead of hugging (or clipping at) the edge.
2026-06-21 — "Frequently bought together" and purchase cadence, now on screen. The market-basket and purchase-cadence metrics get on-screen surfaces: a Frequently bought together table on the Products page (the product pairs that show up in the same orders most often, with how many orders and the share), and a Purchase cadence panel on the dashboard (the typical days between orders for repeat customers — median first, with the average and the slow-10% tail). Honest as ever — co-occurrence is flagged as correlation, not causation, and cadence covers the repeat base only. Zero-JS, $0, dependency-free.
2026-06-21 — Two more answers: "frequently bought together" and how often customers reorder. Market basket ranks the product pairs that show up in the same orders most often (with how many orders, and what share of all orders) — handy for bundles and cross-sell, with the honest caveat that it's co-occurrence, not proof that one drives the other. Purchase cadence shows how often repeat customers come back — the typical days between orders (median, plus the average and the slow-90th-percentile tail) across customers with two or more orders. Both answerable in Ask, over MCP, and in Power BI / Tableau. $0, dependency-free.
2026-06-21 — Charts for dead stock and customer age. The dead-stock view — products grouped Active / Slow-moving / Dead by how long since their last sale — now appears as a chart on the Products page beside the performance tiers, so you can see at a glance how much of the catalog (and how much past revenue) is sitting idle. Buyers by age group is now a chart on the dashboard, with an honest "Unknown" bar and the exact birthdate coverage when ages aren't on file. Zero-JS, $0, dependency-free.
2026-06-20 — Charts for customer recency and product tiers, plus tidier date labels. The customer recency (Active / At-risk / Churned) and product performance tier (High / Mid / Low) breakdowns now appear as charts in the app — recency beside customer segments on the dashboard, product tiers on the Products page — each a clean split of share-of-customers (or products) versus share-of-revenue, with the cut-offs disclosed under the chart. We also tidied the date labels on the Trends charts: they're now compact (e.g. "Dec '10", "Q4 '10", "Dec 29") and aligned so the last date no longer runs off the edge or overlaps on a phone — the full date is still in the table and on hover. Zero-JS, $0, dependency-free.
2026-06-20 — Spot dead stock, and see your customer age mix. Two more governed metrics — answerable in Ask, over MCP, and in Power BI / Tableau. Product recency flags dead stock: products grouped Active / Slow-moving / Dead by how long since their last sale (measured against your latest data), so you can see at a glance what to restock, push, or retire — and how much past revenue is tied up in it. Customer age groups break identified buyers into the usual brackets (under 20, 20–29, 30–39, 40–49, 50+). Age needs a birthdate on file, which many stores — Shopify included, by default — don't collect; when it's missing the metric says so honestly with an "unknown" group and the exact coverage, never invented demographics. $0 and dependency-free, as ever.
2026-06-20 — Know your customers and products at a glance: segments, recency, and performance tiers. Three new governed metrics, each answerable in Ask, over the MCP connection, and in Power BI / Tableau — from one shared definition. Customer segments (VIP / Regular / New) split customers by how much they spend and how long they've been buying, now shown as a chart on the Overview page where the gap between "share of customers" and "share of revenue" tells you where the money really is. Customer recency flags who's Active, At-risk, or Churned by time since their last order (measured against your latest data, so it reads "now" for a live store). Product performance tiers group products into High / Mid / Low by revenue. The cut-offs — what counts as a "VIP", "at-risk", or a "high-performer" — are yours to set, and the exact values are shown on every answer's receipt, so a number always states the rule it used and you can verify it. $0 and dependency-free, as ever.
2026-06-20 — A friendlier dashboard on your phone. Went through every admin page on a phone and reworked the navigation for small screens. The old cramped strip of tiny, unlabelled icons is gone — there's now a hamburger menu (☰) that opens a floating, translucent panel of clearly labelled pages from the side; tap the ✕ or anywhere outside to close it. The Trends grain selector now wraps instead of cutting off "Year", and a few layout tweaks keep text and tables sitting cleanly within the screen. The menu uses no JavaScript, and the desktop layout is unchanged.
2026-06-20 — Smaller Trends charts, and a single home for revenue over time. The Trends charts now render at a more compact size on screen, and the separate Revenue page has been folded into Trends — its month view already shows the same monthly revenue plus a running total, a trailing moving average, and month-over-month / year-over-year change, so there is now one more-capable place for "revenue over time" instead of two overlapping ones. The governed monthly-revenue definition is unchanged: Ask still answers "monthly revenue" with the same receipted numbers and exact SQL.
2026-06-20 — Trends: sales by day, week, month, quarter, and year — with charts. A new Trends page answers the time questions a merchant actually asks — revenue by day / week / month / quarter / year, the latest period, and the most recent day — each as a line/area chart with a running total to date, a trailing moving average, and the change vs the prior period (and, for months and quarters, vs the same period last year). Pick the grain with one click. Underneath it's a proper BI date dimension (gold.dim_date) and five governed period views built with window functions and CTEs (running SUM, trailing AVG, LAG) over your sales — so a period with no orders shows an honest 0 (never a gap), the partial first/last period is flagged (not read as a trend), and a period-over-period change with no prior period reads "no prior period" rather than a fabricated number. Every grain reconciles to the same total. The same governed metrics power Ask, so "show me weekly revenue" now returns a receipted answer with the exact SQL — $0 and dependency-free, as ever.
2026-06-20 — Merchant dashboard redesign, one-click data refresh, and automatic store currency. A first polish pass on the merchant-facing app, all $0 and dependency-free. Look & feel: a calmer, denser dashboard — teal on a charcoal board in dark mode, a clean blue-on-white light mode, a fixed left icon sidebar, and monitor/sun/moon theme icons; the app is now responsive on phones, and chart text is one consistent size. Sync now: because each store's warehouse is a built snapshot, newly added orders didn't appear until a re-sync — there is now a one-click "Sync now" on the dashboard (plus an opt-in nightly re-sync). The rebuild is atomic, so the dashboard stays up while it runs. Currency, the honest way: amounts now show the store's own currency symbol, captured automatically from Shopify at sync — never an invented exchange rate; if the currency isn't known yet, a bare number is shown rather than a wrong symbol.
2026-06-20 — "Ask" is now a chat, and it shows the exact SQL. The Ask page is reframed as a conversation in the Plum palette — your question, then the analyst's receipted answer — and every answer now displays the exact SQL the deterministic layer ran (query, parameters, row count) beside the receipt id and hashes. The model only picks a governed tool; code writes and runs that SQL, and every number traces to it — so the whole process is visible, not just the result. The duplicate query-log table was removed from the Ask page (the Audit page remains its single home).
2026-06-19 — Multi-store install proven end-to-end against a Shopify development store. Ran the full Shopify App Store install flow for real on a free Shopify development store — an engineering validation, not a customer: the store holds Shopify's synthetic test data, so any figures it shows are not real-merchant numbers. The path that had only ever been fixture-tested now works live: OAuth install → HMAC + signed-state verification → token exchange → a live GraphQL Admin API pull (version 2026-04) → a per-tenant warehouse build → trial start → the first-run onboarding page → a working dashboard, with every number receipted. Also bumped the Shopify Admin API client 2025-10 → 2026-04 (latest stable) and taught the production launcher to carry the app credentials.
2026-06-19 — First-run onboarding: "syncing" state + trial-days banner (D-030, Phase 4). Closed the two gaps a merchant hits before the dashboard, both $0 and zero-dependency. (1) First-run syncing page: a freshly OAuth-installed tenant is provisioned empty and built by a background first sync, so until that lands its database has no schemas and the dashboard used to 500. A gate in http_api (merchant session + warehouse not built) now renders a branded, honest syncing_page — zero-JS <meta refresh> so the dashboard appears on its own when the build commits, a CSS-only spinner with a reduced-motion opt-out, and a plain "didn't finish" message (no auto-refresh) on a failed sync. The gate keys on build existence, not "sync running", so a later re-sync of a built tenant never flickers (build.py is atomic). Sync state is tracked via shops.set_sync(running|done|failed, started_at) from sync.sync_tenant. (2) Trial-days banner: entitlements.trial_banner renders days-left (rounded up; "Last day" inside 24h) + a Subscribe CTA, bound into _layout via a _BANNER ContextVar (same pattern as the theme var) and shown only for a trialing merchant session — silent for active subscriptions, expired trials (the entitlement gate shows the subscribe page), and the operator. Verified: new test_onboarding.py (6), trial-banner tests in test_billing.py (+2), sync-status test in test_sync.py (+1), and the real-HTTP smoke_sessions.py extended to 21/21 — a trialing session sees the banner and a second EMPTY-provisioned tenant returns the syncing page (200, not a 500) over the production path. Operator smoke_admin green; C8 receipt-integrity suites pass on a demo warehouse (interception 7/7, fault_injection 24/24).
2026-06-14 — Multi-store foundation + Shopify App Store readiness (Phase 4, D-021–D-029). Turned the single-store product multi-tenant, built to the Shopify App Store's requirements — fixture-proven, no new dependencies. A five-expert panel costed the path (D-021); the build then landed the full spine: a tenant-routing seam (D-022) with db-per-tenant isolation (each store its own PostgreSQL database, zero SQL change, D-023); the Shopify OAuth authorization-code install flow plus the three mandatory GDPR/compliance webhooks, with constant-time HMAC and a stateless signed state (D-024); per-tenant warehouse sync reusing the proven build path (D-025); signed-cookie merchant sessions so each store sees only its own data (D-026); a Shopify Billing subscription with a 14-day trial and an entitlement gate (D-028); and per-tenant receipt isolation so the Ask agent is fully tenant-scoped (D-029). A four-agent adversarial security review then hardened the OAuth/session/webhook surface — tenant scoping moved into the warehouse layer, the raw SQL API made operator-only, webhook destructive actions bound to the signed body, fail-closed auth (D-027). Verified by unit tests and real-HTTP smokes; the receipt-integrity gate stayed green.
2026-06-14 — Ask agent: facts-first answers + a dormant direct-API transport (perf, team design-debate). A live model-path question took ~43s; a measured breakdown showed ~98% was two sequential LLM round-trips (tool-selection 12s + interpretation 30.6s) while the actual warehouse/receipt/render work was 0.7s. Two changes, run through the full team cycle (design → Critic → implement → Frontend → independent verify); the Critic surfaced 5 must-fix items including a cross-request honesty hole in the Explain step (proven numerically safe by post-check scoping, but fixed anyway by deriving the question from the receipt, never the client form). (1) Facts-first staging: ask() split into ask_facts (selection+execute → renders RECEIPTED FACTS immediately) and interpret(receipt_ids) (model prose on demand, still under the full two-strike post-check). The admin Ask page shows the verified number first with an "Explain" button carrying only receipt_ids; refusals and ?nomodel=1 show no button; an honest new placeholder distinguishes "not generated yet" from the deterministic "no model ran". (2) Direct Anthropic API transport in model._complete via stdlib urllib (no new dependency), gated on ANTHROPIC_API_KEY with the subscription path as automatic fallback; non-200/parse/empty responses all map to ModelError with no key leakage. Verified on the subscription path: facts render first, Explain yields a PASS-postchecked interpretation, a bad receipt id is an inline error (not a 500), other pages + nomodel unbroken, canonical 11/11, replay MATCH, placeholder green. The API path is implemented but dormant pending Anthropic API credits (valid key, zero balance); once funded it is expected to cut a question from ~43s to single-digit seconds.
2026-06-13 — Added the active_customers governed metric (D-007), via the team design-debate. A live merchant question — *"how many customers this month?"* — hit the Ask agent's honesty guardrail: no governed definition existed, so it refused with no_definition rather than invent a number (the product working as designed). Closed the gap via the charter process: Data Architect proposed → Evidence Engineer critiqued (3 must-fix honesty issues: an orphaned guest-checkout caveat, an inaccurate LTV-distinction claim, an unlabeled live example) → Lead arbitrated (kept comparable, dropped segmentable as out-of-scope) → AI Engineer implemented → Evidence Engineer independently verified. The metric is a scalar COUNT(DISTINCT customer_key) over gold.fact_sales (date_range + country filters; not additive; guest checkouts excluded by design — a count of *identified* customers). Served by the existing run_metric scalar path with no tool-layer change (enums derive from the registry); the guest-exclusion caveat ships in every receipt (_STATIC_CAVEATS), not just the doc. Verified live: the original NL question now answers 2 (model path, claude-haiku-4-5) with a receipt that replays MATCH; deterministic path also 2/MATCH; direct warehouse cross-check 2 (5 of 9 lines are guest, excluded); canonical 11/11; placeholder check green.
2026-06-13 — Shopify connector went LIVE against a real store (W5-2, D-006). Added the --source shopify-live build target and the live auth path: the Dev Dashboard's client_credentials grant (legacy static shpat_ custom apps are retired on new stores). ShopifyClient.from_env now mints a 24h Admin API token from SHOPIFY_CLIENT_ID + SHOPIFY_CLIENT_SECRET (static SHOPIFY_ACCESS_TOKEN still wins if set); load.load_bronze took a node_iter factory so the live path reuses the fixture path's bronze DDL/INSERT and the identical shopify_silver.sql unchanged. Ran live against pfvew0-0x.myshopify.com (RSD): full pipeline built clean, 11/11 canonical checks, receipted Ask total_revenue = 32,250 replayed MATCH (build 8a9628f2). Found & fixed a real defect live: the store's products had no SKUs, so keying products on SKU alone left every line unattributed (v_product_performance empty; NULL product_number failing the canonical PK check). The mapping now keys products on SKU when set, else the variant GID (always present); fixtures carry SKUs, so the fixture path stayed byte-identical (revenue 210,417; 11/11 — regression-verified). Honest gaps (live store was tiny — 5 products, 1 guest order, 0 customers): multi-page pagination, customer/LTV/cohort/repeat analytics, and the incremental cursor remain fixture-proven only. Live warehouse left as the repo's current build.
2026-06-13 — Shopify connector shipped, contract-first and fixture-proven (W5-1, D-006). Introduced the connector contract (app/connectors/contract.md): every source maps to bronze.<source>_* → canonical silver entities (silver.customers / silver.products / silver.order_lines, engine-agnostic) → the EXISTING gold star schema + 6 analytics views, unchanged. Refactored the demo as connector #1: the CRM+ERP integration that lived in the gold dim views moved down into a new 02b_canonical_demo.sql with byte-identical logic, and gold was re-pointed at the canonical entities. Equivalence proven — every audited number reproduced EXACTLY: total_revenue 29,356,250; 27,659 orders; 18,484 customers; 295 current products; repeat 62.86%/37.14% with 77.02% of revenue; LTV avg 1,588.10 / median 272.00 / p90 4,825.70 / max 13,294; top customer AW00012132 (France) 13,294; 33/33 demo quality checks; C8 gate 6/6. New source-agnostic 05_canonical_checks.sql + run_canonical_checks.py (PK uniqueness, sales_amount = quantity × price, referential integrity, gold non-degeneracy) — 11/11 on the demo build. Connector #2 — Shopify (GraphQL Admin API 2025-10, incremental sync by updated_at cursor, Relay pagination): finished client.py (injectable transport; from_env raises a clear error when credentials are absent; 14/14 unit tests over parse/pagination/edge-cases), version-dated synthetic fixtures with guest-checkout, partial/full refund, deleted-product, and incremental-re-sync edge cases, load.py → bronze.shopify_* (raw JSONB), and shopify_silver.sql → canonical entities (dedup by latest updatedAt; refunds NOT netted). Build integration: build.py --source demo|shopify-fixture (default demo, unchanged + identical counts); meta.build_info now records the source. --source shopify-fixture observed: full pipeline to all 6 gold views; bronze 41/25/121 as-received → silver 40/25/317 after dedup; gross revenue 210,417 EUR; guest checkout → 15 NULL-customer lines, deleted product → 1 NULL-product line, both keeping revenue; 11/11 canonical checks; a receipted ask (plum_ask.py --no-model "total_revenue") returned 210,417 and replayed MATCH, proving the D-003 receipt machinery works unchanged over Shopify-shaped data. Demo source rebuilt afterward — repo default state is the demo warehouse. Liveness (verbatim, D-006): tested against API-shaped fixtures; not yet run against a live store. The live GraphQL path (SHOPIFY_SHOP + SHOPIFY_ACCESS_TOKEN) is config-only and unexercised — no live call has ever been made, and there is no --source shopify-live target. No new dependencies (stdlib + psycopg; urllib for the unexercised live path). check_no_placeholders.py passes.

Week 4

2026-06-13 — Public marketing site shipped (W4-2): five static pages, evidence-first, every claim paired with its reproduction command. New site/ — plain HTML + one handwritten CSS file (ink/paper with a ledger-green accent), zero JS, no CDN, no framework, relative links only (renders from file://). Pages: index (one-liner, three honest sections, and a demo block embedding the VERBATIM output of a real model run of plum_ask.py "What was total revenue?" made today — total_revenue 29,356,250 receipted as rcp_c60c017f43c346328383448a66088735, post-check PASS, replay verdict MATCH — labeled "real output, demo dataset"); product (Ask flow, receipts + replay verdicts table, the real governed funnel refusal quoted from a live run, the five existing admin pages, honest absences and not-yet-supported metrics, architecture section); evidence (9 claim→proof rows, each with the exact command, what to observe, and the freshly observed result: ask→replay MATCH, doctored-receipt FAIL, C8 gate 6/6 re-run today, quality checks 33/33 re-run today, fingerprint 3e02a4b72bbd66a222d0ff9dc808e7fb reproducibility, cross-engine equivalence, 10/10 independent validation, governed refusal re-run today, smoke_admin all-green re-run today; plus a "what we don't claim" section and dataset provenance: public training dataset, 2010–2014 bike sales, not real merchants); changelog (GENERATED from this file by new stdlib site/build_changelog.py — one source of truth, IA rule 1); about (solo builder + AI agent team stated plainly, the category teardown story in three sentences without naming names, contact mailto). No pricing page (pricing undecided — nav carries an early-access mailto CTA instead); zero invented traction anywhere (no merchant counts, testimonials, logos, or lift percentages — the footer on every page states pre-launch status). Observed: new site/check_site.py — 201/201 checks PASS (all files exist, every internal link relative and resolving, nav + footer + pre-launch statement on all five pages, no placeholder patterns in any site file, index demo block contains the receipted 29,356,250, evidence has ≥5 claim rows citing the real commands, changelog page generated with Week 1 + Week 4 entries); check_no_placeholders.py over app/ passes. No new dependencies (stdlib only).
2026-06-13 — Thin admin UI shipped (W4-1, D-005): five server-rendered pages on the existing stdlib server, zero new dependencies. New app/server/admin_ui.py (+ app/server/static/admin.css) — semantic HTML rendered in Python, no framework, no Node, no CDN tags, no client-side JS; http_api.py routes /admin/* and /static/* to the new handle_ui seam (form-encoded POST bodies on those paths), all JSON routes untouched. Pages, every number queried live at render time with the canonical SQL from app/docs/metrics.md, each KPI footnoted with its governed metric id: /admin Overview (total_revenue 29,356,250; orders 27,659; aov 1,061.36; repeat_rate split 62.86%/37.14% with 77.02% of revenue; customer_ltv_summary avg/median/p90/max; build-identity panel from meta.build_info + dataset window); /admin/revenue (v_revenue_monthly table + dependency-free server-side inline SVG bar chart, 38 bars from the same rows, partial-month caveat quoted); /admin/cohorts (v_cohort_retention as a 38×38 shaded matrix, sparse cells honestly labeled "zero, not missing", governed cohort definition quoted); /admin/ask (form POST → model.ask, blocking, with a visible "calls the model" note; renders the EXACT renderer output — RECEIPTED FACTS code-rendered with receipt ids + hashes, INTERPRETATION visibly tagged "model prose — not receipted", governed REFUSAL tagged distinctly; per-receipt model-free replay command; post-check status line; ?nomodel=1 deterministic toggle via ask_without_model; recent questions from the audit log, new admin.ask/admin.page audit actions); /admin/audit (last-N table with ?action= prefix filter, refusals/rejections highlighted red, build_id column). Honest absences: Funnel/Experiments/Insights get NO pages or mock UIs — the nav carries one greyed non-link "Funnel — requires event data (see docs/metrics.md §Not yet supported)"; every page footer: "Every number on this page traces to a governed metric. Receipts replayable via plum_replay." Observed: new app/scripts/smoke_admin.py — 31/31 deterministic checks PASS (all five pages 200 with real values incl. 29,356,250, 2013-07 = 1,371,595, cohort 5.23%, live fingerprint 3e02a4b72bbd66a222d0ff9dc808e7fb; nomodel Ask returned RECEIPTED FACTS + receipt id; JSON /health and /views unbroken); --with-model run: real Ask "What was total revenue?" over the HTTP form path → rcp_7c61617284374555a3f03ef7add35c75, post-check PASS (attempts 1, regenerated no, fallback no), replay verdict MATCH. C8 gate re-run after the change: all 6 suites PASS; placeholder check passes. No new dependencies (stdlib only).

Week 2

2026-06-13 — Model layer wired via Claude Agent SDK (W2-3, D-004); five real end-to-end transcripts, all numbers receipted, 6/6 gate still green. New app/agent/model.py — the ONLY module importing claude-agent-sdk (new dependency, 0.2.100, flagged per D-004; rides the local Claude Code subscription, no API key; model default haiku, override PLUM_MODEL/--model). The model does exactly two things: proposes tool calls as JSON (executed solely through tools.ToolSession.call — closed schemas, registry SQL, receipts, audit; rejections go back once, then a deterministic governed refuse) and writes INTERPRETATION from ONLY the code-rendered RECEIPTED FACTS text, enforced by postcheck.enforce (two strikes → deterministic fallback). New CLI app/scripts/plum_ask.py (--no-model, --json, --model); audit extended with model.proposal, model.proposal_invalid, model.repair_round, model.refuse_fallback, model.interpretation (actor "model"). Observed, real model runs (claude-haiku-4-5-20251001): "What was total revenue?" → total_revenue = 29,356,250 receipted (rcp_68b2cff53e4c4a92a93f1d8bff155ba2), post-check PASS, replay MATCH; "Compare June and July 2013 revenue" → 1,642,948 vs 1,371,595, derived delta −271,353 / −16.52% receipted, post-check PASS, MATCH; "Which countries drive revenue?" → country breakdown top-10 (US 9,162,327 … segment_sum 29,356,250) receipted, PASS, MATCH; "Show me the conversion funnel" → governed not_supported refusal quoting metrics.md, after an audited live repair round (first refuse proposal lacked metric_id, rejected, model corrected); adversarial "Why did revenue drop in July 2013?" → model chose compare_periods + July-by-category breakdown, prose stayed inside receipted numbers and stated "the data shows what changed but not why", PASS, both MATCH. Post-check caught a real strike: an earlier run's first draft quoted caveat-only numbers (19; 4,992) → answer.postcheck_violation → regenerated draft clean (answer.regenerated); that run's date-filtered total 29,348,616 was itself fully receipted and replays MATCH — no number ever shipped unreceipted. render.NO_MODEL_NOTE re-worded (model wiring is no longer "the next task"). C8 gate re-run after the change: all 6 suites PASS; placeholder check passes.
2026-06-13 — Defect C8-1 resolved: replay grades ERROR on malformed receipts; C8 gate re-run PASS. AI Engineer remediation of the MAJOR filed by the Evidence Engineer (W2-2). app/agent/replay.py: _grade is now a never-raises wrapper around _grade_strict — structurally malformed receipts (missing keys, wrong types, non-dict top level) grade ERROR with reason "receipt malformed — refusing to grade (<exc>)" instead of crashing out of replay_file/replay_all; execution-field reads moved out of the re-execution try so a missing execution block reads as malformed, not "re-execution failed". Two tests added to app/tests/test_c8_fault_injection.py: 4 malformed shapes all grade ERROR naming malformation, and replay_all survives a corrupt store member (corrupt file ERROR, every other receipt still graded — observed MATCH). Observed: run_c8_gate.py re-run all 6 suites PASS (fault-injection 9 tests, was 7); ask_demo.py regression PASS (3x MATCH, doctored copy FAIL, refusal audited); placeholder check passes. No open BLOCKER/MAJOR remains (C8-2..C8-9 MINOR/LOW, backlog) — gate verdict after re-run: PASS (resolution + re-run log appended to app/docs/evidence.md).
2026-06-13 — C8 verification gate built + Amendment C1a implemented (W2-2); gate verdict FAIL on one open MAJOR. Evidence Engineer deliverables: replay.py verdict filter re-based on the content fingerprint per Amendment C1a (fingerprint differs → STALE; fingerprint match + re-execution hash match → MATCH with rebuilt_since: true when build_id differs; fingerprint match + hash mismatch → FAIL; tamper/internal/derived failures FAIL on any build). New gate runner app/scripts/run_c8_gate.py + four suites in app/tests/ (scratch receipt store, live warehouse): test_c8_interception.py (receipt sql/rows asserted as the SAME OBJECTS run_query executed — is, not == — across the full registry matrix: 9 metrics + filtered variants, 4 compare templates + filtered, 3×4 segment templates ± date_range), test_c8_fault_injection.py (five doctoring layers, each with its exact predicted failure signature: rows-only → internal_result_hash+receipt_hash; rows+result_hash → receipt_hash+reexecution_hash; fully coherent → reexecution_hash only; doctored derived under a coherent hash → derived only; deleted caveat → receipt_hash only; plus C9 collision-is-error), test_c8_renderer.py (stored = shown under in-memory mutation; per-turn binding: ToolSession.receipt_ids is the renderer's and post-check's only source — a true number from another turn is a violation), test_c1a_verdicts.py (all three C1a branches incl. a REAL build.py rebuild mid-suite: build_id flipped, fingerprint identical 3e02a4b72bbd66a222d0ff9dc808e7fb, pre-rebuild receipt → MATCH + rebuilt_since, NOT STALE; coherent lie after rebuild → FAIL, no longer hidden by STALE). Observed: all 6 suites PASS — c8_interception 17.1s, c8_fault_injection 3.2s, c8_renderer 2.2s, c1a_verdicts 6.4s, quality_checks 33/33 1.1s, no_placeholders 0.2s; 22 tests green on first full run. Gate verdict nonetheless FAIL: adversarial pass beyond the prescribed tests filed 9 defects (C8-1…C8-9 in app/docs/evidence.md), and C8-1 is MAJOR — replay crashes (KeyError) instead of grading ERROR on structurally malformed receipts, so one corrupt file kills plum_replay --all. Also filed: canonical num_str crash >28-digit ints / silent rounding of >28-digit decimals / negative-zero asymmetry (C8-2..4, MINOR), coherent-edit residual risk — replay does not cross-check the audit-mirrored receipt_hash (C8-5, MINOR), missing filename↔receipt_id binding (C8-6, MINOR), lowercase "1.5m" magnitude bypass in the post-check (C8-7, MINOR), identical/overlapping compare periods accepted (C8-8, LOW), renderer binding is call-site discipline (C8-9, LOW). Defects filed, not fixed (C8 discipline: remediation is the AI Engineer's); gate re-runs after C8-1 lands. No new dependencies (stdlib unittest + psycopg).
2026-06-13 — Ask-agent tool layer shipped (W2-1, D-003 C1–C7 + C9). app/agent/: deterministic machinery only — model wiring is the next task. canonical.py + registry.py (inherited from the interrupted run) verified against the live warehouse and kept unchanged: canonical-JSON idempotence across dump/load, ROUND_HALF_UP pinning, double-execution hash stability on customer_ltv top-10 confirmed (C2). New: receipts.py (receipts built from the same objects passed to/returned by run_query; build identity from meta.build_info — C1; full-receipt receipt_hash + per-turn binding token — C7; uuid receipt ids persisted to app/data/receipts/, collision = error — C9), tools.py (6 tools + explicit refuse; closed stdlib schema validation, no $ref, top_n ≤ 50 everywhere; EVERY call and EVERY rejection audited with params/result_hash/definition_version/build_id — C6), postcheck.py (critique rules R1–R7: magnitude-marker + number-word bans, presence- scoped allowed set incl. top_n and date components, 0/1/2-dec ROUND_HALF_UP roundings, abs-matching, NULL-safe, two-strike enforce hook — C3/C4), render.py (structural RECEIPTED FACTS / INTERPRETATION contract, facts code-built from the stored receipts, every interpretation line prefixed — C5), replay.py + CLI scripts/plum_replay.py (model-free re-execution; MATCH/STALE/FAIL/ERROR per C1, verdicts audited). Observed (scripts/ask_demo.py, live Postgres): total_revenue 29,356,250; compare 2013-06 → 2013-07: 1,642,948 → 1,371,595, derived delta −271,353 / pct_change −16.52; top-5 country breakdown (US 9,162,327 … France 2,643,751, segment_sum 27,151,692); replay of all three receipts 3× MATCH; doctored copy (rows + both hashes coherently recomputed) → FAIL on reexecution_hash; refuse(session_funnel) quoted the metrics.md "Not yet supported" text verbatim and audit-logged tool.refuse with build_id + turn_id; 6 rejection classes (schema, registry filter, unknown value 'Narnia', out-of-window, top_n 51, missing export id) each audited as tool.rejected. Post-check: faithful prose passed; "1.37M" / "twenty" / foreign number flagged (R1, R2/R3). Warehouse rebuilt (build f7572a6d → b2aa969c, fingerprint unchanged 3e02a4b72bbd66a222d0ff9dc808e7fb) → all 6 stored receipts re-replay STALE with "fingerprint and numbers unchanged" detail. No new dependencies (stdlib + psycopg); placeholder check passes. C8 (named Evidence-harness tests) is the milestone gate owned by the Evidence Engineer — fault-injection shape demonstrated in the demo, interception + renderer tests pending.

Week 1

2026-06-12 — Service layer ported DuckDB → PostgreSQL (D-002 complete); build identity on /health (D-003 C1). Backend port of the last D-002 remainder: server/warehouse.py now runs on psycopg 3.3.4 — duckdb import gone from the server modules. Read-only is double-enforced: every service connection opens with default_transaction_read_only=on + statement_timeout=10000 as libpq startup options (fresh autocommit connection per call), and the unchanged SQL gate (single statement, comment stripping, SELECT/WITH-only, forbidden keywords incl. SET/RESET, 1,000-row cap) keeps it unflippable from SQL. Server-side statement_timeout replaces the DuckDB interrupt watchdog — observed killing the 60,398² cross-join count at 1.08 s on a 1 s budget and 10.12 s on the default 10 s budget (QueryCanceled → QueryTimeout). Params now bind via psycopg %s placeholders through the unchanged run_query(sql, params) signature (D-003 C1); list_views()/get_view_sample() resolve against information_schema whitelisted to the gold schema (9 views); NEW get_build_info() returns the latest meta.build_info row, exposed on GET /health (build_id f7572a6d-…, fingerprint 3e02a4b72bbd66a222d0ff9dc808e7fb). audit.py untouched; http_api.py only gained the health build block. smoke_server.py ported + extended: 21/21 checks PASS — all 13 previous (v_repeat_rate still one_time 11,619 / 62.86% / 22.98% of revenue vs repeat 6,865 / 37.14% / 77.02%; DROP, SELECT 1; SELECT 2, comment-wrapped DELETE all 400 + audited) plus: health returns the real build_id matching meta.build_info; %s param round-trip; SET default_transaction_read_only=off and an INSERT both refused 400 by the gate and logged as query.refused with the exact SQL; belt-and-braces direct-driver INSERT on the service connection refused by Postgres itself ("cannot execute INSERT in a read-only transaction"); statement-timeout kill verified in-suite. No new dependencies (stdlib + psycopg already in place per D-002); placeholder check passes. Docs: service-layer section added to docs/warehouse.md.
2026-06-12 — Warehouse ported DuckDB → PostgreSQL (D-002), cross-engine equivalence proven. Data Architect port to local PostgreSQL 18.4 (database/role goolify, DSN via PLUM_PG_DSN, driver psycopg 3.3.4). SQL files ported in place: 01_bronze.sql is now schemas + DDL only (CSV loading moved to build.py as client-side COPY ... FROM STDIN via psycopg cursor.copy() — no server-visible paths needed); 02_silver.sql (try_strptime → new meta.try_yyyymmdd() NULL-on-invalid plpgsql function, INTERVAL '100 years', explicit ROUND(numeric)::int price derivation); 03_gold.sql (date_diff('month',…) → calendar-month boundary arithmetic, MEDIAN/quantile_cont → percentile_cont(…) WITHIN GROUP, ::numeric division for AOV where Postgres INT/INT truncates); 04_quality_checks.sql (one derived-table alias). build.py rewritten: single-transaction rebuild (DROP SCHEMA … CASCADE → SQL files → COPY loads), per-layer counts + timing, and NEW meta.build_info build identity (build_id uuid, built_at UTC, per-CSV source row counts, per- relation row counts, total revenue, deterministic md5 content fingerprint — the T8 receipts identity, replacing file mtime). run_quality_checks.py ported (read-only psycopg session). Observed equivalence vs the audited DuckDB baseline — every number exact: bronze 18,494 / 397 / 60,398 / 18,484 / 18,484 / 37; silver identical incl. 18,484 deduped customers and 37 birthdate NULLs (16 future + 21 age>100, DQ-1 rule ported); gold dim_customers 18,484, dim_products 295, fact_sales 60,398 with 0 NULL dimension keys; views 38 / 430 / 18,482 / 130 / 2 rows (repeat rate 62.86% / 37.14%, repeat revenue share 77.02%); total revenue 29,356,250 over 27,659 distinct orders; LTV avg 1,588.10 / median 272.00 / p90 4,825.70 / max 13,294; date-filtered revenue 29,351,258 (delta 4,992 = the 19 NULL-date lines); spot checks (2013-07 = 1,371,595 / 1,875 / AOV 731.52; cohort 2013-01: 325 / 17 / 5.23%; Mountain-200 Black- 46 = 1,373,454) all identical. Quality suite on Postgres: 33/33 PASS. Two consecutive builds verified re-runnable with identical content fingerprint 3e02a4b72bbd66a222d0ff9dc808e7fb (distinct build_ids). goolify.duckdb retained on disk untouched as the frozen audited baseline. Docs updated: docs/warehouse.md (engine, rebuild, build_info), docs/metrics.md (engine ground rule; 0 canonical SQL blocks changed — 3 formula descriptions updated to the Postgres function equivalents). Still on DuckDB pending their own port (D-002 scope, not this task): server/warehouse.py read-only service and scripts/smoke_server.py (they keep working against the frozen baseline file).
2026-06-12 — Defect DQ-1 resolved: implausible birthdates cleansed. Data Architect fix for the defect filed by the Evidence Engineer (T4/T6). 02_silver.sql now nulls erp_cust_az12 birthdates implying age > 100 at load (bdate < current_date - INTERVAL 100 YEAR), symmetric with the existing future-date rule. Observed effect after rebuild: 21 bronze rows nulled (the 15 flagged pre-1924 births, 1916–1923, plus 6 born 1924-01-01 to 1926-06-11 caught by the age boundary); silver min birthdate now 1926-09-07; total silver birthdate NULLs 37 (16 future + 21 old, 0 in bronze). Quality suite re-run: 33/33 PASS (was 32/33). No gold metric affected — birthdate feeds no analytics view; all row counts unchanged (silver.erp_cust_az12 still 18,484). Documented in app/docs/warehouse.md and resolution note appended to DQ-1 in app/docs/evidence.md.
2026-06-12 — Quality checks + independent metric validation (T4 + T6). Evidence Engineer pass over the warehouse. T4: 33-check quality suite (app/warehouse/sql/04_quality_checks.sql, runner app/scripts/run_quality_checks.py) covering silver PK duplicates/NULLs, untrimmed strings, standardized-value audits, date validity/order, sales = quantity × price, gold surrogate-key uniqueness, and fact→dim referential integrity. Result: 32 PASS, 1 FAIL — the failure is defect DQ-1: 15 pre-1924 birthdates (1916–1923) pass through silver.erp_cust_az12 uncleansed and undocumented; filed to the Data Architect, check left red on purpose. The two documented source gaps (19 NULL order dates, 7 CO_PE products) are allow-listed at their documented magnitude. T6: every claimed gold metric independently recomputed from silver tables with separate SQL — total revenue 29,356,250; 27,659 distinct orders; dates 2010-12-29 → 2014-01-28; dim_customers 18,484; dim_products 295; fact_sales 60,398 (no join fan-out); repeat 37.14% of customers / 77.02% of revenue; LTV avg 1,588.10 / median 272 / max 13,294 over 18,482 customers (2 excluded, ids 27039 + 16322, all-NULL order dates) — 10/10 claims CONFIRMED, including full-table diffs of v_revenue_monthly (38/38 months) and v_cohort_retention (430/430 cells) plus raw-row traces (June 2013 revenue 1,642,948; cohort 2013-01 month 3 = 19 of 325; customer AW00012301 LTV 13,294 across 13 lines; Mountain-200 Black- 46 revenue 1,373,454 traced to bronze). Full audit trail with verbatim queries: app/docs/evidence.md.
2026-06-12 — Service layer skeleton + audit log shipped (T9). app/server/: warehouse.py (read-only query service — single-SELECT gate with comment stripping, forbidden-keyword + multi-statement rejection, 10 s interrupt watchdog verified killing a 60k³ cross join at 1.02 s, 1,000-row cap, list_views() / get_view_sample() resolved against the catalog so no raw identifier interpolation), audit.py (append-only JSONL at app/data/audit/audit.jsonl — ts/actor/action/ detail + optional sql/result_summary + open kwargs so agent tool-calls and diffs slot in without migration), http_api.py (stdlib-only localhost JSON API: /health, /views, /views/{name}/sample, /audit, POST /query; all routing in a framework-agnostic handle_request() — the stdlib shell is the throwaway half pending the T8 framework decision). app/scripts/smoke_server.py: 13/13 checks green on a free port — real query returned both v_repeat_rate segments (one_time 11,619 / repeat 6,865 customers, 22.98% / 77.02% of revenue), DROP TABLE x, SELECT 1; SELECT 2, and a comment-wrapped DELETE all refused with 400 and logged as query.refused with the exact rejected SQL. No new dependencies (stdlib + existing duckdb 1.5.3); placeholder check passes.
2026-06-12 — Governed metric definitions shipped (T7). app/docs/metrics.md — the semantic layer the Ask agent cites in every answer. 8 metric sections (total_revenue, orders, aov, revenue_monthly, cohort_retention, customer_ltv + customer_ltv_summary, product_performance, repeat_rate), each with canonical id, plain-language definition, exact formula, grain, source lineage (gold view → silver tables), tested canonical SQL, real caveats, and a query-verified example value (all examples computed against the built warehouse on 2026-06-12; e.g. revenue for 2013-07 = 1,371,595). Documents the two reconcilable revenue totals (29,356,250 all lines vs 29,351,258 date-filtered; delta = the 19 NULL-date lines' 4,992) and verified order-month additivity (every order has exactly one order date) and that the 7 NULL-category products have zero sales. Includes an honest "Not yet supported" section (session_funnel, promo_performance — connector-dependent) and a versioning rule: definitions change only via dated entries in spec/decisions.md.
2026-06-12 — Warehouse shipped (T3 bronze/silver + T5 gold). DuckDB medallion warehouse at app/warehouse/goolify.duckdb, built by python app/warehouse/build.py from app/warehouse/sql/01_bronze.sql, 02_silver.sql, 03_gold.sql (~2 s rebuild, verified re-runnable across two consecutive builds). Bronze: 6 raw tables (18,494 customers / 397 products / 60,398 sales lines / 18,484 + 18,484 ERP rows / 37 categories). Silver: same 6 cleansed (customers deduped to 18,484; 19 invalid order dates nulled; 23 sales amounts and 12 prices repaired; 200 broken product end dates recomputed via LEAD). Gold: star schema views — dim_customers 18,484, dim_products 295 (current versions), fact_sales 60,398 with zero NULL dimension keys — plus 6 analytics views (v_revenue_monthly 38 months, v_cohort_retention 430 cohort cells, v_customer_ltv 18,482 + v_customer_ltv_summary, v_product_performance 130, v_repeat_rate: 62.86% one-time / 37.14% repeat carrying 77.02% of revenue). Total observed revenue 29,356,250 across 27,659 orders, Dec 2010 – Jan 2014. Session funnel and promo views NOT built — dataset has no event/promo data (documented as connector-dependent in app/docs/warehouse.md).
2026-06-12 — Project scaffolded. app/ structure, Claude Code hook guardrails (.env read guard, action logging), no-placeholder CI check, and the demo store dataset seeded (6 CRM/ERP CSVs, 5.2 MB). Decisions D-001 locked: DuckDB local-first, bootcamp dataset, name "Plum" (spec/decisions.md).