Documentation

fsds/prism-data-cleanup-and-journal-backfill-plan.md

Data Cleanup and Journal Backfill Plan

Overview

Fix duplicate EPC/basket and OBLPN/stack records, then backfill journal tables for pre-existing data.


Part 1: Duplicate Basket Cleanup (2,566 duplicate EPCs)

Findings

  • 2,566 EPCs are each assigned to 2 different Bakery Basket items
  • The "newer" baskets (created Dec 2) have MORE activity because BBU hardening deployed Dec 2
  • Movement counts per basket are reasonable (max ~19, not 822 as aggregate suggested)
  • Strategy: Keep older (original) basket, migrate all relations, delete newer

Tables to Update (in order)

  1. spatial.movements - Update item_id from newer basket to older basket
  2. bbu.tag_read_processing - Update basket_id from newer to older
  3. bbu.camera_read_baskets - Update basket_item_id from newer to older
  4. bbu.movement_deduplication - Update basket_id from newer to older
  5. prism.identifiers - Delete identifier records for newer baskets (EPC_HEX)
  6. catalog.items - Delete the newer basket items
  7. prism.passports - Delete passports for newer baskets (cascade should handle most)

SQL Script Approach

-- Step 1: Create mapping table of older->newer basket pairs
CREATE TEMP TABLE basket_duplicate_map AS
WITH ranked AS (
    SELECT
        i.value as epc,
        i.associated_object_id as basket_id,
        ROW_NUMBER() OVER (PARTITION BY i.value ORDER BY
            to_timestamp((('x' || substring(i.associated_object_id::text, 1, 8) ||
              substring(i.associated_object_id::text, 10, 4))::bit(48)::bigint) / 1000.0)
        ) as rn
    FROM prism.identifiers i
    WHERE i.identifier_type_id = 273
    AND i.value IN (SELECT value FROM prism.identifiers WHERE identifier_type_id = 273
                    GROUP BY value HAVING COUNT(DISTINCT associated_object_id) > 1)
)
SELECT
    r1.epc,
    r1.basket_id as keep_basket_id,
    r2.basket_id as delete_basket_id
FROM ranked r1
JOIN ranked r2 ON r1.epc = r2.epc AND r1.rn = 1 AND r2.rn = 2;

-- Step 2-5: Migrate relations (UPDATE statements)
-- Step 6-7: Delete duplicates

Part 2: Duplicate Stack Cleanup (18,547 duplicate OBLPNs)

Findings

  • Both older and newer stacks have allocations to the SAME shipments
  • Allocations are truly duplicated (same OBLPN allocated twice to same shipment)
  • Strategy: Keep older (original) stack, delete newer stack AND its allocations

Verification Confirmed

  • Older and newer stacks point to same shipment_id for each OBLPN pair
  • Safe to delete newer stacks without data loss

Tables to Update (in order)

  1. bbu.camera_read_processing - Update matched_stack_id from newer to older (11 records)
  2. transport.shipment_item_allocations - Delete allocations for newer stacks
  3. prism.identifiers - Delete all identifiers for newer stacks (OBLPN, PRIMARY_OBLPN, etc.)
  4. catalog.items - Delete the newer stack items
  5. prism.passports - Delete passports for newer stacks

Part 3: Code-Level Uniqueness Enforcement

Decision

  • Do NOT add database constraints yet (tenant scoping needs consideration)
  • Add application-level checks in BBU component

Already Protected (Confirmed)

  • Basket creation: BasketProcessingService.cs checks EPC via PrismQueryExtensions.GetEntityIdsByIdentifierAsync() before creating (added Dec 2, 2025)
  • Stack creation: OracleProcessor.cs uses same pattern for OBLPN lookup before creating

No Additional Code Changes Needed

Both paths already have proper upsert logic with the Dec 2 hardening deployment.


Part 4: Journal Backfill

Current State

  • Journal triggers added Dec 9, 2025
  • ~99.3% of passports (2,178,938) and identifiers (4,095,442) lack journal entries
  • Attribute tables have 0 journal entries

Backfill Strategy

Timestamp Sources (in priority order)

  1. IoT zb_tag_reads.event_timestamp - For baskets (first EPC read time)
  2. background_service_runs.started_at - For Oracle-imported entities
  3. UUIDv7 extraction - Fallback: extract timestamp from entity ID itself
  4. Default - Use earliest known system timestamp if no source available

Journal Table Schema (per JournalTable base class)

- seq_id: 1 (first version)
- valid_from: <calculated timestamp>
- valid_to: 'infinity'::timestamptz
- initiated_by_user_id: <matched processor user UUID>
- invalidated_by_user_id: NULL

System User Mapping for initiated_by_user_id

Entity Type User UUID
Baskets (Bakery Basket items) rfid-processor 019a4865-3c89-7069-a21c-510374e7f460
Stacks (from Oracle) oblpn-processor 019a4865-3b47-72b3-8a86-e2f3b89a2ee1
Shipments/Deliveries shipment-processor 019a4865-3ca1-7624-8625-13cd1d35ca50
Camera reads camera-processor 019a4865-3c98-702f-9fae-23b0be51de11
Locations/Other acsis 019a4865-1616-7080-8fe3-8be053ae98d4

Backfill Script Approach

-- Must set session variable for trigger compatibility
SET app.user_id = '<system-user-uuid>';

-- Insert into journal tables directly (bypass triggers for initial backfill)
INSERT INTO prism.passports_journal (global_id, platform_type_id, seq_id, valid_from, valid_to, initiated_by_user_id)
SELECT
    p.global_id,
    p.platform_type_id,
    1 as seq_id,
    COALESCE(
        -- Try to find creation time from source data
        <source_timestamp_subquery>,
        -- Fallback to UUIDv7 extraction
        to_timestamp((('x' || substring(p.global_id::text, 1, 8) ||
          substring(p.global_id::text, 10, 4))::bit(48)::bigint) / 1000.0)
    ) as valid_from,
    'infinity'::timestamptz as valid_to,
    '<system-user-uuid>'::uuid as initiated_by_user_id
FROM prism.passports p
WHERE NOT EXISTS (SELECT 1 FROM prism.passports_journal pj WHERE pj.global_id = p.global_id);

Tables to Backfill

  1. passports_journal - 2,178,938 records
  2. identifiers_journal - 4,095,442 records
  3. attribute_definitions_journal - 2 records
  4. attribute_assignments_journal - 2 records
  5. attribute_values_journal - 1,146 records

Execution Order

  1. Backup database (user already planning this)
  2. Part 1: Clean up duplicate baskets
  3. Part 2: Clean up duplicate stacks
  4. Part 3: Verify code protections (already done - no changes needed)
  5. Part 4: Run journal backfill scripts
  6. Verify: Run data integrity checks post-cleanup

Key Files Referenced

  • /engines/bbu/src/Acsis.Dynaplex.Engines.Bbu/Services/BasketProcessingService.cs - Basket upsert logic
  • /engines/bbu/src/Acsis.Dynaplex.Engines.Bbu/Services/OracleProcessor.cs - Stack upsert logic
  • /engines/prism/src/Acsis.Dynaplex.Engines.Prism.Database/Routines/journal_writer.sql - Journal trigger
  • /strata/core/src/Acsis.Dynaplex/JournalTable.cs - Journal base class

Risks and Mitigations

Risk Mitigation
FK violations during delete Delete in correct order; use temp mapping tables
Missing data after cleanup Verify all relations migrated before deleting
Journal backfill performance Batch inserts, run during low-activity period
Wrong timestamps in journal Use multiple fallback sources; UUIDv7 extraction as last resort