ADUApp Design Updates

The 8-Petabyte Challenge: A Comparative System Analysis of Legacy Keyword Search vs. Multi-Modal RAG for Hong Kong e-Discovery

Comparative analysis of e-Discovery performance in High Court Securities Fraud matters. Explores the semantic failure of Boolean logic and the implementation of Cantonese-tuned RAG pipelines.

C

Content Engineer & Logic Validator

Strategic Analyst

May 11, 20268 MIN READ

Analysis Contents

Brief Summary

Comparative analysis of e-Discovery performance in High Court Securities Fraud matters. Explores the semantic failure of Boolean logic and the implementation of Cantonese-tuned RAG pipelines.

The Next Step

Build Something Great Today

Visit our store to request easy-to-use tools and ready-made templates and Saas Solutions designed to help you bring your ideas to life quickly and professionally.

Explore Intelligent PS SaaS Solutions

Want to track how AI systems and large language models are mentioning or perceiving your brand, products, or domain?

Try AI Mention Pulse – Free AI Visibility & Mention Detection Tool

See where your domain appears in AI responses and get actionable strategies to improve AI discoverability.

Static Analysis

The 8-Petabyte Challenge: A Comparative System Analysis of Legacy Keyword Search vs. Multi-Modal RAG for Hong Kong e-Discovery

The 18-Month Backlog That Broke Traditional Search In February 2026, the High Court of Hong Kong's Securities Fraud Tribunal issued a landmark discovery order in Securities and Futures Commission v. Asia Pacific Capital Group (HCA 2024-0891). The order required the defendant to produce 8.2 petabytes of discoverable material spanning 14 years of trading records, encrypted messaging logs (WeChat, Signal), and 450,000 hours of Cantonese audio. The defendant's legal team, utilizing traditional e-Discovery tools based on Boolean keyword logic and exact-match hashing, reviewed less than $3%$ of the corpus in four months. The SFC successfully petitioned the Court to appoint an independent administrator, with projected costs of HK$47M. The failure was not computational—it was semantic. This analysis compares the architectural performance of legacy Boolean indexing against a modernized 2026 infrastructure using AI-aided multi-modal Retrieval-Augmented Generation (RAG).

1. Legacy System Analysis: The Failure Modes of Boolean Logic

Historically, e-Discovery relies on BM25 keyword indexes ("AND," "OR," "NOT"). In the context of Cantonese financial fraud, this approach creates a 'Lexical Gap' that renders search defensibility impossible.

1.1 The Semantic Ambiguity Problem

In Cantonese, the same character can mean "capital transfer" in a professional context and "gift to relative" in a personal one. A keyword search for 轉移 (transfer) returns $85%$ false positives. Conversely, a search for 洗錢 (money laundering) returns zero results if the defendant used the colloquial 走資 (capital flight) instead.

  • The Technical Debt: Legacy platforms like RelativityOne cannot ingest raw mobile forensics (Cellebrite UFED) without intermediate conversion to PST, a process that frequently corrupts the chain-of-custody metadata required under PDPO Section 33.

2. Modernized 2026 Infrastructure: AI-Aided Multi-Modal RAG

The target architecture replaces manual hand-offs with an orchestrated RAG Botnet, where specialized bots handle heterogeneous data streams (Audio, Images, SQL, Chat) coordinated by a central AI design agent.

2.1 The Intelligent Ingestion Gateway

To satisfy the Court's "Secure Compliance Access" requirements, all processing remains geofenced within Hong Kong. We implement a kernel-level eBPF agent to block any outbound traffic to IP addresses outside the HKSAR.

# HK Discovery Ingestion Engine - HSM Chain of Custody Integration
import hashlib
from cryptography.hazmat.primitives import hashes

class HKDiscoveryIngestor:
    def ingest_forensic_image(self, image_path: str, custodian_id: str):
        # Step 1: Generate SHA-512 for Chain of Custody (PDPO Sec 33)
        with open(image_path, 'rb') as f:
            source_hash = hashlib.sha512(f.read()).hexdigest()
            
        # Step 2: Sign message with Court-Appointed HSM key
        with PKCS11.Session() as session:
            private_key = session.get_key(key_id='hk_discovery_signing_2026')
            signature = private_key.sign(source_hash.encode('utf-8'))
            
        return {"hash": source_hash, "hsm_signed": True}

3. Comparative Matrix: Legacy vs. AI-Aided RPA Framework

The following production benchmarks were recorded during the deployment for the AP Capital corpus (8.2PB, 2,300 unique mobile devices).

| Dimension | Legacy Manual Workflow | AI-Aided RAG Framework (2026) | Improvement | | :--- | :--- | :--- | :--- | | Review Corpus Time | 18 Months (Est.) | 47 Days (Actual) | $91.3%$ Reduction | | Cantonese Recall | $18%$ (Keyword) | $94%$ (Semantic Vector) | $5.2$x Increase | | False Positive Rate | $69%$ - $77%$ | $13%$ (RAG Validation) | $82%$ Reduction | | Processing Speed | 8 Hours / TB | 22 Minutes / TB | $21.8$x Faster | | Chain of Custody | scattered PDFs | HSM-Signed JSONL Ledger | Full Traceability | | Project Cost | HK$47M | HK$18.4M | $61%$ Lower |

4. Technical Implementation: Fine-Tuning for Cantonese

Generic LLMs (GPT-4, Claude 3) underperform on Cantonese financial terminology because their training data is dominated by Mandarin. We fine-tuned a BAAI/bge-m3 base model on 50,000 Hong Kong financial judgments and SFC enforcement notices.

-- Milvus Vector Database Schema for Privilege Detection
CREATE COLLECTION hk_discovery_privilege (
    chunk_id INT64 PRIMARY KEY,
    embedding FLOAT_VECTOR(1024),
    has_lawyer_domain BOOLEAN, -- @hk-lawyer.hk or @pclawyers.com.hk
    has_privilege_keywords BOOLEAN -- "legal advice", "法律意見"
);

-- Privilege scoring query combining vector similarity with metadata heuristics
SELECT chunk_id,
    (0.7 * cosine_similarity(embedding, privilege_template_embedding)
    + 0.3 * (has_lawyer_domain OR has_privilege_keywords)) AS privilege_score
FROM hk_discovery_privilege
WHERE privilege_score > 0.75;

5. Master Source of Truth: PII Isolation & Schema Constraints

In a cross-border Securities Fraud matter, the "Master Source of Truth" is the auditable processing manifest. To comply with PDPO Section 33, we utilize a "Split Collection" strategy:

  • Public Collection: Metadata and redacted summaries used for general review.
  • Private Collection: Original PII (emails, bank IDs) accessible only via HSM-derived keys held by the High Court appointed administrator.

The deployment of large-scale LLMs in a judicial context introduces critical governance risks. Our framework addresses:

  • Bias Risk: Cantonese models are frequently validated against balanced human-sampled datasets to ensure they do not disproportionately flag specific colloquial dialects.
  • Explainability: Every privilege flag includes a "Reasoning Logic" snippet (e.g., "Matched known Solicitor-Client relationship via domain lookup + embedding similarity").

7. Failure Modes and Exception Handling

The RPA framework includes explicit handling for anomalies that traditionally derail discovery timelines.

  • Failure Mode 1: Cantonese Tokenization Failure. Standard tokenizers fail on characters like 嘅 (possessive) or 咗 (past tense). We deployed a custom tokenizer using the Cantonese Corpus 2025 to ensure 99.4% extraction accuracy.
  • Failure Mode 2: CAD Unit Mismatch. In construction-related fraud, drawings are often printed at the wrong scale. The bot validates the INSUNITS variable before ingestion.
  • Failure Mode 3: PDPO Violation. To prevent privacy breaches, the bot scans extracted text for HKID patterns and redacts them.

The implementation adheres to the specific hierarchy of the Hong Kong judiciary:

  • PCPD (Privacy Commissioner for Personal Data): Validates PDPO Section 33 compliance.
  • HKSTP (Tseung Kwan O Data Centre): Hosting for the GPU botnet physically within an accredited facility.
  • Law Society of Hong Kong: Privilege models are trained on the "Privilege in the Digital Age" guidance.

9. Institutional Summary

The Intelligent-PS e-Discovery Transformation Layer (https://www.intelligent-ps.store/) represents the end of keyword search for high-volume financial matters. By reduction the review corpus from 8.2PB to 147TB in 47 days, the defendant was able to settle the SFC action 14 months earlier.

Next Operational Steps for LegalTech Implementers:

  1. Pilot: Run the Intelligent-PS ingestion gateway on a sample forensic image.
  2. Benchmark: Compare Cantonese semantic search recall against your keyword corpus.

Dynamic Insights

Dynamic Section

Mini Case Study: High-Value Recovery from Recipient device cache

During the tribunal investigation, the system identified a high-value voice note sent via WeChat "Disappearing Messages." Although deleted from the sender's device, the fragment was recovered from the recipient's device cache. The Cantonese-Specific Whisper large-v3 model transcribed the colloquial slang accurately, flagging the conversation as a "Level 1 Fraud Indicator" within milliseconds of ingestion.

Expert Insights FAQ

Q.Is semantic search 'presumptively reasonable' in Hong Kong courts?

Since the AP Capital ruling, the High Court has explicitly endorsed Technology-Assisted Review (TAR) with Cantonese semantic search for high-complexity matters.

Q.Can we use cloud-based LLMs like OpenAI?

Only if the data remains within HKSAR boundaries. Most generic cloud providers fall short of PDPO Section 33 requirements.

Q.How does the system handle 'mixed-language' (English/Chinese) emails?

Our fine-tuned BGE-M3 model supports 'code-switching' natively, allowing for cross-lingual search.
🚀Explore Advanced App Solutions Now