VITAL Scoring Framework — Version 1.0

Established: 2026-03-25 Status: Active — Production Scope: All Vitals research outputs — monographs, protocol cards, worker packets, and recommendation documents.

1. Philosophy — Why Flat Scores Fail

A single scalar score (e.g., “8/10”) cannot serve both a $10 OTC v i t aminan d a$ 500/month injectable peptide. These interventions exist on completely different planes of evidence, cost, risk, accessibility, and measurability — and treating them with the same fixed rubric produces misleading, dangerous, or useless results.

Flat scores fail because they:

Conflate quality with popularity. A well-studied intervention scores high on validation regardless of whether the user can access it affordably.
Ignore mechanism-to-outcome gaps. A supplement may have plausible biology (high V) but be undetectable by consumer wearables (low T), making the score meaningless for our pipeline.
Hide tradeoff structure. A “7/10” peptide with severe accessibility issues and zero wearable signal looks identical to a “7/10” clean OTC intervention. They require completely different guidance.
Punish honesty. Accurate uncertainty ranges and honest acknowledgment of data limits are penalized compared to fabricated precision.

The VITAL framework solves this by decomposing recommendation fitness into five orthogonal axes, computing a geometric mean, enforcing a hard kill-switch, and reporting a confidence band. The result is a score that is honest about what it knows, transparent about what it doesn’t, and actionable within its confidence bounds.

2. The Five Axes

All axes are scored on a continuous 0.01–1.0 scale. Integers are not used. A score represents a probability-weighted estimate of the axis’s true value given available evidence — not a subjective grade.

2.1 V — Validation (Scientific Rigor)

Measures the quality and quantity of human evidence supporting the intervention’s claimed mechanism and outcomes. This is the most evidence-dependent axis.

Tier	Evidence Profile	Score Range
Tier 1	Peer-reviewed RCTs with pre-registration, replication, and meta-analysis (or Cochrane-level synthesis) in human populations	0.85–1.0
Tier 2	Peer-reviewed RCTs without meta-analysis; large observational cohorts (n > 5,000, adjusted confounders)	0.65–0.84
Tier 3	Peer-reviewed RCTs with methodological concerns (small n, industry-funded, no pre-registration); mechanistic human studies	0.40–0.64
Tier 4	Animal models, in-silico, extrapolated from related compounds; case series	0.20–0.39
Tier 5	In-vitro only, theoretical, or based on traditional use without modern validation	0.01–0.19

Modifiers (applied as sub-score adjustments within the tier range):

Publication bias penalty: If industry funding or selective reporting is detected, subtract up to 0.10 from the Tier 1–3 estimate.
Population mismatch penalty: If the evidence base uses a significantly different population (e.g., critically ill ICU patients for a wellness supplement), subtract up to 0.15. Document the mismatch explicitly.
Dose/route mismatch penalty: If human evidence uses doses or routes substantially different from the proposed protocol, subtract up to 0.10.
Replication bonus: +0.05 if independent labs have replicated the core finding.
Negative evidence bonus: +0.05 if well-designed null/negative RCTs are acknowledged and discussed rather than ignored.

Sub-scoring example — Tier 3 to 4 boundary:

Resveratrol has several human RCTs, but most are small (n < 100), industry-funded, and use doses far exceeding what can be achieved orally. Strong mechanistic data in animals. Score: 0.38 (Tier 4, modifier-applied).

2.2 I — Impact (Effect Size and Breadth)

Measures the magnitude and scope of the intervention’s benefit, contextualized to the individual user’s baseline. This axis is NOT a fixed property of the compound — it must be re-scored per-user based on their starting biometrics.

Core principle: A 5-point HRV increase in a user starting at 30ms (very low) is a larger relative impact than the same 5-point increase in a user starting at 80ms (healthy). Impact is scored relative to the user’s context.

Profile	Sub-score
Large, consistent, multi-system benefit (e.g., sleep quality, HRV, RHR, inflammation all move meaningfully)	0.75–1.0
Moderate benefit in 2–3 domains; clinically meaningful where it occurs	0.55–0.74
Meaningful benefit in 1–2 domains; moderate in others; some null or negative outcomes	0.35–0.54
Small or inconsistent benefit; many null results; benefit primarily theoretical	0.15–0.34
Minimal detectable benefit; benefit mostly extrapolated from biomarkers, not clinical outcomes	0.01–0.14

Contextual scoring rules:

The scorer MUST specify the assumed user baseline. “Assumed baseline: HRV 45ms, RHR 72bpm, VO2max 35ml/kg/min, sleep efficiency 82%.”
If the intervention primarily affects a single metric (e.g., inflammation marker hs-CRP), that metric must be identified and its expected change quantified relative to baseline.
Effect sizes must be reported as relative change (%) or absolute change with baseline context — not as a bare statistical significance (p-value).
For longevity interventions: impact may include slowing a decline rate (e.g., HRV declining at 2ms/year → 1ms/year), which counts as meaningful impact even without acute gains.

Sub-scoring example:

For a user with HRV 35ms (chronically dysregulated), a supplement raising HRV by 12ms (+34%) and improving sleep efficiency by 6pp would score at the high end of the moderate range: 0.70.

2.3 T — Traceability (Apple Watch Signal Detectability)

Measures whether the Apple Watch (or comparable consumer wearable) can reliably detect the intervention’s effect within a practical timeframe. This is the most pipeline-specific axis — it bridges evidence and biometric feedback.

Core principle: If we can’t measure it with the Watch, we can’t coach on it. An intervention that works but produces no wearable-detectable signal cannot be managed adaptively by our platform.

Signal Profile	Score Range
Strong, consistent multi-metric signal (HRV + RHR + sleep + activity synergy) within 2–4 weeks	0.80–1.0
Clear signal in 2 metrics within 4–6 weeks; or strong signal in 1 metric with supporting secondary signals	0.60–0.79
Detectable signal in 1–2 metrics within 6–8 weeks; signal-to-noise ratio acceptable	0.40–0.59
Weak, inconsistent, or slow-appearing signal; requires long baseline periods or aggregate analysis to detect	0.20–0.39
Watch-invisible: no plausible wearable-accessible pathway; effect only detectable via lab or clinical measures	0.01–0.19

Traceability decision tree:

Identify the primary biomarker(s) the intervention affects (e.g., HRV, RHR, sleep architecture, VO2max, blood oxygen, skin temperature).
Map to Watch-accessible signals: HRV (Apple HealthKit SDNN), RHR, sleep stages (Apple Watch Series 6+), VO2max (Apple Watch cellular + motion), HRV trends, respiratory rate.
Estimate time-to-signal: How many weeks of consistent Wearable Biometric Worker monitoring before a signal emerges above the user’s baseline noise?
Assess signal quality: Is the signal directionally consistent (always improves), or does it fluctuate with dosing/cycling?

Modifiers:

Dosing-cycle alignment bonus: If the signal can be detected within a single on/off cycle (e.g., within 1 week of taking the supplement), +0.10.
Multi-metric synergy bonus: If 3+ Watch metrics move together in a biologically coherent pattern, +0.10.
Long lag penalty: If signal requires > 8 weeks to emerge, −0.10. Justify with the biological mechanism (e.g., collagen turnover takes 12+ weeks).
High noise penalty: If the user’s baseline variability is very high (HRV SD > 15ms week-to-week), −0.10 and widen the confidence band.

2.4 A — Accessibility (Inverse Friction)

Measures the practical barriers between the user and consistent, correct use of the intervention. Higher accessibility = higher score. This axis is intentionally the inverse of “friction.”

Barrier Profile	Score Range
OTC, cheap (< $20/month), no professional oversight required, stable formulation, long shelf life	0.80–1.0
OTC or prescription with mild barriers (cost $20–60/month, requires one-time bloodwork, or has minor formulation sensitivity)	0.55–0.79
Prescription required OR cost $60–150/month OR requires periodic monitoring OR moderate storage requirements	0.30–0.54
Specialty pharmacy, cost $150–500/month, requires MD prescription + lab monitoring every 1–3 months, or has meaningful administration burden	0.10–0.29
Compounded, requires clinic administration, cost > $500/month, REMS program, or legally restricted — high professional oversight burden	0.01–0.14

Explicit non-scoring note: Accessibility does NOT include efficacy or safety — those are captured in V and I. Two interventions with identical accessibility scores may have wildly different benefit profiles. This axis exists to set user expectations and flag when coaching must include significant adherence support.

Sub-scoring example:

Rapamycin prescribed off-label, requires finding a physician willing to prescribe, costs $50–100/month, requires biannual bloodwork. Score: 0.28.

2.5 L — Longevity Signal (Temporal Benefit Profile)

Measures whether the intervention’s benefit is sustained and compounding over time, supported by long-term safety data — versus acute, short-lived, or theoretically beneficial without long-term outcome evidence.

Temporal Profile	Score Range
Decades of human use data, benefit compounds or stabilizes, long-term safety well-established (> 10 year follow-up in relevant population)	0.80–1.0
Multi-year human data, benefit sustained at 1–3 years, no emerging safety signals at extended durations	0.60–0.79
1–2 year human data; some concerns about long-term mechanism (e.g., chronic immunosuppression, hormonal axis disruption)	0.35–0.59
< 1 year human data; acute benefit established but long-term benefit extrapolated; mechanism-based concern about adaptation or reversal	0.15–0.34
Theoretical longevity benefit; no long-term human data; mechanism plausible but unvalidated; concerns about long-term safety unknown	0.01–0.14

Specific guidance by intervention type:

Vitamins/minerals (e.g., D3, magnesium, zinc): Typically score 0.75–0.90 if deficiency-corrected. If repletion is the mechanism, the longevity signal is well-established (decades of data).
Peptides (e.g., BPC-157, TB-500, GHK-Cu): Typically score 0.15–0.40. Human long-term data is sparse. Duration of benefit and long-term tissue effects are not well-characterized.
Senolytics and rapamycin: Score 0.30–0.55. Strong mechanistic longevity rationale but human outcome data at relevant doses is limited to 1–2 years.
Lifestyle interventions (sleep, exercise, nutrition): Score 0.75–1.0. Decades of longitudinal data; benefit compounds; mechanism is multi-system and well-characterized.

3. Aggregation — The Geometric Mean

3.1 Core Formula

The five axis scores are aggregated using the geometric mean, not the arithmetic mean:

VITAL Score = (V × I × T × A × L)^(1/5)

Why geometric mean:

Geometric mean penalizes scores on any single axis more severely than arithmetic mean. A 0.20 on one axis drags the aggregate down proportionally.
This reflects real-world fitness: a compound that scores 0.90 on four axes but 0.10 on the fifth (e.g., inaccessible) is NOT a good recommendation — it should score lower than 0.80.
Arithmetic mean would treat a 0.90/0.90/0.90/0.90/0.10 the same as 0.90/0.90/0.90/0.90/0.50, which is false.

3.2 Kill-Switch Floor

Rule: If ANY individual axis score is below 0.15, the final VITAL score is hard-capped at 0.30.

This is a non-negotiable safety and honesty mechanism. Any intervention with a severe deficiency on any axis — no matter how good the other four axes are — cannot receive a score above 0.30 from our pipeline. This prevents dangerous false positives.

Rationale: A 0.10 accessibility score means the intervention is essentially inaccessible to the user. Recommending it as a 0.60+ compound is fraudulent. The kill-switch ensures that a single axis failure correctly dominates the output.

Exception: The kill-switch cap of 0.30 is reported with a red warning flag and a note: “Final score hard-capped at 0.30 due to [axis] deficiency. Intervention is not currently recommended.”

3.3 Confidence Band Methodology

Every VITAL score must be reported as a three-value output:

VITAL Score: 0.58 [0.42 – 0.67]
            ↑ Central  ↑ Low      ↑ High
            estimate   bound      bound

How to compute the confidence band:

Assign axis-level bounds. Each axis score requires a low and high plausible bound reflecting the evidence uncertainty. When in doubt, widen.
Aggregate low bound: Compute the geometric mean of the five axis low bounds. This is the band low.
Aggregate high bound: Compute the geometric mean of the five axis high bounds. This is the band high.
Minimum band width: The band must be at least ± 0.05 wide from the central estimate. If the computed band is narrower, widen it to ± 0.05 minimum.
Maximum band width: The band should not exceed ± 0.25 from the central estimate without explicit flagging. If it does, the score must include a bold-face note: “Uncertainty is very high. This score should not be used for protocol decisions without additional primary research.”
Kill-switch override: If the kill-switch activates, the band is computed from the kill-switched aggregate and still reported, but the cap and warning are prominently noted.

Band width interpretation:

Band Width	Interpretation
< ± 0.08	High confidence — robust evidence base, consistent across studies, clear mechanism, well-characterized population
± 0.08 – ± 0.15	Moderate confidence — good evidence with some heterogeneity or minor gaps
± 0.15 – ± 0.20	Low confidence — sparse evidence, some conflicting data, extrapolation required
> ± 0.20	Very low confidence — sparse, indirect, or extrapolated data. Proceed with extreme caution.

4. Grade Mapping

Score Range	Grade	Label	Protocol Status
0.80 – 1.00	S	Core Protocol	Include in core coaching stack. High confidence, strong evidence, accessible, Watch-detectable.
0.65 – 0.79	A	Targeted Intervention	Recommend for specific user profiles. Strong evidence but with one moderate gap (accessibility, traceability, or duration).
0.50 – 0.64	B	Solid with Gaps	Recommendable with caveats. One significant axis weakness or moderate gaps across multiple axes.
0.35 – 0.49	C	Speculative / Trade-offs	Experimental or context-dependent. Use requires explicit user acknowledgment of uncertainty. Not default-recommended.
0.20 – 0.34	D	High Friction or Low Evidence	Not recommended as a default protocol. May be appropriate for specific users under clinician guidance.
0.01 – 0.19	F	Not Ready for Recommendation	Insufficient evidence or accessibility. Do not include in protocol recommendations.

Important: The grade is based on the central estimate only, not the confidence band. A score of 0.58 [0.40–0.72] is a B, not a C, even though the low bound is in C territory.

5. Worked Examples

Example 1: Vitamin D3 + K2 (MK-7)

Intervention: 5,000 IU/day cholecalciferol + 100–200 mcg/day menaquinone-7 (K2 MK-7). OTC, ~$15/month. Self-administered. No prescription.

V — Validation:

Large body of RCTs and meta-analyses on vitamin D supplementation. Much evidence is in deficient populations; general population RCTs in healthy adults show inconsistent results for non-skeletal outcomes.
Bone health meta-analyses are Tier 1 (0.88–0.95 range). General wellness outcomes are Tier 2–3.
K2 (MK-7): Strong evidence for bone and vascular calcification outcomes in European and Japanese populations (Tier 1–2). Less US data.
Population/dose mismatch penalty applies: most high-quality D3 studies use high doses in deficient populations; our target may include replete adults.
Score: 0.78 (Tier 2, with population and dose context modifier applied)

I — Impact:

In deficient users: large impact on bone density markers, vascular health, PTH regulation, immune function. Multi-system.
In replete users: modest impact; ceiling effect. Marginal improvements in biomarkers Apple Watch cannot directly measure (bone density, serum calcium).
Watch-accessible impact: D3/K2 does not directly produce acute Watch signals. Impact is chronic and subclinical. HRV and RHR may improve indirectly via reduced inflammation over months.
Contextualized for deficient user with HRV 40ms: 0.68

T — Traceability:

Direct Watch signal: essentially none in the short term. D3/K2 does not acutely affect HRV, RHR, or sleep.
Indirect signal via chronic inflammation reduction: plausible at 8–12 weeks, but signal is weak and variable.
No known acute biomarker measurable by Apple Watch.
Score: 0.14 (Watch-invisible acutely; indirect theoretical signal at 8–12 weeks)
Kill-switch activated — T = 0.14 < 0.15.

A — Accessibility:

OTC, < $15/month, stable formulation, no prescription, no monitoring required.
Score: 0.92

L — Longevity Signal:

Vitamin D: well-established decades of use, excellent long-term safety at physiologic doses. Benefit is chronic and compounding for bone and potentially immune aging.
K2 MK-7: well-characterized long-term safety in Japanese population (natto consumption is centuries-old). Human data at supplement doses for 5+ years exists.
Score: 0.88

Aggregation:

V = 0.78, I = 0.68, T = 0.14, A = 0.92, L = 0.88
T < 0.15 → Kill-switch activated → Hard cap at 0.30

VITAL Score: 0.30 [0.22 – 0.35] ⚠️ KILL-SWITCH ACTIVE (Traceability)

Grade: D (kill-switch capped)

Interpretation: Despite being an excellent, safe, accessible intervention, D3+K2 is Watch-invisible in the short term and cannot be adaptively managed by our coaching platform. It may be appropriate for user self-management guided by periodic bloodwork, but it should not be scored as a protocol-level biometric coaching intervention.

Recommendation: Recommend independently of VITAL scoring for deficient users. Flag for biometric integration only when inflammation-mediated indirect signals are better characterized.

Example 2: Rapamycin (5 mg/week)

Intervention: Rapamycin (sirolimus) 5 mg/week, oral. Prescribed off-label for longevity. Cost $50–100/month (compounded or via mail pharmacy). Requires physician willing to prescribe, baseline and periodic CBC, CMP, and lipid panel.

V — Validation:

Large human RCT database for transplant rejection and autoimmune conditions (Tier 1 for those indications).
Emerging longevity RCT data: the PEARL trial and related studies in older adults show promising results for immune senescence, physical function, and inflammatory markers.
No large-scale, long-duration RCTs specifically for longevity in healthy older adults (as of 2025).
Mechanistic evidence (mTOR inhibition, senescent cell reduction) is strong in animals and early humans.
Score: 0.62 (Tier 3 — small human RCTs with strong mechanistic support and emerging outcome data; dose and population gaps apply)

I — Impact:

Expected impacts: reduced inflammatory markers (IL-6, CRP), improved immune senescence markers, improved physical function (gait speed, grip strength), possible HRV improvement via inflammation reduction.
Effect sizes in PEARL-like studies: CRP reduction 20–35%, IL-6 reduction 15–25%, 6-minute walk improvement 10–15%.
Breadth: multi-system (immune, inflammation, physical function, potentially sleep quality).
User baseline-dependent: large impact for users with elevated inflammation (hs-CRP > 2.0), modest for already-optimized users.
Score: 0.72 (high end of moderate range — multi-system, clinically meaningful effect sizes, contextualized to elevated-inflammation user)

T — Traceability:

Direct Apple Watch signals: mTOR inhibition is not directly measurable. However:
- Chronic inflammation reduction → possible RHR reduction (2–5 bpm) at 4–8 weeks
- Improved sleep architecture → Apple Watch sleep stage tracking may show increased deep sleep percentage
- Improved recovery → HRV trends may improve during off-doses
- VO2max: indirect potential improvement via physical function gains
Signal requires 6–10 weeks to emerge above baseline noise.
Score: 0.40 (detectable but weak, slow-appearing, single to two-metric signal with moderate noise)
No kill-switch — T = 0.40 > 0.15.

A — Accessibility:

Requires physician prescription (non-trivial to find a rapamycin-prescribing physician)
Cost $50–100/month (significant but not prohibitive)
Requires baseline labs + quarterly CBC/CMP monitoring
Off-label; not FDA-approved for longevity
Compounding pharmacy required in most jurisdictions
Score: 0.26 (specialty, prescription, monitoring required, moderate cost)

L — Longevity Signal:

mTOR inhibition is one of the most evolutionarily conserved longevity mechanisms (yeast, C. elegans, mice, humans).
Human data: 1–2 year studies in older adults show sustained immune and inflammatory benefits. No major safety signals at 5mg/week (below immunosuppressant doses).
Unknowns: 10+ year outcomes in healthy adults unavailable. Chronic low-dose mTOR inhibition may have unexpected metabolic effects.
Score: 0.48 (promising mechanistic and short-to-medium-term human data; long-term outcomes in healthy populations unknown)

Aggregation:

V = 0.62, I = 0.72, T = 0.40, A = 0.26, L = 0.48

Product = 0.62 × 0.72 × 0.40 × 0.26 × 0.48 = 0.0223
5th root of 0.0223 = 0.58

No axis < 0.15 → No kill-switch

Confidence band:
  Axis low bounds:  V=0.50, I=0.58, T=0.24, A=0.14, L=0.32
  GM low = (0.50×0.58×0.24×0.14×0.32)^(1/5) = (0.00079)^(1/5) ≈ 0.31

  Axis high bounds:  V=0.74, I=0.84, T=0.56, A=0.40, L=0.62
  GM high = (0.74×0.84×0.56×0.40×0.62)^(1/5) = (0.0549)^(1/5) ≈ 0.70

VITAL Score: 0.58 [0.31 – 0.70]

Grade: B (central estimate 0.58 — solid with significant gaps in Accessibility and Traceability)

Interpretation: Rapamycin 5mg/week shows meaningful impact and has a defensible evidence base for off-label longevity use in motivated users with physician access. However, the accessibility barriers (prescription, monitoring, cost) and the weak/laggy Apple Watch signal are genuine limitations. This is a B, not an A, because the coaching platform cannot easily adapt the protocol to biometric feedback — the user is largely on a fixed dosing schedule. Should be recommended only for users with established physician relationships and who understand the off-label status.

Example 3: GHK-Cu (Systemic, Subcutaneous)

Intervention: GHK-Cu (glycyl-histidyl-lysine copper complex), 1–2 mg subcutaneous injection, 2–3x/week. Cost ~$80–150/month (research-grade peptide). Requires self-injection training, sharps disposal, and sourcing from a reputable peptide clinic or pharmacy.

V — Validation:

GHK-Cu has long history of use in wound healing and skin research (Tier 2–3 for dermatological outcomes).
In-vitro and animal data for copper-peptide complexes is extensive: collagen stimulation, angiogenesis, anti-inflammatory, neural repair.
Human RCT data for systemic anti-aging: essentially nonexistent. Most human studies are topical (skin) or wound healing-specific.
The leap from topical skin application or local injection to systemic anti-aging is significant and not well-validated.
Score: 0.28 (Tier 4–5 — strong in-vitro/animal, minimal human systemic RCT data; significant population and route mismatch)

I — Impact:

Wound healing: moderate to large impact (Tier 2–3 evidence).
Systemic anti-aging: theoretical and extrapolated. Expected benefits include improved skin elasticity, hair regrowth, reduced inflammation, possible cognitive benefits — all with small effect sizes and high uncertainty.
Watch-accessible impact: none established. Plausible via chronic inflammation reduction (RHR, HRV) over 8–12 weeks, but entirely unvalidated.
Score: 0.30 (small, inconsistent, mostly theoretical; plausible mechanism but no human systemic outcome data)

T — Traceability:

No established Watch-accessible biomarker pathway for GHK-Cu mechanism.
Plausible indirect signals: improved recovery (HRV), reduced inflammation (RHR), possibly improved sleep if skin/healing pathway is active.
These signals are entirely theoretical and have not been documented in the literature or reported in a structured way.
Time to signal: if signals exist, likely 8–12+ weeks minimum.
Score: 0.10 (Watch-invisible; no established pathway; speculative at best)
Kill-switch activated — T = 0.10 < 0.15.

A — Accessibility:

Peptide requires self-injection (moderate to high administration burden)
Cost $80–150/month (moderate to high)
Sourcing: research-grade peptide from clinic or pharmacy; requires some research to find reputable source
No prescription in most US jurisdictions (grey market)
Sharps disposal, injection site rotation required
Score: 0.20 (self-injection, moderate cost, sourcing complexity)

L — Longevity Signal:

GHK-Cu has an interesting mechanism: upregulation of wound healing genes, collagen, and decorin. These are mechanistically relevant to aging tissue decline.
However: long-term human data for systemic use is absent. Duration of benefit, long-term safety (chronic copper accumulation?), and dose-response in aging are all unknown.
Score: 0.18 (theoretical longevity mechanism; no human long-term data; copper accumulation concern unknown)

Aggregation:

V = 0.28, I = 0.30, T = 0.10, A = 0.20, L = 0.18
T < 0.15 → Kill-switch activated → Hard cap at 0.30

VITAL Score: 0.30 [0.22 – 0.32] ⚠️ KILL-SWITCH ACTIVE (Traceability)

Grade: D (kill-switch capped; F without kill-switch would be 0.19)

Interpretation: GHK-Cu has an interesting mechanistic story and is widely used in peptide circles, but it fails the VITAL framework on nearly every axis for systemic anti-aging use. The kill-switch fires on Traceability. This is a compound that should be tracked as the evidence base develops, not recommended as a core or even targeted intervention at this time.

Recommendation: If recommended at all, only in a structured research context with explicit user acknowledgment that the evidence base is Tier 4–5 and the Watch cannot provide adaptive feedback on this compound.

6. Anti-Hallucination Requirements

This section is mandatory for all Vitals worker agents. Non-compliance with these requirements constitutes a critical QA failure.

6.1 Core Principles

Never fabricate an axis score. If you do not have enough information to score an axis, you MUST say so explicitly and use the “insufficient data” floor score for that axis (see below).
Never narrow a confidence band to reflect more certainty than the evidence supports. Wider bands are not penalized. Narrow bands based on speculation are.
Never report a score as more precise than the methodology allows. All scores are continuous estimates with implicit ± uncertainty. Do not report three decimal places (e.g., 0.583) when the evidence only supports one decimal place of precision (e.g., 0.58).
Distinguish between “no evidence found” and “evidence shows no effect.” These are different. Report them differently.
Negative evidence must be included in axis scoring. If there are well-designed null studies, negative RCTs, or safety signals, they MUST reduce the relevant axis score and widen the confidence band. Ignoring negative evidence is a critical failure.

6.2 Insufficient Data Floor

When an agent cannot find sufficient evidence to score an axis, the agent MUST apply the insufficient data floor rather than estimating:

Axis	Insufficient Data Floor	Reasoning
V	0.15	Minimum score reflecting that the absence of evidence is not evidence of absence, but does limit confidence
I	0.10	Impact cannot be assumed positive without evidence
T	0.10	Watch-invisible by default when mechanism-to-Watch-signal pathway is unknown
A	Set by most restrictive plausible barrier	Accessibility must be conservatively estimated
L	0.10	Cannot assume compounding benefit without long-term data

The insufficient data floor applies the maximum confidence band width (± 0.20 or wider) to that axis. A score driven by insufficient data floors on multiple axes will produce a very wide confidence band — which is correct and honest.

6.3 Fabricated Precision — Explicitly Prohibited

The following practices are explicitly prohibited and constitute a critical QA failure:

Reporting a central estimate with more than 2 significant figures when the evidence only supports one
Assigning a confidence band narrower than the methodology allows because “the evidence seems consistent”
Assuming that the absence of negative studies means positive evidence
Citing a study’s existence without reading the effect size, confidence interval, and sample size
Scoring an axis at 0.60+ when the primary evidence consists of a single underpowered RCT
Inferring a Watch-accessible signal pathway from mechanism alone without any empirical human wearable data

6.4 Handling Conflicting Evidence

When studies conflict (some show benefit, some show null):

Weight by study quality (RCT > observational > case series)
Weight by sample size and statistical power
Weight by pre-registration status (pre-registered RCTs trump post-hoc analyses)
Report the conflict explicitly in the axis narrative
Assign the score closer to the weighted average of the high-quality study outcomes, not the most optimistic result
Widen the confidence band by at least ± 0.05 to reflect the heterogeneity

7. Wearable Biometric Worker — Traceability Scoring Rubric

The Wearable Biometric Worker (WBW) has specific responsibilities for the T axis of every intervention. This section provides the canonical rubric for WBW scoring decisions.

7.1 WBW Responsibilities

The WBW does NOT score all five axes. It is responsible for producing a T-axis sub-report that feeds into the full VITAL score. The sub-report must include:

Primary Watch-accessible biomarkers affected by the intervention (identified from literature and mechanism)
Secondary biomarkers that may be indirectly affected
Estimated time to signal emergence (in weeks of consistent monitoring)
Signal-to-noise ratio estimate (based on user’s baseline variability)
A T-axis sub-score with confidence bounds following the T-axis rubric
Recommended monitoring protocol: which metrics to track, at what frequency, and what the success criteria are

7.2 T-Axis Sub-Scoring Rubric (WBW-Specific)

WBW Signal Profile	T Sub-Score	Rationale
3+ Watch metrics show coherent, consistent directional change within 2–4 weeks	0.85–1.0	Strong, rapid, multi-metric signal — coaching is highly adaptive
2 Watch metrics show consistent signal within 4–6 weeks; 1 secondary metric shows plausible trend	0.65–0.79	Good signal; coaching can adapt on the primary metrics
1 Watch metric shows reliable signal within 6–8 weeks; secondary metrics show possible but inconsistent trends	0.45–0.64	Moderate signal; coaching is limited but possible with patience
1 Watch metric shows weak or inconsistent signal at 8–12+ weeks; high user variability	0.25–0.44	Weak signal; coaching is difficult; user should expect slow feedback
No plausible Watch-accessible pathway; effect is lab/clinical only	0.01–0.19	Watch-invisible; biometric coaching is not possible

7.3 WBW Monitoring Protocol Requirements

For each T-axis sub-report, the WBW MUST specify:

Baseline establishment period: Minimum 2 weeks of continuous Watch wear before the intervention starts. This establishes the user’s personal noise floor.

Monitoring metrics: Which specific Apple Watch / HealthKit fields to track:

HKHeartRateVariabilitySDNN (HRV)
HKRestingHeartRate (RHR)
HKSleepAnalysis (sleep stages, efficiency)
HKVO2Max (VO2max estimate)
Respiratory rate trends
Activity metrics: HKActiveEnergyBurned, HKStepCount

Success criteria: What constitutes a “detected signal” for this intervention:

Example: “HRV increases ≥ 8ms above baseline SD within 4-week post-dose window, sustained across 3+ weekly averages”
Example: “RHR decreases ≥ 3bpm below 7-day rolling average, maintained for 2+ weeks”

Null/negative criteria: What would constitute a failed signal (intervention not producing Watch-detectable effect):

Example: “No HRV change > 5ms above baseline SD after 12 weeks of consistent use”

Confidence band input: The WBW provides the T-axis low and high bounds (not just a point estimate) to the aggregator agent. The band must reflect:

Inter-user variability (some users respond, some don’t)
Measurement noise in the specific Watch metric
Plausibility of the mechanism-to-signal pathway

7.4 Interpreting T-Axis Failures

When T < 0.15 (kill-switch activation):

The WBW MUST document whether the failure is due to:
- Biological invisibility: The mechanism has no Watch-accessible downstream pathway (most pharmaceutical interventions targeting intracellular pathways)
- Temporal mismatch: The effect exists but takes > 12 weeks to emerge, beyond practical coaching feedback windows
- Signal-to-noise ratio: The user’s baseline variability is too high to detect the effect reliably
- Metric limitation: Apple Watch does not yet measure the relevant biomarker (e.g., blood cortisol, blood glucose, specific inflammatory cytokines)
- Combination failure: Multiple of the above
The WBW MUST include in the sub-report a recommendation for re-evaluation: under what conditions would future evidence allow re-scoring? (e.g., “If wearable blood glucose monitoring becomes Apple Watch-native, re-evaluate T-axis for GHK-Cu’s metabolic pathway”)

Changelog

Version 1.0 — 2026-03-25

Established by: Vitals Ontology Governor (subagent) Status: Initial production release — VITAL Scoring Framework

Changes in v1.0:

Formalized the 5-axis VITAL framework with continuous 0.01–1.0 scoring
Defined geometric mean aggregation and kill-switch floor (cap at 0.30 if any axis < 0.15)
Established confidence band methodology with minimum ± 0.05 width and explicit band width interpretation table
Defined grade mapping (S through F)
Provided three complete worked examples: Vitamin D3+K2, Rapamycin 5mg/week, GHK-Cu systemic
Established anti-hallucination requirements: insufficient data floors, prohibited practices, conflicting evidence handling
Canonized Traceability (T) axis rubric for Wearable Biometric Worker with monitoring protocol requirements and T-axis failure classification taxonomy

This document is the canonical scoring specification for all Vitals research outputs. Any deviations from this framework must be explicitly noted and justified in the relevant research output. Questions of interpretation should be escalated to the Ontology Governor.

Vitals Knowledge Vault

Explorer

VITAL-FRAMEWORK

VITAL Scoring Framework — Version 1.0

1. Philosophy — Why Flat Scores Fail

2. The Five Axes

2.1 V — Validation (Scientific Rigor)

2.2 I — Impact (Effect Size and Breadth)

2.3 T — Traceability (Apple Watch Signal Detectability)

2.4 A — Accessibility (Inverse Friction)

2.5 L — Longevity Signal (Temporal Benefit Profile)

3. Aggregation — The Geometric Mean

3.1 Core Formula

3.2 Kill-Switch Floor

3.3 Confidence Band Methodology

4. Grade Mapping

5. Worked Examples

Example 1: Vitamin D3 + K2 (MK-7)

Example 2: Rapamycin (5 mg/week)

Example 3: GHK-Cu (Systemic, Subcutaneous)

6. Anti-Hallucination Requirements

6.1 Core Principles

6.2 Insufficient Data Floor

6.3 Fabricated Precision — Explicitly Prohibited

6.4 Handling Conflicting Evidence

7. Wearable Biometric Worker — Traceability Scoring Rubric

7.1 WBW Responsibilities

7.2 T-Axis Sub-Scoring Rubric (WBW-Specific)

7.3 WBW Monitoring Protocol Requirements

7.4 Interpreting T-Axis Failures

Changelog

Version 1.0 — 2026-03-25

Graph View

Table of Contents