Building Accurate Polygenic Risk Scores from Consumer DNA Data
TL;DR:We scored 3,550 disease risk models from a consumer DNA file. We spent weeks trying to make scores “better” — Bayesian weight recomputation, Ridge regression corrections, GPU-accelerated validation. Every improvement made things worse. The scores were already good. We just needed to be honest about which ones to trust.
Start Simple, See What Breaks
The PGS Catalog publishes 3,550+ peer-reviewed polygenic score models. Each one is a list of genetic variants with effect weights. The math is straightforward: look up a user’s genotype at each variant, multiply by the weight, sum it up, compare against a reference population.
Our first implementation scored every model and produced percentiles using 1000 Genomes Phase 3 as the reference (2,504 samples, 5 ancestry populations). It ran in under 2 minutes.
Three Compounding Errors
- No ancestry matching. A single EUR reference distribution for all users.
- Allele alignment errors. Some models use alternate allele as effect, others use reference. Variants scored backwards.
- Strand-ambiguous inflation. A/T and C/G SNP pairs complement-flipped instead of directly matched. Height PRS had a z-score of 14+.
174GB Database, GPU-Accelerated Scoring
We built a 174GB SQLite database containing all 2.375 billion variant weights across 3,550 models. We constructed a 6.2GB sparse weight matrix for GPU-accelerated batch scoring.
On a Vast.ai build server (256 vCPUs, 503GB RAM, RTX 3060 Ti), we scored all 4,257 QC’d OpenSNP genomes in 7.4 minutes using chunked sparse CSR tensor multiplication via PyTorch.
Validation Results
| Trait | Metric | Ours | UK Biobank |
|---|---|---|---|
| Height | Pearson r | 0.107 | r ≈ 0.45–0.50 |
| Red hair | AUC | 0.67 | — |
| Black hair | AUC | 0.63 | — |
| Eye colour | AUC | 0.54 | ~0.95 |
Bayesian Weight Recomputation
We ran PRS-CSx on 33 quantitative traits using UK Biobank GWAS summary statistics. After 2–3 weeks of continuous GPU compute: the posterior weights did not improve our validation metrics. In some cases, they made things worse.
943 Genomes, 95% Failure Rate
Of 569 PGP genomes we tried to impute: 538 failed. Heterogeneous chip formats, Beagle memory exhaustion, bcftools timeouts. Only 84 survived.
Ridge Regression Pulled Everything to the Mean
We trained Ridge regression correction models. Every condition showed as 40th–60th percentile. Nothing stood out. Nothing was actionable.
Extreme scores are not bugs. They’re the whole point. A 99th percentile Type 1 Diabetes PRS from a model validated at AUC > 0.80 is genuinely meaningful. The problem was never that scores were “too extreme.” It was that we presented low-confidence and high-confidence scores identically.
What It Cost
| Item | Detail | Cost |
|---|---|---|
| GPU compute | Vast.ai instances | $200–400 |
| Claude API | 6 parallel domain agents | $100–300 |
| VPS hosting | Production server | $20/mo |
| Domain + SSL | helixsequencing.com | $15/yr |
| Total | $400–800 |
Where We Are Now
- 3,550 PGS Catalog models with proper allele alignment and ancestry-matched distributions
- Beagle 5.5 imputation expanding ~700K chip SNPs to ~28M variants
- Ancestry detection with per-population model selection
- Raw percentiles preserved — extreme scores kept when model is well-powered
- Zero data retention — all user data deleted after report generation
Lessons Learned
More data doesn’t automatically mean better scores. Adding PRS-CSx weights, Ridge regression corrections, and ensemble methods actively degraded accuracy when our validation cohort was small.
Extreme percentiles are features, not bugs. The mistake is presenting all models with equal confidence.
The validation bottleneck is the real constraint. We have 3,550 models and 2.37 billion variant weights. What we lack is ground truth.
Go slowly. Validate each improvement individually. If it doesn’t measurably improve predictions, it doesn’t ship.