What this is
UN Human Rights Database is a paragraph-level search interface for UN human
rights interpretation. Its primary corpus is Treaty Body General
Comments. CEDAW/CRPD jurisprudence and Special Procedures are separate
expansion streams, clearly marked as previews while those collections
are still under development. The General Comments corpus remains the
priority; preview material is loaded lazily on demand.
Across the current published build, 203,302 paragraphs across
4,900 documents are searchable, filterable and citable down to
their number.
This is an independent academic project, not affiliated
with or endorsed by the OHCHR or any UN entity. Citations of individual
paragraphs should reference the original UN document signature (e.g.
CRC/C/GC/25 ¶12), not this database.
Dataset snapshot: last synchronised with
OHCHR Treaty Body Database
on 20 May 2026.
4,900 documents · 203,302 paragraphs total —
187 General Comments (all 10 treaty bodies),
4,334 jurisprudence cases (8 bodies: CCPR, CAT, CEDAW, CESCR, CRC, CRPD, CERD, CED),
379 Special Procedures reports (8 mandates).
The most recent General Comment is
CEDAW/C/GC/30/Add.1 (Women, Peace and Security addendum, 18 February 2026);
the most recent Special Procedures report is
A/HRC/61/42 (SR Torture, Charter of Rights of Victims and Survivors of Torture, May 2026).
Daily link checks run on
GitHub Actions.
See change log
for previous snapshots.
Treaty Body General Comments
This is the completed and most authoritative part of the app:
187 General Comments issued by the ten UN human rights
treaty bodies (CEDAW, CCPR, CERD, CESCR, CRC, CRPD, CMW, CAT, CAT-OP, CED).
The collection is exhaustive as of build date, sourced primarily from
tbinternet.ohchr.org,
and annotated for concerned groups such as children, women, migrants
and persons with disabilities.
Jurisprudence
4,334 individual-communication cases ·
152,794 paragraphs from all 8 OP / individual-complaint
mechanisms — CCPR (2,862), CAT (835), CEDAW (172), CESCR (155),
CRC (150), CRPD (95), CERD (58), CED (7). Published as separate
shards under docs/jur/ so the main General
Comments interface stays lightweight; the shard for a case is
fetched lazily when the user opens it. CCPR and CAT include
OCR-recovered scanned legacy cases (some from the early 1990s with
no embedded text layer); the reader-pane displays a provenance
banner on every OCR-recovered document with a one-click
report-issue link to the GitHub tracker.
Sources & cross-checks. Primary
source: the OHCHR bulk jurisprudence dump (catalog + raw English
DOCX/PDF). For partial cross-checking and methodological
inspiration in selected areas — particularly metadata
normalisation and substantive-article extraction for CCPR cases —
we also consulted external curated databases such as the
CCPR Centre's
decisions digest and the
University of Minnesota Human Rights Library.
Coverage vs OHCHR JURIS — 13 May 2026
We match
3,950 of the 4,015 unique case symbols
(98.4 %) listed in the
OHCHR JURIS authority catalogue.
The 65 catalogue entries we don't yet cover fall into a
few well-understood buckets — when we audited the missing
list, 43 had no PDF available in any language
on OHCHR's own download surface (empty downloads
dict on the catalogue entry), so they're not recoverable from
OHCHR; the rest split across:
- 1990s scanned PDFs without text layer — some
recoverable via further OCR work, some are image-only
legacy files where no fallback source returns text.
- Non-English-only sources (~21 cases) where
the case is publicly available but only in French,
Spanish or Russian — flagged for translation/cross-check
rather than ingest, to keep the corpus monolingual.
- Joint communications the catalogue lists as
separate symbols; we merge them under a single document,
so the headline miss-count slightly overstates unique cases.
Per-treaty coverage:
CRPD 100 % ·
CED 100 % ·
CRC 99.3 % ·
CAT 98.8 % ·
CCPR 98.4 % ·
CESCR 98.4 % ·
CEDAW 98.0 % ·
CERD 75.4 %.
The full list of 65 missing decisions —
missing-cases.csv on GitHub
— has the OHCHR JURIS catalogue URL for every entry plus
the source-recovery notes
(methodology & sources tried).
For research help on a specific case, contact
l.szoszkiewicz@amu.edu.pl.
Special Procedures
379 Special Procedures reports from eight mandates are
included as a curated preview: SR Freedom of Religion or Belief (88),
SR Torture (69), SR Health (50), SR Freedom of Expression (45),
SR Indigenous Peoples (44), SR Education (43), SR Disability (22)
and SR Privacy (18). The collection is under active development;
further mandates are added through a
generic ingestion pipeline
and these results should be read as exploratory rather than exhaustive.
Methodology — pipeline & quality
Each document is downloaded as PDF, converted to Markdown, then to
structured JSON paragraph-by-paragraph. Paragraphs are annotated with
concerned-group labels from a fixed 19-category taxonomy
(children, women/girls, persons with disabilities, persons deprived of
their liberty, indigenous peoples, LGBTI+, and so on). The corpus is
rebuilt deterministically by
build_corpus.py;
search runs entirely in the browser via FlexSearch with stemming, and
the index is cached in IndexedDB after first load.
Ask tab pipeline BETA
extractive RAG · hybrid retrieval + LLM-judge · verbatim only, no hallucination
The Ask tab is an experimental retrieval layer over
the General Comments corpus. It returns verbatim paragraphs
from the Committees — no paraphrase, no AI-generated answer text. The
LLM only operates on the query side (rewriting the question into
doctrinal language) and as a final-stage ranker (deciding which
of 50 retrieved paragraphs most directly answer the question). This is
an extractive RAG — fundamentally different from the chatbot
style that synthesises answers and may hallucinate citations.
Pipeline
- Gemini Flash-Lite HyDE rewrite of the question
- BM25 (SQLite FTS5) + Voyage law-2 dense embeddings, retrieved in parallel
- Reciprocal Rank Fusion (k = 60)
- Voyage rerank-2 cross-encoder against the original question
- Gemini Flash-Lite second-stage judge over top-10 candidates, augmented with the actual treaty article text (all 9 core treaties + 9 optional protocols)
- Final ordered top-10
Quality gates
30-question commentary-anchored benchmark
(Saul on ICESCR, Joseph & Castan on ICCPR;
16/30 carry paragraph-level expected IDs).
- docHit@5: 100% — every expected document in the first 5 results
- paraHit@5: 81.2% — canonical paragraph in the first 5 (3 misses are off-by-one in the correct GC)
- paraHit@10: 87.5%
- answerScore@1: 0.906 ± 0.010 — Gemini-as-judge faithfulness (yes/partial/no → 1/0.5/0, 3 runs × 30 q)
- Latency: median ~3.9 s, p95 ~6.1 s end-to-end
Known limitations
Cross-treaty references inside GCs (a CERD/CRC/CEDAW GC mentioning
"the Covenant" or "article 4(2)" meaning ICCPR) are handled by a
defensive fallback: "article N of that Protocol" is left
as plain text rather than wrong-linked. Full disambiguation needs
the corpus pre-tagged with explicit cited_articles
metadata — deferred to a later pipeline. Some lay queries surface
adjacent-but-not-canonical paragraphs from the right document —
usable, sometimes confusing. Per-IP rate limit (15 q/min) caps
token costs from automated traffic; researchers running heavy
iteration may briefly hit it.
Grounding
Treaty-text augmentation, the LLM-judge stage, and the rerank
cross-encoder all operate on actual paragraph text, not on
summaries. Every visible citation is traceable:
signature + GC number + paragraph number →
click Open dossier for full context. The Ask tab
augments — it does not replace —
paragraph-level Search,
which remains the source of truth for exact wording or Boolean
queries.
Jurisprudence OCR
scanned-PDF recovery · 5,359 ¶ OCR'd · 1,388 cleanup fixes · please report residual errors
170 older CCPR cases (mostly 1980s–90s scanned PDFs) come into
the corpus through Tesseract via a quality-first sidecar pipeline.
5,359 paragraphs across these cases are
OCR-recovered; the remaining 106,688 jurisprudence paragraphs
come from native text-layer PDFs or DOCX. Published paragraphs
keep official numbering where possible; individual opinions are
namespaced as OP1-,
OP2-, and real case appendices as
A1-. Generated IDs and OCR provenance
stay exposed in the export rather than silently merging into
official text.
Cleanup pipeline (May 2026): a
five-stage post-OCR sweep brought 1,388 fixes across
~422 paragraphs. Stages: (1) mechanical fn-marker and
section-header repairs (436); (2) Tesseract tsv
word-row leak strip — 575 cases where coordinate metadata had
bled into paragraph text (e.g.
August 19825 1 6 1 4 5 1049 1995 130 42 49.939957 from
→ August 1982 from); (3) hand-vetted
mangled-word audit on 229 prefilter-flagged ¶ — Spanish-name
accent recovery (Sanjuán, Gaitán, Velásquez Rodríguez),
sentence-start I mis-OCR'd as 1, and well-known
case-name fixes (47); (4) ~80 letter-substitution rules covering
systematic Tesseract shape confusions
(I↔l, u↔n, c↔e, h↔b,
m↔in, g↔s) — 289 fixes; (5) end-to-end spot-read
of 10 heavily-fixed paragraphs catching one regression
(beae. iciary wrongly normalised to
judiciary; correct reading was beneficiary).
Every rule is verified in context; nothing was applied where the
source word could be a real English/Spanish/French token.
Residual errors: non-systematic
single-character glitches remain (e.g. cn
for on, i:to for into).
The reader pane carries an OCR provenance banner
on every scanned-PDF case with a one-click
report-issue link.
If you spot one in your research, please file it — the cleanup
pipeline is open to community contributions and audit.
Two annotation layers
manual labels first, automation only fills gaps · 74 % GC, 37 % SP coverage
Labels applied manually before this pass are preserved verbatim;
the automated pipeline only fills paragraphs left unlabelled.
Pipeline labels come from human-readable regular expressions
matched against paragraph text — every match is deterministic
and reproducible. After six iterations, the corpus reaches
74 % coverage on General Comments
(5,283 of 7,105 paragraphs labelled) and
37 % on Special Procedures
(about 6,990 of 18,751 paragraphs labelled). The 26 % of GC
paragraphs that remain unlabelled are mostly procedural framework
text or discuss groups outside the taxonomy (older persons,
persons of African descent, religious minorities).
Conservatism over coverage
patterns we removed · false-positive cleanups · 81 Poverty fixes
Patterns target unambiguous markers — full terms and fixed phrases
such as indigent,
free, prior and informed consent,
corporal punishment — rather than generic
synonyms. During quality review we removed several patterns that
produced false positives: social security
and social protection (wrongly mapped
to Poverty), forced displacement (moved
from Refugees to Armed conflict), marginalized
(removed from Poverty in SP because it overwhelmingly described
religious minorities in freedom-of-religion reports), and
alien (too ambiguous for Non-citizens).
81 false-positive Poverty labels were cleaned out in this pass;
the largest precision-and-recall gains are in Persons deprived
of their liberty (+271) and Persons with disabilities (+187).
CESCR jurisprudence coverage
147 of 247 catalog rows have full text · the 100 missing are bundled mass-discontinuances OHCHR never published as standalone documents
The OHCHR Treaty Body Database lists 247 unique
E/C.12/<sess>/D/<n>/<yr>
symbols for CESCR Optional Protocol decisions
(sessions 55–79, 2014–2025). 147 are ingested with full
text via the docx-first pipeline (native DOCX from
documents.un.org; pre-2015 OLE
.doc converted via LibreOffice
so the <w:footnoteReference>
marker placement is preserved). The remaining
100 are catalog-only — every UN endpoint
returns 404, and OHCHR's own download surface answers
"Sorry there is no files available".
The pattern is structural rather than a pipeline gap:
88 of the 100 are bundled discontinuance
decisions where the Committee adopts a single
transmittal letter closing six to fourteen withdrawn
communications at once. Only the symbol appears in the
OHCHR catalog; no standalone document is published. The
remaining 12 are catalog rows with truncated titles
including one verifiable typo (E/C.12/77/D/82/018,
year missing the leading 2). Coverage is therefore
59 % full text, 41 % catalog-only.
Role of generative AI
no LLM-generated labels · regex audit, byte-identical reproducibility
Claude Sonnet 4.6 (Anthropic) was used as a coding assistant —
proposing candidate patterns, reading samples of newly-applied
labels to flag probable false positives, and writing the audit
scripts. No paragraph label in the published corpus was
generated by an LLM as free-form output. Every label is
the result of a regular expression that a human can read, audit
and modify; re-running the pipeline on the same input produces
byte-identical output.
How to cite
Suggested citation: Szoszkiewicz, Ł., & Kowalska, Z. (2026). The
UN Human Rights Database — A paragraph-level search interface for UN Treaty
Body General Comments.
Citations of individual paragraphs should reference the original UN
document by its signature (e.g. CRC/C/GC/25 ¶12),
not this database.
Open data — Hugging Face
The General Comments corpus is published as a paragraph-level
dataset on Hugging Face:
lszoszk/treaty-bodies-general-comments.
Every row is one paragraph from a Treaty Body General Comment or
General Recommendation, carrying its section path, footnotes (with
cross-reference resolutions), preamble flag, concerned-group labels,
and — new in the v3 release — cited_articles,
per-paragraph treaty-article references resolved to a specific
convention. 187 documents · 7,216 paragraphs · parquet, loadable
directly with the datasets library:
from datasets import load_dataset
ds = load_dataset(
"lszoszk/treaty-bodies-general-comments",
split="train")
The dataset (curation, segmentation, annotation) is released under
CC BY-NC-SA 4.0; the jurisprudence and Special Procedures
collections are not yet part of the Hugging Face package — see the
coverage notes
for their status.
Licence & tech
Software is released under the
GNU Affero General Public License v3.0
— anyone running a modified copy as a hosted service must release
their modifications under the same licence. The
curated dataset (paragraph-level corpus, document
metadata, section annotations, footnote cross-reference resolutions,
concerned-group labels) is licensed under
CC BY-NC-SA 4.0
— academic and non-commercial reuse is welcome with attribution;
commercial use requires permission. The underlying UN documents
remain under the United Nations'
content terms.
Built as a static site on GitHub Pages with vanilla JavaScript,
FlexSearch and SheetJS. Set in
Spectral, Source Serif 4 and JetBrains Mono.
Contact
Created by Łukasz Szoszkiewicz and Zuzanna Kowalska. Contact:
l.szoszkiewicz@amu.edu.pl ·
source on GitHub.
File issues, suggestions, missing documents and corrections through
the GitHub repository.