Pick any document from the rail on the left to read its full text. Bookmarks, notes and citations stay attached to specific paragraphs.
Loading corpus…
Reading the paragraph-level corpus from your archive.
Reading the paragraph-level corpus from your archive.
Pick a document on the left to read it in full. Notes and bookmarks appear on the right.
Pick any document from the rail on the left to read its full text. Bookmarks, notes and citations stay attached to specific paragraphs.
Bookmarks, notes, pinned comparisons and saved searches live in your browser's localStorage. Nothing leaves your device. Clear browser data and they're gone.
UN Human Rights Database is a paragraph-level search interface for UN human rights interpretation. Its primary corpus is Treaty Body General Comments. CEDAW/CRPD jurisprudence and Special Procedures are separate expansion streams, clearly marked as previews while those collections are still under development. The General Comments corpus remains the priority; preview material is loaded lazily on demand. Across the current published build, 178,659 paragraphs across 4,687 documents are searchable, filterable and citable down to their number.
This is an independent academic project, not affiliated
with or endorsed by the OHCHR or any UN entity. Citations of individual
paragraphs should reference the original UN document signature (e.g.
CRC/C/GC/25 ¶12), not this database.
Dataset snapshot: last synchronised with OHCHR Treaty Body Database on 12 May 2026. 4,687 documents · 178,659 paragraphs total — 187 General Comments (all 10 treaty bodies), 4,327 jurisprudence cases (8 bodies: CCPR, CAT, CEDAW, CESCR, CRC, CRPD, CERD, CED), 173 Special Procedures reports (4 mandates). The most recent General Comment is CEDAW/C/GC/30/Add.1 (Women, Peace and Security addendum, 18 February 2026); the most recent Special Procedures report is A/80/170 (SR Disability, 25 September 2025). Daily link checks run on GitHub Actions. See change log for previous snapshots.
This is the completed and most authoritative part of the app: 187 General Comments issued by the ten UN human rights treaty bodies (CEDAW, CCPR, CERD, CESCR, CRC, CRPD, CMW, CAT, CAT-OP, CED). The collection is exhaustive as of build date, sourced primarily from tbinternet.ohchr.org, and annotated for concerned groups such as children, women, migrants and persons with disabilities.
4,327 individual-communication cases ·
152,734 paragraphs from all 8 OP / individual-complaint
mechanisms — CCPR (2,862), CAT (832), CEDAW (172), CESCR (154),
CRC (147), CRPD (95), CERD (58), CED (7). Published as separate
shards under docs/jur/ so the main General
Comments interface stays lightweight; the shard for a case is
fetched lazily when the user opens it. CCPR and CAT include
OCR-recovered scanned legacy cases (some from the early 1990s with
no embedded text layer); the reader-pane displays a provenance
banner on every OCR-recovered document with a one-click
report-issue link to the GitHub tracker.
Sources & cross-checks. Primary source: the OHCHR bulk jurisprudence dump (catalog + raw English DOCX/PDF). For partial cross-checking and methodological inspiration in selected areas — particularly metadata normalisation and substantive-article extraction for CCPR cases — we also consulted external curated databases such as the CCPR Centre's decisions digest and the University of Minnesota Human Rights Library.
We match
3,950 of the 4,015 unique case symbols
(98.4 %) listed in the
OHCHR JURIS authority catalogue.
The 65 catalogue entries we don't yet cover fall into a
few well-understood buckets — when we audited the missing
list, 43 had no PDF available in any language
on OHCHR's own download surface (empty downloads
dict on the catalogue entry), so they're not recoverable from
OHCHR; the rest split across:
Per-treaty coverage: CRPD 100 % · CED 100 % · CRC 99.3 % · CAT 98.8 % · CCPR 98.4 % · CESCR 98.4 % · CEDAW 98.0 % · CERD 75.4 %.
The full list of 65 missing decisions — missing-cases.csv on GitHub — has the OHCHR JURIS catalogue URL for every entry plus the source-recovery notes (methodology & sources tried). For research help on a specific case, contact l.szoszkiewicz@amu.edu.pl.
173 Special Procedures reports from four mandates are included as a curated preview: SR Freedom of Religion or Belief (88), SR Freedom of Expression (46), SR Disability (22) and SR Privacy (19). The collection is under active development; further mandates are added through a generic ingestion pipeline and these results should be read as exploratory rather than exhaustive.
Each document is downloaded as PDF, converted to Markdown, then to structured JSON paragraph-by-paragraph. Paragraphs are annotated with concerned-group labels from a fixed 19-category taxonomy (children, women/girls, persons with disabilities, persons deprived of their liberty, indigenous peoples, LGBTI+, and so on). The corpus is rebuilt deterministically by build_corpus.py; search runs entirely in the browser via FlexSearch with stemming, and the index is cached in IndexedDB after first load.
The Ask tab is an experimental retrieval layer over the General Comments corpus. It returns verbatim paragraphs from the Committees — no paraphrase, no AI-generated answer text. The LLM only operates on the query side (rewriting the question into doctrinal language) and as a final-stage ranker (deciding which of 50 retrieved paragraphs most directly answer the question). This is an extractive RAG — fundamentally different from the chatbot style that synthesises answers and may hallucinate citations.
30-question commentary-anchored benchmark (Saul on ICESCR, Joseph & Castan on ICCPR; 16/30 carry paragraph-level expected IDs).
Cross-treaty references inside GCs (a CERD/CRC/CEDAW GC mentioning
"the Covenant" or "article 4(2)" meaning ICCPR) are handled by a
defensive fallback: "article N of that Protocol" is left
as plain text rather than wrong-linked. Full disambiguation needs
the corpus pre-tagged with explicit cited_articles
metadata — deferred to a later pipeline. Some lay queries surface
adjacent-but-not-canonical paragraphs from the right document —
usable, sometimes confusing. Per-IP rate limit (15 q/min) caps
token costs from automated traffic; researchers running heavy
iteration may briefly hit it.
Treaty-text augmentation, the LLM-judge stage, and the rerank cross-encoder all operate on actual paragraph text, not on summaries. Every visible citation is traceable: signature + GC number + paragraph number → click Open dossier for full context. The Ask tab augments — it does not replace — paragraph-level Search, which remains the source of truth for exact wording or Boolean queries.
170 older CCPR cases (mostly 1980s–90s scanned PDFs) come into
the corpus through Tesseract via a quality-first sidecar pipeline.
5,359 paragraphs across these cases are
OCR-recovered; the remaining 106,688 jurisprudence paragraphs
come from native text-layer PDFs or DOCX. Published paragraphs
keep official numbering where possible; individual opinions are
namespaced as OP1-,
OP2-, and real case appendices as
A1-. Generated IDs and OCR provenance
stay exposed in the export rather than silently merging into
official text.
Cleanup pipeline (May 2026): a
five-stage post-OCR sweep brought 1,388 fixes across
~422 paragraphs. Stages: (1) mechanical fn-marker and
section-header repairs (436); (2) Tesseract tsv
word-row leak strip — 575 cases where coordinate metadata had
bled into paragraph text (e.g.
August 19825 1 6 1 4 5 1049 1995 130 42 49.939957 from
→ August 1982 from); (3) hand-vetted
mangled-word audit on 229 prefilter-flagged ¶ — Spanish-name
accent recovery (Sanjuán, Gaitán, Velásquez Rodríguez),
sentence-start I mis-OCR'd as 1, and well-known
case-name fixes (47); (4) ~80 letter-substitution rules covering
systematic Tesseract shape confusions
(I↔l, u↔n, c↔e, h↔b,
m↔in, g↔s) — 289 fixes; (5) end-to-end spot-read
of 10 heavily-fixed paragraphs catching one regression
(beae. iciary wrongly normalised to
judiciary; correct reading was beneficiary).
Every rule is verified in context; nothing was applied where the
source word could be a real English/Spanish/French token.
Residual errors: non-systematic
single-character glitches remain (e.g. cn
for on, i:to for into).
The reader pane carries an OCR provenance banner
on every scanned-PDF case with a one-click
report-issue link.
If you spot one in your research, please file it — the cleanup
pipeline is open to community contributions and audit.
Labels applied manually before this pass are preserved verbatim; the automated pipeline only fills paragraphs left unlabelled. Pipeline labels come from human-readable regular expressions matched against paragraph text — every match is deterministic and reproducible. After six iterations, the corpus reaches 74 % coverage on General Comments (5,283 of 7,105 paragraphs labelled) and 37 % on Special Procedures (about 6,990 of 18,751 paragraphs labelled). The 26 % of GC paragraphs that remain unlabelled are mostly procedural framework text or discuss groups outside the taxonomy (older persons, persons of African descent, religious minorities).
Patterns target unambiguous markers — full terms and fixed phrases
such as indigent,
free, prior and informed consent,
corporal punishment — rather than generic
synonyms. During quality review we removed several patterns that
produced false positives: social security
and social protection (wrongly mapped
to Poverty), forced displacement (moved
from Refugees to Armed conflict), marginalized
(removed from Poverty in SP because it overwhelmingly described
religious minorities in freedom-of-religion reports), and
alien (too ambiguous for Non-citizens).
81 false-positive Poverty labels were cleaned out in this pass;
the largest precision-and-recall gains are in Persons deprived
of their liberty (+271) and Persons with disabilities (+187).
The OHCHR Treaty Body Database lists 247 unique
E/C.12/<sess>/D/<n>/<yr>
symbols for CESCR Optional Protocol decisions
(sessions 55–79, 2014–2025). 147 are ingested with full
text via the docx-first pipeline (native DOCX from
documents.un.org; pre-2015 OLE
.doc converted via LibreOffice
so the <w:footnoteReference>
marker placement is preserved). The remaining
100 are catalog-only — every UN endpoint
returns 404, and OHCHR's own download surface answers
"Sorry there is no files available".
The pattern is structural rather than a pipeline gap:
88 of the 100 are bundled discontinuance
decisions where the Committee adopts a single
transmittal letter closing six to fourteen withdrawn
communications at once. Only the symbol appears in the
OHCHR catalog; no standalone document is published. The
remaining 12 are catalog rows with truncated titles
including one verifiable typo (E/C.12/77/D/82/018,
year missing the leading 2). Coverage is therefore
59 % full text, 41 % catalog-only.
Claude Sonnet 4.6 (Anthropic) was used as a coding assistant — proposing candidate patterns, reading samples of newly-applied labels to flag probable false positives, and writing the audit scripts. No paragraph label in the published corpus was generated by an LLM as free-form output. Every label is the result of a regular expression that a human can read, audit and modify; re-running the pipeline on the same input produces byte-identical output.
Suggested citation: Szoszkiewicz, Ł., & Kowalska, Z. (2026). The
UN Human Rights Database — A paragraph-level search interface for UN Treaty
Body General Comments.
Citations of individual paragraphs should reference the original UN
document by its signature (e.g. CRC/C/GC/25 ¶12),
not this database.
Software is released under the GNU Affero General Public License v3.0 — anyone running a modified copy as a hosted service must release their modifications under the same licence. The curated dataset (paragraph-level corpus, document metadata, section annotations, footnote cross-reference resolutions, concerned-group labels) is licensed under CC BY-NC-SA 4.0 — academic and non-commercial reuse is welcome with attribution; commercial use requires permission. The underlying UN documents remain under the United Nations' content terms. Built as a static site on GitHub Pages with vanilla JavaScript, FlexSearch and SheetJS. Set in Spectral, Source Serif 4 and JetBrains Mono.
Created by Łukasz Szoszkiewicz and Zuzanna Kowalska. Contact: l.szoszkiewicz@amu.edu.pl · source on GitHub. File issues, suggestions, missing documents and corrections through the GitHub repository.