What this is

UN Human Rights Database is a paragraph-level search interface for UN human rights interpretation. Its authoritative core is Treaty Body General Comments. It now also indexes the complete set of thematic Special Procedures annual reports — all 46 mandates — and a near-complete set of CCPR / CEDAW / CRPD individual-complaint jurisprudence — about 98% of the OHCHR JURIS catalogue. Special Procedures reports are soft law (independent expert opinion, not binding treaty-body interpretation); the jurisprudence collection is near-complete but not yet exhaustive — a small non-English / image-only-scan remainder is pending. Secondary collections are loaded lazily on demand. Across the current published build, 320,018 paragraphs across 6,081 documents are searchable, filterable and citable down to their number.

This is an independent academic project, not affiliated with or endorsed by the OHCHR or any UN entity. Citations of individual paragraphs should reference the original UN document signature (e.g. CRC/C/GC/25 ¶12), not this database.

Dataset snapshot: last synchronised with OHCHR Treaty Body Database on 18 June 2026. 6,081 documents · 320,018 paragraphs total — 187 General Comments (all 10 treaty bodies), 1,560 Special Procedures reports (all 46 thematic mandates), 4,334 jurisprudence cases (8 bodies: CCPR, CAT, CEDAW, CESCR, CRC, CRPD, CERD, CED). The most recent General Comment is CEDAW/C/GC/30/Add.1 (Women, Peace and Security addendum, 18 February 2026); the most recent Special Procedures report is A/HRC/61/42 (SR Torture, Charter of Rights of Victims and Survivors of Torture, May 2026). Weekly link checks run on GitHub Actions. See change log for previous snapshots.

Treaty Body General Comments

This is the completed and most authoritative part of the app: 187 General Comments issued by the ten UN human rights treaty bodies (CEDAW, CCPR, CERD, CESCR, CRC, CRPD, CMW, CAT, CAT-OP, CED). The collection is exhaustive as of build date, sourced primarily from tbinternet.ohchr.org, and annotated for concerned groups such as children, women, migrants and persons with disabilities.

Jurisprudence

4,334 individual-communication cases · 152,794 paragraphs from all 8 OP / individual-complaint mechanisms — CCPR (2,862), CAT (835), CEDAW (172), CESCR (155), CRC (150), CRPD (95), CERD (58), CED (7). Published as separate shards under docs/jur/ so the main General Comments interface stays lightweight; the shard for a case is fetched lazily when the user opens it. CCPR and CAT include OCR-recovered scanned legacy cases (some from the early 1990s with no embedded text layer); the reader-pane displays a provenance banner on every OCR-recovered document with a one-click report-issue link to the GitHub tracker.

Sources & cross-checks. Primary source: the OHCHR bulk jurisprudence dump (catalog + raw English DOCX/PDF). For partial cross-checking and methodological inspiration in selected areas — particularly metadata normalisation and substantive-article extraction for CCPR cases — we also consulted external curated databases such as the CCPR Centre's decisions digest and the University of Minnesota Human Rights Library.

Coverage vs OHCHR JURIS — 13 May 2026

We match 3,950 of the 4,015 unique case symbols (98.4 %) listed in the OHCHR JURIS authority catalogue. The 65 catalogue entries we don't yet cover fall into a few well-understood buckets — when we audited the missing list, 43 had no PDF available in any language on OHCHR's own download surface (empty downloads dict on the catalogue entry), so they're not recoverable from OHCHR; the rest split across:

1990s scanned PDFs without text layer — some recoverable via further OCR work, some are image-only legacy files where no fallback source returns text.
Non-English-only sources (~21 cases) where the case is publicly available but only in French, Spanish or Russian — flagged for translation/cross-check rather than ingest, to keep the corpus monolingual.
Joint communications the catalogue lists as separate symbols; we merge them under a single document, so the headline miss-count slightly overstates unique cases.

Per-treaty coverage: CRPD 100 % · CED 100 % · CRC 99.3 % · CAT 98.8 % · CCPR 98.4 % · CESCR 98.4 % · CEDAW 98.0 % · CERD 75.4 %.

The full list of 65 missing decisions — missing-cases.csv on GitHub — has the OHCHR JURIS catalogue URL for every entry plus the source-recovery notes (methodology & sources tried). For research help on a specific case, contact l.szoszkiewicz@amu.edu.pl.

Special Procedures

1,560 reports covering all 46 thematic mandates of the UN Special Procedures — 33 Special Rapporteurs, 6 Independent Experts and 7 Working Groups. Each mandate's full run of thematic annual reports to the Human Rights Council, the General Assembly and the former Commission on Human Rights is split paragraph by paragraph and labelled for concerned groups, the same way as General Comments. Special Procedures reports are soft law: the independent opinion of a mandate-holder, persuasive but not the binding interpretation a treaty body issues in a General Comment.

What is in and out of scope. The corpus is the thematic output only. By design it excludes country-visit reports, communications (joint letters and urgent appeals) and addenda / corrigenda — these report on situations rather than develop doctrine. A small number of older reports are not yet included: a handful are unfetchable through the UN document API (an internal-redirect glitch, retried periodically) and a few are image-only scans with no text layer (pending OCR). Reports are added and maintained through a deterministic ingestion pipeline.

Methodology — pipeline & quality

Each document is downloaded as PDF, converted to Markdown, then to structured JSON paragraph-by-paragraph. Paragraphs are annotated with concerned-group labels from a fixed 19-category taxonomy (children, women/girls, persons with disabilities, persons deprived of their liberty, indigenous peoples, LGBTI+, and so on). The corpus is rebuilt deterministically by build_corpus.py; local fallback search runs in the browser via FlexSearch and verifies exact word forms; a trailing wildcard such as child* explicitly includes variants. The index is cached in IndexedDB after first load. One caveat on references: structured cited-article metadata is currently extracted for jurisprudence only (about 72 % of cases). In General Comments and Special Procedures, treaty-article citations stay fully searchable inside the footnote and body text but are not yet pulled into a separate field.

Read more: Jurisprudence OCR Two annotation layers Conservatism over coverage CESCR jurisprudence coverage Role of generative AI Ask tab pipeline BETA

Ask tab pipeline BETA extractive RAG · hybrid retrieval + LLM-judge · verbatim only, no hallucination

The Ask tab is an experimental retrieval layer over the General Comments corpus. It returns verbatim paragraphs from the Committees — no paraphrase, no AI-generated answer text. The LLM only operates on the query side (rewriting the question into doctrinal language) and as a final-stage ranker (deciding which of 50 retrieved paragraphs most directly answer the question). This is an extractive RAG — fundamentally different from the chatbot style that synthesises answers and may hallucinate citations.

Pipeline

Gemini Flash-Lite HyDE rewrite of the question
BM25 (SQLite FTS5) + Voyage law-2 dense embeddings, retrieved in parallel
Reciprocal Rank Fusion (k = 60)
Voyage rerank-2 cross-encoder against the original question
Gemini Flash-Lite second-stage judge over top-10 candidates, augmented with the actual treaty article text (all 9 core treaties + 9 optional protocols)
Final ordered top-10

Quality gates

30-question commentary-anchored benchmark (Saul on ICESCR, Joseph & Castan on ICCPR; 16/30 carry paragraph-level expected IDs).

docHit@5: 100% — every expected document in the first 5 results
paraHit@5: 81.2% — canonical paragraph in the first 5 (3 misses are off-by-one in the correct GC)
paraHit@10: 87.5%
answerScore@1: 0.906 ± 0.010 — Gemini-as-judge faithfulness (yes/partial/no → 1/0.5/0, 3 runs × 30 q)
Latency: median ~3.9 s, p95 ~6.1 s end-to-end

Known limitations

Cross-treaty references inside GCs (a CERD/CRC/CEDAW GC mentioning "the Covenant" or "article 4(2)" meaning ICCPR) are handled by a defensive fallback: "article N of that Protocol" is left as plain text rather than wrong-linked. Full disambiguation needs the corpus pre-tagged with explicit cited_articles metadata — deferred to a later pipeline. Some lay queries surface adjacent-but-not-canonical paragraphs from the right document — usable, sometimes confusing. Per-IP rate limit (15 q/min) caps token costs from automated traffic; researchers running heavy iteration may briefly hit it.

Grounding

Treaty-text augmentation, the LLM-judge stage, and the rerank cross-encoder all operate on actual paragraph text, not on summaries. Every visible citation is traceable: signature + GC number + paragraph number → click Open dossier for full context. The Ask tab augments — it does not replace — paragraph-level Search, which remains the source of truth for exact wording or Boolean queries.

Jurisprudence OCR scanned-PDF recovery · 5,359 ¶ OCR'd · 1,388 cleanup fixes · please report residual errors

170 older CCPR cases (mostly 1980s–90s scanned PDFs) come into the corpus through Tesseract via a quality-first sidecar pipeline. 5,359 paragraphs across these cases are OCR-recovered; the remaining 106,688 jurisprudence paragraphs come from native text-layer PDFs or DOCX. Published paragraphs keep official numbering where possible; individual opinions are namespaced as OP1-, OP2-, and real case appendices as A1-. Generated IDs and OCR provenance stay exposed in the export rather than silently merging into official text.

Cleanup pipeline (May 2026): a five-stage post-OCR sweep brought 1,388 fixes across ~422 paragraphs. Stages: (1) mechanical fn-marker and section-header repairs (436); (2) Tesseract tsv word-row leak strip — 575 cases where coordinate metadata had bled into paragraph text (e.g. August 19825 1 6 1 4 5 1049 1995 130 42 49.939957 from → August 1982 from); (3) hand-vetted mangled-word audit on 229 prefilter-flagged ¶ — Spanish-name accent recovery (Sanjuán, Gaitán, Velásquez Rodríguez), sentence-start I mis-OCR'd as 1, and well-known case-name fixes (47); (4) ~80 letter-substitution rules covering systematic Tesseract shape confusions (I↔l, u↔n, c↔e, h↔b, m↔in, g↔s) — 289 fixes; (5) end-to-end spot-read of 10 heavily-fixed paragraphs catching one regression (beae. iciary wrongly normalised to judiciary; correct reading was beneficiary). Every rule is verified in context; nothing was applied where the source word could be a real English/Spanish/French token.

Residual errors: non-systematic single-character glitches remain (e.g. cn for on, i:to for into). The reader pane carries an OCR provenance banner on every scanned-PDF case with a one-click report-issue link. If you spot one in your research, please file it — the cleanup pipeline is open to community contributions and audit.

Two annotation layers manual labels first, automation only fills gaps · 74 % GC, 37 % SP coverage

Labels applied manually before this pass are preserved verbatim; the automated pipeline only fills paragraphs left unlabelled. Pipeline labels come from human-readable regular expressions matched against paragraph text — every match is deterministic and reproducible. After six iterations, the corpus reaches 74 % coverage on General Comments (5,283 of 7,105 paragraphs labelled) and 37 % on Special Procedures (about 6,990 of 18,751 paragraphs labelled). The 26 % of GC paragraphs that remain unlabelled are mostly procedural framework text or discuss groups outside the taxonomy (older persons, persons of African descent, religious minorities).

Conservatism over coverage patterns we removed · false-positive cleanups · 81 Poverty fixes

Patterns target unambiguous markers — full terms and fixed phrases such as indigent, free, prior and informed consent, corporal punishment — rather than generic synonyms. During quality review we removed several patterns that produced false positives: social security and social protection (wrongly mapped to Poverty), forced displacement (moved from Refugees to Armed conflict), marginalized (removed from Poverty in SP because it overwhelmingly described religious minorities in freedom-of-religion reports), and alien (too ambiguous for Non-citizens). 81 false-positive Poverty labels were cleaned out in this pass; the largest precision-and-recall gains are in Persons deprived of their liberty (+271) and Persons with disabilities (+187).

CESCR jurisprudence coverage 147 of 247 catalog rows have full text · the 100 missing are bundled mass-discontinuances OHCHR never published as standalone documents

The OHCHR Treaty Body Database lists 247 unique E/C.12/<sess>/D/<n>/<yr> symbols for CESCR Optional Protocol decisions (sessions 55–79, 2014–2025). 147 are ingested with full text via the docx-first pipeline (native DOCX from documents.un.org; pre-2015 OLE .doc converted via LibreOffice so the <w:footnoteReference> marker placement is preserved). The remaining 100 are catalog-only — every UN endpoint returns 404, and OHCHR's own download surface answers "Sorry there is no files available".

The pattern is structural rather than a pipeline gap: 88 of the 100 are bundled discontinuance decisions where the Committee adopts a single transmittal letter closing six to fourteen withdrawn communications at once. Only the symbol appears in the OHCHR catalog; no standalone document is published. The remaining 12 are catalog rows with truncated titles including one verifiable typo (E/C.12/77/D/82/018, year missing the leading 2). Coverage is therefore 59 % full text, 41 % catalog-only.

Role of generative AI no LLM-generated labels · regex audit, byte-identical reproducibility

Claude Sonnet 4.6 (Anthropic) was used as a coding assistant — proposing candidate patterns, reading samples of newly-applied labels to flag probable false positives, and writing the audit scripts. No paragraph label in the published corpus was generated by an LLM as free-form output. Every label is the result of a regular expression that a human can read, audit and modify; re-running the pipeline on the same input produces byte-identical output.

How to cite

Suggested citation: Szoszkiewicz, Ł., & Kowalska, Z. (2026). UNHRD — UN Human Rights Database. Zenodo. https://doi.org/10.5281/zenodo.10495719 Citations of individual paragraphs should reference the original UN document by its signature (e.g. CRC/C/GC/25 ¶12), not this database.

Open data — Hugging Face

The General Comments corpus is published as a paragraph-level dataset on Hugging Face: lszoszk/treaty-bodies-general-comments. Every row is one paragraph from a Treaty Body General Comment or General Recommendation, carrying its section path, footnotes (with cross-reference resolutions), preamble flag, concerned-group labels, and — new in the v3 release — cited_articles, per-paragraph treaty-article references resolved to a specific convention. 187 documents · 7,216 paragraphs · parquet, loadable directly with the datasets library:

from datasets import load_dataset
ds = load_dataset(
    "lszoszk/treaty-bodies-general-comments",
    split="train")

The dataset (curation, segmentation, annotation) is released under CC BY-NC-SA 4.0; the jurisprudence and Special Procedures collections are not yet part of the Hugging Face package — see the coverage notes for their status.

MCP server — query it from an AI assistant

The corpus is also exposed as a Model Context Protocol server, so an LLM assistant (Claude, or any MCP-capable client) can search and cite treaty-body paragraphs directly — returning the verbatim text with its document signature and paragraph number, instead of paraphrasing from training data. It exposes search_paragraphs (full-text search across the corpus, filterable by treaty body) and lookup_by_citation (resolve a reference like CRC/C/GC/25 ¶12 to its exact paragraph). Source and setup: github.com/lszoszk/mcp-unhrdb.

Licence & tech

Software is released under the GNU Affero General Public License v3.0 — anyone running a modified copy as a hosted service must release their modifications under the same licence. The curated dataset (paragraph-level corpus, document metadata, section annotations, footnote cross-reference resolutions, concerned-group labels) is licensed under CC BY-NC-SA 4.0 — academic and non-commercial reuse is welcome with attribution; commercial use requires permission. The underlying UN documents remain under the United Nations' content terms. Built as a static site on GitHub Pages with vanilla JavaScript, FlexSearch and SheetJS. Set in Spectral, Source Serif 4 and JetBrains Mono.

Contact

Created by Łukasz Szoszkiewicz and Zuzanna Kowalska. Contact: l.szoszkiewicz@amu.edu.pl · source on GitHub. File issues, suggestions, missing documents and corrections through the GitHub repository.

Title

UN Human Rights Database

Loading corpus…

Loading…

Your workspace

About this database