New partnership

Licensed Arabic Academic Text Corpus

Rights-cleared Modern Standard Arabic (MSA) academic texts for AI and LLM training, delivered via Dialect Data.

Overview

This corpus is licensed in partnership with Al-Maktab Al-Arabi Lil-Maaref (Cairo). It is designed for teams building Arabic LLMs who need high-quality text with clear commercial usage rights and licensed access via Dialect Data.

Pilot highlights

  • ~70 academic titles
  • ~4.3M words
  • Domains: psychology, social sciences, law, economics, technology
  • Commercial AI/LLM usage rights included
  • Non-exclusive license

Use cases

  • Pretraining and continued pretraining
  • Fine-tuning and instruction tuning
  • Arabic evaluation / benchmarking datasets
  • Domain adaptation for academic and formal Arabic

Delivery format

Delivered as one file per book (TXT, UTF-8), organized by section, plus a structured metadata sheet. Title-level details are shared under NDA.

Exact file naming convention and delivery structure are standardized for ingestion pipelines.

Rights and compliance

  • Rights-cleared for commercial AI use
  • No stand-alone redistribution of texts
  • Access provided via controlled delivery and approvals

Request access

Tell us your use case (training, eval, domains, size). We can share access details and title-level details under NDA.

Contact us