New partnership
Licensed Arabic Academic Text Corpus
Rights-cleared Modern Standard Arabic (MSA) academic texts for AI and LLM training, delivered via Dialect Data.
Overview
This corpus is licensed in partnership with Al-Maktab Al-Arabi Lil-Maaref (Cairo). It is designed for teams building Arabic LLMs who need high-quality text with clear commercial usage rights and licensed access via Dialect Data.
Pilot highlights
- ~70 academic titles
- ~4.3M words
- Domains: psychology, social sciences, law, economics, technology
- Commercial AI/LLM usage rights included
- Non-exclusive license
Use cases
- Pretraining and continued pretraining
- Fine-tuning and instruction tuning
- Arabic evaluation / benchmarking datasets
- Domain adaptation for academic and formal Arabic
Delivery format
Delivered as one file per book (TXT, UTF-8), organized by section, plus a structured metadata sheet. Title-level details are shared under NDA.
Exact file naming convention and delivery structure are standardized for ingestion pipelines.
Rights and compliance
- Rights-cleared for commercial AI use
- No stand-alone redistribution of texts
- Access provided via controlled delivery and approvals
Request access
Tell us your use case (training, eval, domains, size). We can share access details and title-level details under NDA.
Contact us