Skip to main content

Dataset

Arabic Dialect Dataset – Authentic Speech & Text for AI

Licensed and rights-cleared Arabic speech (with video) and licensed text data, delivered with dialect labels, metadata, and licensing documentation for production AI.

What's Available Now

  • 40+ hours of conversational Lebanese Arabic video captured across Lebanon (licensed and rights-cleared sourcing).
  • Licensed Egyptian academic text corpus in Modern Standard Arabic (MSA), ~4 million words, sourced through Egyptian university partnerships.
  • Modalities currently in-hand: audio, video, and text.
  • Current custom programs focus on Levantine, Iraqi, and Yemeni dialects; Algeria and Morocco are available on request.

Coverage Snapshot (as of March 2026)

  • Audio hours: 40
  • Video hours: 40
  • Recordings / clips: 1,800+
  • Conversations: 150+
  • Speakers: ~100
  • Video coverage: 90% of relevant records include video.
  • Transcript coverage (Dialect Arabic): 50%
  • Transcript coverage (MSA): 35%
  • Transcript coverage (English): 20%
  • Text size: 4,000,000 words (licensed sources only).

How the dataset is curated

Local teams record conversations, dialect experts annotate them, and native reviewers verify consent and metadata before delivery.

Quality assurance

Our data undergoes random spot checks by native experts for dialect accuracy and transcript quality.

Request a sample pack or tell us your dialect/volume requirements.

We'll respond with sample options, licensing guidance, and next steps within two business days.

Representative Samples

Real clips that showcase dialect diversity, environmental richness, and licensed, rights-cleared data capture.

Each clip here is consent-verified and annotated with its dialect, number of speakers, and recording environment.

Phone call sample (116s)

A phone call excerpt from the dataset, presented as a short, consent-verified audio clip.

If audio does not load in your browser, open the direct audio file.

⏱️ Duration: 116s; 🎧 Format: M4A (AAC); ✅ Consent: Verified ✅

Video conversation sample (12s)

A short video excerpt from the dataset, with the full player controls available inline.

⏱️ Duration: 12s; 🎥 Format: MP4 (H.264); ✅ Consent: Verified ✅

Want to explore more?

Request a sample pack

Technical summary

  • Delivery formats: WAV audio, MP4 video, and UTF-8 text/transcripts with structured manifests.
  • Delivery package: secure cloud download, checksums, README, schema reference, and collection guidelines.
  • Data splits are provided for training, validation, and testing.

Need field-level specs and detailed QA rates? Request our downloadable spec sheet for full manifest definitions, split policy, and delivery templates.

Licensing Overview and Sample Pack

  • Evaluation sample pack: evaluation-only usage.
  • Production usage: requires an executed production license agreement.
  • Production license options: commercial non-exclusive, research-only, custom.
  • Sample pack includes: 3-10 representative clips, sample JSON/CSV manifest, schema docs, and README with licensing summary/evaluation terms.

Request access via contact form or email team@dialectdata.com.