About 懂中文 Dong Chinese

Developed byPeter Olson.

Blog: 东东's notes

Dong Chinese open roadmap

Contact: feedback@dong-chinese.com

Privacy policy|Terms of service


Where do the sentences come from?

Dong Chinese uses a database of 705,493 sentences. The sentences come from several different sources:

  • Tatoeba (17,355 sentences)

    Available underCreative Commons Attribution 2.0 license (CC-BY 2.0)
  • UM-Corpus(29,446 sentences)

    Liang Tian, Derek F. Wong, Lidia S. Chao, Paulo Quaresma, Francisco Oliveira, Shuo Li, Yiming Wang, Yi Lu, &quotUM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation". Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, 2014.
    • Education (13,080 sentences)

    • Microblog (156 sentences)

    • News (6,055 sentences)

    • Science (1,341 sentences)

    • Spoken (8,054 sentences)

    • Subtitles (760 sentences)

  • AI Challenger caption dataset (210,000 images with 565,231 captions)

    Wu, Jiahong, et al. &quotAi challenger: A large-scale dataset for going deeper in image understanding.&quot arXiv preprint arXiv:1711.06475 (2017).
  • AI Challenger translation dataset (91,220 sentences)

  • Programmatically generated small-vocabulary sentences (2,241 sentences)

How is the percentage of movies and books I understand estimated?

Dong Chinese uses the following data:

What technologies are used?

Dong Chinese was built with the help of the following libraries, frameworks, and services:

The following open-source libraries were created while developing Dong Chinese:

Miscellaneous attributions