With modern techniques in Machine Translation research, we have collected dataset and trained MT model for Southern Uzbek, a language with around 5 million speakers.
I created UzLiB, a comprehensive linguistic benchmark to evaluate LLMs on Uzbek. The results revealed that even top models fail to surpass 70% accuracy, highlighting a key gap in their capabilities.
We translated FLORES+ devset to Karakalpak language. Then, we collected 100,000 pairs involving Karakalpak to fine-tune machine translation model to improve upon existing baselines.
We developed a hybrid GEC model for Uzbek by integrating a rule-based morphological analyzer with a neural network, significantly improving its ability to handle complex agglutinative grammar.
We built and released UzBooks and UzCrawl by processing over 35,000 books and web data. At 36 GB, it is now the largest publicly available, high-quality text corpus for the Uzbek language.
Miscellanea
Sports
I am an avid football (the real one) and table tennis player.
Blogging
I enjoy translating popular Machine Learning blog posts into Uzbek. You can find my translations on my Substack.
Tutoring
I have experience as a Math tutor for 4th-grade students, where I prepared them for entrance exams for a prestigious school in Uzbekistan.