Open Source Text Processing Project: langid

Deep Learning Specialization on Coursera Stand-alone language identification system

Project Website: None

Github Link:

Description is a standalone Language Identification (LangID) tool.

The design principles are as follows:

Pre-trained over a large number of languages (currently 97)
Not sensitive to domain-specific features (e.g. HTML/XML markup)
Single .py file with minimal dependencies
Deployable as a web service
All that is required to run is >= Python 2.7 and numpy. The main script langid/ is cross-compatible with both Python2 and Python3, but the accompanying training tools are still Python2-only. is WSGI-compliant. will use fapws3 as a web server if available, and default to wsgiref.simple_server otherwise. comes pre-trained on 97 languages (ISO 639-1 codes given):

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu
The training data was drawn from 5 different sources:

ClueWeb 09
Reuters RCV2
Debian i18n

Leave a Reply

Your email address will not be published. Required fields are marked *