langid.py： Stand-alone language identification system
Project Website: None
Github Link: https://github.com/saffsd/langid.py
langid.py is a standalone Language Identification (LangID) tool.
The design principles are as follows:
Pre-trained over a large number of languages (currently 97)
Not sensitive to domain-specific features (e.g. HTML/XML markup)
Single .py file with minimal dependencies
Deployable as a web service
All that is required to run langid.py is >= Python 2.7 and numpy. The main script langid/langid.py is cross-compatible with both Python2 and Python3, but the accompanying training tools are still Python2-only.
langid.py is WSGI-compliant. langid.py will use fapws3 as a web server if available, and default to wsgiref.simple_server otherwise.
langid.py comes pre-trained on 97 languages (ISO 639-1 codes given):
af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu
The training data was drawn from 5 different sources: