眼睛飞蚊症吃什么药| 梦到小鸟是什么意思| 猫吃什么食物除了猫粮| 潮吹是什么样的| 牛肉用什么腌制比较嫩| 鸡冲什么生肖| 做梦梦见棺材和死人是什么意思| 什么药治便秘效果最好最快| 尿的是白色米汤是什么病| 银黑了用什么可以洗白| 蛇吃什么东西| 女性盆腔炎什么症状| 性转是什么意思| 美尼尔综合症是一种什么病| 硬脂酸镁是什么东西| 熬笔是什么意思| alt是什么意思| 扁桃体发炎吃什么药好| 能吃是福是什么意思| 十八大什么时候| 病毒性感冒什么症状| 飞花令是什么| 什么天长地久| 痛风可以喝什么饮料| 今天是什么月| 温水煮青蛙是什么意思| 木林森是什么品牌| 不睡人的空床放点什么| 大腿抽筋是什么原因引起的| 黄五行属性是什么| 早上7点到9点是什么时辰| 凯乐石属于什么档次| 吃什么对牙齿好| 鼠是什么命| 青团是什么节日吃的| 百脚虫的出现意味什么| 什么树林| 药流后吃什么消炎药| 吃氨糖有什么副作用| 吃什么尿酸高| 什么是近视| 宠物医院需要什么资质| 左眉毛上有痣代表什么| gv是什么| 戒指戴无名指是什么意思| 左下腹部是什么器官| 鸽子和什么一起炖汤最有营养| 男性生殖系统感染吃什么药| 查心电图挂什么科| 三个火读什么| 睡觉起来口苦是什么原因| 三什么九什么成语| 什么其不什么| 呕吐是什么原因| 吃芒果对身体有什么好处| 做梦梦见出车祸是什么征兆| 幸福是什么的经典语录| 脾胃不好吃什么水果好| 关节痛挂号挂什么科| 铁皮石斛有什么功效| 甲状腺结节吃什么散结| 早上9点到10点是什么时辰| 美国为什么有哥伦比亚| 头发发黄是什么原因造成的| 摩卡棕是什么颜色| 心率过速吃什么药| 西夏是现在的什么地方| 男人吃叶酸片有什么好处| 为什么一吃饭就肚子疼| 调理月经吃什么药最好| 低压高是什么引起的| 排酸肉是什么意思| 什么心什么心| au999是什么金| 振五行属什么| 什么是化石| 什么原因造成高血压| 12月31号什么星座| 1月13日是什么星座| 里正相当于现在什么官| 蟹粉是什么| 出差什么意思| 广东有什么好玩的地方| 三七粉什么颜色| 喝啤酒有什么好处| 负压引流器有什么作用| 本命年为什么不能结婚| hb是什么意思| 出生日期查五行缺什么| 降压药什么时候吃比较好| 刮痧用的油是什么油| 令是什么生肖| 毕罗是什么食物| oder是什么意思| 马叉虫是什么意思| 八月八日是什么星座| 脖子短适合什么发型| 统战部是干什么的| ph值高是什么原因| 颜面扫地什么意思| 觊觎是什么意思| 髻是什么意思| 局座是什么梗| 压力过大会有什么症状| 举重的器材叫什么| 破关是什么意思| 牙刷属于什么垃圾| 为什么要当兵| 为什么睡觉会突然抖一下| 福州立冬吃什么| aj是什么牌子| 双侧上颌窦炎是什么病| 5月11号是什么星座| 姓毛的男孩取什么名字好| 什么情况需要割包皮| 大葱什么时候播种| 撤退性出血是什么| 茭白是什么植物| 肉偿是什么意思| 七夕节干什么| 人体乳头瘤病毒是什么| 笨和蠢有什么区别| 令羽读什么| 肌酸激酶偏低是什么原因| 提报是什么意思| 空腹血糖17已经严重到什么时候| 男人脖子后面有痣代表什么| 拉屎肛门疼是什么原因| 四大皆空是指什么| 猫咪冠状病毒什么症状| 多囊卵巢是什么原因造成的| 裤裙搭配什么上衣好看| 华丽的什么| 胃肠道感冒吃什么药| 419什么意思| 院长是什么级别| 梦见别人买房子是什么预兆| 血象高会导致什么后果| 维生素c阳性是什么意思| 十月二十二是什么星座| hicon是什么牌子| 元肉是什么| 胃烧灼感是什么原因引起的| 夏天适合养什么花| 什么是普拉提| hr是什么| 耍朋友是什么意思| 不可名状的名是什么意思| 喝酒后胃疼吃什么药| 菠萝不能和什么一起吃| 工程院院士是什么级别| 包皮龟头炎用什么药膏| 梦见自己掉牙是什么意思| 拉屎拉出血是什么原因| 母亲o型父亲b型孩子是什么血型| saq是什么意思| 1月份是什么星座的人| 苡字五行属什么| 同型半胱氨酸是什么| 农历3月14日是什么星座| 沐浴露什么牌子好| 血常规能查出什么病| 501是什么意思| b2b是什么| 窦性心动过速是什么意思| 羊宝是什么东西| 孟力念什么| 嘴唇有痣代表什么| 虾不能和什么一起吃| 黄瓜为什么会苦| 跳空缺口是什么意思| 梦见吵架是什么意思| 梦见老婆出轨是什么预兆| 活水是什么意思| 形态各异的异是什么意思| 脂肪肝是什么引起的| 什么是子公司| 今天吃什么菜好呢| 考试前吃巧克力有什么好处| 送锦旗有什么讲究| 姊妹是什么意思| 多发淋巴结是什么意思| 七月十一日是什么日子| 马脸是什么脸型| 脸上白一块一块的是什么原因| 相性是什么意思| 什么样的花朵| 心经讲的是什么| 有什么放不下| 降结肠在什么位置| 长方脸适合什么样的发型| 拿什么不用手| 桃花什么时候开| hbo什么意思| 鼻炎用什么药效果好| 熊猫血是什么| 养殖有什么好项目| 牙膏洗脸有什么好处和坏处| 脾湿吃什么中成药| 一什么颜色| 什么是代偿| 三点水翟读什么| 男生为什么喜欢女生叫爸爸| 三班两倒是什么意思| 血糖高适合喝什么茶| 韩信点兵什么意思| 高碳钻是什么| 什么地坐着| 检查脑袋应该挂什么科| 龙虾的血是什么颜色的| 胃反流吃什么药| 心脏神经官能症吃什么药| 肠胃炎需要做什么检查| 9月3号是什么日子| 防蓝光是什么意思| tomorrow什么意思| 很无奈是什么意思| 车震是什么| amazon是什么意思| 检查甲亢挂什么科| 清洁度1度是什么意思| 蛋白石是什么| 餐补是什么意思| 解痉镇痛酊有什么功效| 孕期什么时候补铁| 月经一直不停有什么办法止血| 谦虚什么意思| 人中长代表什么| 侄女结婚送什么礼物最好| 生育酚乙酸酯是什么| 沮丧是什么意思| 太古里是什么意思| 备孕前准备吃什么叶酸| 祛湿有什么好处| 讳莫如深是什么意思| 做梦梦见老公出轨是什么意思| 女人做梦梦到蛇是什么意思| 喝红花有什么作用与功效| 上网是什么意思| 前白蛋白低是什么原因| 什么疾什么快| 瞳距是什么| 为什么女人比男人长寿| 嗔是什么意思| 旗人是什么意思| 看嗓子去医院挂什么科| 睾酮是什么意思| 拉肚子吃什么药| 指标是什么意思| 龙生九子下一句是什么| cr是什么金属| 腺瘤型息肉是什么意思| 甲苯是什么| 新西兰移民需要什么条件| 输卵管堵塞是什么原因| 去香港需要准备什么| 男人吃什么壮阳最快| 手指麻木是什么病的前兆| 玛尼是什么意思| 梦到钓鱼是什么意思| 鸟屎掉脸上有什么预兆| 痔疮是什么东西| 女女叫什么| 终亡其酒的亡是什么意思| 百度Jump to content

北方华创高端真空装备踏出国门

From mediawiki.org
百度 徒步区域:怀柔区内自延庆界到云梦仙境沟口全程约★延庆怀柔公路界-西帽山村-盘道沟村-宝山镇政府-转年村-鸽子堂村-西帽湾村南-汤河口,共约;★汤河口-大黄塘村南桥头-白河滨水公园标志-后安岭村西-后安岭村东南山脊垭口-田园鸡度假村大门-白河北村西桥头,共约;★白河北村西桥头-青石岭村口-青石岭村南收费桥-品字型度假小屋西侧铁桥-让子弹飞铁轨北头-让子弹飞铁轨南头-白河云梦仙境沟口,共约6km;沿京承高速行驶,在水源九厂桥朝大庆/怀柔方向继续行驶,在高各庄桥朝京密高速公路/怀柔城区/顺义方向,稍向右转进入怀柔桥,沿怀柔桥行驶公里,过怀柔桥约790米后直行进入京密高速公路,后进入直行进入雁栖湖联络线,行驶公里后进入京加路,沿京加路行驶,在前安岭二桥左转,行驶公里后右前方转弯,行驶公里,到达青石岭。

TextCat is a language detection library based on n-gram text categorization used by the CirrusSearch extension for MediaWiki, which provides on-wiki search.

On Wikipedias where TextCat is deployed, a search that gets very few results will be sent to TextCat for language identification. If a likely language for the query can be determined, then the relevant Wikipedia is also searched, and any results are also displayed.

It is currently deployed to the Dutch, English, French, German, Italian, Japanese, Portuguese, Russian, and Spanish Wikipedias.

As an example, searching English Wikipedia for málvísindi (‘linguistics’ in Icelandic) gets no results, so results from the Icelandic Wikipedia are shown.

How TextCat works

[edit]

TextCat attempts to determine the language of a search string by comparing it to a model for each language. These models are basically a ranked list of n-grams (1- to 5-letter sequences found in the text) that are the most common in a particular language.

For example, as you can see in the language model for French that TextCat uses, the character "é" appears higher in the ranking (currently line 41) than say the model for English, where that character appears much lower in the list (currently line 3781).

Language identification is generally harder on shorter strings of text, and search queries are often less than a dozen characters long. Some words are inherently ambiguous (for example the words "liaison" or "magistrate" in both English and French). Special characters and diacritics can be fairly distinctive (like German "?" or Vietnamese "?"), but some visitors don't include special characters when searching for a particular phrase or word, while others do.

Because of the differences in the language people use in queries, where possible, our language models are built on samples of actual queries. For other languages, we use articles from the relevant Wikipedia as samples of general text in the language.

[edit]

There are several limits we have to place on TextCat for reasons of efficiency, complexity, or accuracy. And there are some things that it has a bit of trouble with.

We limit the number of languages considered by TextCat on a given wiki for a couple of different reasons, and so the exact list differs by wiki (see wgCirrusSearchTextcatLanguages for the current list per wiki).

  • It’s much faster to consider fewer languages, so we only consider languages that are relatively common in queries on a particular wiki. So, Japanese Wikipedia does not consider Spanish when doing language detection because, historically, there haven’t been that many queries in Spanish there.
  • Some languages are more likely to cause confusion, but also don’t occur that frequently, so we omit them from consideration. For example, Dutch and Afrikaans are closely related, and it may be the case that queries in those languages are frequently confused. If there are twenty times as many Dutch queries as Afrikaans queries, but half the Dutch queries are incorrectly identified as Afrikaans, we would drop Afrikaans, because missing ten Afrikaans queries is better than getting 100 Dutch queries wrong.

We only consider at most the one “best” answer returned from TextCat.

  • Searching additional Wikis is computationally expensive, especially if we searched a lot of them. Also, for lower-quality language matches, the results are less likely to be meaningful—for example, there may be only one instance of a word on a given wiki, which is found in a title of a reference work.
  • From a user interface perspective, showing results from multiple wikis is complicated. The search results page can already be crowded, and we don’t have any UI/UX experts on the search team.
  • Generally, when the results from TextCat are ambiguous—meaning that two or more languages scored very similarly—the “best” answer is much more likely to be wrong. Since we are trying to provide useful additional results to the searcher, omitting these results improves accuracy.

Similar queries may get different results from TextCat. You have to draw the line somewhere, and there will always be related pairs of words that land on opposite sides of the line.

  • TextCat has a bias in favor of the language of the wiki you are on, and in favor of English. Because queries on a given wiki are most likely to be in the language of that wiki, we give that language a boost. For example, adorable is a word in English, French, and Spanish. On the Dutch Wikipedia, it might be identified as Spanish (over French and English), while on the English Wikipedia, it would be too ambiguous, because English gets a boost. (Note that this is just a hypothetical example; there are too many results for adorable on Dutch Wikipedia for TextCat to be invoked.)
    • It turns out to be good to not only give the language of the wiki a boost, but also the second most common language seen in queries. For every other wiki, this happens to be English. (On English Wikipedia, the second most common is Chinese.)
  • Because capitalization matters a little—Folklore is ever so slightly more likely to be German than English if it is capitalized in the middle of a sentence, for example—there will always be cases where the difference between different capitalizations happens to cross the threshold for ambiguous. So Folklore might just barely be recognized as German, while folklore is too ambiguous between German and English and ignored.

Chinese and Korean are harder for TextCat than you might expect. If a query is nothing but Chinese characters, it’s probably Chinese, right? Similarly if it’s all Korean characters. However, TextCat’s n-gram model means that for writing systems with a very large number of individual Unicode characters (Korean has eleven thousand, Chinese has tens of thousands), not all of them are in the model. And since people often search Wikipedia for “interesting” things, rarer characters are not at all unlikely to occur in queries from time to time. For large pieces of text, the more common characters are almost certain to occur, but in short queries, they may not. Throw in a few Latin characters in a query, and the result may suddenly become too ambiguous.

Development and technical details

[edit]

Rationale

[edit]

People sometimes search using words that are not in the language of the wiki they are searching. Sometimes it works (e.g., Луковичная глава, Unión de Radioaficionados Espa?oles, or 汉语 on English Wikipedia), and sometimes it doesn't (e.g., force électromotrice on English Wikipedia)—but would if we could redirect the query to the right wiki in the same language as the query (e.g., force électromotrice on French Wikipedia). In order to do that we need to be able to detect the language of the query.

Origins and development

[edit]

The original version of TextCat is a Perl library developed by Gertjan van Noord, based on a 1994 paper by Cavnar and Trenkle. The original TextCat is relatively lightweight and reasonably accurate, compared to other language identification libraries available when the search team first looked into using language identification.

The Wikimedia Foundation maintains a PHP port of this library available as a Composer package. It is used by the CirrusSearch extension for MediaWiki.

The PHP port has several new features that take advantage of a couple of decades of improvements to computer hardware to use much larger and more accurate language models, use Unicode, and to otherwise improve the ability of TextCat’s n-gram models to distinguish between languages while remaining relatively lightweight.

There is also an updated Perl version maintained by Trey Jones which also has all of the new features of the updated PHP version.

Training data

[edit]

To understand what makes a language look (or not look) like a particular language, training data was developed based upon historical query strings. These query strings were run against TextCat and used to build up the model for a given language. These corpora of text, sanitized from bots and errant searches, helped to 'teach' TextCat what n-grams commonly appear in a language. Using query data for training, rather than general text like Wikipedia article text, also gives more positive results in testing and improves the accuracy of the language detection for queries.

The PHP port of TextCat includes models built on query data (for use with queries), and models built on general Wikipedia article text, which may be more useful for generic language detection.

Maintenance

[edit]

You can find tasks related to TextCat in Phabricator.

Updating the library
[edit]

In order to update the deployment library once a change has been merged into the library repository:

  1. Tag the library with the new version and push the tag
  2. Check on wikimedia/textcat that the tag is updated
  3. Update composer.json in extension/CirrusSearch
  4. Test on non-production install that after composer update --no-dev everything runs smoothly
  5. Check out mediawiki/vendor repo
  6. Edit composer.json and put new version of wikimedia/textcat there
  7. Run composer update --no-dev
  8. Make patch of the changes and put it to review on Gerrit.

See also

[edit]
[edit]
理想主义是什么意思 在什么情况下需要做肠镜 门静脉增宽是什么意思 吃什么食物补肾最快 为什么吃一点东西肚子就胀
变格是什么意思 鸡精和鸡粉有什么区别 全身燥热是什么原因引起的 宫颈息肉是什么原因引起的 心脏房颤是什么症状
大体重减肥做什么运动 为什么会做噩梦 子宫腺肌症是什么 界代表什么生肖 梦见掉牙是什么意思
喝什么解渴 人体七大营养素是什么 静脉炎的症状是什么 翡翠戴久了会有什么变化 朔望月是什么意思
长高吃什么钙片hcv8jop6ns3r.cn 经常手瘾吃什么药hcv8jop3ns3r.cn 胆囊炎属于什么科hcv9jop7ns0r.cn 好汉不吃眼前亏是什么意思hcv9jop2ns1r.cn 有核红细胞是什么意思hcv9jop7ns4r.cn
新的五行属性是什么hcv7jop7ns2r.cn 12月10号什么星座hcv8jop3ns3r.cn 白凉粉是什么zhongyiyatai.com 栓塞是什么意思hcv9jop7ns3r.cn 四月十八日是什么日子clwhiglsz.com
皈依什么意思hcv8jop1ns5r.cn 看见老鼠有什么预兆hcv7jop5ns6r.cn 夏天摆摊适合卖什么hcv7jop6ns7r.cn 为什么打喷嚏hcv9jop8ns2r.cn 过敏性咳嗽用什么药hcv8jop2ns6r.cn
第三者责任险是什么意思hcv7jop6ns7r.cn 西铁城是什么档次的表travellingsim.com 羽毛球拍什么材质的好hcv8jop5ns0r.cn 毫发无损是什么意思hcv8jop9ns3r.cn 生命科学专业学什么hcv9jop2ns0r.cn
百度