2023年是属什么生肖| 孩子半夜咳嗽是什么原因| 数字8五行属什么| 出汗太多是什么原因| 高血糖能吃什么| 出现血尿是什么原因| 脑梗死是什么意思| 天秤座跟什么星座最配| 吓得什么填空| 什么食物是碱性的| 种牙好还是镶牙好区别是什么| 高血糖吃什么| 晚上睡觉老做梦是什么原因| 太阳最后会变成什么| 财主代表什么生肖| 什么是高利贷| 挂失补办身份证需要什么| 感冒流清水鼻涕吃什么药| 什么像什么| 什么是易孕体质| 骨质增生是什么| M3什么意思| 冰糖是什么做的| 紫藤花什么时候开| 牛肉饺子馅配什么蔬菜好吃| 2月2号什么星座| 小肚子是什么部位| 甲状腺低密度结节是什么意思| 水上漂是什么意思| 红眼病是什么原因引起的| 血沉偏高说明什么| 三个直念什么| 套马的汉子你威武雄壮是什么歌| 混社会的人一般干什么| igg是什么意思| 乳腺点状强回声是什么意思| 一级军士长是什么级别| dpa是什么意思| 虱目鱼在大陆叫什么| 包茎是什么意思| 体脂率是什么意思| 子宫肌瘤是什么病严重吗| 怀二胎初期有什么症状| 急性心肌炎有什么症状| 吃什么东西养胃最有效| 咳嗽痰多吃什么药| 红花跟藏红花有什么区别| 皮蛋是什么蛋做的| 水瓶座是什么性格| 什么时候泡脚效果最好| 稠是什么意思| 幽门螺旋杆菌是什么症状| 什么t恤质量好| 慢性非萎缩性胃炎是什么意思| 掂过碌蔗是什么意思| 牙痛吃什么药最快见效| 气血两虚吃什么补最快| 民营经济属于什么经济| 多吃木耳有什么好处和坏处| 血管瘤是什么东西| 为什么会得甲亢| 吃生红枣有什么好处| 朱顶红什么时候剪叶子| 无偿是什么意思| 八月底什么星座| 芡实适合什么人吃| 慢性阑尾炎吃什么药| 草鱼是什么鱼| 什么生肖名扬四海| 西边五行属什么| 春秋是一部什么体史书| 叶韵是什么意思| 喝什么茶能降血压| 意味深长的意思是什么| 洛基是什么神| 白热化阶段是什么意思| 夏天吃什么| 嘿嘿嘿是什么意思| 小腿骨头疼是什么原因| 温开水冲服是什么意思| touch什么意思| 骨量偏高代表什么意思| 梦见做饭是什么意思| 五音是什么| 下午五点到七点是什么时辰| 白开水是什么意思| 半夜会醒是什么原因| 低密度脂蛋白偏高什么意思| 林彪为什么要叛逃| 离婚需要什么手续和证件| 66岁生日有什么讲究| 三点水加累读什么| 14数字代表什么意思| 经常流鼻血是什么原因| 胃热吃什么药| 肛门出血什么原因| 梦到吃梨是什么意思| 骨感是什么意思| 豆米是什么| 东南方是什么生肖| 脑控是什么| 紫色是什么颜色调出来的| 血红蛋白是查什么的| 梦见滑雪是什么意思| 缠腰蛇是什么症状图片| 星星是什么| 什么情况下要做宫腔镜| 空腔是什么意思| 瘿瘤是什么病| 肾脏不好吃什么食物好| 日柱华盖是什么意思| 指甲上的月牙代表什么| 韩语阿西吧是什么意思| 车标是牛的是什么车| 焦虑挂什么科| 什么时候恢复的高考| 闹觉是什么意思| 什么人不能喝蜂蜜| 应用化学是干什么的| 女人骨质疏松吃什么最好| 土方是什么| 产复欣颗粒什么时候吃| 7月15日是什么星座| 什么时候喝咖啡最好| 连奕名为什么娶杨若兮| 败血症是什么症状| 诏安是什么意思| 长沙有什么区| 无水焗是什么意思| 心肌缺血吃什么好| 食用植物油是什么油| 荆条是什么意思| 过期的啤酒能干什么| 黄芪有什么功效| 潜血弱阳性什么意思| 敏感肌是什么| 册那什么意思| 斯德哥尔摩综合症是什么意思| 耿耿于怀什么意思| 拉尿分叉是什么原因| npv是什么意思| 做梦梦到掉牙齿是什么意思| 阴囊潮湿挂什么科| 三个六代表什么意思| 下眼皮肿是什么原因| 胃有灼热感是什么原因| 恶对什么| o型血的人是什么性格| 活血化瘀是什么意思| 昆明是什么城| ab型血和b型血的孩子是什么血型| 排骨汤里放什么食材好| 鼻窦炎挂什么科| 喝白酒有什么好处| 吃阿莫西林过敏有什么症状| 过敏性紫癜有什么危害| 空调有异味是什么原因| 中午12点到1点是什么时辰| 望尘莫及是什么意思| 口坐读什么| 杭盖是什么意思| 什么叫基因检测| 南瓜不能和什么食物一起吃| 什么的回答| NT是检查什么| 土豆是什么科| 什么是假药| 金牛座女和什么星座最配| 枇杷不能和什么一起吃| 狗的胡须有什么用| 鼻窦炎吃什么药效果好| 邓字五行属什么| 家庭油炸用什么油好| 盆腔钙化灶是什么意思| rh(d)血型阳性是什么意思| 漂发是什么意思| 心肾不交失眠吃什么中成药| 星期天为什么不叫星期七| 蛇和什么属相相冲| 施教区是什么意思| ck什么意思| 直肠下垂有什么症状| 为什么端午节要吃粽子| 低密度脂蛋白胆固醇高吃什么药| 降血糖吃什么| 让我随你去让我随你去是什么歌| 电解质水是什么| 梦见茄子是什么意思| 蛊是什么| 腰上长痘痘是什么原因| 冬天吃什么水果| 逼长什么样| 巴基斯坦用什么语言| 肺部玻璃结节是什么病| 人为什么要吃盐| 口苦口臭吃什么药效果最佳| 什么叫肛裂| 灵芝有什么功效与作用| 全麻对身体有什么危害| 出其不意下一句是什么| 津液亏虚是什么意思| 工事是什么意思| 骨折吃什么恢复快| 水头是什么意思| 猫怕什么声音| 一什么山| 什么吹风机好用| 福禄寿是什么意思| 小孩抵抗力差吃什么提高免疫力| 每天熬夜有什么危害| 人体缺钙吃什么补最快| 女性胆囊炎有什么症状| 1947年属猪的是什么命| 口腔异味吃什么药| 名士手表属于什么档次| 6月5日是世界什么日| 日成念什么| 三个又是什么字| 心率过高是什么原因| 中秋节为什么要吃月饼| 什么叫脑白质病变| 股市量比什么意思| 双数是什么| 什么是偏印| 什么的石桥| 什么是规培生| 金银花有什么作用| 女性体毛多是什么原因| 盐水洗脸有什么好处| 稀松平常是什么意思| 不走心是什么意思| 史努比是什么意思| cps是什么意思| 梦见戴孝是什么意思| fox什么意思| 面部肌肉跳动是什么原因| 绿松石是什么| 什么动物站着睡觉| aojo眼镜什么档次| 五月十四号是什么情人节| 眼睛为什么会长麦粒肿| 宫颈肥大是什么原因造成的| 霍山石斛有什么功效| 93鸡和94狗生什么宝宝| 为什么大便拉不出来| 寄居蟹吃什么| 腮腺炎是什么症状| 月经来头疼是什么原因引起的| 什么地赞叹| 鱼刺卡喉咙去医院挂什么科| 什么是保健食品| 什么品牌的奶粉最好| 精心的什么| 人为什么打哈欠| 怀孕吃什么好| 三百多分能上什么大学| 刚需房是什么意思| 睡觉为什么流口水| 航班预警是什么意思| 生辰纲是什么东西| 劫煞是什么意思| 孕妇dha什么时候开始吃| 梦见自己的车丢了是什么意思| 一血是什么意思| 百度Jump to content

《爱我你敢吗》公布演员阵容!王晓晨搭档韩庚引期待

From mediawiki.org
百度 中国执政当局显然已从过往过度的扩张性宏观政策中总结了经验,过于扩张的总需求政策固有利于短期一时的增长,却不利于长期健康的高质量发展,因此在习近平的第一个任期中就进行了调整,开始强调「供给侧改革」与「高质量发展」。

TextCat is a language detection library based on n-gram text categorization used by the CirrusSearch extension for MediaWiki, which provides on-wiki search.

On Wikipedias where TextCat is deployed, a search that gets very few results will be sent to TextCat for language identification. If a likely language for the query can be determined, then the relevant Wikipedia is also searched, and any results are also displayed.

It is currently deployed to the Dutch, English, French, German, Italian, Japanese, Portuguese, Russian, and Spanish Wikipedias.

As an example, searching English Wikipedia for málvísindi (‘linguistics’ in Icelandic) gets no results, so results from the Icelandic Wikipedia are shown.

How TextCat works

[edit]

TextCat attempts to determine the language of a search string by comparing it to a model for each language. These models are basically a ranked list of n-grams (1- to 5-letter sequences found in the text) that are the most common in a particular language.

For example, as you can see in the language model for French that TextCat uses, the character "é" appears higher in the ranking (currently line 41) than say the model for English, where that character appears much lower in the list (currently line 3781).

Language identification is generally harder on shorter strings of text, and search queries are often less than a dozen characters long. Some words are inherently ambiguous (for example the words "liaison" or "magistrate" in both English and French). Special characters and diacritics can be fairly distinctive (like German "?" or Vietnamese "?"), but some visitors don't include special characters when searching for a particular phrase or word, while others do.

Because of the differences in the language people use in queries, where possible, our language models are built on samples of actual queries. For other languages, we use articles from the relevant Wikipedia as samples of general text in the language.

[edit]

There are several limits we have to place on TextCat for reasons of efficiency, complexity, or accuracy. And there are some things that it has a bit of trouble with.

We limit the number of languages considered by TextCat on a given wiki for a couple of different reasons, and so the exact list differs by wiki (see wgCirrusSearchTextcatLanguages for the current list per wiki).

  • It’s much faster to consider fewer languages, so we only consider languages that are relatively common in queries on a particular wiki. So, Japanese Wikipedia does not consider Spanish when doing language detection because, historically, there haven’t been that many queries in Spanish there.
  • Some languages are more likely to cause confusion, but also don’t occur that frequently, so we omit them from consideration. For example, Dutch and Afrikaans are closely related, and it may be the case that queries in those languages are frequently confused. If there are twenty times as many Dutch queries as Afrikaans queries, but half the Dutch queries are incorrectly identified as Afrikaans, we would drop Afrikaans, because missing ten Afrikaans queries is better than getting 100 Dutch queries wrong.

We only consider at most the one “best” answer returned from TextCat.

  • Searching additional Wikis is computationally expensive, especially if we searched a lot of them. Also, for lower-quality language matches, the results are less likely to be meaningful—for example, there may be only one instance of a word on a given wiki, which is found in a title of a reference work.
  • From a user interface perspective, showing results from multiple wikis is complicated. The search results page can already be crowded, and we don’t have any UI/UX experts on the search team.
  • Generally, when the results from TextCat are ambiguous—meaning that two or more languages scored very similarly—the “best” answer is much more likely to be wrong. Since we are trying to provide useful additional results to the searcher, omitting these results improves accuracy.

Similar queries may get different results from TextCat. You have to draw the line somewhere, and there will always be related pairs of words that land on opposite sides of the line.

  • TextCat has a bias in favor of the language of the wiki you are on, and in favor of English. Because queries on a given wiki are most likely to be in the language of that wiki, we give that language a boost. For example, adorable is a word in English, French, and Spanish. On the Dutch Wikipedia, it might be identified as Spanish (over French and English), while on the English Wikipedia, it would be too ambiguous, because English gets a boost. (Note that this is just a hypothetical example; there are too many results for adorable on Dutch Wikipedia for TextCat to be invoked.)
    • It turns out to be good to not only give the language of the wiki a boost, but also the second most common language seen in queries. For every other wiki, this happens to be English. (On English Wikipedia, the second most common is Chinese.)
  • Because capitalization matters a little—Folklore is ever so slightly more likely to be German than English if it is capitalized in the middle of a sentence, for example—there will always be cases where the difference between different capitalizations happens to cross the threshold for ambiguous. So Folklore might just barely be recognized as German, while folklore is too ambiguous between German and English and ignored.

Chinese and Korean are harder for TextCat than you might expect. If a query is nothing but Chinese characters, it’s probably Chinese, right? Similarly if it’s all Korean characters. However, TextCat’s n-gram model means that for writing systems with a very large number of individual Unicode characters (Korean has eleven thousand, Chinese has tens of thousands), not all of them are in the model. And since people often search Wikipedia for “interesting” things, rarer characters are not at all unlikely to occur in queries from time to time. For large pieces of text, the more common characters are almost certain to occur, but in short queries, they may not. Throw in a few Latin characters in a query, and the result may suddenly become too ambiguous.

Development and technical details

[edit]

Rationale

[edit]

People sometimes search using words that are not in the language of the wiki they are searching. Sometimes it works (e.g., Луковичная глава, Unión de Radioaficionados Espa?oles, or 汉语 on English Wikipedia), and sometimes it doesn't (e.g., force électromotrice on English Wikipedia)—but would if we could redirect the query to the right wiki in the same language as the query (e.g., force électromotrice on French Wikipedia). In order to do that we need to be able to detect the language of the query.

Origins and development

[edit]

The original version of TextCat is a Perl library developed by Gertjan van Noord, based on a 1994 paper by Cavnar and Trenkle. The original TextCat is relatively lightweight and reasonably accurate, compared to other language identification libraries available when the search team first looked into using language identification.

The Wikimedia Foundation maintains a PHP port of this library available as a Composer package. It is used by the CirrusSearch extension for MediaWiki.

The PHP port has several new features that take advantage of a couple of decades of improvements to computer hardware to use much larger and more accurate language models, use Unicode, and to otherwise improve the ability of TextCat’s n-gram models to distinguish between languages while remaining relatively lightweight.

There is also an updated Perl version maintained by Trey Jones which also has all of the new features of the updated PHP version.

Training data

[edit]

To understand what makes a language look (or not look) like a particular language, training data was developed based upon historical query strings. These query strings were run against TextCat and used to build up the model for a given language. These corpora of text, sanitized from bots and errant searches, helped to 'teach' TextCat what n-grams commonly appear in a language. Using query data for training, rather than general text like Wikipedia article text, also gives more positive results in testing and improves the accuracy of the language detection for queries.

The PHP port of TextCat includes models built on query data (for use with queries), and models built on general Wikipedia article text, which may be more useful for generic language detection.

Maintenance

[edit]

You can find tasks related to TextCat in Phabricator.

Updating the library
[edit]

In order to update the deployment library once a change has been merged into the library repository:

  1. Tag the library with the new version and push the tag
  2. Check on wikimedia/textcat that the tag is updated
  3. Update composer.json in extension/CirrusSearch
  4. Test on non-production install that after composer update --no-dev everything runs smoothly
  5. Check out mediawiki/vendor repo
  6. Edit composer.json and put new version of wikimedia/textcat there
  7. Run composer update --no-dev
  8. Make patch of the changes and put it to review on Gerrit.

See also

[edit]
[edit]
包面是什么 同房是什么 十二朵玫瑰花代表什么意思 房性早搏是什么意思 嘛是什么意思
er是什么元素 swag什么意思 茉莉花茶属于什么茶类 椰浆是什么 二郎神是什么动物
吃海参有什么好处 比心什么意思 小肠换气吃什么药 牙髓炎是什么 kingtis手表什么牌的
1月12号是什么星座 经常眩晕是什么原因引起的 女红是什么意思 6月30日什么星座 十二月二十三是什么星座
头七有什么规矩hcv8jop6ns2r.cn 瘦肉炒什么好吃hcv8jop7ns3r.cn 阴虚什么症状hcv9jop7ns2r.cn 风湿什么药好hcv8jop8ns6r.cn 今天属什么生肖日历hcv8jop2ns0r.cn
狼吞虎咽是什么生肖hcv9jop7ns9r.cn 香蕉为什么是弯的hcv7jop9ns8r.cn 干是什么意思hcv8jop8ns6r.cn 中午12点是什么时辰wuhaiwuya.com 蝉为什么叫hcv8jop0ns5r.cn
梦到买房子是什么意思hcv9jop2ns4r.cn 糖精对人体有什么危害xinjiangjialails.com 2005年属什么生肖hcv9jop2ns8r.cn h表示什么hcv8jop2ns1r.cn 柿子叶有什么功效hcv8jop7ns4r.cn
银耳和雪耳有什么区别hcv8jop2ns1r.cn 什么东西最补肾naasee.com 口臭口苦什么原因引起的hcv8jop4ns8r.cn 健康状况填什么hcv9jop1ns4r.cn 做肌电图挂什么科hlguo.com
百度