Elasticsearchの中国語Analyzer - Uzabase for Engineers

こんにちは！SaaS Product Teamの成です。

出身は中国・上海です。学生のときからは国語が苦手でテストもよく落ちましたが、一応中国語は分かります。最近Product開発で中国語文章の全文検索について調べたことがありましたので、ここでElasticsearchの中国語Analyzerについて紹介したいと思います。もし皆さんも中国語の全文検索を対応するとき、中国語が分からなくても、この記事が多少参考になれるならば大変嬉しいです。

Elasticsearchで中国語の検索精度を高めるためには、中国語の文章を正しく形態素分析できるAnalyzerは不可欠だと思います。 Baiduで中国国内の記事も含んで色々ググってみまして、下記２種類のAnalyzerに絞って紹介したいと思います。

smartcn
- Elasticsearchのプリインストールの中国語Analyzerです。LuceneのSmart Chinese analysisをベースにしたAnalyzerです。
ik-analysis
- 中国では一番人気なAnalyzerです。メンテもそこそこ速いし、オーナーさんはElastic社の社員らしいです。

この２種類のAnalyzerをローカルのElasticsearchにインストールし、検証してみて下記のマトリクスにまとめました。

結論
準備
検証
最後に

結論

-	smartcn	ik-analysis	補足
形態素分析	△	◯	後で例を介して説明する
拡張性	◯	◯	提供されているtokenizerに合わせて、character filterとtoken filterを加えれば色々カスタマイズができる。ik-analysisではstop wordsなどはtoken filterではなく、プラグイン内部でカスタマイズもできる。
辞書のカスタマイズ	✕	◯	smartcnはできなさそう

大量の中国語文章で検証してないですが、感覚としては smartcn より ik-analysis のほうが中国語検索に最適だと思います。 ik-analysisは日本語Analyzer（kuromoji）と似ていて、辞書のカスタマイズができるし、tokenizerも複数種類が提供されています。

ik_smart
- 単語として識別できる場合、それ以上は分割しない。
  例： 中华人民共和国国歌 ー＞ 中华人民共和国 （中国の正式名称）, 国歌 （国の歌）
- kuromoji_tokenizerのnormal modeと相当
ik_max_word
- ik_smartより単語を再分割する。
  例： 中华人民共和国国歌 ー＞ 中华人民共和国 , 中华人民 , 中华 , 华人 , 人民共和国 , 人民 , 人 , 民 , 共和国 , 共和 , 和 , 国国 , 国歌
- kuromoji_tokenizerのsearch modeと相当

この２種類のtokenizerとPhraseなどのQueryを組み合わせて活用すれば、再現率と適合率を柔軟に調整できると思います。

結論を先に書きましたが、どんな検証を行ったかを簡単に紹介したいと思います。

準備

Dockerでsmartcnとik-analysisをインストールしたElasticsearchとKibanaを起動させる

ElasticsearchのDockerfile

FROM elasticsearch:7.10.1

RUN elasticsearch-plugin install -b https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.10.1/elasticsearch-analysis-ik-7.10.1.zip
RUN elasticsearch-plugin install analysis-smartcn

Dockerfileのbuild

docker build --rm -t es_chinese_analyzers:7.10.1 .

ElasticsearchとKibanaの起動

# 初回のみ
docker network create es_kibana_nw

docker rm -vf kibana
docker rm -vf elasticsearch

docker run -d --name elasticsearch --net es_kibana_nw -p 9200:9200 -e "discovery.type=single-node" es_chinese_analyzers:7.10.1
docker run -d --name kibana --net es_kibana_nw -p 5601:5601 kibana:7.10.1

http://localhost:5601 にアクセスすると、Kibanaが開けて「Dev Tools」で色々リクエストを投げられる（いい感じにサジェストもしてくれるので便利です！）

検証

Kibanaでsmartcnとik-analysisに対してanalyze apiを実行する通訳：上天不仅给了她美貌，还给了她智慧　ー＞　神は彼女に美しさだけでなく知恵も与えた

ik_smartの形態素分析

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": ["上天不仅给了她美貌，还给了她智慧。"]
}

結果：

{
  "tokens" : [
    {
      "token" : "上天", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0
    },
    {
      "token" : "不仅", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 1
    },
    {
      "token" : "给", "start_offset" : 4, "end_offset" : 5, "type" : "CN_CHAR", "position" : 2
    },
    {
      "token" : "了", "start_offset" : 5, "end_offset" : 6, "type" : "CN_CHAR", "position" : 3
    },
    {
      "token" : "她", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 4
    },
    {
      "token" : "美貌", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 5
    },
    {
      "token" : "还给", "start_offset" : 10, "end_offset" : 12, "type" : "CN_WORD", "position" : 6
    },
    {
      "token" : "了", "start_offset" : 12, "end_offset" : 13, "type" : "CN_CHAR", "position" : 7
    },
    {
      "token" : "她", "start_offset" : 13, "end_offset" : 14, "type" : "CN_CHAR", "position" : 8
    },
    {
      "token" : "智慧", "start_offset" : 14, "end_offset" : 16, "type" : "CN_WORD", "position" : 9
    }
  ]
}

smartcnの形態素分析

GET /_analyze
{
  "analyzer": "smartcn",
  "text": ["上天不仅给了她美貌，还给了她智慧。"]
}

結果：

{
  "tokens" : [
    {
      "token" : "上天", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0
    },
    {
      "token" : "不仅", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 1
    },
    {
      "token" : "给", "start_offset" : 4, "end_offset" : 5, "type" : "word", "position" : 2
    },
    {
      "token" : "了", "start_offset" : 5, "end_offset" : 6, "type" : "word", "position" : 3
    },
    {
      "token" : "她", "start_offset" : 6, "end_offset" : 7, "type" : "word", "position" : 4
    },
    {
      "token" : "美", "start_offset" : 7, "end_offset" : 8, "type" : "word", "position" : 5
    },
    {
      "token" : "貌", "start_offset" : 8, "end_offset" : 9, "type" : "word", "position" : 6
    },
    {
      "token" : "还", "start_offset" : 10, "end_offset" : 11, "type" : "word", "position" : 8
    },
    {
      "token" : "给", "start_offset" : 11, "end_offset" : 12, "type" : "word", "position" : 9
    },
    {
      "token" : "了", "start_offset" : 12, "end_offset" : 13, "type" : "word", "position" : 10
    },
    {
      "token" : "她", "start_offset" : 13, "end_offset" : 14, "type" : "word", "position" : 11
    },
    {
      "token" : "智慧", "start_offset" : 14, "end_offset" : 16, "type" : "word", "position" : 12
    }
  ]
}

比較してみると、符号はどっちもいい感じに除外できましたが、smartcnは単語として識別するべき 美貌 （美しさ）が認識できず、文字ごとtokenとして分割してしまいました。他の文章も色々試してみましたが、だいたいsmartcnの単語認識率がやや低い感じでした。

ただ、stop wordsの対応はどっちもデフォルトでは対応していないようです。 stop wordsとは単独だと意味がないが、頻繁に使われる単語や文字のことです。日本語でいうと は, が, です などの文字で、上記の例は 了 を指します。

Elasticsearchの検索スコアは tf-idf というアルゴリズムで計算していて、 stop wordsは色んな所で頻繁に表すのでidfが低いため、検索でヒットされてもスコアが高くならないが、ノイズとして残ってしまいます。なので、stop wordsは検索精度の向上やindexサイズの節約などのためにも除外するべきだと思います。

https://github.com/goto456/stopwords ここにいくつの種類の中国語stop wordsをまとめていて、 cn_stopwords.txt を使ってみます。

ik-analysisにstop wordsフィルターを適用する

cn_stopwords.dic をstop wordsとして適用する IKAnalyzer.cfg.xml をdocker imageに追加する

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--辞書のカスタマイズ-->
    <entry key="ext_dict"></entry>
    <!--stop wordsのカスタマイズ-->
    <entry key="ext_stopwords">cn_stopwords.dic</entry>
    <!--辞書のカスタマイズ（URL指定）-->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--stop wordsのカスタマイズ（URL指定）-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

cn_stopwords.txtを cn_stopwords.dic にリネームしてdocker imageに追加する

FROM elasticsearch:7.10.1

RUN elasticsearch-plugin install -b https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.10.1/elasticsearch-analysis-ik-7.10.1.zip
ADD cn_stopwords.dic /usr/share/elasticsearch/config/analysis-ik
ADD IKAnalyzer.cfg.xml /usr/share/elasticsearch/config/analysis-ik

RUN elasticsearch-plugin install analysis-smartcn

これでdocker imageをビルドし直して、ElasticsearchとKibanaを再起動して、ik_smartの形態素分析を試します。

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": ["上天不仅给了她美貌，还给了她智慧。"]
}

結果：

{
  "tokens" : [
    {
      "token" : "上天", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0
    },
    {
      "token" : "美貌", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 1
    },
    {
      "token" : "还给", "start_offset" : 10, "end_offset" : 12, "type" : "CN_WORD", "position" : 2
    },
    {
      "token" : "智慧", "start_offset" : 14, "end_offset" : 16, "type" : "CN_WORD", "position" : 3
    }
  ]
}