I registered a program on PyPI to share with all of you a program for eliminating stopwords from a list of documents that have undergone morphological analysis, represented as a list.

The words for each part-of-speech to be deleted were selected by referring to wikipedia, ミエルカAI-日本語ストップワードの考察【品詞別】, and slothlib (30~50 for each part-of-speech)

 

 

Installation

It can be installed with pip. The only library it depends on is scikit-learn.

pip install ja_stopword_remover

After that, just import as usual.

from ja_stopword_remover.remover import StopwordRemover

 

 

Sample Code

The part to note is that “If you throw a list of sentences that have morphologically analyzed words as a list with morphologically analyzed words, it will remove the stopword and return it “.

If you prepare an instance from the StopwordRemover class and call the remove() method with a list of words as the argument for the list of words, the resulting list will be returned.


from ja_stopword_remover.remover import StopwordRemover
import pprint

# 多田なの(@ohta_nano)さんの詩です。
text_list = [[ "僕", "たち", "は", "プラネタリウム", "に", "立て籠もり", "夜明け", "の", "シーン", "だけ", "繰り返す",],
    [ "桜", "って", "「", "さくら", "」", "って", "読む", "って", "あなた", "から", "教えて", "もらう", "人", "に", "なりたい",],]

stopwordRemover = StopwordRemover()

text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)

stopwordRemover = StopwordRemover()

text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)

 

 

Specify the part of speech to be deleted

If you want to specify the part of speech, specify it in the argument of test_choose_parts().

argument part of speech
demonstrative 指示語
pronoun こそあど言葉
symbol 記号
verb 動詞
one_character 一字
postpositional_particle 助詞
adjective 形容詞
auxiliary_verb 助動詞
slothlib slothlib収録語

    stopwordRemover.choose_parts(
        demonstrative=False,
        symbol=False,
        verb=False,
        one_character=False,
        postpositional_particle=False,
        slothlib=True,
        auxiliary_verb=False,
        adjective=False
    )

If test_choose_parts() is not used, all parts-of-speech words are eliminated by default.

If you want to use it in the scikit-learn pipeline, please use the SKStopwordRemover class.

The usage of this class is also simple, just register an instance to step.


    sKKStopwordRemover = SKStopwordRemover()

    step = [("StopwordRemover", sKKStopwordRemover)]

    pipe = Pipeline(steps=step)

    pipe.fit(text_list)

    text_list_result = pipe.transform(text_list)

If you want to specify parts of speech, specify the parts of speech you do not want to remove one by one in the argument when creating the instance SKStopwordRemover(one_character=False)

The default here is True, but the default is False because slothlib includes all parts of speech (sorry for the complication).

github:https://github.com/Pickerdot/ja_stopword_remover

PYPL:https://pypi.org/project/ja-stopword-remover/