I registered a program on PyPI to share with all of you a program for eliminating stopwords from a list of documents that have undergone morphological analysis, represented as a list.

The words for each part-of-speech to be deleted were selected by referring to wikipedia, ミエルカAI-日本語ストップワードの考察【品詞別】, and slothlib (30~50 for each part-of-speech)

# Installation

It can be installed with pip. The only library it depends on is scikit-learn.

pip install ja_stopword_remover


After that, just import as usual.

from ja_stopword_remover.remover import StopwordRemover


# Sample Code

The part to note is that “If you throw a list of sentences that have morphologically analyzed words as a list with morphologically analyzed words, it will remove the stopword and return it “.

If you prepare an instance from the StopwordRemover class and call the remove() method with a list of words as the argument for the list of words, the resulting list will be returned.


from ja_stopword_remover.remover import StopwordRemover
import pprint

# 多田なの(@ohta_nano)さんの詩です。
text_list = [[ "僕", "たち", "は", "プラネタリウム", "に", "立て籠もり", "夜明け", "の", "シーン", "だけ", "繰り返す",],
[ "桜", "って", "「", "さくら", "」", "って", "読む", "って", "あなた", "から", "教えて", "もらう", "人", "に", "なりたい",],]

stopwordRemover = StopwordRemover()

text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)

stopwordRemover = StopwordRemover()

text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)

# Specify the part of speech to be deleted

If you want to specify the part of speech, specify it in the argument of test_choose_parts().

argument part of speech
demonstrative 指示語
pronoun こそあど言葉
symbol 記号
verb 動詞
one_character 一字
postpositional_particle 助詞
auxiliary_verb 助動詞
slothlib slothlib収録語

stopwordRemover.choose_parts(
demonstrative=False,
symbol=False,
verb=False,
one_character=False,
postpositional_particle=False,
slothlib=True,
auxiliary_verb=False,
)


If test_choose_parts() is not used, all parts-of-speech words are eliminated by default.

If you want to use it in the scikit-learn pipeline, please use the SKStopwordRemover class.

The usage of this class is also simple, just register an instance to step.


sKKStopwordRemover = SKStopwordRemover()

step = [("StopwordRemover", sKKStopwordRemover)]

pipe = Pipeline(steps=step)

pipe.fit(text_list)

text_list_result = pipe.transform(text_list)


If you want to specify parts of speech, specify the parts of speech you do not want to remove one by one in the argument when creating the instance SKStopwordRemover(one_character=False)

The default here is True, but the default is False because slothlib includes all parts of speech (sorry for the complication).