I registered a program on PyPI to share with all of you a program for eliminating stopwords from a list of documents that have undergone morphological analysis, represented as a list.
The words for each part-of-speech to be deleted were selected by referring to wikipedia, ミエルカAI-日本語ストップワードの考察【品詞別】, and slothlib (30~50 for each part-of-speech)
Installation
It can be installed with pip. The only library it depends on is scikit-learn.
pip install ja_stopword_remover
After that, just import as usual.
from ja_stopword_remover.remover import StopwordRemover
Sample Code
The part to note is that “If you throw a list of sentences that have morphologically analyzed words as a list with morphologically analyzed words, it will remove the stopword and return it “.
If you prepare an instance from the StopwordRemover class and call the remove() method with a list of words as the argument for the list of words, the resulting list will be returned.
from ja_stopword_remover.remover import StopwordRemover
import pprint
# 多田なの(@ohta_nano)さんの詩です。
text_list = [[ "僕", "たち", "は", "プラネタリウム", "に", "立て籠もり", "夜明け", "の", "シーン", "だけ", "繰り返す",],
[ "桜", "って", "「", "さくら", "」", "って", "読む", "って", "あなた", "から", "教えて", "もらう", "人", "に", "なりたい",],]
stopwordRemover = StopwordRemover()
text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)
stopwordRemover = StopwordRemover()
text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)
Specify the part of speech to be deleted
If you want to specify the part of speech, specify it in the argument of test_choose_parts().
argument | part of speech |
---|---|
demonstrative | 指示語 |
pronoun | こそあど言葉 |
symbol | 記号 |
verb | 動詞 |
one_character | 一字 |
postpositional_particle | 助詞 |
adjective | 形容詞 |
auxiliary_verb | 助動詞 |
slothlib | slothlib収録語 |
stopwordRemover.choose_parts(
demonstrative=False,
symbol=False,
verb=False,
one_character=False,
postpositional_particle=False,
slothlib=True,
auxiliary_verb=False,
adjective=False
)
If test_choose_parts() is not used, all parts-of-speech words are eliminated by default.
If you want to use it in the scikit-learn pipeline, please use the SKStopwordRemover class.
The usage of this class is also simple, just register an instance to step.
sKKStopwordRemover = SKStopwordRemover()
step = [("StopwordRemover", sKKStopwordRemover)]
pipe = Pipeline(steps=step)
pipe.fit(text_list)
text_list_result = pipe.transform(text_list)
If you want to specify parts of speech, specify the parts of speech you do not want to remove one by one in the argument when creating the instance SKStopwordRemover(one_character=False)
The default here is True, but the default is False because slothlib includes all parts of speech (sorry for the complication).
github:https://github.com/Pickerdot/ja_stopword_remover
PYPL:https://pypi.org/project/ja-stopword-remover/