I registered a program on PyPI to share with all of you a program for eliminating stopwords from a list of documents that have undergone morphological analysis, represented as a list.
The words for each part-of-speech to be deleted were selected by referring to wikipedia, ミエルカAI-日本語ストップワードの考察【品詞別】, and slothlib (30~50 for each part-of-speech)
It can be installed with pip. The only library it depends on is scikit-learn.
pip install ja_stopword_remover
After that, just import as usual.
from ja_stopword_remover.remover import StopwordRemover
The part to note is that “If you throw a list of sentences that have morphologically analyzed words as a list with morphologically analyzed words, it will remove the stopword and return it “.
If you prepare an instance from the StopwordRemover class and call the remove() method with a list of words as the argument for the list of words, the resulting list will be returned.
from ja_stopword_remover.remover import StopwordRemover import pprint # 多田なの(@ohta_nano)さんの詩です。 text_list = [[ "僕", "たち", "は", "プラネタリウム", "に", "立て籠もり", "夜明け", "の", "シーン", "だけ", "繰り返す",], [ "桜", "って", "「", "さくら", "」", "って", "読む", "って", "あなた", "から", "教えて", "もらう", "人", "に", "なりたい",],] stopwordRemover = StopwordRemover() text_list_result = stopwordRemover.remove(text_list) pprint.pprint(text_list_result) stopwordRemover = StopwordRemover() text_list_result = stopwordRemover.remove(text_list) pprint.pprint(text_list_result)
Specify the part of speech to be deleted
If you want to specify the part of speech, specify it in the argument of test_choose_parts().
|argument||part of speech|
stopwordRemover.choose_parts( demonstrative=False, symbol=False, verb=False, one_character=False, postpositional_particle=False, slothlib=True, auxiliary_verb=False, adjective=False )
If test_choose_parts() is not used, all parts-of-speech words are eliminated by default.
If you want to use it in the scikit-learn pipeline, please use the SKStopwordRemover class.
The usage of this class is also simple, just register an instance to step.
sKKStopwordRemover = SKStopwordRemover() step = [("StopwordRemover", sKKStopwordRemover)] pipe = Pipeline(steps=step) pipe.fit(text_list) text_list_result = pipe.transform(text_list)
If you want to specify parts of speech, specify the parts of speech you do not want to remove one by one in the argument when creating the instance SKStopwordRemover(one_character=False)
The default here is True, but the default is False because slothlib includes all parts of speech (sorry for the complication).