How to encode Japanese in Python 3-MeCab

Asked 2 months ago, Updated 2 months ago, 3 views

Environment: Python 3.5.2, MacOS Sierra

Below is a code that reads Excel data, creates a string, and uses ReCab to indicate nouns and number of appearances.

import pandas as pd
import MeCab
import sys

df=pd.read_excel("filename.xls", sheetname=0)
df = df.dropna()

m = MeCab.Tagger()

none_list = [ ]
for i indf:
    for lin m.parse(i).splitlines():
        ifl!='EOS' and l.split('\t')[1].split(',')[0]=='noun':
            no_list.append(l.split('\t')[0])

no_cnt=collections.Counter(noun_list)

no=pd.DataFrame(list(noun_cnt.items())), columns=['noun', 'number of appearances'])
no=noun.sort_values('number of appearances', axis=0, ascending=True, insert=False,kind='quicksort',na_position='last')
noun=noun[noun['Number of appearances']>10]#Only those that appear more than 10 times

print(noun.tail())

Out

word Appearance Count
51 Worry 18
199 Yes 23
171 Security 31
156 Anxiety 40
154 Convenient 81

I ran the code in my file, but when I ran the code using different Japanese data, I got the following error, so I think it's an encoding problem.

NotImplementedError: Wrong number or type of arguments for overloaded function 'Tagger_parse'.

I think it is necessary to include encodes and decodes for utf-8, but could someone please tell me?

python mecab

2022-09-30 14:07

1 Answers

This issue has nothing to do with Pandas and appears to be caused by the different format of i passed to m.parse(i) in Mecab if it works correctly and if it does not.It may be a character code, but it may not be.

To clarify what i is, check what data you are trying to pass to the previous line, such as print(i), print(repr(i)) or print(type(i)).Once you have confirmed it, try to isolate whether you can experience the same problem when you are not using Pandas.

Just to guess, if it doesn't work well, i may be an array or other data format, not a string.


2022-09-30 14:07

If you have any answers or tips


© 2022 OneMinuteCode. All rights reserved.