uhfx/20210205-mtg.md

## 20210205-mtg.md

      
    Raw
  

              20210205-mtg.md
            
          
    参考 : https://lunarwoffie.com/ja/lda-topic-model/
参考 : http://dslab.work/2019/10/30/post-237/

  
## run-test2.txt
(bachelor) user@MacBook-Pro bachelor % python3 test2.py VPN
コサイン類似度: 0.5854334894924799
アブストラクト: 'OCUNET3 VPN に接続できない。'
(bachelor) user@MacBook-Pro bachelor % python3 test2.py Windows10
コサイン類似度: 0.5504350833525765
アブストラクト: 'Windows10 Homeを使用しているが，リモートデスクトップを導入ためにはどうしたらよいか'
(bachelor) user@MacBook-Pro bachelor % python3 test2.py Windows
コサイン類似度: 0.6819498420709244
アブストラクト: 'Windows 10にアップデートしたい'
(bachelor) user@MacBook-Pro bachelor % python3 test2.py Office
コサイン類似度: 0.565107264115037
アブストラクト: 'Microsoft Office をインストール手順を知りたい'
(bachelor) user@MacBook-Pro bachelor % python3 test2.py 仮想ネットワーク
NotFound
(bachelor) user@MacBook-Pro bachelor % python3 test2.py ネットワーク
コサイン類似度: 0.7071067811865476
アブストラクト: 'ネットワーク プリンターを追加したい'
(bachelor) user@MacBook-Pro bachelor % python3 test2.py 仮想
NotFound
(bachelor) user@MacBook-Pro bachelor % python3 test2.py 印刷
NotFound

## run-test3.txt
(bachelor) user@MacBook-Pro bachelor % python3 test3.py Office
File already exists.
コサイン類似度: 0.8358414625441092, 配列番号: 243
類似度最大単語: 'office インストール office office 2013 office アンインストール 必要 ある'
配列個数:295
類似度最大質問文章: 'Officeをインストール際，Office 2010やOffice 2013 ，Office2016などはアンインストール必要はあるか'
(bachelor) user@MacBook-Pro bachelor % python3 test3.py メール
File already exists.
コサイン類似度: 0.7193007484841745, 配列番号: 213
類似度最大単語: 'メール thunderbird メール 設定 する'
配列個数:295
類似度最大質問文章: 'OCU メールをThunderbirdにメール設定したい'
(bachelor) user@MacBook-Pro bachelor % python3 test3.py リモートデスクトップ
File already exists.
コサイン類似度: 0.6981527750221076, 配列番号: 198
類似度最大単語: 'リモートデスクトップ 接続 できる'
配列個数:295
類似度最大質問文章: 'リモートデスクトップに接続できない'
(bachelor) user@MacBook-Pro bachelor % python3 test3.py 仮想
File already exists.
コサイン類似度: 0.6099451551510185, 配列番号: 220
類似度最大単語: '仮想 ネットワーク ログイン する'
配列個数:295
類似度最大質問文章: '別の仮想ネットワークにログインしたい'
(bachelor) user@MacBook-Pro bachelor % python3 test3.py VPN
File already exists.
コサイン類似度: 0.7156917108344479, 配列番号: 52
類似度最大単語: 'vpn 接続 できる'
配列個数:295
類似度最大質問文章: 'OCUNET3 VPN に接続できない。'
(bachelor) user@MacBook-Pro bachelor % python3 test3.py 仮想ネットワーク
File already exists.
MAX NotFound
(bachelor) user@MacBook-Pro bachelor % python3 test3.py リモートデスクトップに接続出来ない # 普通の文章を入力することは出来ていない
File already exists.
MAX NotFound
(bachelor) user@MacBook-Pro bachelor % MS Office # 空白スペースで単語を入力すると引数と認識しておかしくなる
zsh: command not found: MS

## run-test4.txt
(bachelor) user@MacBook-Pro bachelor % python3 test4.py VPN
ストップワードの読み込み完了
配列個数:295, コサイン類似度: 0.7156917108344479, 配列番号: 52, 類似度最大単語: 'vpn 接続 できる'
類似度最大質問文章: 'OCUNET3 VPN に接続できない。'
類似度最大回答文章: '「OCUNET3 利用者ガイド」のとおりに設定したら問題なく接続できた。'
(bachelor) user@MacBook-Pro bachelor % python3 test4.py プリンター
ストップワードの読み込み完了
配列個数:295, コサイン類似度: 0.6601325769215738, 配列番号: 267, 類似度最大単語: 'プリンター 印刷 できる'
類似度最大質問文章: 'プリンターで印刷できない'
類似度最大回答文章: 'モバイルルータを取り外し⇒プリンタ接続可になりました'
(bachelor) user@MacBook-Pro bachelor % python3 test4.py ネットワーク
ストップワードの読み込み完了
配列個数:295, コサイン類似度: 0.5232035465173208, 配列番号: 51, 類似度最大単語: 'ネットワーク 繋がる'
類似度最大質問文章: 'ネットワークに繋がらない。'
類似度最大回答文章: '室内にコミュニティWi-Fiのルータがあったので、ocunet3に接続すると解決した。'
(bachelor) user@MacBook-Pro bachelor % python3 test4.py Wi-Fi
ストップワードの読み込み完了
配列個数:295, コサイン類似度: 0.7865715545290431, 配列番号: 135, 類似度最大単語: 'コミュニティ wi-fi 自室 wi-fi 設置 する'
類似度最大質問文章: 'コミュニティWi-Fiではなく自室にWi-Fiを設置したい'
類似度最大回答文章: '無線 LAN ルーターを設定いただくに以下の点にご注意下さい。順守項目は次の 3 点になります。
1. 動作モード　→　ブリッジモードにする。ルーターモードにはしない。
2. 暗号化方式　→　WPA2（AES）
3. WAN（インターネット）側のポートは使用しない
'
(bachelor) user@MacBook-Pro bachelor % python3 test4.py Office
ストップワードの読み込み完了
配列個数:295, コサイン類似度: 0.8358414625441092, 配列番号: 243, 類似度最大単語: 'office インストール office office 2013 office アンインストール 必要 ある'
類似度最大質問文章: 'Officeをインストール際，Office 2010やOffice 2013 ，Office2016などはアンインストール必要はあるか'
類似度最大回答文章: '上記ソフトとOffice365は併用する事ができません。アンインストールを実行後、Office365のインストールを行ってください。'
(bachelor) user@MacBook-Pro bachelor % python3 test4.py office
ストップワードの読み込み完了
配列個数:295, コサイン類似度: 0.8358414625441092, 配列番号: 243, 類似度最大単語: 'office インストール office office 2013 office アンインストール 必要 ある'
類似度最大質問文章: 'Officeをインストール際，Office 2010やOffice 2013 ，Office2016などはアンインストール必要はあるか'
類似度最大回答文章: '上記ソフトとOffice365は併用する事ができません。アンインストールを実行後、Office365のインストールを行ってください。'
(bachelor) user@MacBook-Pro bachelor % python3 test4.py Windows
ストップワードの読み込み完了
配列個数:295, コサイン類似度: 0.5412622697296391, 配列番号: 1, 類似度最大単語: 'windows 10 アップデート する'
類似度最大質問文章: 'Windows 10にアップデートしたい'
類似度最大回答文章: 'Windows10は以下URLよりダウンロード可能です。 https://osaka-cu.onthehub.com/ [プロダクトキー＆ソフトウェア入手マニュアル] https://intra.cii.osaka-cu.ac.jp/wp-content/uploads/2019/12/student-win1020191227.pdf'
(bachelor) user@MacBook-Pro bachelor % python3 test4.py zoom
ストップワードの読み込み完了
配列個数:295, コサイン類似度: 0.6902546627574335, 配列番号: 27, 類似度最大単語: 'zoom ログイン できる'
類似度最大質問文章: 'Zoomにログインできない'
類似度最大回答文章: 'まず確認として以下URLへアクセスいただけますでしょうか。 https://intra.cii.osaka-cu.ac.jp/zoom/ 上記URLの「ライセンス付与対象者」の項目にご自身の「Office365サインイン情報」が表示されていますでしょうか。されていない場合ライセンス付与対象者ではございません。表示されている場合、今まで市立大学でご使用されたことがなければ、 http://ocu.jp/zoom にアクセスいただきZoomのアカウントの申請を行ってください。アカウントの申請方法、その後のサインイン、初期設定、アプリのインストールについては下記マニュアルをご確認ください。
◆教職員向けマニュアル
https://intra.cii.osaka-cu.ac.jp/wp-content/uploads/2020/08/Zoom%E3%83%9E%E3%83%8B%E3%83%A5%E3%82%A2%E3%83%AB_%E6%95%99%E8%81%B7%E5%93%A1%E5%90%91%E3%81%91_%E5%B8%82%E5%A4%A7%E7%89%88_20200827%E4%BF%AE%E6%AD%A3.pdf'
(bachelor) user@MacBook-Pro bachelor % python3 test4.py Unipa
ストップワードの読み込み完了
MAX NotFound

## run-test6.txt
(bachelor) user@MacBook-Pro bachelor % python3 test6.py office
ストップワードの読み込み完了
該当する質問番号: [243  42  61 176 103 205 279 113 101 217]
質問データ #243, コサイン類似度: 0.8358414625441092, 'Officeをインストール際，Office 2010やOffice 2013 ，Office2016などはアンインストール必要はあるか'
質問データ #42, コサイン類似度: 0.5990930956243289, '5台以上のPCでOfficeを利用したい'
質問データ #61, コサイン類似度: 0.559547767172075, '非常勤講師でもOfficeなどは利用か'
質問データ #176, コサイン類似度: 0.5479888871275133, 'OfficeのURLを教えてほしい'
質問データ #103, コサイン類似度: 0.5156493759091675, 'Microsoft Officeのインストール方法を教えて欲しい'
質問データ #205, コサイン類似度: 0.4497323829856985, '自宅のMacにOfficeをインストールしたい'
質問データ #279, コサイン類似度: 0.44224426868245115, 'MS Office をiPad で利用か'
質問データ #113, コサイン類似度: 0.4216200519991936, '共有PCでOfficeを利用したいが可能か。'
質問データ #101, コサイン類似度: 0.40648066344768174, 'Microsoft Office をインストール手順を知りたい'
質問データ #217, コサイン類似度: 0.29494231319153624, 'iPhone/iPad など iOS/iPad OS へのOfficeインストール方法'
(bachelor) user@MacBook-Pro bachelor % python3 test6.py vpn
ストップワードの読み込み完了
該当する質問番号: [ 52 158 223 116 238 154 212  93  94  95]
質問データ #52, コサイン類似度: 0.7156917108344479, 'OCUNET3 VPN に接続できない。'
質問データ #158, コサイン類似度: 0.5697996360894004, '名誉教授だが、VPNは使用か'
質問データ #223, コサイン類似度: 0.4437022790553058, '共同研究者として、市大ネットワークにVPN接続ようにしたい'
質問データ #116, コサイン類似度: 0.43108316010363573, 'VPN接続時に証明書エラーが表示されます'
質問データ #238, コサイン類似度: 0.40510762612028006, 'iOS（iPad，iPhone）でOCUNET3 VPNの使用方法'
質問データ #154, コサイン類似度: 0.39265724219543996, 'VPN 接続時にサーバーに ssh 接続できない'
質問データ #212, コサイン類似度: 0.36764801829392085, 'VPNの同時アクセス数の上限はいくつか'
(bachelor) user@MacBook-Pro bachelor % python3 test6.py リモートデスクトップ
ストップワードの読み込み完了
該当する質問番号: [198 114 100 202 252 241  81   3  37 101]
質問データ #198, コサイン類似度: 0.6981527750221076, 'リモートデスクトップに接続できない'
質問データ #114, コサイン類似度: 0.6446371131569532, 'リモートデスクトップにログイン出来ない'
質問データ #100, コサイン類似度: 0.5539404486946976, 'リモートデスクトップにログイン出来ない場合がある'
質問データ #202, コサイン類似度: 0.5082976647685898, 'Windows10 home はリモートデスクトップはか'
質問データ #252, コサイン類似度: 0.4683672836098455, 'リモートデスクトップのユーザー名とパスワードを知りたい'
質問データ #241, コサイン類似度: 0.4401706982155947, 'リモートデスクトップで大学のPCに繋がらなくなった。'
質問データ #81, コサイン類似度: 0.423391264722909, 'リモートデスクトップ接続で大学のパソコンにログインできない'
質問データ #3, コサイン類似度: 0.37922653677022006, 'MacでWindowsPCのおリモートデスクトップを操作方法を知りたい．'
質問データ #37, コサイン類似度: 0.3425054571123299, 'Windows10 Homeを使用しているが，リモートデスクトップを導入ためにはどうしたらよいか'
(bachelor) user@MacBook-Pro bachelor % python3 test6.py 仮想ネットワーク
ストップワードの読み込み完了
NotFound
(bachelor) user@MacBook-Pro bachelor % python3 test6.py 仮想
ストップワードの読み込み完了
該当する質問番号: [220 227   9  15  80  14  70 124  60 138]
質問データ #220, コサイン類似度: 0.6099451551510185, '別の仮想ネットワークにログインしたい'
質問データ #227, コサイン類似度: 0.583048209142973, '仮想ネットワークのCとYとZの違いが分からない'
質問データ #9, コサイン類似度: 0.567496752347493, '仮想ネットワークに登録したい。'
質問データ #15, コサイン類似度: 0.5263715386866682, '仮想ネットワークに登録方法を教えてほしい'
質問データ #80, コサイン類似度: 0.49054686610260423, '仮想ネットワークへ学生を追加方法'
質問データ #14, コサイン類似度: 0.47257770525071885, '学生が仮想ネットワークの申請はできないのか？'
質問データ #70, コサイン類似度: 0.42392572033961273, '仮想ネットワークの引っ越し方法のご質問'
質問データ #124, コサイン類似度: 0.38042647769464893, '仮想ネットワークにメンバーを追加にはどうしたらよいか'
質問データ #60, コサイン類似度: 0.3740576676643657, '仮想ネットワークに留学生を登録したいが，出てこない'
質問データ #138, コサイン類似度: 0.36707876303673787, 'OCUNET3の仮想ネットワークに入るためのレルムが切り換えれない'
(bachelor) user@MacBook-Pro bachelor % python3 test6.py OCUNET3
ストップワードの読み込み完了
NotFound

## test2.py
# 形態素解析する前 最上位回答1件のみ
import argparse
import numpy as np
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# import IPython
from IPython.display import display

# lunar1.py header
import pandas as pd
# import glob

# lunar2.py header
import MeCab
tagger = MeCab.Tagger("-Ochasen")
import mojimoji
import os
import urllib


text_paths = glob.glob('data/ocu2/*.txt')
texts = []

def main(args):
    data = get_data(args.datafile)
    cs = get_cs(args.query, data)

    max_index = np.argmax(cs)
    max_cs = cs[max_index][0]
    max_data = data[max_index]

    if max_cs > 1e-10:
        print(f"コサイン類似度: {max_cs}")
        print(f"アブストラクト: '{max_data}'")
    else:
        print("NotFound")

def get_data(datafile):
    # abstract = np.loadtxt(f"{datafile}", encoding="utf-8", delimiter='|', dtype=str)
    # data = []
    # for s in abstract:
    #     text = s.replace(".", "")
    #     text = text.replace(",", "")
    #     text = text.replace("(", "")
    #     text = text.replace(")", "")
    #     text = text.replace("-", " ")
    #     data.append(text.lower())
    data = []
    for text_path in text_paths:
        text = open(text_path, 'r').read()
        # text = text.split('\n') # modified
        text = text.split(',')
        title = text[3] # added
        # title = text[2] # modified
        text = ' '.join(text[8:9])
        # text = text.strip('\n')
        text = text.replace( '\n' , '' )
        text = text.strip('"')
        text = text.replace('する', '')
        text = text.replace('できる', '')

        data.append(text)

    return data

def get_cs(query, data):
    tfidf = TfidfVectorizer()
    abstract_vector = tfidf.fit_transform(data).toarray()
    query_vector = tfidf.transform([query]).toarray()
    cs = cosine_similarity(abstract_vector, query_vector)

    return cs

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("query", type=str)
    parser.add_argument("--datafile", type=str, default="data/ocu2/*.txt")
    args = parser.parse_args()
    main(args)

## test3.py
import argparse
import numpy as np
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# import IPython
from IPython.display import display

# lunar1.py header
import pandas as pd
# import glob

# lunar2.py header
import MeCab
tagger = MeCab.Tagger("-Ochasen")
import mojimoji
import os
import urllib


text_paths = glob.glob('data/ocu2/*.txt')
texts = []

data = []
for text_path in text_paths:
    text = open(text_path, 'r').read()
    # text = text.split('\n') # modified
    text = text.split(',')
    title = text[3] # added
    # title = text[2] # modified
    text = ' '.join(text[8:9])
    # text = text.strip('\n')
    text = text.replace( '\n' , '' )
    text = text.strip('"')
    text = text.replace('する', '')
    text = text.replace('できる', '')
    texts.append(text)

# def get_data(datafile):
def get_data():
    data = []
    for text_path in text_paths:
        text = open(text_path, 'r').read()
        # text = text.split('\n') # modified
        text = text.split(',')
        title = text[3] # added
        # title = text[2] # modified
        text = ' '.join(text[8:9])
        # text = text.strip('\n')
        text = text.replace( '\n' , '' )
        text = text.strip('"')
        text = text.replace('する', '')
        text = text.replace('できる', '')

        data.append(text)

    return data

def load_jp_stopwords(path="data/jp_stop_words.txt"):
    url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    if os.path.exists(path):
        print('File already exists.')
    else:
        print('Downloading...')
        urllib.request.urlretrieve(url, path)
    return pd.read_csv(path, header=None)[0].tolist()

def preprocess_jp(series):
    stop_words = load_jp_stopwords()
    def tokenizer_func(text):
        tokens = []
        node = tagger.parseToNode(str(text))
        while node:
            features = node.feature.split(',')
            surface = features[6]
            if (surface == '*') or (len(surface) < 2) or (surface in stop_words):
                node = node.next
                continue
            noun_flag = (features[0] == '名詞')
            proper_noun_flag = (features[0] == '名詞') & (features[1] == '固有名詞')
            verb_flag = (features[0] == '動詞') & (features[1] == '自立')
            adjective_flag = (features[0] == '形容詞') & (features[1] == '自立')
            if proper_noun_flag:
                tokens.append(surface)
            elif noun_flag:
                tokens.append(surface)
            elif verb_flag:
                tokens.append(surface)
            elif adjective_flag:
                tokens.append(surface)
            node = node.next
        return " ".join(tokens)

    series = series.map(tokenizer_func)

    #---------------Normalization-----------#
    series = series.map(lambda x: x.lower())
    # series = series.map(mojimoji.zen_to_han, kana=False)¥

    return series

# def get_cs(query, data):
def get_cs(query, series):
    tfidf = TfidfVectorizer()
    abstract_vector = tfidf.fit_transform(series).toarray()
    query_vector = tfidf.transform([query]).toarray()
    cs = cosine_similarity(abstract_vector, query_vector)

    return cs

def main(args):
    news_ss = pd.Series(texts)
    processed_news_ss = preprocess_jp(news_ss)
    data_mod = processed_news_ss
    str_data = map(str,data_mod)
    cs = get_cs(args.query, str_data)

    max_index = np.argmax(cs)
    max_cs = cs[max_index][0]
    max_data = processed_news_ss[max_index]

    if max_cs > 1e-10:
        print(f"コサイン類似度: {max_cs}, 配列番号: {max_index}")
        print(f"類似度最大単語: '{max_data}'")
        print(f"配列個数:{len(cs)}")
        print(f"類似度最大質問文章: '{texts[max_index]}'")
        # print(f"原文: '{max_raw_data}'")
    else:
        print("MAX NotFound")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("query", type=str)
    # parser.add_argument("--datafile", type=str, default="data/ocu2/*.txt")
    args = parser.parse_args()
    main(args)

## test4.py
import argparse
import numpy as np
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# import IPython
from IPython.display import display

# lunar1.py header
import pandas as pd
# import glob

# lunar2.py header
import MeCab
tagger = MeCab.Tagger("-Ochasen")
import mojimoji
import os
import urllib


text_paths = glob.glob('data/ocu2/*.txt')
# texts = []
# a_texts = []

def q_get(): # 質問文書を text に格納
    texts = []
    for text_path in text_paths:
        text = open(text_path, 'r').read()
        text = text.split(',') # CSV ファイルのセルで分割
        text = ' '.join(text[8:9]) # 質問文書部分
        text = text.replace( '\n' , '' ) # 質問文書の改行を削除
        text = text.strip('"') # CSV ファイルのセル " を削除
        text = text.replace('する', '') # する できる の削除（不要？）
        text = text.replace('できる', '')
        texts.append(text) # 配列 texts に格納
    return texts

def a_get(): # 回答文書を text に格納
    a_texts = []
    for text_path in text_paths:
        a_text = open(text_path, 'r').read()
        a_text = a_text.split(',') # CSV ファイルのセルで分割
        a_text = ' '.join(a_text[16:17]) # 質問文書部分
        # a_text = a_text.replace( '\n' , '' ) # 質問文書の改行を削除．読みにくいのでやっぱり不要
        a_text = a_text.strip('"') # CSV ファイルのセル " を削除
        a_texts.append(a_text) # 配列 a_texts に格納
    return a_texts

def load_jp_stopwords(path="data/jp_stop_words.txt"): # ストップワードの読み込み 外部ファイル
    url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    if os.path.exists(path):
        # print('File already exists.')
        print('ストップワードの読み込み完了')
    else:
        # print('Downloading...')
        print('ストップワードのダウンロード中')
        urllib.request.urlretrieve(url, path)
    return pd.read_csv(path, header=None)[0].tolist()

def preprocess_jp(series): # 前処理
    stop_words = load_jp_stopwords() # ストップワードの削除
    def tokenizer_func(text): # MeCab で名詞，動詞，形容動詞のみを残す処理する部分
        tokens = []
        node = tagger.parseToNode(str(text))
        while node:
            features = node.feature.split(',') # MeCab 辞書はコンマ区切りなので，コンマで分割
            surface = features[6] # MeCab 辞書の6番目の言葉の原型を抽出
            if (surface == '*') or (len(surface) < 2) or (surface in stop_words): # 知らない言葉は無視
                node = node.next
                continue
            noun_flag = (features[0] == '名詞')
            proper_noun_flag = (features[0] == '名詞') & (features[1] == '固有名詞')
            verb_flag = (features[0] == '動詞') & (features[1] == '自立')
            adjective_flag = (features[0] == '形容詞') & (features[1] == '自立')
            if proper_noun_flag:
                tokens.append(surface)
            elif noun_flag:
                tokens.append(surface)
            elif verb_flag:
                tokens.append(surface)
            elif adjective_flag:
                tokens.append(surface)
            node = node.next
        return " ".join(tokens)

    series = series.map(tokenizer_func)

    #---------------Normalization-----------#
    series = series.map(lambda x: x.lower()) # 小文字に統一
    # series = series.map(mojimoji.zen_to_han, kana=False) # 半角に（カタカナ除く）統一．なんか動かないし不要．

    return series

# def get_cs(query, data):
def get_cs(query, series): # 質問文書を MeCab で処理したあとのものをコサイン類似度を評価
    tfidf = TfidfVectorizer() # Tf-Idf 化関数に名前を付ける
    abstract_vector = tfidf.fit_transform(series).toarray() # 質問文書を Tf-Idf を用いて数値化
    query_vector = tfidf.transform([query]).toarray() # 入力された質問を Tf-Idf を用いて数値化
    cs = cosine_similarity(abstract_vector, query_vector) # コサイン類似度の評価

    return cs # それぞれのコサイン類似度を評価

def main(args):
    texts = q_get() # 質問文書の取得
    a_texts = a_get() # 回答文書の取得
    news_ss = pd.Series(texts) # 質問文書を Pandas の Series に格納
    processed_news_ss = preprocess_jp(news_ss) # 質問文書の前処理および MeCab を用いて形態素解析する
    data_mod = processed_news_ss # 形態素解析後のデータに名前を付ける
    str_data = map(str, data_mod) # data_mod の list 配列を str 型に変更
    cs = get_cs(args.query, str_data) # コサイン類似度の取得

    max_index = np.argmax(cs) # コサイン類似度の最大の配列番号の取得
    max_cs = cs[max_index][0] # コサイン類似度最大値
    max_data = processed_news_ss[max_index] # コサイン類似度最大の質問文書（処理後）
    max_raw_data = texts[max_index] # コサイン類似度最大の質問文書（処理前）
    max_ans_data = a_texts[max_index] # コサイン類似度最大の回答文書

    if max_cs > 1e-10:
        # print(f"配列個数:{len(cs)}")
        print(f"配列個数:{len(cs)}, コサイン類似度: {max_cs}, 配列番号: {max_index}, 類似度最大単語: '{max_data}'")
        # print(f"類似度最大単語: '{max_data}'")
        print(f"類似度最大質問文章: '{max_raw_data}'")
        print(f"類似度最大回答文章: '{max_ans_data}'")
        # print(f"原文: '{max_raw_data}'")
    else:
        print("MAX NotFound")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("query", type=str)
    # parser.add_argument("--datafile", type=str, default="data/ocu2/*.txt")
    args = parser.parse_args()
    main(args)

## test5.py
import argparse
import numpy as np
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# import IPython
from IPython.display import display

# lunar1.py header
import pandas as pd
# import glob

# lunar2.py header
import MeCab
tagger = MeCab.Tagger("-Ochasen")
import mojimoji
import os
import urllib


text_paths = glob.glob('data/ocu2/*.txt')

def q_get(): # 質問文書を text に格納
    texts = []
    for text_path in text_paths:
        text = open(text_path, 'r').read()
        text = text.split(',') # CSV ファイルのセルで分割
        text = ' '.join(text[8:9]) # 質問文書部分
        text = text.replace( '\n' , '' ) # 質問文書の改行を削除
        text = text.strip('"') # CSV ファイルのセル " を削除
        text = text.replace('する', '') # する できる の削除（不要？）
        text = text.replace('できる', '')
        texts.append(text) # 配列 texts に格納
    return texts

def a_get(): # 回答文書を text に格納
    a_texts = []
    for text_path in text_paths:
        a_text = open(text_path, 'r').read()
        a_text = a_text.split(',') # CSV ファイルのセルで分割
        a_text = ' '.join(a_text[16:17]) # 質問文書部分
        # a_text = a_text.replace( '\n' , '' ) # 質問文書の改行を削除．読みにくいのでやっぱり不要
        a_text = a_text.strip('"') # CSV ファイルのセル " を削除
        a_texts.append(a_text) # 配列 a_texts に格納
    return a_texts

def load_stopwords(path="data/jp_stop_words.txt"): # ストップワードの読み込み 外部ファイル
    url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    if os.path.exists(path):
        print('ストップワードの読み込み完了')
    else:
        print('ストップワードのダウンロード中')
        urllib.request.urlretrieve(url, path)
    return pd.read_csv(path, header=None)[0].tolist()

def preprocess(series): # 前処理
    stop_words = load_stopwords() # ストップワードの削除
    def tokenizer_func(text): # MeCab で名詞，動詞，形容動詞のみを残す処理する部分
        tokens = []
        node = tagger.parseToNode(str(text))
        while node:
            features = node.feature.split(',') # MeCab 辞書はコンマ区切りなので，コンマで分割
            surface = features[6] # MeCab 辞書の6番目の言葉の原型を抽出
            if (surface == '*') or (len(surface) < 2) or (surface in stop_words): # 知らない言葉は無視
                node = node.next
                continue
            noun_flag = (features[0] == '名詞')
            proper_noun_flag = (features[0] == '名詞') & (features[1] == '固有名詞')
            verb_flag = (features[0] == '動詞') & (features[1] == '自立')
            adjective_flag = (features[0] == '形容詞') & (features[1] == '自立')
            if proper_noun_flag:
                tokens.append(surface)
            elif noun_flag:
                tokens.append(surface)
            elif verb_flag:
                tokens.append(surface)
            elif adjective_flag:
                tokens.append(surface)
            node = node.next
        return " ".join(tokens)

    series = series.map(tokenizer_func)

    #---------------Normalization-----------#
    series = series.map(lambda x: x.lower()) # 小文字に統一
    # series = series.map(mojimoji.zen_to_han, kana=False) # 半角に（カタカナ除く）統一．なんか動かないし不要．

    return series

def get_cs(query, series): # 質問文書を MeCab で処理したあとのものをコサイン類似度を評価
    tfidf = TfidfVectorizer() # Tf-Idf 化関数に名前を付ける
    question_vector = tfidf.fit_transform(series).toarray() # 質問文書を Tf-Idf を用いて数値化
    query_vector = tfidf.transform([query]).toarray() # 入力された質問を Tf-Idf を用いて数値化
    cs = cosine_similarity(question_vector, query_vector) # コサイン類似度の評価

    return cs # それぞれのコサイン類似度を評価

def main(args):
    texts = q_get() # 質問文書の取得
    a_texts = a_get() # 回答文書の取得
    q_series = pd.Series(texts) # 質問文書を Pandas の Series に格納
    processed_q_series = preprocess(q_series) # 質問文書の前処理および MeCab を用いて形態素解析する
    # data_mod = processed_q_series # 形態素解析後のデータに名前を付ける
    # str_data = map(str, data_mod) # data_mod の list 配列を str 型に変更
    str_data = map(str, processed_q_series) # data_mod の list 配列を str 型に変更
    cs = get_cs(args.query, str_data) # コサイン類似度の取得

    max_index = np.argmax(cs) # コサイン類似度の最大の配列番号の取得
    max_cs = cs[max_index][0] # コサイン類似度最大値
    max_data = processed_q_series[max_index] # コサイン類似度最大の質問文書（処理後）
    max_raw_data = texts[max_index] # コサイン類似度最大の質問文書（処理前）
    max_ans_data = a_texts[max_index] # コサイン類似度最大の回答文書

    if max_cs > 1e-10:
        print(f"配列個数:{len(cs)}, コサイン類似度: {max_cs}, 配列番号: {max_index}, 類似度最大単語: '{max_data}'")
        print(f"類似度最大質問文章: '{max_raw_data}'")
        print(f"類似度最大回答文章: '{max_ans_data}'")
    else:
        print("MAX NotFound")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("query", type=str)
    args = parser.parse_args()
    main(args)

## test6-2.py
import argparse
import numpy as np
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# import IPython
from IPython.display import display

# lunar1.py header
import pandas as pd
# import glob

# lunar2.py header
import MeCab
tagger = MeCab.Tagger("-Ochasen")
import mojimoji
import os
import urllib


text_paths = glob.glob('data/ocu2/*.txt')

def q_get(): # 質問文書を text に格納
    texts = []
    for text_path in text_paths:
        text = open(text_path, 'r').read()
        text = text.split(',') # CSV ファイルのセルで分割
        text = ' '.join(text[8:9]) # 質問文書部分
        text = text.replace( '\n' , '' ) # 質問文書の改行を削除
        text = text.strip('"') # CSV ファイルのセル " を削除
        text = text.replace('する', '') # する できる の削除（不要？）
        text = text.replace('できる', '')
        texts.append(text) # 配列 texts に格納
    return texts

def a_get(): # 回答文書を text に格納
    a_texts = []
    for text_path in text_paths:
        a_text = open(text_path, 'r').read()
        a_text = a_text.split(',') # CSV ファイルのセルで分割
        a_text = ' '.join(a_text[16:17]) # 質問文書部分
        # a_text = a_text.replace( '\n' , '' ) # 質問文書の改行を削除．読みにくいのでやっぱり不要
        a_text = a_text.strip('"') # CSV ファイルのセル " を削除
        a_texts.append(a_text) # 配列 a_texts に格納
    return a_texts

def load_stopwords(path="data/jp_stop_words.txt"): # ストップワードの読み込み 外部ファイル
    url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    # if os.path.exists(path):
    #     print('ストップワードの読み込み完了')
    # else:
    #     print('ストップワードのダウンロード中')
    #     urllib.request.urlretrieve(url, path)
    return pd.read_csv(path, header=None)[0].tolist()

def preprocess(series): # 前処理
    stop_words = load_stopwords() # ストップワードの削除
    def tokenizer_func(text): # MeCab で名詞，動詞，形容動詞のみを残す処理する部分
        tokens = []
        node = tagger.parseToNode(str(text))
        while node:
            features = node.feature.split(',') # MeCab 辞書はコンマ区切りなので，コンマで分割
            surface = features[6] # MeCab 辞書の6番目の言葉の原型を抽出
            if (surface == '*') or (len(surface) < 2) or (surface in stop_words): # 知らない言葉は無視
                node = node.next
                continue
            noun_flag = (features[0] == '名詞')
            proper_noun_flag = (features[0] == '名詞') & (features[1] == '固有名詞')
            verb_flag = (features[0] == '動詞') & (features[1] == '自立')
            adjective_flag = (features[0] == '形容詞') & (features[1] == '自立')
            if proper_noun_flag:
                tokens.append(surface)
            elif noun_flag:
                tokens.append(surface)
            elif verb_flag:
                tokens.append(surface)
            elif adjective_flag:
                tokens.append(surface)
            node = node.next
        return " ".join(tokens)

    series = series.map(tokenizer_func)

    #---------------Normalization-----------#
    series = series.map(lambda x: x.lower()) # 小文字に統一
    # series = series.map(mojimoji.zen_to_han, kana=False) # 半角に（カタカナ除く）統一．なんか動かないし不要．

    return series

def get_cs(query, series): # 質問文書を MeCab で処理したあとのものをコサイン類似度を評価
    tfidf = TfidfVectorizer() # Tf-Idf 化関数に名前を付ける
    question_vector = tfidf.fit_transform(series).toarray() # 質問文書を Tf-Idf を用いて数値化
    query_vector = tfidf.transform([query]).toarray() # 入力された質問を Tf-Idf を用いて数値化
    cs = cosine_similarity(question_vector, query_vector) # コサイン類似度の評価

    return cs # それぞれのコサイン類似度を評価

def find_top_n(n, cs): # コサイン類似度上から順に n 件の配列番号を取得する
    # arr_top_n_indices = np.argsort(cs)[::-1][:n] # https://www.pytry3g.com/entry/cosine_similarity
    arr_top_n_indices = np.argsort(cs, axis = None)[-n:]
    # https://stackoverflow.com/questions/16993707/getting-the-top-k-relevant-document-from-a-similarity-numpy-ndarray
    top_n_indices = arr_top_n_indices[::-1] # 降順にソート
    return top_n_indices # too_n_indices は n 個の配列，一つ一つは番号

def get_n_cs(cs, top_n_index, top_n_indices): # 配列番号 top_n_index 番目のコサイン類似度の取得
    for n_cs in top_n_indices:
        n_cs = cs[top_n_index][0]
    return n_cs

def main(args):
    texts = q_get() # 質問文書の取得
    a_texts = a_get() # 回答文書の取得
    q_series = pd.Series(texts) # 質問文書を Pandas の Series に格納
    processed_q_series = preprocess(q_series) # 質問文書の前処理および MeCab を用いて形態素解析する
    # str_data = map(str, processed_q_series) # data_mod の list 配列を str 型に変更．不要．
    str_data = processed_q_series # 上の行使うかどちらか選択．
    cs = get_cs(args.query, str_data) # コサイン類似度の取得
    n = 10 # 上位 n 件を表示させる
    top_n_indices = find_top_n(n, cs) # 検索結果の上から順に n 個見つける．top_n_indices は上位 n 件の番号が格納された np.array 配列

    max_index = np.argmax(cs) # コサイン類似度の最大の配列番号の取得
    max_cs = cs[max_index][0] # コサイン類似度最大値

    if max_cs > 1e-10:
        print(f"該当する質問番号: {top_n_indices}")
        # print(f"{lst_top_n}") # np.array 形式から通常の list 形式に変換したものを表示
        # print(f"配列個数:{len(cs)}, コサイン類似度: {max_cs}, 配列番号: {max_index}, 類似度最大単語: '{max_data}'")
        for top_n_index in top_n_indices: # 結果の表示
            n_cs = get_n_cs(cs, top_n_index, top_n_indices) # 各コサイン類似度の取得
            if n_cs > 1e-10:
                print(f"質問データ #{top_n_index}, コサイン類似度: {n_cs}, '{texts[top_n_index]}'")
    else:
        print("NotFound")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("query", type=str)
    args = parser.parse_args()
    main(args)

## test6.py
import argparse
import numpy as np
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# import IPython
from IPython.display import display

# lunar1.py header
import pandas as pd
# import glob

# lunar2.py header
import MeCab
tagger = MeCab.Tagger("-Ochasen")
import mojimoji
import os
import urllib


text_paths = glob.glob('data/ocu2/*.txt')

def q_get(): # 質問文書を text に格納
    texts = []
    for text_path in text_paths:
        text = open(text_path, 'r').read()
        text = text.split(',') # CSV ファイルのセルで分割
        text = ' '.join(text[8:9]) # 質問文書部分
        text = text.replace( '\n' , '' ) # 質問文書の改行を削除
        text = text.strip('"') # CSV ファイルのセル " を削除
        text = text.replace('する', '') # する できる の削除（不要？）
        text = text.replace('できる', '')
        texts.append(text) # 配列 texts に格納
    return texts

def a_get(): # 回答文書を text に格納
    a_texts = []
    for text_path in text_paths:
        a_text = open(text_path, 'r').read()
        a_text = a_text.split(',') # CSV ファイルのセルで分割
        a_text = ' '.join(a_text[16:17]) # 質問文書部分
        # a_text = a_text.replace( '\n' , '' ) # 質問文書の改行を削除．読みにくいのでやっぱり不要
        a_text = a_text.strip('"') # CSV ファイルのセル " を削除
        a_texts.append(a_text) # 配列 a_texts に格納
    return a_texts

def load_stopwords(path="data/jp_stop_words.txt"): # ストップワードの読み込み 外部ファイル
    url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    if os.path.exists(path):
        print('ストップワードの読み込み完了')
    else:
        print('ストップワードのダウンロード中')
        urllib.request.urlretrieve(url, path)
    return pd.read_csv(path, header=None)[0].tolist()

def preprocess(series): # 前処理
    stop_words = load_stopwords() # ストップワードの削除
    def tokenizer_func(text): # MeCab で名詞，動詞，形容動詞のみを残す処理する部分
        tokens = []
        node = tagger.parseToNode(str(text))
        while node:
            features = node.feature.split(',') # MeCab 辞書はコンマ区切りなので，コンマで分割
            surface = features[6] # MeCab 辞書の6番目の言葉の原型を抽出
            if (surface == '*') or (len(surface) < 2) or (surface in stop_words): # 知らない言葉は無視
                node = node.next
                continue
            noun_flag = (features[0] == '名詞')
            proper_noun_flag = (features[0] == '名詞') & (features[1] == '固有名詞')
            verb_flag = (features[0] == '動詞') & (features[1] == '自立')
            adjective_flag = (features[0] == '形容詞') & (features[1] == '自立')
            if proper_noun_flag:
                tokens.append(surface)
            elif noun_flag:
                tokens.append(surface)
            elif verb_flag:
                tokens.append(surface)
            elif adjective_flag:
                tokens.append(surface)
            node = node.next
        return " ".join(tokens)

    series = series.map(tokenizer_func)

    #---------------Normalization-----------#
    series = series.map(lambda x: x.lower()) # 小文字に統一
    # series = series.map(mojimoji.zen_to_han, kana=False) # 半角に（カタカナ除く）統一．なんか動かないし不要．

    return series

def get_cs(query, series): # 質問文書を MeCab で処理したあとのものをコサイン類似度を評価
    tfidf = TfidfVectorizer() # Tf-Idf 化関数に名前を付ける
    question_vector = tfidf.fit_transform(series).toarray() # 質問文書を Tf-Idf を用いて数値化
    query_vector = tfidf.transform([query]).toarray() # 入力された質問を Tf-Idf を用いて数値化
    cs = cosine_similarity(question_vector, query_vector) # コサイン類似度の評価

    return cs # それぞれのコサイン類似度を評価

def find_top_n(n, cs): # 検索結果の上から順に n 個見つける
    # arr_top_n_indices = np.argsort(cs)[::-1][:n] # https://www.pytry3g.com/entry/cosine_similarity
    arr_top_n_indices = np.argsort(cs, axis = None)[-n:]
    # https://stackoverflow.com/questions/16993707/getting-the-top-k-relevant-document-from-a-similarity-numpy-ndarray
    top_n_indices = arr_top_n_indices[::-1] # 降順にソート
    return top_n_indices # too_n_indices は n 個の配列，一つ一つは番号


# def top_n(n, cs, texts, a_texts, top_n_indices): # 上位 n 件の検索結果を取得
# def top_n(n, cs, text, a_text): # 上位 n 件の検索結果を取得．find_top_n で特定した配列の番号の要素を新しい配列に入れる
#     top_n_indices = find_top_n(n, cs)
#     n_texts = []
#     n_a_texts = []
#     for n_text in top_n_indices:
#         n_texts.append(text)
#     for n_a_text in top_n_indices:
#         n_a_texts.append(a_text)
#     return n_texts, n_a_texts

def main(args):
    texts = q_get() # 質問文書の取得
    a_texts = a_get() # 回答文書の取得
    q_series = pd.Series(texts) # 質問文書を Pandas の Series に格納
    processed_q_series = preprocess(q_series) # 質問文書の前処理および MeCab を用いて形態素解析する
    str_data = map(str, processed_q_series) # data_mod の list 配列を str 型に変更
    cs = get_cs(args.query, str_data) # コサイン類似度の取得
    n = 10 # 上位 n 件を表示させる
    top_n_indices = find_top_n(n, cs) # 検索結果の上から順に n 個見つける．top_n_indices は上位 n 件の番号が格納された np.array 配列
    # questions, answers = top_n(n, cs, texts, a_texts, top_n_indices)
    # n_texts, n_a_texts = top_n(n, cs, texts, a_texts)
    # lst_top_n = top_n_indices.tolist() # np.array 形式から通常の list 形式に変換

    max_index = np.argmax(cs) # コサイン類似度の最大の配列番号の取得
    max_cs = cs[max_index][0] # コサイン類似度最大値
    # max_data = processed_q_series[max_index] # コサイン類似度最大の質問文書（処理後）
    # max_raw_data = texts[max_index] # コサイン類似度最大の質問文書（処理前）
    # max_ans_data = a_texts[max_index] # コサイン類似度最大の回答文書

    if max_cs > 1e-10:
        print(f"該当する質問番号: {top_n_indices}")
        # print(f"{lst_top_n}") # np.array 形式から通常の list 形式に変換したものを表示
        # print(f"配列個数:{len(cs)}, コサイン類似度: {max_cs}, 配列番号: {max_index}, 類似度最大単語: '{max_data}'")
        for top_n_index in top_n_indices: # 結果の表示
            for n_cs in top_n_indices:
                n_cs = cs[top_n_index][0] # 各コサイン類似度の取得
            if n_cs > 1e-10:
                print(f"質問データ #{top_n_index}, コサイン類似度: {n_cs}, '{texts[top_n_index]}'")
    else:
        print("NotFound")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("query", type=str)
    args = parser.parse_args()
    main(args)

## test7.py
# 入力に対して処理してコサイン類似度を表示する
import argparse
import numpy as np
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# import IPython
from IPython.display import display

# lunar1.py header
import pandas as pd
# import glob

# lunar2.py header
import MeCab
tagger = MeCab.Tagger("-Ochasen")
import mojimoji
import os
import urllib


text_paths = glob.glob('data/ocu2/*.txt')

def q_get(): # 質問文書を text に格納
    texts = []
    for text_path in text_paths:
        text = open(text_path, 'r').read()
        text = text.split(',') # CSV ファイルのセルで分割
        text = ' '.join(text[8:9]) # 質問文書部分
        text = text.replace( '\n' , '' ) # 質問文書の改行を削除
        text = text.strip('"') # CSV ファイルのセル " を削除
        text = text.replace('する', '') # する できる の削除（不要？）
        text = text.replace('できる', '')
        texts.append(text) # 配列 texts に格納
    return texts

def a_get(): # 回答文書を text に格納
    a_texts = []
    for text_path in text_paths:
        a_text = open(text_path, 'r').read()
        a_text = a_text.split(',') # CSV ファイルのセルで分割
        a_text = ' '.join(a_text[16:17]) # 質問文書部分
        # a_text = a_text.replace( '\n' , '' ) # 質問文書の改行を削除．読みにくいのでやっぱり不要
        a_text = a_text.strip('"') # CSV ファイルのセル " を削除
        a_texts.append(a_text) # 配列 a_texts に格納
    return a_texts

def load_stopwords(path="data/jp_stop_words.txt"): # ストップワードの読み込み 外部ファイル
    url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    # if os.path.exists(path):
    #     print('ストップワードの読み込み完了')
    # else:
    #     print('ストップワードのダウンロード中')
    #     urllib.request.urlretrieve(url, path)
    return pd.read_csv(path, header=None)[0].tolist()

def preprocess(series): # 前処理
    stop_words = load_stopwords() # ストップワードの削除
    def tokenizer_func(text): # MeCab で名詞，動詞，形容動詞のみを残す処理する部分
        tokens = []
        node = tagger.parseToNode(str(text))
        while node:
            features = node.feature.split(',') # MeCab 辞書はコンマ区切りなので，コンマで分割
            surface = features[6] # MeCab 辞書の6番目の言葉の原型を抽出
            if (surface == '*') or (len(surface) < 2) or (surface in stop_words): # 知らない言葉は無視
                node = node.next
                continue
            noun_flag = (features[0] == '名詞')
            proper_noun_flag = (features[0] == '名詞') & (features[1] == '固有名詞')
            verb_flag = (features[0] == '動詞') & (features[1] == '自立')
            adjective_flag = (features[0] == '形容詞') & (features[1] == '自立')
            if proper_noun_flag:
                tokens.append(surface)
            elif noun_flag:
                tokens.append(surface)
            elif verb_flag:
                tokens.append(surface)
            elif adjective_flag:
                tokens.append(surface)
            node = node.next
        return " ".join(tokens)

    series = series.map(tokenizer_func)

    #---------------Normalization-----------#
    series = series.map(lambda x: x.lower()) # 小文字に統一
    # series = series.map(mojimoji.zen_to_han, kana=False) # 半角に（カタカナ除く）統一．なんか動かないし不要．

    return series

# query_preprocess は不要．
# def query_preprocess(query_series): # 前処理
#     stop_words = load_stopwords() # ストップワードの削除
#     def tokenizer_func(text): # MeCab で名詞，動詞，形容動詞のみを残す処理する部分
#         tokens = []
#         node = tagger.parseToNode(str(text))
#         while node:
#             features = node.feature.split(',') # MeCab 辞書はコンマ区切りなので，コンマで分割
#             surface = features[6] # MeCab 辞書の6番目の言葉の原型を抽出
#             if (surface == '*') or (len(surface) < 2) or (surface in stop_words): # 知らない言葉は無視
#                 node = node.next
#                 continue
#             noun_flag = (features[0] == '名詞')
#             proper_noun_flag = (features[0] == '名詞') & (features[1] == '固有名詞')
#             verb_flag = (features[0] == '動詞') & (features[1] == '自立')
#             adjective_flag = (features[0] == '形容詞') & (features[1] == '自立')
#             if proper_noun_flag:
#                 tokens.append(surface)
#             elif noun_flag:
#                 tokens.append(surface)
#             elif verb_flag:
#                 tokens.append(surface)
#             elif adjective_flag:
#                 tokens.append(surface)
#             node = node.next
#         return " ".join(tokens)
#
#     query_series = query_series.map(tokenizer_func)
#     # query_series = tokenizer_func(query_series)
#
#     #---------------Normalization-----------#
#     query_series = query_series.map(lambda x: x.lower()) # 小文字に統一
#     # series = series.map(mojimoji.zen_to_han, kana=False) # 半角に（カタカナ除く）統一．なんか動かないし不要．
#     return query_series

###############################
# ここの多次元配列の処理に困っている． #
###############################
def get_cs(query_series, series): # 質問文書を MeCab で処理したあとのものをコサイン類似度を評価．多次元配列の処理難しい．
    tfidf = TfidfVectorizer() # Tf-Idf 化関数に名前を付ける
    question_vector = tfidf.fit_transform(series).toarray() # 質問文書を Tf-Idf を用いて数値化
    # query_vector = tfidf.transform([query_series]).toarray() # 入力された質問を Tf-Idf を用いて数値化
    query_vector = tfidf.fit_transform(query_series).toarray() # 入力された質問を Tf-Idf を用いて数値化．多次元配列にする必要ありそう
    cs = cosine_similarity(question_vector, query_vector) # コサイン類似度の評価

    return cs # それぞれのコサイン類似度を評価

def find_top_n(n, cs): # コサイン類似度上から順に n 件の配列番号を取得する
    arr_top_n_indices = np.argsort(cs, axis = None)[-n:]
    top_n_indices = arr_top_n_indices[::-1] # 降順にソート
    return top_n_indices # too_n_indices は n 個の配列，一つ一つは番号

def get_n_cs(cs, top_n_index, top_n_indices): # 配列番号 top_n_index 番目のコサイン類似度の取得
    for n_cs in top_n_indices:
        n_cs = cs[top_n_index][0]
    return n_cs

def listing_query(query): # 質問文書を queries に格納
    list_query = []
    list_query.append(query) # 配列 queries に格納

    return list_query

def main(args):
    texts = q_get() # 質問文書の取得
    a_texts = a_get() # 回答文書の取得
    query_texts = listing_query(args.query)
    q_series = pd.Series(texts) # 質問文書を Pandas の Series に格納
    query_series = pd.Series(query_texts) # クエリを Pandas の Series に格納
    processed_q_series = preprocess(q_series) # 質問文書の前処理および MeCab を用いて形態素解析する
    # str_data = map(str, processed_q_series) # data_mod の list 配列を str 型に変更．不要．
    str_data = processed_q_series # 上の行使うかどちらか選択．
    processed_query_series = preprocess(query_series) # query_series を処理
    # print(type(processed_query_series))
    str_query = map(str, processed_query_series)
    cs = get_cs(str_query, str_data) # コサイン類似度の取得
    n = 10 # 上位 n 件を表示させる
    top_n_indices = find_top_n(n, cs) # 検索結果の上から順に n 個見つける．top_n_indices は上位 n 件の番号が格納された np.array 配列

    max_index = np.argmax(cs) # コサイン類似度の最大の配列番号の取得
    max_cs = cs[max_index][0] # コサイン類似度最大値

    if max_cs > 1e-10:
        print(f"該当する質問番号: {top_n_indices}")
        # print(f"{lst_top_n}") # np.array 形式から通常の list 形式に変換したものを表示
        # print(f"配列個数:{len(cs)}, コサイン類似度: {max_cs}, 配列番号: {max_index}, 類似度最大単語: '{max_data}'")
        for top_n_index in top_n_indices: # 結果の表示
            n_cs = get_n_cs(cs, top_n_index, top_n_indices) # 各コサイン類似度の取得
            if n_cs > 1e-10:
                print(f"質問データ #{top_n_index}, コサイン類似度: {n_cs}, '{texts[top_n_index]}'")
    else:
        print("NotFound")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("query", type=str)
    args = parser.parse_args()
    main(args)
	(bachelor) user@MacBook-Pro bachelor % python3 test2.py VPN
	コサイン類似度: 0.5854334894924799
	アブストラクト: 'OCUNET3 VPN に接続できない。'
	(bachelor) user@MacBook-Pro bachelor % python3 test2.py Windows10
	コサイン類似度: 0.5504350833525765
	アブストラクト: 'Windows10 Homeを使用しているが，リモートデスクトップを導入ためにはどうしたらよいか'
	(bachelor) user@MacBook-Pro bachelor % python3 test2.py Windows
	コサイン類似度: 0.6819498420709244
	アブストラクト: 'Windows 10にアップデートしたい'
	(bachelor) user@MacBook-Pro bachelor % python3 test2.py Office
	コサイン類似度: 0.565107264115037
	アブストラクト: 'Microsoft Office をインストール手順を知りたい'
	(bachelor) user@MacBook-Pro bachelor % python3 test2.py 仮想ネットワーク
	NotFound
	(bachelor) user@MacBook-Pro bachelor % python3 test2.py ネットワーク
	コサイン類似度: 0.7071067811865476
	アブストラクト: 'ネットワークプリンターを追加したい'
	(bachelor) user@MacBook-Pro bachelor % python3 test2.py 仮想
	NotFound
	(bachelor) user@MacBook-Pro bachelor % python3 test2.py 印刷
	NotFound
	(bachelor) user@MacBook-Pro bachelor % python3 test3.py Office
	File already exists.
	コサイン類似度: 0.8358414625441092, 配列番号: 243
	類似度最大単語: 'office インストール office office 2013 office アンインストール必要ある'
	配列個数:295
	類似度最大質問文章: 'Officeをインストール際，Office 2010やOffice 2013 ，Office2016などはアンインストール必要はあるか'
	(bachelor) user@MacBook-Pro bachelor % python3 test3.py メール
	File already exists.
	コサイン類似度: 0.7193007484841745, 配列番号: 213
	類似度最大単語: 'メール thunderbird メール設定する'
	配列個数:295
	類似度最大質問文章: 'OCU メールをThunderbirdにメール設定したい'
	(bachelor) user@MacBook-Pro bachelor % python3 test3.py リモートデスクトップ
	File already exists.
	コサイン類似度: 0.6981527750221076, 配列番号: 198
	類似度最大単語: 'リモートデスクトップ接続できる'
	配列個数:295
	類似度最大質問文章: 'リモートデスクトップに接続できない'
	(bachelor) user@MacBook-Pro bachelor % python3 test3.py 仮想
	File already exists.
	コサイン類似度: 0.6099451551510185, 配列番号: 220
	類似度最大単語: '仮想ネットワークログインする'
	配列個数:295
	類似度最大質問文章: '別の仮想ネットワークにログインしたい'
	(bachelor) user@MacBook-Pro bachelor % python3 test3.py VPN
	File already exists.
	コサイン類似度: 0.7156917108344479, 配列番号: 52
	類似度最大単語: 'vpn 接続できる'
	配列個数:295
	類似度最大質問文章: 'OCUNET3 VPN に接続できない。'
	(bachelor) user@MacBook-Pro bachelor % python3 test3.py 仮想ネットワーク
	File already exists.
	MAX NotFound
	(bachelor) user@MacBook-Pro bachelor % python3 test3.py リモートデスクトップに接続出来ない # 普通の文章を入力することは出来ていない
	File already exists.
	MAX NotFound
	(bachelor) user@MacBook-Pro bachelor % MS Office # 空白スペースで単語を入力すると引数と認識しておかしくなる
	zsh: command not found: MS
	(bachelor) user@MacBook-Pro bachelor % python3 test4.py VPN
	ストップワードの読み込み完了
	配列個数:295, コサイン類似度: 0.7156917108344479, 配列番号: 52, 類似度最大単語: 'vpn 接続できる'
	類似度最大質問文章: 'OCUNET3 VPN に接続できない。'
	類似度最大回答文章: '「OCUNET3 利用者ガイド」のとおりに設定したら問題なく接続できた。'
	(bachelor) user@MacBook-Pro bachelor % python3 test4.py プリンター
	ストップワードの読み込み完了
	配列個数:295, コサイン類似度: 0.6601325769215738, 配列番号: 267, 類似度最大単語: 'プリンター印刷できる'
	類似度最大質問文章: 'プリンターで印刷できない'
	類似度最大回答文章: 'モバイルルータを取り外し⇒プリンタ接続可になりました'
	(bachelor) user@MacBook-Pro bachelor % python3 test4.py ネットワーク
	ストップワードの読み込み完了
	配列個数:295, コサイン類似度: 0.5232035465173208, 配列番号: 51, 類似度最大単語: 'ネットワーク繋がる'
	類似度最大質問文章: 'ネットワークに繋がらない。'
	類似度最大回答文章: '室内にコミュニティWi-Fiのルータがあったので、ocunet3に接続すると解決した。'
	(bachelor) user@MacBook-Pro bachelor % python3 test4.py Wi-Fi
	ストップワードの読み込み完了
	配列個数:295, コサイン類似度: 0.7865715545290431, 配列番号: 135, 類似度最大単語: 'コミュニティ wi-fi 自室 wi-fi 設置する'
	類似度最大質問文章: 'コミュニティWi-Fiではなく自室にWi-Fiを設置したい'
	類似度最大回答文章: '無線 LAN ルーターを設定いただくに以下の点にご注意下さい。順守項目は次の 3 点になります。
	1. 動作モード　→　ブリッジモードにする。ルーターモードにはしない。
	2. 暗号化方式　→　WPA2（AES）
	3. WAN（インターネット）側のポートは使用しない
	'
	(bachelor) user@MacBook-Pro bachelor % python3 test4.py Office
	ストップワードの読み込み完了
	配列個数:295, コサイン類似度: 0.8358414625441092, 配列番号: 243, 類似度最大単語: 'office インストール office office 2013 office アンインストール必要ある'
	類似度最大質問文章: 'Officeをインストール際，Office 2010やOffice 2013 ，Office2016などはアンインストール必要はあるか'
	類似度最大回答文章: '上記ソフトとOffice365は併用する事ができません。アンインストールを実行後、Office365のインストールを行ってください。'
	(bachelor) user@MacBook-Pro bachelor % python3 test4.py office
	ストップワードの読み込み完了
	配列個数:295, コサイン類似度: 0.8358414625441092, 配列番号: 243, 類似度最大単語: 'office インストール office office 2013 office アンインストール必要ある'
	類似度最大質問文章: 'Officeをインストール際，Office 2010やOffice 2013 ，Office2016などはアンインストール必要はあるか'
	類似度最大回答文章: '上記ソフトとOffice365は併用する事ができません。アンインストールを実行後、Office365のインストールを行ってください。'
	(bachelor) user@MacBook-Pro bachelor % python3 test4.py Windows
	ストップワードの読み込み完了
	配列個数:295, コサイン類似度: 0.5412622697296391, 配列番号: 1, 類似度最大単語: 'windows 10 アップデートする'
	類似度最大質問文章: 'Windows 10にアップデートしたい'
	類似度最大回答文章: 'Windows10は以下URLよりダウンロード可能です。 https://osaka-cu.onthehub.com/ [プロダクトキー＆ソフトウェア入手マニュアル] https://intra.cii.osaka-cu.ac.jp/wp-content/uploads/2019/12/student-win1020191227.pdf'
	(bachelor) user@MacBook-Pro bachelor % python3 test4.py zoom
	ストップワードの読み込み完了
	配列個数:295, コサイン類似度: 0.6902546627574335, 配列番号: 27, 類似度最大単語: 'zoom ログインできる'
	類似度最大質問文章: 'Zoomにログインできない'
	類似度最大回答文章: 'まず確認として以下URLへアクセスいただけますでしょうか。 https://intra.cii.osaka-cu.ac.jp/zoom/ 上記URLの「ライセンス付与対象者」の項目にご自身の「Office365サインイン情報」が表示されていますでしょうか。されていない場合ライセンス付与対象者ではございません。表示されている場合、今まで市立大学でご使用されたことがなければ、 http://ocu.jp/zoom にアクセスいただきZoomのアカウントの申請を行ってください。アカウントの申請方法、その後のサインイン、初期設定、アプリのインストールについては下記マニュアルをご確認ください。
	◆教職員向けマニュアル
	https://intra.cii.osaka-cu.ac.jp/wp-content/uploads/2020/08/Zoom%E3%83%9E%E3%83%8B%E3%83%A5%E3%82%A2%E3%83%AB_%E6%95%99%E8%81%B7%E5%93%A1%E5%90%91%E3%81%91_%E5%B8%82%E5%A4%A7%E7%89%88_20200827%E4%BF%AE%E6%AD%A3.pdf'
	(bachelor) user@MacBook-Pro bachelor % python3 test4.py Unipa
	ストップワードの読み込み完了
	MAX NotFound
	(bachelor) user@MacBook-Pro bachelor % python3 test6.py office
	ストップワードの読み込み完了
	該当する質問番号: [243 42 61 176 103 205 279 113 101 217]
	質問データ #243, コサイン類似度: 0.8358414625441092, 'Officeをインストール際，Office 2010やOffice 2013 ，Office2016などはアンインストール必要はあるか'
	質問データ #42, コサイン類似度: 0.5990930956243289, '5台以上のPCでOfficeを利用したい'
	質問データ #61, コサイン類似度: 0.559547767172075, '非常勤講師でもOfficeなどは利用か'
	質問データ #176, コサイン類似度: 0.5479888871275133, 'OfficeのURLを教えてほしい'
	質問データ #103, コサイン類似度: 0.5156493759091675, 'Microsoft Officeのインストール方法を教えて欲しい'
	質問データ #205, コサイン類似度: 0.4497323829856985, '自宅のMacにOfficeをインストールしたい'
	質問データ #279, コサイン類似度: 0.44224426868245115, 'MS Office をiPad で利用か'
	質問データ #113, コサイン類似度: 0.4216200519991936, '共有PCでOfficeを利用したいが可能か。'
	質問データ #101, コサイン類似度: 0.40648066344768174, 'Microsoft Office をインストール手順を知りたい'
	質問データ #217, コサイン類似度: 0.29494231319153624, 'iPhone/iPad など iOS/iPad OS へのOfficeインストール方法'
	(bachelor) user@MacBook-Pro bachelor % python3 test6.py vpn
	ストップワードの読み込み完了
	該当する質問番号: [ 52 158 223 116 238 154 212 93 94 95]
	質問データ #52, コサイン類似度: 0.7156917108344479, 'OCUNET3 VPN に接続できない。'
	質問データ #158, コサイン類似度: 0.5697996360894004, '名誉教授だが、VPNは使用か'
	質問データ #223, コサイン類似度: 0.4437022790553058, '共同研究者として、市大ネットワークにVPN接続ようにしたい'
	質問データ #116, コサイン類似度: 0.43108316010363573, 'VPN接続時に証明書エラーが表示されます'
	質問データ #238, コサイン類似度: 0.40510762612028006, 'iOS（iPad，iPhone）でOCUNET3 VPNの使用方法'
	質問データ #154, コサイン類似度: 0.39265724219543996, 'VPN 接続時にサーバーに ssh 接続できない'
	質問データ #212, コサイン類似度: 0.36764801829392085, 'VPNの同時アクセス数の上限はいくつか'
	(bachelor) user@MacBook-Pro bachelor % python3 test6.py リモートデスクトップ
	ストップワードの読み込み完了
	該当する質問番号: [198 114 100 202 252 241 81 3 37 101]
	質問データ #198, コサイン類似度: 0.6981527750221076, 'リモートデスクトップに接続できない'
	質問データ #114, コサイン類似度: 0.6446371131569532, 'リモートデスクトップにログイン出来ない'
	質問データ #100, コサイン類似度: 0.5539404486946976, 'リモートデスクトップにログイン出来ない場合がある'
	質問データ #202, コサイン類似度: 0.5082976647685898, 'Windows10 home はリモートデスクトップはか'
	質問データ #252, コサイン類似度: 0.4683672836098455, 'リモートデスクトップのユーザー名とパスワードを知りたい'
	質問データ #241, コサイン類似度: 0.4401706982155947, 'リモートデスクトップで大学のPCに繋がらなくなった。'
	質問データ #81, コサイン類似度: 0.423391264722909, 'リモートデスクトップ接続で大学のパソコンにログインできない'
	質問データ #3, コサイン類似度: 0.37922653677022006, 'MacでWindowsPCのおリモートデスクトップを操作方法を知りたい．'
	質問データ #37, コサイン類似度: 0.3425054571123299, 'Windows10 Homeを使用しているが，リモートデスクトップを導入ためにはどうしたらよいか'
	(bachelor) user@MacBook-Pro bachelor % python3 test6.py 仮想ネットワーク
	ストップワードの読み込み完了
	NotFound
	(bachelor) user@MacBook-Pro bachelor % python3 test6.py 仮想
	ストップワードの読み込み完了
	該当する質問番号: [220 227 9 15 80 14 70 124 60 138]
	質問データ #220, コサイン類似度: 0.6099451551510185, '別の仮想ネットワークにログインしたい'
	質問データ #227, コサイン類似度: 0.583048209142973, '仮想ネットワークのCとYとZの違いが分からない'
	質問データ #9, コサイン類似度: 0.567496752347493, '仮想ネットワークに登録したい。'
	質問データ #15, コサイン類似度: 0.5263715386866682, '仮想ネットワークに登録方法を教えてほしい'
	質問データ #80, コサイン類似度: 0.49054686610260423, '仮想ネットワークへ学生を追加方法'
	質問データ #14, コサイン類似度: 0.47257770525071885, '学生が仮想ネットワークの申請はできないのか？'
	質問データ #70, コサイン類似度: 0.42392572033961273, '仮想ネットワークの引っ越し方法のご質問'
	質問データ #124, コサイン類似度: 0.38042647769464893, '仮想ネットワークにメンバーを追加にはどうしたらよいか'
	質問データ #60, コサイン類似度: 0.3740576676643657, '仮想ネットワークに留学生を登録したいが，出てこない'
	質問データ #138, コサイン類似度: 0.36707876303673787, 'OCUNET3の仮想ネットワークに入るためのレルムが切り換えれない'
	(bachelor) user@MacBook-Pro bachelor % python3 test6.py OCUNET3
	ストップワードの読み込み完了
	NotFound
	# 形態素解析する前最上位回答1件のみ
	import argparse
	import numpy as np
	import glob
	from sklearn.feature_extraction.text import TfidfVectorizer
	from sklearn.metrics.pairwise import cosine_similarity

	# import IPython
	from IPython.display import display

	# lunar1.py header
	import pandas as pd
	# import glob

	# lunar2.py header
	import MeCab
	tagger = MeCab.Tagger("-Ochasen")
	import mojimoji
	import os
	import urllib


	text_paths = glob.glob('data/ocu2/*.txt')
	texts = []

	def main(args):
	data = get_data(args.datafile)
	cs = get_cs(args.query, data)

	max_index = np.argmax(cs)
	max_cs = cs[max_index][0]
	max_data = data[max_index]

	if max_cs > 1e-10:
	print(f"コサイン類似度: {max_cs}")
	print(f"アブストラクト: '{max_data}'")
	else:
	print("NotFound")

	def get_data(datafile):
	# abstract = np.loadtxt(f"{datafile}", encoding="utf-8", delimiter='\|', dtype=str)
	# data = []
	# for s in abstract:
	# text = s.replace(".", "")
	# text = text.replace(",", "")
	# text = text.replace("(", "")
	# text = text.replace(")", "")
	# text = text.replace("-", " ")
	# data.append(text.lower())
	data = []
	for text_path in text_paths:
	text = open(text_path, 'r').read()
	# text = text.split('\n') # modified
	text = text.split(',')
	title = text[3] # added
	# title = text[2] # modified
	text = ' '.join(text[8:9])
	# text = text.strip('\n')
	text = text.replace( '\n' , '' )
	text = text.strip('"')
	text = text.replace('する', '')
	text = text.replace('できる', '')

	data.append(text)

	return data

	def get_cs(query, data):
	tfidf = TfidfVectorizer()
	abstract_vector = tfidf.fit_transform(data).toarray()
	query_vector = tfidf.transform([query]).toarray()
	cs = cosine_similarity(abstract_vector, query_vector)

	return cs

	if __name__ == "__main__":
	parser = argparse.ArgumentParser()
	parser.add_argument("query", type=str)
	parser.add_argument("--datafile", type=str, default="data/ocu2/*.txt")
	args = parser.parse_args()
	main(args)
	# 入力に対して処理してコサイン類似度を表示する
	import argparse
	import numpy as np
	import glob
	from sklearn.feature_extraction.text import TfidfVectorizer
	from sklearn.metrics.pairwise import cosine_similarity

	# import IPython
	from IPython.display import display

	# lunar1.py header
	import pandas as pd
	# import glob

	# lunar2.py header
	import MeCab
	tagger = MeCab.Tagger("-Ochasen")
	import mojimoji
	import os
	import urllib


	text_paths = glob.glob('data/ocu2/*.txt')

	def q_get(): # 質問文書を text に格納
	texts = []
	for text_path in text_paths:
	text = open(text_path, 'r').read()
	text = text.split(',') # CSV ファイルのセルで分割
	text = ' '.join(text[8:9]) # 質問文書部分
	text = text.replace( '\n' , '' ) # 質問文書の改行を削除
	text = text.strip('"') # CSV ファイルのセル " を削除
	text = text.replace('する', '') # するできるの削除（不要？）
	text = text.replace('できる', '')
	texts.append(text) # 配列 texts に格納
	return texts

	def a_get(): # 回答文書を text に格納
	a_texts = []
	for text_path in text_paths:
	a_text = open(text_path, 'r').read()
	a_text = a_text.split(',') # CSV ファイルのセルで分割
	a_text = ' '.join(a_text[16:17]) # 質問文書部分
	# a_text = a_text.replace( '\n' , '' ) # 質問文書の改行を削除．読みにくいのでやっぱり不要
	a_text = a_text.strip('"') # CSV ファイルのセル " を削除
	a_texts.append(a_text) # 配列 a_texts に格納
	return a_texts

	def load_stopwords(path="data/jp_stop_words.txt"): # ストップワードの読み込み外部ファイル
	url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
	# if os.path.exists(path):
	# print('ストップワードの読み込み完了')
	# else:
	# print('ストップワードのダウンロード中')
	# urllib.request.urlretrieve(url, path)
	return pd.read_csv(path, header=None)[0].tolist()

	def preprocess(series): # 前処理
	stop_words = load_stopwords() # ストップワードの削除
	def tokenizer_func(text): # MeCab で名詞，動詞，形容動詞のみを残す処理する部分
	tokens = []
	node = tagger.parseToNode(str(text))
	while node:
	features = node.feature.split(',') # MeCab 辞書はコンマ区切りなので，コンマで分割
	surface = features[6] # MeCab 辞書の6番目の言葉の原型を抽出
	if (surface == '*') or (len(surface) < 2) or (surface in stop_words): # 知らない言葉は無視
	node = node.next
	continue
	noun_flag = (features[0] == '名詞')
	proper_noun_flag = (features[0] == '名詞') & (features[1] == '固有名詞')
	verb_flag = (features[0] == '動詞') & (features[1] == '自立')
	adjective_flag = (features[0] == '形容詞') & (features[1] == '自立')
	if proper_noun_flag:
	tokens.append(surface)
	elif noun_flag:
	tokens.append(surface)
	elif verb_flag:
	tokens.append(surface)
	elif adjective_flag:
	tokens.append(surface)
	node = node.next
	return " ".join(tokens)

	series = series.map(tokenizer_func)

	#---------------Normalization-----------#
	series = series.map(lambda x: x.lower()) # 小文字に統一
	# series = series.map(mojimoji.zen_to_han, kana=False) # 半角に（カタカナ除く）統一．なんか動かないし不要．

	return series

	# query_preprocess は不要．
	# def query_preprocess(query_series): # 前処理
	# stop_words = load_stopwords() # ストップワードの削除
	# def tokenizer_func(text): # MeCab で名詞，動詞，形容動詞のみを残す処理する部分
	# tokens = []
	# node = tagger.parseToNode(str(text))
	# while node:
	# features = node.feature.split(',') # MeCab 辞書はコンマ区切りなので，コンマで分割
	# surface = features[6] # MeCab 辞書の6番目の言葉の原型を抽出
	# if (surface == '*') or (len(surface) < 2) or (surface in stop_words): # 知らない言葉は無視
	# node = node.next
	# continue
	# noun_flag = (features[0] == '名詞')
	# proper_noun_flag = (features[0] == '名詞') & (features[1] == '固有名詞')
	# verb_flag = (features[0] == '動詞') & (features[1] == '自立')
	# adjective_flag = (features[0] == '形容詞') & (features[1] == '自立')
	# if proper_noun_flag:
	# tokens.append(surface)
	# elif noun_flag:
	# tokens.append(surface)
	# elif verb_flag:
	# tokens.append(surface)
	# elif adjective_flag:
	# tokens.append(surface)
	# node = node.next
	# return " ".join(tokens)
	#
	# query_series = query_series.map(tokenizer_func)
	# # query_series = tokenizer_func(query_series)
	#
	# #---------------Normalization-----------#
	# query_series = query_series.map(lambda x: x.lower()) # 小文字に統一
	# # series = series.map(mojimoji.zen_to_han, kana=False) # 半角に（カタカナ除く）統一．なんか動かないし不要．
	# return query_series

	###############################
	# ここの多次元配列の処理に困っている． #
	###############################
	def get_cs(query_series, series): # 質問文書を MeCab で処理したあとのものをコサイン類似度を評価．多次元配列の処理難しい．
	tfidf = TfidfVectorizer() # Tf-Idf 化関数に名前を付ける
	question_vector = tfidf.fit_transform(series).toarray() # 質問文書を Tf-Idf を用いて数値化
	# query_vector = tfidf.transform([query_series]).toarray() # 入力された質問を Tf-Idf を用いて数値化
	query_vector = tfidf.fit_transform(query_series).toarray() # 入力された質問を Tf-Idf を用いて数値化．多次元配列にする必要ありそう
	cs = cosine_similarity(question_vector, query_vector) # コサイン類似度の評価

	return cs # それぞれのコサイン類似度を評価

	def find_top_n(n, cs): # コサイン類似度上から順に n 件の配列番号を取得する
	arr_top_n_indices = np.argsort(cs, axis = None)[-n:]
	top_n_indices = arr_top_n_indices[::-1] # 降順にソート
	return top_n_indices # too_n_indices は n 個の配列，一つ一つは番号

	def get_n_cs(cs, top_n_index, top_n_indices): # 配列番号 top_n_index 番目のコサイン類似度の取得
	for n_cs in top_n_indices:
	n_cs = cs[top_n_index][0]
	return n_cs

	def listing_query(query): # 質問文書を queries に格納
	list_query = []
	list_query.append(query) # 配列 queries に格納

	return list_query

	def main(args):
	texts = q_get() # 質問文書の取得
	a_texts = a_get() # 回答文書の取得
	query_texts = listing_query(args.query)
	q_series = pd.Series(texts) # 質問文書を Pandas の Series に格納
	query_series = pd.Series(query_texts) # クエリを Pandas の Series に格納
	processed_q_series = preprocess(q_series) # 質問文書の前処理および MeCab を用いて形態素解析する
	# str_data = map(str, processed_q_series) # data_mod の list 配列を str 型に変更．不要．
	str_data = processed_q_series # 上の行使うかどちらか選択．
	processed_query_series = preprocess(query_series) # query_series を処理
	# print(type(processed_query_series))
	str_query = map(str, processed_query_series)
	cs = get_cs(str_query, str_data) # コサイン類似度の取得
	n = 10 # 上位 n 件を表示させる
	top_n_indices = find_top_n(n, cs) # 検索結果の上から順に n 個見つける．top_n_indices は上位 n 件の番号が格納された np.array 配列

	max_index = np.argmax(cs) # コサイン類似度の最大の配列番号の取得
	max_cs = cs[max_index][0] # コサイン類似度最大値

	if max_cs > 1e-10:
	print(f"該当する質問番号: {top_n_indices}")
	# print(f"{lst_top_n}") # np.array 形式から通常の list 形式に変換したものを表示
	# print(f"配列個数:{len(cs)}, コサイン類似度: {max_cs}, 配列番号: {max_index}, 類似度最大単語: '{max_data}'")
	for top_n_index in top_n_indices: # 結果の表示
	n_cs = get_n_cs(cs, top_n_index, top_n_indices) # 各コサイン類似度の取得
	if n_cs > 1e-10:
	print(f"質問データ #{top_n_index}, コサイン類似度: {n_cs}, '{texts[top_n_index]}'")
	else:
	print("NotFound")

	if __name__ == "__main__":
	parser = argparse.ArgumentParser()
	parser.add_argument("query", type=str)
	args = parser.parse_args()
	main(args)