ChatGPTでタイタニック号の生存予測（LangChain編） « LANCARD.LAB

ChatGPTでタイタニック号の生存予測（LangChain編）

タイタニック号の生存予測とは、Kaggleでよく知られている機械学習のお題です。
タイタニック号で生き残るのは誰かを予測します。
今回はそれをすべてChatGPTにやってもらいます。

AIエージェントして、LangChainを使います。
LangChainは、LLMを利用したアプリケーションを開発するためのフレームワークです。
LangChainのAgentを使うと、LLMが作ってくれたPythonプログラムをPython REPLで実行し、実行結果をプロンプトに追記してLLMに再び投げる、というのをFinal Answerを見つけるまで自動実行する、と言うようなことができます。

ローカル環境でPython REPLを自動実行されるのは嫌なので、Google Colaboratoryで試しました。

実装

最初に必要なパッケージをインストールします。

!pip install openai
!pip install langchain

OpenAIのAPIキーを環境変数にセットします。

import os
os.environ["OPENAI_API_KEY"] = "**********************"

エージェントを作成します。
公式サイトでは、LLMをOpenAI()で生成してますが、これだとGPT-3（text-davinci-003）なので、ChatGPT（gpt-3.5-turbo）を生成するため、OpenAIChat()としています。

from langchain.agents.agent_toolkits import create_python_agent
from langchain.tools.python.tool import PythonREPLTool
from langchain.llms import OpenAIChat

agent_executor = create_python_agent(
    llm=OpenAIChat(temperature=0),
    tool=PythonREPLTool(),
    verbose=True
)

次にagent_executor.run(<質問文>)で、質問をOpenAI APIで投げます。
これにより以下のプロンプトがOpenAI APIに投げられます。

You are an agent designed to write and execute python code to answer questions.
You have access to a python REPL, which you can use to execute python code.
If you get an error, debug your code and try again.
Only use the output of your code to answer the question. 
You might know the answer without running any code, but you should still run the code to get the answer.
If it does not seem like you can write code to answer the question, just return \\"I don\'t know\\" as the answer.


Python REPL: A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Python REPL]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: <入力した質問文>
Thought:

上記をDeepLで翻訳：

あなたは、質問に答えるためにpythonコードを書き、実行するように設計されたエージェントです。
あなたはpython REPLにアクセスすることができ、pythonコードを実行するために使用することができます。
エラーが発生した場合は、コードをデバッグして、もう一度やり直してください。
質問に答えるには、コードの出力のみを使用してください。
コードを実行しなくても答えがわかるかもしれませんが、それでも答えを得るためにコードを実行すべきです。
コードを書いても答えられそうにない場合は、答えとして「I don't know」を返してください。


Python REPL：Pythonのシェル。Pythonのコマンドを実行するために使う。入力は、有効なpythonコマンドでなければならない。値の出力を見たい場合は、`print(...)`で出力する必要があります。

以下のような書式を使用します：

Question：あなたが答えなければならない入力の質問
Thought：常に何をすべきかを考えるべき
Action: 取るべき行動。[Python REPL]のいずれかである必要があります。
アクションインプット：アクションへの入力
Observation：行動の結果
...（この思考・行動・行為の入力・観察は、N回繰り返すことができます。）
Thought： 最終的な答えがわかった
Final Answer：入力された元の質問に対する最終的な回答

Begin!

Question： <入力した質問文>
Thought：

これでChatGPTがFinal Answerを出すまで、繰り返しChatGPTにプロンプトを投げます。

最初にtrain.csvとtest.csvファイルをdataディレクトリ以下に置き、
以下の質問を投げます。

question="""
It is your task to predict if a passenger survived the crashing plane or not.

The data has been split into two groups:
    training set (path: './data/train.csv')
    test set (path: './data/test.csv')
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes.

You should write a csv file with exactly 418 entries plus a header row.
The file should have exactly 2 columns:
- PassengerId (sorted in any order)
- Survived (contains your binary predictions: 1 for survived, 0 for deceased)
"""

agent_executor.run(question)

上記の質問文をDeepLで翻訳:

墜落した飛行機から乗客が生還したかどうかを予測するのがあなたの仕事です。

データは2つのグループに分けられています：
    トレーニングセット（パス：'./data/train.csv')
    テストセット（パス：'./data/test.csv')
トレーニングセットは、機械学習モデルを構築するために使用する必要があります。トレーニングセットでは、各乗客の結果（「ground truth」とも呼ばれる）を提供します。
テストセットは、未知のデータに対するモデルの性能を確認するために使用されます。テストセットでは、各乗客のグランドトゥルースは提供されません。これらの結果を予測するのはあなたの仕事です。

あなたは、418個のエントリとヘッダー行を持つcsvファイルを作成する必要があります。
このファイルには2つの列が必要です：
- PassengerId (任意の順序でソートされています)
- 生存（生存の場合は1、死亡の場合は0というバイナリ予測が含まれます。）

タイタニック号の生存予測は有名でおそらくChatGPTは知ってると思うので、その場合ズルをする恐れがあるので、「タイタニック」というワードは入れずに「墜落した飛行機」としました。

出力結果

以下、実行した出力結果です。
(head()実行部分、一部省略しています)

> Entering new AgentExecutor chain...
This is a machine learning task. 
I need to build a model using the training set and use it to predict the outcomes for the test set. I should start by loading the data.

Action: Python REPL
Action Input:
import pandas as pd
train_data = pd.read_csv('./data/train.csv')
test_data = pd.read_csv('./data/test.csv')
print(train_data.head())
print(test_data.head())

Observation:    PassengerId  Survived  Pclass  Name     Sex   Age  SibSp Parch            Ticket     Fare Cabin Embarked 
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   <省略>


Thought:I need to explore the data to see what features are available and what they look like.

Action: Python REPL
Action Input:
print(train_data.info())
print(test_data.info())

Observation: <class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None

Thought:I see that there are missing values in the Age, Cabin, and Embarked columns in the training set, and in the Age, Fare, and Cabin columns in the test set. I need to handle these missing values before building my model.

Action: Python REPL
Action Input:
print(train_data.isnull().sum())
print(test_data.isnull().sum())

Observation: PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Thought:I need to handle the missing values in the Age, Cabin, and Embarked columns in the training set, and in the Age, Fare, and Cabin columns in the test set. I can either drop the rows with missing values or fill in the missing values with some strategy.

Action: Python REPL
Action Input:
# Fill in missing values in Age column with median age
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)
test_data['Age'].fillna(test_data['Age'].median(), inplace=True)

# Fill in missing values in Embarked column with most common value
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)

# Fill in missing values in Fare column with median fare
test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)

# Drop Cabin column since it has too many missing values
train_data.drop('Cabin', axis=1, inplace=True)
test_data.drop('Cabin', axis=1, inplace=True)

print(train_data.isnull().sum())
print(test_data.isnull().sum())

Observation: PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64
PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Thought:Now that I have handled the missing values, I need to prepare the data for modeling. I should start by encoding categorical variables as numerical variables.

Action: Python REPL
Action Input:
# Encode Sex column as binary variable
train_data['Sex'] = train_data['Sex'].map({'male': 0, 'female': 1})
test_data['Sex'] = test_data['Sex'].map({'male': 0, 'female': 1})

# Encode Embarked column as dummy variables
train_data = pd.get_dummies(train_data, columns=['Embarked'])
test_data = pd.get_dummies(test_data, columns=['Embarked'])

print(train_data.head())
print(test_data.head())

Observation:    PassengerId  Survived  Pclass  Name     Sex   Age  SibSp Parch 
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   <省略>

Thought:I see that the Name and Ticket columns are not useful for modeling, so I should drop them.

Action: Python REPL
Action Input:
train_data.drop(['Name', 'Ticket'], axis=1, inplace=True)
test_data.drop(['Name', 'Ticket'], axis=1, inplace=True)

print(train_data.head())
print(test_data.head())

Observation:    PassengerId  Survived  Pclass  Name     Sex   Age  SibSp Parch
0            1         0       3    0  22.0      1      0   7.2500   
1            2         1       1    1  38.0      1      0  71.2833   
2            3         1       3    1  26.0      0      0   7.9250   
3            4         1       1    1  35.0      1      0  53.1000   
4            5         0       3    0  35.0      0      0   8.0500   <省略>

Thought:I need to split the training set into a training set and a validation set to evaluate my model.

Action: Python REPL
Action Input:
from sklearn.model_selection import train_test_split

X = train_data.drop('Survived', axis=1)
y = train_data['Survived']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_val.shape)
print(y_train.shape)
print(y_val.shape)

Observation: (712, 10)
(179, 10)
(712,)
(179,)

Thought:Now I can build and train my model. I will use a logistic regression model.

Action: Python REPL
Action Input:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

print(model.score(X_val, y_val))

Observation: 0.7821229050279329

Thought:/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
My model has an accuracy of 78.2% on the validation set. Now I can use it to make predictions on the test set.

Action: Python REPL
Action Input:
# Make predictions on test set
predictions = model.predict(test_data)

# Create submission file
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': predictions})
submission.to_csv('submission.csv', index=False)

print(submission.head())

Observation:    PassengerId  Survived
0          892         0
1          893         1
2          894         0
3          895         0
4          896         1

Thought:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-93-1b4108d5832a> in <cell line: 1>()
----> 1 agent_executor.run(prompt)

6 frames
/usr/local/lib/python3.10/dist-packages/langchain/agents/mrkl/output_parser.py in parse(self, text)
     18         match = re.search(regex, text, re.DOTALL)
     19         if not match:
---> 20             raise ValueError(f"Could not parse LLM output: `{text}`")
     21         action = match.group(1).strip()
     22         action_input = match.group(2)

ValueError: Could not parse LLM output: `I have successfully made predictions on the test set and created a submission file. My final answer is that I have predicted the survival outcomes for the test set passengers and saved the predictions in a csv file.`

出力ファイル「submission.csv」は生成されましたが、最後にエラーになってしまいました。
ChatGPTは最後に以下を返しています。

Thought:
I have successfully made predictions on the test set and created a submission file. My final answer is that I have predicted the survival outcomes for the test set passengers and saved the predictions in a csv file.

必ずThought:の後にAction: またはFinal Answer：を返して次のアクションが指定されてる必要がありますが、それがないです。
今回は「submission.csv」が回答であり、Final Answer：が不要なためそうなってしまったようです。
これをフィックスするのは、テンプレートをカスタマイズする必要があるかもしれません。

「submission.csv」をKaggleにsubmitしたところ、スコアは0.75でした。

出力結果の流れ

以下のような処理を行っています。

データの読み込みおよび観察
データの欠損値を補完
カテゴリ変数を数値データに置換
不要と思われるカラムを削除
- しかし何を根拠にそう思ったのかが書かれていない
訓練データをトレーニングセットとバリデーションセットに分割
モデルを作成
バリデーションセットで精度を確認
テストデータを推論
出力ファイルを作成

通常なら特徴間の関係の可視化や相関分析等を行いますが、それらが行われていないようです。
また、モデルも何を根拠に決めてるのか分からないです。

考察

submission（提出）とは一言も言ってないのに、出力ファイル名が「submission.csv」になっていました。ChatGPTに元ネタがバレてるかもしれません。
上記ではモデルがLogisticRegressionですが、RandomForestClassifier等の別のモデルの時もあります。
簡単にできたように書きましたが、実際は何度も色々試して、たまたまうまくできたのを掲載しています。よく発生するエラーは、”This model’s maximum context length is 4097 tokens,…”です。会話履歴をプロンプトに追記してくので、会話のやり取りが長くなると、どうしてもトークンが最大値を超えてしまいます。
プロンプトで「Action: the action to take, should be one of [Python REPL]」と書いてあるので、Action: Python REPL ではなく、Action: [Python REPL]と返してきて、エラーになることがありました。