This project analyzes the Titanic dataset to predict passenger survival using machine learning models. The project includes data preprocessing, feature engineering, model training, and evaluation.
- Introduction
- Installation
- Usage
- Features
- Dependencies
- Configuration
- Documentation
- Examples
- Troubleshooting
- Contributors
- License
- Clone the repository:
git clone https://github.com/Ndrake337/Kagle_Titanic.git cd Kagle_Titanic
- Install the required packages:
pip install -r requirements.txt
To run the analysis and prediction:
- Ensure you have the Titanic dataset (
train.csv
andtest.csv
) in the project directory. - Open and run the Jupyter notebook:
jupyter notebook Titanic.ipynb
- Data loading and preprocessing
- Feature engineering
- Model training using RandomForestClassifier and LogisticRegression
- Model evaluation using cross-validation
- Error analysis
- pandas
- numpy
- matplotlib
- scikit-learn
Ensure the following configurations in the notebook:
- File paths for
train.csv
andtest.csv
. - Random state and other parameters for reproducibility.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
Converting categorical 'Sex' feature to binary:
def binarySex(valor):
if valor == 'female':
return 1
else:
return 0
train['Sex_binario'] = train['Sex'].map(binarySex)
test['Sex_binario'] = test['Sex'].map(binarySex)
Using Repeated K-Fold Cross-Validation with RandomForestClassifier:
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import RandomForestClassifier
resultados = []
kf = RepeatedKFold(n_splits=2, n_repeats=10, random_state=10)
for linhas_treino, linhas_valid in kf.split(X):
X_treino, X_valid = X.iloc[linhas_treino], X.iloc[linhas_valid]
y_treino, y_valid = y.iloc[linhas_treino], y.iloc[linhas_valid]
modelo = RandomForestClassifier(n_estimators = 100, n_jobs=-1, random_state=0)
modelo.fit(X_treino, y_treino)
p = modelo.predict(X_valid)
acc = np.mean(y_valid == p)
resultados.append(acc)
Analyzing errors in predictions:
X_valid_check = train.iloc[linhas_valid].copy()
X_valid_check['p'] = p
erros = X_valid_check[X_valid_check['Survived'] != X_valid_check['p']]
- Ensure all dependencies are installed.
- Verify the paths to the dataset files are correct.
- Check for missing values and handle them appropriately before training.
- Your Name - @Ndrake337
This project is licensed under the MIT License - see the LICENSE file for details.