5_classification_supervisee.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5 - Classification supervisée"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns # cf. https://stackoverflow.com/questions/41499857/seaborn-why-import-as-sns#44484758"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.set(rc={\"figure.figsize\": (32, 16)})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df5 = pd.read_pickle('data/df5.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 74963 entries, 0 to 95159\n",
      "Data columns (total 28 columns):\n",
      " #   Column              Non-Null Count  Dtype  \n",
      "---  ------              --------------  -----  \n",
      " 0   essencefrancais     74963 non-null  object \n",
      " 1   circonference_cm    74963 non-null  float64\n",
      " 2   hauteurtotale_m     74963 non-null  int64  \n",
      " 3   hauteurfut_m        74963 non-null  float64\n",
      " 4   diametrecouronne_m  74963 non-null  int64  \n",
      " 5   rayoncouronne_m     74900 non-null  float64\n",
      " 6   dateplantation      50216 non-null  object \n",
      " 7   genre               74963 non-null  object \n",
      " 8   espece              74963 non-null  object \n",
      " 9   variete             74963 non-null  object \n",
      " 10  essence             74963 non-null  object \n",
      " 11  architecture        74963 non-null  object \n",
      " 12  localisation        74963 non-null  object \n",
      " 13  naturerevetement    74963 non-null  object \n",
      " 14  mobilierurbain      74963 non-null  object \n",
      " 15  anneeplantation     50218 non-null  float64\n",
      " 16  commune             74963 non-null  object \n",
      " 17  codeinsee           74963 non-null  int64  \n",
      " 18  nomvoie             74963 non-null  object \n",
      " 19  codefuv             74808 non-null  float64\n",
      " 20  identifiant         74963 non-null  int64  \n",
      " 21  numero              74963 non-null  int64  \n",
      " 22  codegenre           74963 non-null  int64  \n",
      " 23  gid                 74963 non-null  int64  \n",
      " 24  surfacecadre_m2     49993 non-null  float64\n",
      " 25  lat                 74963 non-null  float64\n",
      " 26  lon                 74963 non-null  float64\n",
      " 27  circonference_m     74963 non-null  float64\n",
      "dtypes: float64(9), int64(7), object(12)\n",
      "memory usage: 16.6+ MB\n"
     ]
    }
   ],
   "source": [
    "df5.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objectif \n",
    "\n",
    "Déterminer le genre d'un arbre à partir des ses propriètes mesurables : hauteur totale, hauteur du fut, circonference, diametre de la couronne, latitude, longitude. Il s'agit d'un problème de **classification supervisée**, qu'on resoudra grâce à la librairie `scikit-learn`, https://scikit-learn.org/."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pour rappel :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Nombre de genres différents =  86\n"
     ]
    }
   ],
   "source": [
    "print(\"Nombre de genres différents = \", df5.groupby(['genre'])['genre'].count().count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Il convient de ranger les propriètes (*features*) numériques qu'on souhaite utiliser dans la variable suivante, car on en aura besoin ci-dessous :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_features = ['circonference_m', 'diametrecouronne_m', 'hauteurfut_m', 'hauteurtotale_m', 'lat', 'lon']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "À partir de `df5`, on peut créer un DataFrame n'incluant que ces dernières *features* :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>circonference_m</th>\n",
       "      <th>diametrecouronne_m</th>\n",
       "      <th>hauteurfut_m</th>\n",
       "      <th>hauteurtotale_m</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.30</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>45.804503</td>\n",
       "      <td>4.772993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.45</td>\n",
       "      <td>4</td>\n",
       "      <td>2.0</td>\n",
       "      <td>6</td>\n",
       "      <td>45.803322</td>\n",
       "      <td>4.775080</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.50</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>45.803241</td>\n",
       "      <td>4.775227</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.40</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>45.804540</td>\n",
       "      <td>4.772921</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.30</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>45.804468</td>\n",
       "      <td>4.773058</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   circonference_m  diametrecouronne_m  hauteurfut_m  hauteurtotale_m  \\\n",
       "0             0.30                   5           2.0                7   \n",
       "1             0.45                   4           2.0                6   \n",
       "2             0.50                   5           2.0                7   \n",
       "3             0.40                   5           2.0                7   \n",
       "4             0.30                   5           2.0                7   \n",
       "\n",
       "         lat       lon  \n",
       "0  45.804503  4.772993  \n",
       "1  45.803322  4.775080  \n",
       "2  45.803241  4.775227  \n",
       "3  45.804540  4.772921  \n",
       "4  45.804468  4.773058  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = df5[ num_features ].copy()\n",
    "X.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "min_lat = X.lat.min()\n",
    "max_lat = X.lat.max()\n",
    "min_lon = X.lon.min()\n",
    "max_lon = X.lon.max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "X['nlat'] = X.lat.apply( lambda row : (row - min_lat)/(max_lat-min_lat) )\n",
    "X['nlon'] = X.lon.apply( lambda row : (row - min_lon)/(max_lon-min_lon) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>circonference_m</th>\n",
       "      <th>diametrecouronne_m</th>\n",
       "      <th>hauteurfut_m</th>\n",
       "      <th>hauteurtotale_m</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>nlat</th>\n",
       "      <th>nlon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.30</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>45.804503</td>\n",
       "      <td>4.772993</td>\n",
       "      <td>0.638981</td>\n",
       "      <td>0.209793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.45</td>\n",
       "      <td>4</td>\n",
       "      <td>2.0</td>\n",
       "      <td>6</td>\n",
       "      <td>45.803322</td>\n",
       "      <td>4.775080</td>\n",
       "      <td>0.635795</td>\n",
       "      <td>0.215563</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.50</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>45.803241</td>\n",
       "      <td>4.775227</td>\n",
       "      <td>0.635576</td>\n",
       "      <td>0.215970</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.40</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>45.804540</td>\n",
       "      <td>4.772921</td>\n",
       "      <td>0.639080</td>\n",
       "      <td>0.209593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.30</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>45.804468</td>\n",
       "      <td>4.773058</td>\n",
       "      <td>0.638886</td>\n",
       "      <td>0.209974</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   circonference_m  diametrecouronne_m  hauteurfut_m  hauteurtotale_m  \\\n",
       "0             0.30                   5           2.0                7   \n",
       "1             0.45                   4           2.0                6   \n",
       "2             0.50                   5           2.0                7   \n",
       "3             0.40                   5           2.0                7   \n",
       "4             0.30                   5           2.0                7   \n",
       "\n",
       "         lat       lon      nlat      nlon  \n",
       "0  45.804503  4.772993  0.638981  0.209793  \n",
       "1  45.803322  4.775080  0.635795  0.215563  \n",
       "2  45.803241  4.775227  0.635576  0.215970  \n",
       "3  45.804540  4.772921  0.639080  0.209593  \n",
       "4  45.804468  4.773058  0.638886  0.209974  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = X.drop(['lat', 'lon'], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... et un autre DataFrame qui ne contient que la colonne qu'on souhaite prédire :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = df5[ ['genre'] ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>genre</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Acer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Acer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Acer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Acer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Acer</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  genre\n",
       "0  Acer\n",
       "1  Acer\n",
       "2  Acer\n",
       "3  Acer\n",
       "4  Acer"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([['Acer'],\n",
       "       ['Acer'],\n",
       "       ['Acer'],\n",
       "       ...,\n",
       "       ['Quercus'],\n",
       "       ['Fraxinus'],\n",
       "       ['Acer']], dtype=object)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = y.values.ravel() # pour que y soit conforme au format attendu par la librairie qu'on utilisera ci-dessous..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Acer', 'Acer', 'Acer', ..., 'Quercus', 'Fraxinus', 'Acer'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Découpage du jeu de données en deux parties : *training set* et *test set*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "La librairie `sklearn` fournit la fonction dont on a besoin, cf. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7499966650214106\n",
      "0.25000333497858945\n"
     ]
    }
   ],
   "source": [
    "print( len(X_train)/len(X) )\n",
    "print( len(X_test)/len(X) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "La librairie `scikit-learn` inclut plusieurs algorithmes de classification supervisée, cf. https://scikit-learn.org/stable/supervised_learning.html#supervised-learning. Ici on se limitera à en tester quelques-uns. Afin de comparer les algorithmes entre eux, on stockera dans le dictionnaire `accuracy_report` la mésure de fiabilité de chaque algorithme."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy_report = dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/acerioni/Documents/ClubDevAnonymes/20190703_Python/venv/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of the Logistic Regression classifier on the training set: 0.36\n",
      "Accuracy of the Logistic Regression classifier on the test set: 0.35\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "logreg = LogisticRegression()\n",
    "\n",
    "logreg.fit( X_train, y_train )\n",
    "\n",
    "print('Accuracy of the Logistic Regression classifier on the training set: {:.2f}'\n",
    "     .format( logreg.score(X_train, y_train)) )\n",
    "\n",
    "print('Accuracy of the Logistic Regression classifier on the test set: {:.2f}'\n",
    "     .format( logreg.score(X_test, y_test)) )\n",
    "\n",
    "accuracy_report[ 'logreg' ] = logreg.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'logreg': 0.3537164505629369}"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_report"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-Nearest Neighbors Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of the K-NN Classifier on the training set: 0.51\n",
      "Accuracy of the K-NN classifier on the test set: 0.49\n"
     ]
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "knn = KNeighborsClassifier(n_neighbors = 50) # <- on devrait faire tourner l'algorithme avec différentes valeurs de ce paramètre, afin de sélectionner la meilleure configuration... \n",
    "\n",
    "knn.fit(X_train, y_train)\n",
    "\n",
    "print('Accuracy of the K-NN Classifier on the training set: {:.2f}'\n",
    "     .format(knn.score(X_train, y_train)))\n",
    "\n",
    "print('Accuracy of the K-NN classifier on the test set: {:.2f}'\n",
    "     .format(knn.score(X_test, y_test)))\n",
    "\n",
    "accuracy_report[ 'knn' ] = knn.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Decision Tree Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy of the Decision Tree classifier on the training set: 1.00\n",
      "Accuracy of the Decision Tree classifier on the test set: 0.77\n"
     ]
    }
   ],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "dt = DecisionTreeClassifier().fit(X_train, y_train)\n",
    "\n",
    "print('Accuracy of the Decision Tree classifier on the training set: {:.2f}'\n",
    "     .format(dt.score(X_train, y_train)))\n",
    "\n",
    "print('Accuracy of the Decision Tree classifier on the test set: {:.2f}'\n",
    "     .format(dt.score(X_test, y_test)))\n",
    "\n",
    "accuracy_report[ 'decision_tree' ] = dt.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Les résultats fournis par cet algorithme sont tout à fait respectables ! Cela mérite un petit approfondissement :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/acerioni/Documents/ClubDevAnonymes/20190703_Python/venv/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/home/acerioni/Documents/ClubDevAnonymes/20190703_Python/venv/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                 precision    recall  f1-score   support\n",
      "\n",
      "          Abies       1.00      0.17      0.29         6\n",
      "         Acacia       1.00      1.00      1.00         2\n",
      "           Acer       0.70      0.71      0.70      2099\n",
      "       Aesculus       0.75      0.71      0.73       276\n",
      "      Ailanthus       1.00      0.55      0.71        11\n",
      "       Albizzia       0.68      0.72      0.70        47\n",
      "          Alnus       0.59      0.59      0.59       303\n",
      "    Amelanchier       0.58      0.65      0.61        23\n",
      "         Betula       0.66      0.61      0.63        69\n",
      "   Broussonetia       0.22      0.29      0.25         7\n",
      "          Buxus       0.00      0.00      0.00         1\n",
      "     Calocedrus       0.67      0.47      0.55        17\n",
      "       Carpinus       0.67      0.68      0.68       151\n",
      "       Castanea       0.67      0.67      0.67         3\n",
      "        Catalpa       0.46      0.38      0.42        29\n",
      "        Cedrela       0.60      0.75      0.67         8\n",
      "         Cedrus       0.71      0.76      0.73       132\n",
      "         Celtis       0.81      0.82      0.82      1451\n",
      " Cercidiphyllum       0.00      0.00      0.00         0\n",
      "         Cercis       0.60      0.55      0.57        55\n",
      "     Cladrastis       1.00      1.00      1.00         4\n",
      "         Cornus       0.43      0.43      0.43         7\n",
      "        Corylus       0.73      0.74      0.73       389\n",
      "      Crataegus       1.00      0.73      0.85        15\n",
      "Cupressocyparis       0.00      0.00      0.00         3\n",
      "      Cupressus       0.50      0.83      0.62         6\n",
      "        Davidia       1.00      1.00      1.00         1\n",
      "      Elaeagnus       0.00      0.00      0.00         0\n",
      "     Eucalyptus       0.00      0.00      0.00         1\n",
      "         Evodia       0.65      0.58      0.61        38\n",
      "          Fagus       0.33      0.53      0.41        15\n",
      "          Ficus       1.00      0.50      0.67         2\n",
      "       Fraxinus       0.73      0.72      0.72      1420\n",
      "         Ginkgo       0.59      0.68      0.63        60\n",
      "      Gleditsia       0.74      0.78      0.76       469\n",
      "    Gymnocladus       0.75      0.60      0.67         5\n",
      "        Halesia       1.00      1.00      1.00         1\n",
      "       Hibiscus       0.00      0.00      0.00         1\n",
      "        Juglans       0.35      0.41      0.38        29\n",
      "   Koelreuteria       0.62      0.68      0.65       141\n",
      "  Lagerstroemia       0.93      0.82      0.87        49\n",
      "          Larix       0.00      0.00      0.00         0\n",
      "      Ligustrum       0.00      0.00      0.00         2\n",
      "    Liquidambar       0.64      0.62      0.63       154\n",
      "   Liriodendron       0.58      0.60      0.59        80\n",
      "       Magnolia       0.70      0.68      0.69       101\n",
      "          Malus       0.72      0.65      0.68       195\n",
      "          Melia       0.89      0.76      0.82        21\n",
      "       Mespilus       0.00      0.00      0.00         0\n",
      "    Metasequoia       0.69      0.59      0.63        41\n",
      "          Morus       0.61      0.66      0.64        53\n",
      "          Nyssa       0.00      0.00      0.00         3\n",
      "           Olea       0.00      0.00      0.00         0\n",
      "         Ostrya       0.67      0.73      0.70       102\n",
      "       Parrotia       0.62      0.64      0.63        25\n",
      "      Paulownia       0.58      0.56      0.57        75\n",
      "  Phellodendron       1.00      1.00      1.00         1\n",
      "          Picea       0.14      0.33      0.20         3\n",
      "          Pinus       0.72      0.61      0.66       168\n",
      "          Pirus       0.72      0.75      0.74       782\n",
      "       Platanus       0.93      0.93      0.93      4524\n",
      "     Platycarya       0.00      0.00      0.00         1\n",
      "        Populus       0.69      0.73      0.71        98\n",
      "         Prunus       0.67      0.66      0.67       729\n",
      "    Pseudotsuga       0.00      0.00      0.00         4\n",
      "     Pterocarya       0.56      0.53      0.55        34\n",
      "        Quercus       0.70      0.70      0.70      1180\n",
      "           Rhus       0.00      0.00      0.00         1\n",
      "        Robinia       0.60      0.58      0.59       139\n",
      "          Salix       0.75      0.50      0.60        72\n",
      "        Sequoia       0.00      0.00      0.00         3\n",
      "        Sophora       0.79      0.77      0.78       754\n",
      "         Sorbus       0.45      0.50      0.48        10\n",
      "       Taxodium       0.00      0.00      0.00         0\n",
      "          Taxus       0.00      0.00      0.00         3\n",
      "          Thuya       0.00      0.00      0.00         0\n",
      "          Tilia       0.76      0.77      0.76      1440\n",
      "          Ulmus       0.74      0.73      0.73       328\n",
      "        Zelkova       0.62      0.62      0.62       269\n",
      "\n",
      "       accuracy                           0.77     18741\n",
      "      macro avg       0.53      0.51      0.51     18741\n",
      "   weighted avg       0.77      0.77      0.77     18741\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import classification_report\n",
    "\n",
    "y_pred = dt.predict(X_test)\n",
    "\n",
    "print( classification_report(y_test, y_pred) )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "L'algorithme est aussi capable de nous dire quelles sont les *features* qui ont plus d'importance pour la classification :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>circonference_m</th>\n",
       "      <th>diametrecouronne_m</th>\n",
       "      <th>hauteurfut_m</th>\n",
       "      <th>hauteurtotale_m</th>\n",
       "      <th>nlat</th>\n",
       "      <th>nlon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.30</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>0.638981</td>\n",
       "      <td>0.209793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.45</td>\n",
       "      <td>4</td>\n",
       "      <td>2.0</td>\n",
       "      <td>6</td>\n",
       "      <td>0.635795</td>\n",
       "      <td>0.215563</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.50</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>0.635576</td>\n",
       "      <td>0.215970</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.40</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>0.639080</td>\n",
       "      <td>0.209593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.30</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7</td>\n",
       "      <td>0.638886</td>\n",
       "      <td>0.209974</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   circonference_m  diametrecouronne_m  hauteurfut_m  hauteurtotale_m  \\\n",
       "0             0.30                   5           2.0                7   \n",
       "1             0.45                   4           2.0                6   \n",
       "2             0.50                   5           2.0                7   \n",
       "3             0.40                   5           2.0                7   \n",
       "4             0.30                   5           2.0                7   \n",
       "\n",
       "       nlat      nlon  \n",
       "0  0.638981  0.209793  \n",
       "1  0.635795  0.215563  \n",
       "2  0.635576  0.215970  \n",
       "3  0.639080  0.209593  \n",
       "4  0.638886  0.209974  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.head()"
   ]
  },
  {