ag update readme and added documentation

2024-02-22 14:39:44 +01:00
parent ea097a7b71
commit 7eb456384e
18 changed files with 14754 additions and 28 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,29 @@
+# PID
+
+Libreria per allenare modello xgboost di classificazione su valori india.
+Nel file `main.py` si trova il workflow completo nei paragrafi qui sotto vengono descritte alcune funzionalita'.
+
+## retrive_data
+
+Permette di scarcare i dati dal db, sistema i campi nested (tipologia di veicolo utilizzato) e, per ogni colonna, calcola la percentuale di ogni variabile categorica. Visto che alcuni samples sono abbastanza unici, per evitare overfitting e' possibile aggregare questi valori. Di default tutte le classi con meno del 0.5% di rappresentazna vengnono convogliate nella classe other. Restituisce due dataset: uno con le classi accorpate (chiamato small) uno no (per essere sicuri che rappresentino le stesse informazioni)
+
+## split
+Questo serve per splittare il dataset (small o normale) in train, validation e test. Specificando `SKI_AREA_TEST` e' possibilile rimuovere completamente una skiarea dal dataset per simulare safe index su una zona nuova. La coppia `SEASON_TEST_SKIAREA`-`SEASON_TEST_YEAR` invece serve per rimuovere dal dataset alcune stagioni da una precisa skiarea: questo serve per simulare come si comporta su nuove stagioni di cui ha gia' visto dati passati. Una volta rimossi i dati relativi al test set, il dataset rimamenente viene separato in train e validation (66%-33%) stratificando su india (in modo da avere piu' o meno il rapporto tra le classi costante). Ci sono due modi per aiutare il modello con il dataset sbilanciato (pochissimi india 3 e 4): il primo e fare oversampling delle classi piccole (a me non piace), alternativamente si pesa in maniera diversa l'errore fatto sulle classi piccole. Ne ho messi due uno utilizza la radice del totale dei casi e divide per gli elementi della classe: un po' meno fiscale del dividere la somma per il numero di elementi. Ritorna due Dataset (uno normale e uno di test), sono delle classi di supporto per andare meglio nella fase di train
+## train
+ Questo e' il core del programma: ho messo una griglia di iperparametri con dei range di solito utilizzati. Si allena un xgboost a massimizzare MCC (non accuracy che non e' indicato in caso di classi sbilanciate). Si imposta il numero di trial (suggerisco almeno 1000) e un timeout (in caso di risorse limitate). `num_boost` e' il numero massimo di step, c'e' un sistema di overfitting detection per fermarlo prima. 
+
+## gain_accuracy_train
+Si puo' usare come no ma l'idea e' che, una volta trovato il set di parametri posso giocare con le variabili di input. Esse vengono ordinate utilizzando lo score (`best_model.get_fscore())`).
+![Feature importance](img/FI.png)
+
+ A questo punto, partendo dalla piu' importante si allena il modello con N features e si confrontano i valori.
+ ![Risultati per numero di features](img/FS.png)
+
+  Questo puo' essere utilizzato anche per dire: per avere il x% di accuracy devo utilizzare almeno queste variabili. Mi viene in mente ad esempio i campi obbligatori del form da compilare...
+
+## Notebooks
+
+Ci sono alcuni notebook, TRAIN contiene piu' o meno quello che fa `main.py`, o meglio una sua versione precedente e non pulita con alcuni check etc, l'ho lasciata per sicurezza. `Variable_exploration` contiene la parte di inference su un nuovo dataset utilizzando `prepare_new_data` (c'e' anche un confronto tra le distribuzioni, ma non avendo i labels non saprei che altro mettere). C'e' anche una parte di explainability. Molto difficile da interpretare con le variabili categoriche, ma in qualche modo ti dice perche' un certo sample e' stato classificato in questo modo. Nelle immagini qui sotto vedi il sample considerato, che ha classe 2 in origine, che viene correttamente classificato (guarda i valori degli shap values oppure le predizioni che lo mettono in classe 2 con 86% di probabilita.). Nei due grafici sotto si vede quali feature fanno aumentare o diminuire il valore di probabilita' (non proprio probabilita' ma se vuoi lo possiamo chiamare affidabilita). Tutte le frecce rosse che spingono verso destra si leggono cosi' (guardiamo la seconda riga): la diagnosi, la location e la destinazione sono quelle che maggiormente gli fanno pensare che sia della seconda classe. In effetti elicottero, hospital_emergency_room e dislocation possono fare pensare che non sia una cosa da poco. Non va sempre cosi' bene, ti ho trovato un esempio chiaro per spiegartelo, poi vedete voi se e come usarlo.
+![Sample](img/sample.png)
+
+![Interpretability](img/Interpretability.png)
--- a/init.py
+++ b/init.py
--- a/img/FI.png
+++ b/img/FI.png
--- a/img/FS.png
+++ b/img/FS.png
--- a/img/Interpretability.png
+++ b/img/Interpretability.png
--- a/img/sample.png
+++ b/img/sample.png
--- a/notebooks/Variable_explanation.ipynb
+++ b/notebooks/Variable_explanation.ipynb
--- a/notebooks/old_notebooks/.ipynb_checkpoints/test_binary-checkpoint.ipynb
+++ b/notebooks/old_notebooks/.ipynb_checkpoints/test_binary-checkpoint.ipynb
--- a/notebooks/old_notebooks/.ipynb_checkpoints/test_clean-checkpoint.ipynb
+++ b/notebooks/old_notebooks/.ipynb_checkpoints/test_clean-checkpoint.ipynb
--- a/notebooks/old_notebooks/.ipynb_checkpoints/test_multi-checkpoint.ipynb
+++ b/notebooks/old_notebooks/.ipynb_checkpoints/test_multi-checkpoint.ipynb
--- a/notebooks/old_notebooks/.ipynb_checkpoints/test_multi_CV-checkpoint.ipynb
+++ b/notebooks/old_notebooks/.ipynb_checkpoints/test_multi_CV-checkpoint.ipynb
--- a/notebooks/old_notebooks/test_binary.ipynb
+++ b/notebooks/old_notebooks/test_binary.ipynb
--- a/notebooks/old_notebooks/test_clean.ipynb
+++ b/notebooks/old_notebooks/test_clean.ipynb
--- a/notebooks/old_notebooks/test_multi.ipynb
+++ b/notebooks/old_notebooks/test_multi.ipynb
--- a/notebooks/old_notebooks/test_multi_CV.ipynb
+++ b/notebooks/old_notebooks/test_multi_CV.ipynb
--- a/src/main.py
+++ b/src/main.py
@@ -10,8 +10,10 @@ def main(args):
    

    
-    labeled,labeled_small = retrive_data(reload_data=args.reload_data,threshold_under_represented=0.5,path='/home/agobbi/Projects/PID/datanalytics/PID/src')
-
+    labeled,labeled_small,to_remove = retrive_data(reload_data=args.reload_data,threshold_under_represented=0.5,path='/home/agobbi/Projects/PID/datanalytics/PID/src')
+    with open('to_remove.pkl','wb') as f:
+        pickle.dump(to_remove,f)
+    
    dataset,dataset_test = split(labeled_small if args.use_small  else labeled  ,
                                SKI_AREA_TEST= 'Klausberg',
                                SEASON_TEST_SKIAREA = 'Kronplatz',
--- a/src/model.py
+++ b/src/model.py
@@ -7,8 +7,17 @@ import pandas  as pd



-def objective(trial,dataset:Dataset,num_boost_round:int):
-    
+def objective(trial,dataset:Dataset,num_boost_round:int)->float:
+    """function to maximize during the tuning phase
+
+    Args:
+        trial (??): optuna stuff
+        dataset (Dataset): dataset to use (containing train  and validation)
+        num_boost_round (int): number of iteration of xgboost
+
+    Returns:
+        float: validation MCC
+    """
    #These are the parameters usually used
    params = dict(
                learning_rate = trial.suggest_float("learning_rate", 0.01, 0.2),
@@ -44,8 +53,20 @@ def objective(trial,dataset:Dataset,num_boost_round:int):
    return mcc


-def train(dataset,n_trials=1000,timeout=600,num_boost_round=600):
-    
+def train(dataset:Dataset,n_trials:int=1000,timeout:int=600,num_boost_round:int=600)->(xgb.Boost, dict):
+    """optuna search procedure
+
+    Args:
+        dataset (Dataset): dataset to use (containing train  and validation)
+        n_trials (int, optional): number of combination to try. Defaults to 1000.
+        timeout (int, optional): maximum time before stopping. Defaults to 600.
+        num_boost_round (int, optional): number of iteration of a single boost model. Defaults to 600.
+
+    Returns:
+        trained xgboost and a dictionary containing the best parameters
+    """
+
+
    study = optuna.create_study(direction="maximize")
    study.optimize(lambda trial: objective(trial,dataset,num_boost_round), n_trials=n_trials, timeout=timeout)

@@ -67,7 +88,18 @@ def train(dataset,n_trials=1000,timeout=600,num_boost_round=600):
    return bst,params_final


-def gain_accuracy_train(dataset:Dataset,feat_imp:pd.DataFrame,num_boost_round:int,params:dict):
+def gain_accuracy_train(dataset:Dataset,feat_imp:pd.DataFrame,num_boost_round:int=600,params:dict={})->(pd.DataFrame,xgb.Booster,int):
+    """Starting from the most important feature, add one feature, train the model and get mcc and acc on the validation
+
+    Args:
+        dataset (Dataset):  dataset to use (containing train  and validation)
+        feat_imp (pd.DataFrame): feature importance dataset computed using feat_imp = pd.Series(best_model.get_fscore()).sort_values(ascending=False)
+        num_boost_round (int): number of iteration of a single boost model. Defaults to 600.
+        params (dict): dictionary of best parameters returned from the function `train`
+
+    Returns:
+       dataframe with N-variables, ACC, MCC for each N 
+    """

    tot = []
    for i in range(1,dataset.X_train.shape[1]):
--- a/src/utils.py
+++ b/src/utils.py
@@ -22,10 +22,62 @@ class Dataset_test:
    y_test_area:Union[pd.Series,None]
    X_test_season:Union[pd.DataFrame,None]
    y_test_season:Union[pd.Series,None]
-   
+
+def prepare_new_data(dataset:pd.DataFrame,to_remove:dict)->(pd.DataFrame,pd.DataFrame):
+    """prepare new data for prediction. MUST BE SIMILAR TO retrive_data. Maybe it can use directly inside it...
+
+    Args:
+        dataset (pd.DataFrame): dataset to use as inference
+        to_remove (dict): columns to aggregate
+
+    Returns:
+        two pandas dataframe, one the original the second with condensed classes. 
+
+    """
+    dataset_p = dataset.copy()
+    dataset_p.drop(columns=['dateandtime','skiarea_id','day_of_year','minute_of_day','year'], inplace=True)
+    
+    ##evacuation_vehicles must be explicitated
+    ev = set({})
+    for i,row in dataset_p.iterrows():
+        ev = ev.union(set(row.evacuation_vehicles))
+    for c in ev:
+        dataset_p[c] = False
+    for i,row in dataset_p.iterrows():
+        for c in row.evacuation_vehicles:
+            dataset_p.loc[i,c] = True
+    dataset_p.drop(columns=['town','province','evacuation_vehicles'],inplace=True)
+    
+    
+    dataset_p['age'] =  dataset_p['age'].astype(np.float32).fillna(np.nan)
+    
+    dataset_p_small = dataset_p.copy()
+
+    for c in to_remove.keys():
+        for k in to_remove[c]:
+            dataset_p_small.loc[dataset_p[c]==k,c] = 'other'
+    for c in dataset_p.columns:
+        if c not in ['age','season','skiarea_name']:
+            dataset_p_small[c] =  dataset_p_small[c].fillna('None').astype('category')  
+            dataset_p[c] =  dataset_p[c].fillna('None').astype('category')  
+    dataset_p.dropna(inplace=True)
+    dataset_p_small.dropna(inplace=True)
+
+    return dataset_p,dataset_p_small


-def retrive_data(reload_data:bool,threshold_under_represented:float,path:str):
+
+def retrive_data(reload_data:bool,threshold_under_represented:float,path:str)->(pd.DataFrame,pd.DataFrame):
+    """Get data
+
+    Args:
+        reload_data (bool): if true, the procedure will downolad the data from the db
+        threshold_under_represented (float): classes with few representants are condensed in the class `other`
+        path (str): path in which saving the data
+
+    Returns:
+        two pandas dataframe, one the original the second with condensed classes and a dictionarly of condesed classes
+    """
    if reload_data:
        engine = pg.connect("dbname='safeidx' user='fbk_mpba' host='172.104.247.67' port='5432' password='fbk2024$'")
        df = pd.read_sql('select * from fbk_export_20240212', con=engine) 
@@ -85,7 +137,7 @@ def retrive_data(reload_data:bool,threshold_under_represented:float,path:str):
    labeled.india = labeled.india.apply(lambda x: x.replace('i','')).astype(int)
    labeled_small.india = labeled_small.india.apply(lambda x: x.replace('i','')).astype(int)
    
-    return labeled,labeled_small
+    return labeled,labeled_small,to_remove
    


@@ -94,8 +146,24 @@ def split(labeled:pd.DataFrame,
          SEASON_TEST_SKIAREA:str = 'Kronplatz',
          SEASON_TEST_YEAR:int = 2023,
          use_smote:bool = False,
-          weight_type:str = 'sqrt' ):
-    
+          weight_type:str = 'sqrt' )->(Dataset, Dataset_test):
+    """Split  the dataset into train,validation test. From the initial dataset we remove a single skiarea (SKI_AREA_TEST)
+    generating the first test set. Then we select a skieare and a starting season (SEASON_TEST_SKIAREA,SEASON_TEST_YEAR) 
+    and generate the seconda test set. The rest of the data are splitted 66-33 stratified on the target column (india). 
+    It is possible to specify the weight of eact sample. There are two strategies implemented: using the sum or the square root 
+    of the sum. This is used for mitigating the class umbalance. Another alternative is to use an oversampling procedure (use_smote)
+
+    Args:
+        labeled (pd.DataFrame): dataset
+        SKI_AREA_TEST (str, optional): skiarea to remove from the train and use in test. Defaults to 'Klausberg'.
+        SEASON_TEST_SKIAREA (str, optional): skiarea to remove from the dataset if the season is greater than SEASON_TEST_YEAR. Defaults to 'Kronplatz'.
+        SEASON_TEST_YEAR (int, optional): see SEASON_TEST_SKIAREA . Defaults to 2023.
+        use_smote (bool, optional): use oversampling for class umbalance. Defaults to False.
+        weight_type (str, optional): routine for weighting the error on the samples. Defaults to 'sqrt'.
+
+    Returns:
+        trainin-validation dataset and test dataset
+    """

        
    test_area = labeled[labeled.skiarea_name==SKI_AREA_TEST]
@@ -116,7 +184,6 @@ def split(labeled:pd.DataFrame,
        from imblearn.over_sampling import RandomOverSampler
    
        sm = RandomOverSampler()
-        X_train_smote,y_train_smote = sm.fit_resample(X_train,y_train)
        X_train,y_train = sm.fit_resample(X_train,y_train)

    ##computed the weights for unbalanced dataset