§ ¢TfB<ãóº—dZddlZddlmcmZddlZddlZddl m Z ej d¦«Gd„de ¦«ZGd„de ¦«Zd „Zdd„ZdS) z5 Created on Sat May 18 19:24:45 2024 @author: atdou éN)Úpyplotécó*—eZdZdZd„Zd„Zd„Zd„ZdS)Úcreate_regress_dataz³ Description ----------- Creates an example dataset of arbitrary dimensions, roughly following a regression function, f, and also populated with outliers có6—tj¦«|_dS)aw Parameters ---------- dim: int the dimension of the X data Attributes ---------- dim: int data: pandas Datafarme this is the dataframe we're trying to create. It will ultimately be of the form: ["x_0", "x_1", ..., "x_n", "y", "f", "Residuals", "Outlying?"] N©ÚpandasÚ DataFrameÚdata©Úselfs úbC:\Users\atdou\OneDrive\Desktop\Files\Coding\Python\Programs\XtraMLTools\test_auxiliary_classes.pyÚ__init__zcreate_regress_data.__init__ó€õÔ$Ñ&Ô&ˆŒ ˆ ˆ ócó—||_||_||_||_||_t j |j|z |jf¦«|j|jz z|jz|_t j ||jf¦«|j|jz z|jz|_tj tj|j|jf¦«d„t|j¦«D¦«¬¦«|_ dS)áb Description ----------- generating N random points along x axis, n_out of which are (possible) outliers. Parameters ---------- x_min: float x_max: float N: int number of total points, including outliers n_out: int the number of outliers dim: int the dimension of X/number of columns of X Attributes ---------- N: int n_out: int dim: int x_min: float x_max: float X_norm: numpy array of X coordinates of going to be ostensibly normal y points X_out: numpy array of X coordinates of going to be ostensibly outlier y points X: pandas DataFrame X_norm and X_out concatenated and turned into a dataframe có2—g|]}dt|¦«z‘ŒS©Úx_©Ústr©Ú.0Úns rú z-create_regress_data.set_X..Ks"€Ð,RÐ,RÐ,R¸Q¨Tµ#°a±&´&©[Ð,RÐ,RÐ,Rr©ÚcolumnsN)ÚNÚn_outÚdimÚx_minÚx_maxÚnumpyÚrandomÚX_normÚX_outr r ÚconcatenateÚrangeÚX©r r"r#rr r!s rÚset_Xzcreate_regress_data.set_X$sæ€ð>ˆŒØˆŒ ØˆŒØˆŒ ØˆŒ Ý”l×)Ò)¨4¬6°%©<¸¼Ð*AÑBÔBÀDÄJÈtÌzÑDYÑZÐ[_Ô[eÑeˆŒÝ”\×(Ò(¨%°´Ð)9Ñ:Ô:¸D¼JÀtÄzÑ.ms#ø€Ð@Ð@Ð@¨q˜dŸiši¨™lœlÐ@Ð@Ð@rcó:•—g|]}‰ |¦«‘ŒSr/r0r2s €rrz-create_regress_data.set_y..ns#ø€Ð>Ð>Ð>¨a˜TŸYšY q™\œ\Ð>Ð>Ð>r)ÚsizeN)r1r$Úarrayr&r'r%Únormalrr Úuniformr ÚSeriesr(ÚfÚy) r r1ÚdevÚmin_yÚmax_yÚf_normÚf_outÚy_normÚy_outs ` rÚset_yzcreate_regress_data.set_yMsêø€ð>ˆŒ Ý”Ð@Ð@Ð@Ð@°D´KÐ@Ñ@Ô@ÑAÔAˆÝ”Ð>Ð>Ð>Ð>°4´:Ð>Ñ>Ô>Ñ?Ô?ˆØ˜3uœ|×2Ò2¸¼¸t¼zÑ8IÐ2ÑJÔJÑJÑJˆØœ×,Ò,¨U°EÀÄ Ð,ÑKÔKÑKˆÝ”uÔ0°&¸%°ÑAÔAÑBÔBˆŒÝ”uÔ0°&¸%°ÑAÔAÑBÔBˆŒˆˆrcóÄ—tj|jdg¬¦«}tj|jdg¬¦«}tj|j||gd¬¦«|_|jd|jdz |jd<||_||_tj |jdgd¢¦«}|d|dz }|jd ¦«}|jd ¦«}|jd kr'|d|j|zz |d|j|zz} } n%|jd kr||j|zz ||j|zz} } g}tt|j¦«¦«D]]}|jj|df| ks|jj|df| kr| d¦«ŒH| d¦«Œ^||jd <dS)a@ Description ----------- Now we concatenate the X, and y, and f points. And we classify points as outlying or not. Parameters ---------- metric: string this is "IQR", or "std" factor: float this is the parameter that multiplies either IQR = Q[3] - Q[1], or std, when classifying outlies, i.e., (Q[1] - factor*IQR, Q[3] + factor*IQR), or (mean - factor*std, mean + factor*std) Attributes ---------- metric: string factor: float r:rr;é)ÚaxisÚ Residuals)rgÐ?gà?gè?rEéÚIQRÚstdTFú Outlying?N)r r r:r;Úconcatr*rÚmetricÚfactorr$ÚquantileÚmeanrJr)ÚlenÚlocÚappend) r rMrNr:r;ÚQrIrPrJÚ Res_low_boundÚRes_high_boundÚrow_outlier_setÚis rÚclassify_outliersz%create_regress_data.classify_outlierstsÚ€õ& Ô˜TœV°¨uÐ5Ñ5Ô5ˆÝÔ˜TœV°¨uÐ5Ñ5Ô5ˆÝ”M 4¤6¨1¨a .°qÐ9Ñ9Ô9ˆŒ Ø!%¤¨3¤°$´)¸C´.Ñ!@ˆŒ +ÑØˆŒØˆŒÝŒN˜4œ9 [Ô1Ð2FÐ2FÐ2FÑGÔGˆØŒd1Q”4‰iˆØŒy˜Ô%×*Ò*Ñ,Ô,ˆØŒi˜Ô$×(Ò(Ñ*Ô*ˆØŒ;˜%ÒÐØ,-¨a¬D°4´;¸s±?Ñ,BÀAÀaÄDÈ4Ì;ÐWZÉ?ÑDZ˜>ˆMˆMØ Œ[˜EÒ !Ð !Ø,0°´¸S±Ñ,@À$ÀtÄ{ÐSVÁÑBV˜>ˆMØˆÝ•s˜4œ9‘~”~Ñ&Ô&ð .ð .ˆAØ” ” ˜a ˜mÔ,¨}Ò<Ð<À$Ä)Ä-ÐPQÐR]ÐP]ÔB^ÐaoÒBoÐBoØ×&Ò& tÑ,Ô,Ð,Ð,à×&Ò& uÑ-Ô-Ð-Ð-Ø!0ˆŒ +ÑÐÐrN)Ú__name__Ú __module__Ú__qualname__Ú__doc__rr,rCrYr/rrrr sb€€€€€ððð'ð'ð'ð 'Tð'Tð'TðR%Cð%Cð%CðN'1ð'1ð'1ð'1ð'1rrcó$—eZdZdZd„Zd„Zd„ZdS)Úcreate_classy_dataz· Description ----------- Creates an example dataset of arbitrary dimensions, roughly following a classification function, f, and also populated with outliers có6—tj¦«|_dS)a\ Parameters ---------- dim: int the dimension of the X data Attributes ---------- dim: int data: pandas Datafarme this is the dataframe we're trying to create. It will ultimately be of the form: ["x_0", "x_1", ..., "x_n", "Class"] Nrrs rrzcreate_classy_data.__init__¥rrcó*—||_||_||_||_||_t j |j|z |jf¦«|j|jz z|jz|_t j ||jf¦«|j|jz z|jz|_tj |j|jf¦«|_ tj|j d„t|j¦«D¦«¬¦«|_dS)rcó2—g|]}dt|¦«z‘ŒSrrrs rrz,create_classy_data.set_X..Üs"€Ð:`Ð:`Ð:`È1¸4ÅÀAÁÄ¹;Ð:`Ð:`Ð:`rrN)rr r!r"r#r$r%r&r'r(ÚX_arrayr r r)r*r+s rr,zcreate_classy_data.set_Xµsë€ð>ˆŒØˆŒ ØˆŒØˆŒ ØˆŒ Ý”l×)Ò)¨4¬6°%©<¸¼Ð*AÑBÔBÀDÄJÈtÌzÑDYÑZÐ[_Ô[eÑeˆŒÝ”\×(Ò(¨%°´Ð)9Ñ:Ô:¸D¼JÀtÄzÑ 0, etc. Can obviously be more complicated than this. f_values: list of the possible values f can output. These would be the numbers associated with the different classes p_norm: float probability the normal points assume value f says they should. p_out: float probability the outlier points assume value f says they should. Attributes ---------- func: function this is f f_values: list this is the list of possible values f can output. This should be integers starting from 0. f: pandas DataFrame this is f evaluated at all X points y: pandas DataFrame y-coordinates of normal and outlier points có—g|]}d‘ŒS)Fr/©rÚjs rrz,create_classy_data.set_y..s€Ð*UÐ*UÐ*U°Q¨5Ð*UÐ*UÐ*Urcó—g|]}d‘ŒS)Tr/rfs rrz,create_classy_data.set_y..s€Ð)JÐ)JÐ)J°1¨$Ð)JÐ)JÐ)Jrcó:•—g|]}‰ |¦«‘ŒSr/r0r2s €rrz,create_classy_data.set_y..s#ø€ÐBÐBÐB°˜tŸyšy¨™|œ|ÐBÐBÐBrcóÄ•‡—|dkr‰n‰Šˆˆfd„t‰¦«D¦«}‰||<tjtj d|¦«¦«}|S)NFcó&•—g|] }d‰z ‰dz z‘ŒS)rEr/)rrgÚ num_valuesÚps €€rrzDcreate_classy_data.set_y..class_assigner.. s&ø€ÐEÐEÐE¨aa˜‘c˜J q™LÑ)ÐEÐEÐErrE)r)r$Úargmaxr%Úmultinomial)Úsr:ÚprobsÚcrmrlÚp_normÚp_outs @€€€rÚclass_assignerz0create_classy_data.set_y..class_assignershøø€Ø˜Uš(˜(¨ˆAØEÐEÐEÐEÐEµ5¸Ñ3DÔ3DÐEÑEÔEˆEØˆE!‰HÝ”Uœ\×5Ò5°a¸Ñ?Ô?Ñ@Ô@ˆAØˆHrérEÚintrKr:r;N)r1Úf_valuesrQr$r6r)rr r(rcÚ frompyfuncr r9Úastyper;r*Úcopyr)r r1rxrsrtÚoutlier_status_normÚoutlier_status_outÚoutlier_status_arrayÚf_arrayruÚnum_class_assignerrls` `` @rrCzcreate_classy_data.set_yÞs_øøøø€ðDˆŒ Ø ˆŒ Ý˜œÑ'Ô'ˆ Ý#œkÐ*UÐ*U½%ÀÄÈÌÑ@SÑ:TÔ:TÐ*UÑ*UÔ*UÑVÔVÐÝ"œ[Ð)JÐ)J½¸d¼jÑ8IÔ8IÐ)JÑ)JÔ)JÑKÔKÐÝ$Ô0Ð2EÐGYÐ1ZÑ[Ô[ÐÝ”+ÐBÐBÐBÐB°T´\ÐBÑBÔBÑCÔCˆð ð ð ð ð ð ð õ#Ô-¨n¸aÀÑCÔCÐÝ”Ð1Ð1Ð2FÈÑPÔPÑQÔQ×XÒXÐY^Ñ_Ô_ˆŒØ”F—K’K‘M”MˆŒ Ø!5ˆŒ +ÑØ ˆŒ #‰ØœˆŒ #‰ˆˆrN)rZr[r\r]rr,rCr/rrr_r_žsO€€€€€ððð'ð'ð'ð 'bð'bð'bðR4 ð4 ð4 ð4 ð4 rr_có¶—i}|D]Ó}||}|ddk|ddkz ¦«}|ddk|ddkz ¦«}|ddk|ddkz ¦«}|ddk|ddkz ¦«}tj||g||gg¦«||<ŒÔ|S)aÅ Description ----------- outputs a dictionary of classification matrices comparing ROR outlier predictions on data_exp vs their actual status. Should create a data object and then fit or transform a ROR on it. Then feed the data object into data_exp argument above, and then feed in the dictionary of dataframes desired from ROR. Parameters ---------- data_exp: pandas dataframe this is a dataframe object as created above in that data class, i.e., data.data data_dict: dictionary of pandas dataframes should fit and or transform ROR to data.data. Then can create a dictionary of fit or transform df's with columns ["y_pred", "Residuals", "Outlying_Predictions"]. These are found in ROR.train_data, ROR.train_data_ave, and ROR.test_data, ROR.test_data_ave. So can test any of these guys. Returns ------- dictionary: of classification matrices rKTÚOutlying_PredictionF)Úsumr$r6) Údata_expÚ data_dictÚCM_dictÚnamerÚTPÚFPÚFNÚTNs rÚCM_ROR_predictionsrŒsô€ð,€GØð7ð7ˆØ˜ŒˆØ˜Ô$ dÒ*¨tÐ4IÔ/JÈDÒ/PÑQ× VÒ VÑ XÔ XˆØ˜Ô$ eÒ+°Ð5JÔ0KÈTÒ0QÑR× WÒ WÑ YÔ YˆØ˜Ô$ dÒ*¨tÐ4IÔ/JÈEÒ/QÑR× WÒ WÑ YÔ YˆØ˜Ô$ eÒ+°Ð5JÔ0KÈUÒ0RÑS× XÒ XÑ ZÔ ZˆÝœ b¨ W¨b°¨WÐ$5Ñ6Ô6ˆ‰ ˆ Ø€NrÚx_0c ó|—d}d}|| ¦«|| ¦«z }|d ¦«|d ¦«z }|| ¦«d|zz }|| ¦«d|zz}|d ¦«d|zz } |d ¦«d|zz} |d ||dœ¦«|d<tj¦«tj¦«}| |¬¦«}| |||d d ¬¦«| |||d|dd¬ ¦«| ttj||d¦«¦«¦«| ttj| | d¦«¦«¦«| ||¦«| d¦«| | | ¦«| d¦«| d¦«tj¦«|dkr¯gd¢} d}|D]¨}||}| |dz}|j}| ||||d|d||¬¦«| |||d|d|d|ddd¬¦«| ¦«|dz }Œ§dSdS)aœ Description ----------- in data_exp.data, it plots user specified col_X vs. y, coloring the normal points grey, and the outlier points red. and it does the same for each dataframe in the data_dict. Further, it circles the outlier predictions of the models'' data in the data_dict, so can compare actual outliers (red) with predicted (green circles). x_min/max, y_min/max set the edges of the graph. Parameters ---------- data_exp : pandas dataframe, specifically of format create_data.data this is the experimental data that has the true outliers classed. data_dict : dictionary of pandas dataframes the dataframes could be from a ROR object, in ROR.train_data[m][n], and ROR.test_data[1], and ROR.train_data_ave, and ROR.test_data_ave, etc. col_X : string the column in X that we want to plot vs. y. As X can have multiple columns. ÚgreyÚredr;gš™™™™™¹?rK)TFÚColor)Úbyr:Úblack)Úcoloré)r”rpér3z!Feature Space and Actual OutliersN)ÚdarkblueÚ orangeredÚ limegreenÚpurplerÚy_predrv)Ú linewidthr”Úlabelr‚ÚnoneÚoéd)Ú edgecolorsÚ facecolorsÚmarkerrprE)ÚmaxÚminÚmaprÚfigureÚaxesÚsort_valuesÚplotÚscatterÚ set_xticksÚlistr$ÚlinspaceÚ set_yticksÚset_xlimÚ set_xlabelÚset_ylimÚ set_ylabelÚ set_titleÚgridÚindexÚlegend)r„r…Úcol_XÚ color_normÚ color_outÚx_rangeÚy_ranger"r#Úy_minÚy_maxÚaxÚdata_exp_sortedÚcolorsrgr‡rÚhueÚindicess rÚplot_ROR_predictionsrÄ6s€ð&€JØ€IØuŒo×!Ò!Ñ#Ô# h¨u¤o×&9Ò&9Ñ&;Ô&;Ñ;€GØsŒm×ÒÑ!Ô! H¨S¤M×$5Ò$5Ñ$7Ô$7Ñ7€GØUŒO×ÒÑ!Ô! C¨¡KÑ/€EØUŒO×ÒÑ!Ô! C¨¡KÑ/€EØSŒM×ÒÑÔ # g¡+Ñ-€EØSŒM×ÒÑÔ # g¡+Ñ-€EØ Ô-×1Ò1¸È:Ð2VÐ2VÑWÔW€HˆWÑÝ „MO„O€OÝ Œ‰Œ€BØ×*Ò*¨eÐ*Ñ4Ô4€OØ‡G‚GˆO˜EÔ" O°CÔ$8À'€GÑJÔJÐJØ‡J‚Jˆx˜Œ ¨¤ °xÀÔ7HÈA€JÑNÔNÐNØ‡M‚M•$•u”~ e¨U°AÑ6Ô6Ñ7Ô7Ñ8Ô8Ð8Ø‡M‚M•$•u”~ e¨U°AÑ6Ô6Ñ7Ô7Ñ8Ô8Ð8Ø‡K‚KuÑÔÐØ‡M‚M#ÑÔÐØ‡K‚KuÑÔÐØ‡M‚M#ÑÔÐØ‡L‚LÐ4Ñ5Ô5Ð5Ý „KM„M€MØDÒÐØAÐAÐAˆØ ˆØð ð ˆDØ˜T”?ˆDØ˜˜1™”+ˆCØ%Ô+ˆGØGŠGH˜U”O GÔ,¨d°8¬n¸WÔ.EÐQRÐ\_ÐimˆGÑnÔnÐnØJŠJx ” tÐ,AÔ'BÔCÀXÈcÄ]ÐSWÐXmÔSnÔEoØ#&°VÀcÈsð ñ Tô Tð TàIŠI‰KŒKˆKØ ˆq‰DˆAˆAðÐð ð r)r)r]ÚbuiltinsÚ@py_builtinsÚ_pytest.assertion.rewriteÚ assertionÚrewriteÚ @pytest_arr r$Ú matplotlibrr%ÚseedÚobjectrr_rŒrÄr/rrúrÎsðððð€€€€€€€€€€€€ÐÐÐÐÐÐÐÐØÐÐÐÐÐØ„×Ò"ÑÔÐðN1ðN1ðN1ðN1ðN1˜&ñN1ôN1ðN1ðbt ðt ðt ðt ðt ˜ñt ôt ðt ðnðððB4ð4ð4ð4ð4ð4r