pyexplainer package¶
Submodules¶
pyexplainer.pyexplainer_pyexplainer module¶
-
pyexplainer.pyexplainer_pyexplainer.
AutoSpearman
(X_train, correlation_threshold=0.7, correlation_method='spearman', VIF_threshold=5)[source]¶ An automated feature selection approach that address collinearity and multicollinearity. For more information, please kindly refer to the paper.
- Parameters:
X_train (
pd.core.frame.DataFrame
) – The X_train data to be processedcorrelation_threshold (
float
) – Threshold value of correalation.correlation_method (
str
) – Method for solving the correlation between the features.VIF_threshold (
int
) – Threshold value of VIF score.
-
class
pyexplainer.pyexplainer_pyexplainer.
PyExplainer
(X_train, y_train, indep, dep, blackbox_model, class_label=['Clean', 'Defect'], top_k_rules=3, full_ft_names=[])[source]¶ Bases:
object
A PyExplainer object is able to load training data and an ML model to generate human-centric explanation and visualisation
- Parameters:
X_train (
pandas.core.frame.DataFrame
) – Training data X (Features)y_train (
pandas.core.series.Series
) – Training data y (Label)indep (
pandas.core.indexes.base.Index
) – independent variables (column names)dep (
str
) – dependent variables (column name)blackbox_model (
sklearn.ensemble.RandomForestClassifier
) – A global random forest model trained from sklearnclass_label (
list
) – Classification labels, default = [‘Clean’, ‘Defect’]top_k_rules (
int
) – Number of top positive and negative rules to be retrievedfull_ft_names (
list
) – A list containing full feature names inside X_train
-
auto_spearman
(apply_to_X_train=True, correlation_threshold=0.7, correlation_method='spearman', VIF_threshold=5)[source]¶ An automated feature selection approach that address collinearity and multicollinearity. For more information, please kindly refer to the paper.
- Parameters:
apply_to_X_train (
bool
) – Whether to apply the selected columns to the X_train data inside PyExplainer Obj., default is Truecorrelation_threshold (
float
) – Threshold value of correalation.correlation_method (
str
) – Method for solving the correlation between the features.VIF_threshold (
int
) – Threshold value of VIF score.
-
explain
(X_explain, y_explain, top_k=3, max_rules=2000, max_iter=10000, cv=5, search_function='CrossoverInterpolation', random_state=None, reuse_local_model=False)[source]¶ Generate Rule Object Manually by passing X_explain and y_explain
- Parameters:
X_explain (
pandas.core.frame.DataFrame
) – Features to be explained by the local RuleFit model, can be seen as X_testy_explain (
pandas.core.series.Series
) – Label to be explained by the local RuleFit model, can be seen as y_testtop_k (
int
, default is 3) – Number of top rules to be retrievedmax_rules (
int
, default is 10) – Number of maximum rules to be generatedmax_iter (
int
, default is 10) – Maximum number of iteration to be tuned in to the local RuleFit modelcv (
int
, default is 5) – Cross Validation to be tuned in to the local RuleFit modelsearch_function (
str
, default is ‘crossoverinterpolation’) – Name of the search function to be used to generate the instance used by RuleFit.fit()random_state (
int
, default is None) – Random seed for reproducing the same resultreuse_local_model (
bool
, default is False) – Reproduce the same explanation for the same data
- Returns:
A dict rule object including all of the data related to the local RuleFit model with the following keys, ‘synthetic_data’, ‘synthetic_predictions’, ‘X_explain’, ‘y_explain’, ‘indep’, ‘dep’, ‘top_k_positive_rules’, ‘top_k_negative_rules’.
- Return type:
dict
Examples
>>> from pyexplainer.pyexplainer_pyexplainer import PyExplainer >>> import pandas as pd >>> from sklearn.ensemble import RandomForestClassifier >>> data = pd.read_csv('../tests/pyexplainer_test_data/activemq-5.0.0.csv', index_col = 'File') >>> dep = data.columns[-4] >>> indep = data.columns[0:(len(data.columns) - 4)] >>> X_train = data.loc[:, indep] >>> y_train = data.loc[:, dep] >>> blackbox_model = RandomForestClassifier(max_depth=3, random_state=0) >>> blackbox_model.fit(X_train, y_train) >>> class_label = ['Clean', 'Defect'] >>> py_explainer = PyExplainer(X_train, y_train, indep, dep, class_label, blackbox_model) >>> sample_test_data = pd.read_csv('../tests/pyexplainer_test_data/activemq-5.0.0.csv', index_col='File') >>> X_test = sample_test_data.loc[:, indep] >>> y_test = sample_test_data.loc[:, dep] >>> sample_explain_index = 0 >>> X_explain = X_test.iloc[[sample_explain_index]] >>> y_explain = y_test.iloc[[sample_explain_index]] >>> py_explainer.explain(X_explain, y_explain, search_function='crossoverinterpolation', top_k=3, max_rules=30, max_iter=5, cv=5)
-
generate_bullet_data
(parsed_rule_object)[source]¶ Generate bullet chart data (a list of dict) to be implemented with d3.js chart.
- Parameters:
parsed_rule_object (
dict
) – Top rules parsed from Rule object.- Returns:
A list of dict that contains the data needed to generate a bullet chart.
- Return type:
list
-
generate_html
()[source]¶ Generate d3 bullet chart html and return it as a String.
- Returns:
html String
- Return type:
str
-
generate_instance_crossover_interpolation
(X_explain, y_explain, random_state=None, debug=False)[source]¶ An approach to generate instance using Crossover and Interpolation
- Parameters:
X_explain (
pandas.core.frame.DataFrame
) – X_explain (Testing Features)y_explain (
pandas.core.series.Series
) – y_explain (Testing Label)random_state (
int
) – Random Seeddebug (
bool
) – True for debugging mode, False otherwise.
- Returns:
A dict with two keys ‘synthetic_data’ and ‘sampled_class_frequency’ generated via Crossover and Interpolation.
- Return type:
dict
-
generate_instance_random_perturbation
(X_explain, debug=False)[source]¶ The random perturbation approach to generate synthetic instances which is also used by LIME.
- Parameters:
X_explain (
pandas.core.frame.DataFrame
) – X_explain (Testing Features)debug (
bool
) – True for debugging mode, False otherwise.
- Returns:
A dict with two keys ‘synthetic_data’ and ‘sampled_class_frequency’ generated via Random Perturbation.
- Return type:
dict
-
generate_risk_data
(X_explain)[source]¶ Generate risk prediction and risk score to be visualised
- Parameters:
X_explain (
pandas.core.frame.DataFrame
) – Explained Dataframe generated from RuleFit model.- Returns:
A list of dict that contains the data of risk prediction and risk score.
- Return type:
list
-
generate_sliders
()[source]¶ Generate one or more slider widgets and return as a list. Slider would be either IntSlider or FloatSlider depending on the value in the data
- Returns:
A list of slider widgets.
- Return type:
list
-
get_full_ft_names
()[source]¶ getter of self.full_ft_names
- Returns:
A list of full feature names in X_train following the same order as X_train
- Return type:
list
-
get_risk_pred
()[source]¶ Retrieve the risk prediction from risk_data
- Returns:
A string of risk prediction
- Return type:
str
-
get_risk_score
()[source]¶ Retrieve the risk score from risk_data
- Returns:
A float of risk score
- Return type:
float
-
get_top_k_rules
()[source]¶ Getter of top_k_rules
- Returns:
Number of top positive and negative rules to be retrieved
- Return type:
int
-
on_value_change
(change, debug=False)[source]¶ The callback function for the interactive slider
Whenever the user interacts with the slider, If the slider is in the non-continuous update mode, only if the mouse click is released, this callback will be triggered. If the slider is in the continuous update mode (not recommended here), this function will be triggered continuously when the user is moving the slider.
This callback will first clear the output of Risk Score Progress Bar and the Bullet Chart. Then it will call funcs to compute the new values to be visualised. When the computing is done, it will soon visualise the new value.
- Parameters:
change (
dict
) – A dict that contains the former(before changing) and later(after changing) data inside the slider
-
parse_top_rules
(top_k_positive_rules, top_k_negative_rules)[source]¶ Parse top k positive rules and top k negative rules given positive and negative rules as DataFrame
- Parameters:
top_k_positive_rules (
pandas.core.frame.DataFrame
) – Top positive rules DataFrametop_k_negative_rules (
pandas.core.frame.DataFrame
) – Top negative rules DataFrame
- Returns:
A dict containing two keys, ‘top_tofollow_rules’ and ‘top_toavoid_rules’
- Return type:
dict
-
retrieve_X_explain_min_max_values
()[source]¶ Retrieve the minimum and maximum value from X_train
- Returns:
A dict containing two keys, ‘min_values’ and ‘max_values’
- Return type:
dict
-
set_X_train
(X_train)[source]¶ Setter of X_train
- Parameters:
X_train (
pandas.core.frame.DataFrame
) – X_train data
-
set_full_ft_names
(full_ft_names)[source]¶ Setter of full_ft_names
- Parameters:
full_ft_names (
list
) – A list of full feature names in X_train following the same order as X_train
-
set_top_k_rules
(top_k_rules)[source]¶ Setter of top_k_rules
- Parameters:
top_k_rules (
int
) – Number of top positive and negative rules to be retrieved
-
show_visualisation
(title)[source]¶ Display items as follows, (1) Risk Score Progress Bar (made from ipywidgets) (2) Interactive Slider (made from ipywidgets) (3) Bullet Chart (Generated By D3.js)
-
update_right_text
(right_text)[source]¶ Update the text on the rightward side of the Risk Score Progress Bar
- Parameters:
right_text (
widgets.Label
) – Text on the rightward side of the Risk Score Progress Bar
-
update_risk_score
(risk_score)[source]¶ Update the risk score value inside the risk_data
- Parameters:
risk_score (
int
) – Value of risk score
-
visualisation_data_setup
(rule_obj)[source]¶ Set up the data before visualising them
- Parameters:
rule_obj (
dict
) – A rule dict generated either through loading the .pyobject file or the .explain(…) function
-
visualise
(rule_obj, title=None)[source]¶ Given the rule object, show all of the visualisation as follows . (1) Risk Score Progress Bar (made from ipywidgets) (2) Interactive Slider (made from ipywidgets) (3) Bullet Chart (Generated By D3.js)
- Parameters:
rule_obj (
dict
) – A rule dict generated either through loading the .pyobject file or the .explain(…) function
Examples
>>> from pyexplainer.pyexplainer_pyexplainer import PyExplainer >>> import pandas as pd >>> from sklearn.ensemble import RandomForestClassifier >>> data = pd.read_csv('../tests/pyexplainer_test_data/activemq-5.0.0.csv', index_col = 'File') >>> dep = data.columns[-4] >>> indep = data.columns[0:(len(data.columns) - 4)] >>> X_train = data.loc[:, indep] >>> y_train = data.loc[:, dep] >>> blackbox_model = RandomForestClassifier(max_depth=3, random_state=0) >>> blackbox_model.fit(X_train, y_train) >>> class_label = ['Clean', 'Defect'] >>> pyExp = PyExplainer(X_train, y_train, indep, dep, class_label, blackbox_model) >>> sample_test_data = pd.read_csv('../tests/pyexplainer_test_data/activemq-5.0.0.csv', index_col = 'File') >>> X_test = sample_test_data.loc[:, indep] >>> y_test = sample_test_data.loc[:, dep] >>> sample_explain_index = 0 >>> X_explain = X_test.iloc[[sample_explain_index]] >>> y_explain = y_test.iloc[[sample_explain_index]] >>> rule_obj = pyExp.explain(X_explain, y_explain, search_function = 'CrossoverInterpolation', top_k = 3, max_rules=30, max_iter =5, cv=5, debug = False) >>> pyExp.visualise(rule_obj)
-
pyexplainer.pyexplainer_pyexplainer.
data_validation
(data)[source]¶ Validate that the given data format is a list of dictionary.
- Parameters:
data (
Any
) – Data to be validated.- Returns:
True: The data is a list of dictionary.
False: The data is not a list of dictionary.
- Return type:
bool
-
pyexplainer.pyexplainer_pyexplainer.
filter_rules
(rules, X_explain)[source]¶ Get rules that are actually applied to the commit
- Parameters:
rules (
pandas.core.frame.DataFrame
) – Rules data under the column called ‘rule’ inside Rules DF generated by RuleFitX_explain (
pandas.core.frame.DataFrame
) – Features to be explained by the local RuleFit model, can be seen as X_test
- Returns:
A DataFrame that contains filtered rules
- Return type:
pandas.core.frame.DataFrame
-
pyexplainer.pyexplainer_pyexplainer.
get_base_prefix_compat
()[source]¶ Get base/real prefix, or sys.prefix if there is none.
-
pyexplainer.pyexplainer_pyexplainer.
get_dflt
()[source]¶ Obtain the default data and model
- Returns:
A dictionary wrapping all default data and model
- Return type:
dict
-
pyexplainer.pyexplainer_pyexplainer.
id_generator
(size=15, random_state=RandomState(MT19937) at 0x7F8FC367F840)[source]¶ Generate unique ids for div tag which will contain the visualisation stuff from d3.
- Parameters:
size (
int
) – An integer that specifies the length of the returned id, default = 15. Size should be ion range 1 - 30(both included)random_state (
np.random.RandomState
, default is None.) – A RandomState instance.
- Returns:
A random identifier.
- Return type:
str
pyexplainer.rulefit module¶
We use the RuleFit implementation as provided by the following url: https://raw.githubusercontent.com/christophM/rulefit/master/rulefit/rulefit.py
Linear model of tree-based decision rules
This method implement the RuleFit algorithm
The module structure is the following:
RuleCondition
implements a binary feature transformationRule
implements a Rule composed ofRuleConditions
RuleEnsemble
implements an ensemble ofRules
RuleFit
implements the RuleFit algorithm
-
class
pyexplainer.rulefit.
FriedScale
(winsorizer=None)[source]¶ Bases:
object
Performs scaling of linear variables according to Friedman et al. 2005 Sec 5
Each variable is first Winsorized l->l*, then standardised as 0.4 x l* / std(l*) Warning: this class should not be used directly.
-
class
pyexplainer.rulefit.
Rule
(rule_conditions, prediction_value)[source]¶ Bases:
object
Class for binary Rules from list of conditions
Warning: this class should not be used directly.
-
class
pyexplainer.rulefit.
RuleCondition
(feature_index, threshold, operator, support, feature_name=None)[source]¶ Bases:
object
Class for binary rule condition
Warning: this class should not be used directly.
-
class
pyexplainer.rulefit.
RuleEnsemble
(tree_list, feature_names=None)[source]¶ Bases:
object
Ensemble of binary decision rules
This class implements an ensemble of decision rules that extracts rules from an ensemble of decision trees.
- Parameters:
tree_list (List or array of DecisionTreeClassifier or DecisionTreeRegressor) – Trees from which the rules are created
feature_names (List of strings, optional (default=None)) – Names of the features
-
rules
¶ The ensemble of rules extracted from the trees
- Type:
List of Rule
-
transform
(X, coefs=None)[source]¶ Transform dataset.
- Parameters:
X (array-like matrix, shape=(n_samples, n_features)) –
coefs ((optional) if supplied, this makes the prediction) – slightly more efficient by setting rules with zero coefficients to zero without calling Rule.transform().
- Returns:
X_transformed – Transformed dataset. Each column represents one rule.
- Return type:
array-like matrix, shape=(n_samples, n_out)
-
class
pyexplainer.rulefit.
RuleFit
(tree_size=4, sample_fract='default', max_rules=2000, memory_par=0.01, tree_generator=None, rfmode='regress', lin_trim_quantile=0.025, lin_standardise=True, exp_rand_tree_size=True, model_type='rl', Cs=None, cv=3, tol=0.0001, max_iter=None, n_jobs=None, random_state=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Rulefit class
- Parameters:
tree_size (Number of terminal nodes in generated trees. If exp_rand_tree_size=True,) – this will be the mean number of terminal nodes.
sample_fract (fraction of randomly chosen training observations used to produce each tree.) – FP 2004 (Sec. 2)
max_rules (approximate total number of rules generated for fitting. Note that actual) – number of rules will usually be lower than this due to duplicates.
memory_par (scale multiplier (shrinkage factor) applied to each new tree when) – sequentially induced. FP 2004 (Sec. 2)
rfmode ('regress' for regression or 'classify' for binary classification.) –
lin_standardise (If True, the linear terms will be standardised as per Friedman Sec 3.2) – by multiplying the winsorised variable by 0.4/stdev.
lin_trim_quantile (If lin_standardise is True, this quantile will be used to trim linear) – terms before standardisation.
exp_rand_tree_size (If True, each boosted tree will have a different maximum number of) – terminal nodes based on an exponential distribution about tree_size. (Friedman Sec 3.3)
model_type ('r': rules only; 'l': linear terms only; 'rl': both rules and linear terms) –
random_state (Integer to initialise random objects and provide repeatability.) –
tree_generator (Optional: this object will be used as provided to generate the rules.) – This will override almost all the other properties above. Must be GradientBoostingRegressor or GradientBoostingClassifier, optional (default=None)
tol (The tolerance for the optimization for LassoCV or LogisticRegressionCV:) – if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
max_iter (The maximum number of iterations for LassoCV or LogisticRegressionCV.) –
n_jobs (Number of CPUs to use during the cross validation in LassoCV or) – LogisticRegressionCV. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
-
rule_ensemble
¶ The rule ensemble
- Type:
-
feature_names
¶ The names of the features (columns)
- Type:
list of strings, optional (default=None)
-
get_feature_importance
(exclude_zero_coef=False, subregion=None, scaled=False)[source]¶ Returns feature importance for input features to RuleFit model.
- exclude_zero_coef: If True, returns only the rules with an estimated
coefficient not equalt to zero.
- subregion: If None (default) returns global importances (FP 2004 eq. 28/29), else returns importance over
subregion of inputs (FP 2004 eq. 30/31/32).
scaled: If True, will scale the importances to have a max of 100.
return_df (pandas DataFrame): DataFrame for feature names and feature importances (FP 2004 eq. 35)
-
get_rules
(exclude_zero_coef=False, subregion=None)[source]¶ Return the estimated rules
- Parameters:
exclude_zero_coef (If True (default), returns only the rules with an estimated) – coefficient not equalt to zero.
subregion (If None (default) returns global importances (FP 2004 eq. 28/29), else returns importance over) – subregion of inputs (FP 2004 eq. 30/31/32).
- Returns:
rules – the coefficients and ‘support’ the support of the rule in the training data set (X)
- Return type:
pandas.DataFrame with the rules. Column ‘rule’ describes the rule, ‘coef’ holds