pyexplainer package

Submodules

pyexplainer.pyexplainer_pyexplainer module

pyexplainer.pyexplainer_pyexplainer.AutoSpearman(X_train, correlation_threshold=0.7, correlation_method='spearman', VIF_threshold=5)[source]

An automated feature selection approach that address collinearity and multicollinearity. For more information, please kindly refer to the paper.

Parameters:
  • X_train (pd.core.frame.DataFrame) – The X_train data to be processed

  • correlation_threshold (float) – Threshold value of correalation.

  • correlation_method (str) – Method for solving the correlation between the features.

  • VIF_threshold (int) – Threshold value of VIF score.

class pyexplainer.pyexplainer_pyexplainer.PyExplainer(X_train, y_train, indep, dep, blackbox_model, class_label=['Clean', 'Defect'], top_k_rules=3, full_ft_names=[])[source]

Bases: object

A PyExplainer object is able to load training data and an ML model to generate human-centric explanation and visualisation

Parameters:
  • X_train (pandas.core.frame.DataFrame) – Training data X (Features)

  • y_train (pandas.core.series.Series) – Training data y (Label)

  • indep (pandas.core.indexes.base.Index) – independent variables (column names)

  • dep (str) – dependent variables (column name)

  • blackbox_model (sklearn.ensemble.RandomForestClassifier) – A global random forest model trained from sklearn

  • class_label (list) – Classification labels, default = [‘Clean’, ‘Defect’]

  • top_k_rules (int) – Number of top positive and negative rules to be retrieved

  • full_ft_names (list) – A list containing full feature names inside X_train

auto_spearman(apply_to_X_train=True, correlation_threshold=0.7, correlation_method='spearman', VIF_threshold=5)[source]

An automated feature selection approach that address collinearity and multicollinearity. For more information, please kindly refer to the paper.

Parameters:
  • apply_to_X_train (bool) – Whether to apply the selected columns to the X_train data inside PyExplainer Obj., default is True

  • correlation_threshold (float) – Threshold value of correalation.

  • correlation_method (str) – Method for solving the correlation between the features.

  • VIF_threshold (int) – Threshold value of VIF score.

explain(X_explain, y_explain, top_k=3, max_rules=2000, max_iter=10000, cv=5, search_function='CrossoverInterpolation', random_state=None, reuse_local_model=False)[source]

Generate Rule Object Manually by passing X_explain and y_explain

Parameters:
  • X_explain (pandas.core.frame.DataFrame) – Features to be explained by the local RuleFit model, can be seen as X_test

  • y_explain (pandas.core.series.Series) – Label to be explained by the local RuleFit model, can be seen as y_test

  • top_k (int, default is 3) – Number of top rules to be retrieved

  • max_rules (int, default is 10) – Number of maximum rules to be generated

  • max_iter (int, default is 10) – Maximum number of iteration to be tuned in to the local RuleFit model

  • cv (int, default is 5) – Cross Validation to be tuned in to the local RuleFit model

  • search_function (str, default is ‘crossoverinterpolation’) – Name of the search function to be used to generate the instance used by RuleFit.fit()

  • random_state (int, default is None) – Random seed for reproducing the same result

  • reuse_local_model (bool, default is False) – Reproduce the same explanation for the same data

Returns:

A dict rule object including all of the data related to the local RuleFit model with the following keys, ‘synthetic_data’, ‘synthetic_predictions’, ‘X_explain’, ‘y_explain’, ‘indep’, ‘dep’, ‘top_k_positive_rules’, ‘top_k_negative_rules’.

Return type:

dict

Examples

>>> from pyexplainer.pyexplainer_pyexplainer import PyExplainer
>>> import pandas as pd
>>> from sklearn.ensemble import RandomForestClassifier
>>> data = pd.read_csv('../tests/pyexplainer_test_data/activemq-5.0.0.csv', index_col = 'File')
>>> dep = data.columns[-4]
>>> indep = data.columns[0:(len(data.columns) - 4)]
>>> X_train = data.loc[:, indep]
>>> y_train = data.loc[:, dep]
>>> blackbox_model = RandomForestClassifier(max_depth=3, random_state=0)
>>> blackbox_model.fit(X_train, y_train)
>>> class_label = ['Clean', 'Defect']
>>> py_explainer = PyExplainer(X_train, y_train, indep, dep, class_label, blackbox_model)
>>> sample_test_data = pd.read_csv('../tests/pyexplainer_test_data/activemq-5.0.0.csv', index_col='File')
>>> X_test = sample_test_data.loc[:, indep]
>>> y_test = sample_test_data.loc[:, dep]
>>> sample_explain_index = 0
>>> X_explain = X_test.iloc[[sample_explain_index]]
>>> y_explain = y_test.iloc[[sample_explain_index]]
>>> py_explainer.explain(X_explain, y_explain, search_function='crossoverinterpolation', top_k=3, max_rules=30, max_iter=5, cv=5)
generate_bullet_data(parsed_rule_object)[source]

Generate bullet chart data (a list of dict) to be implemented with d3.js chart.

Parameters:

parsed_rule_object (dict) – Top rules parsed from Rule object.

Returns:

A list of dict that contains the data needed to generate a bullet chart.

Return type:

list

generate_html()[source]

Generate d3 bullet chart html and return it as a String.

Returns:

html String

Return type:

str

generate_instance_crossover_interpolation(X_explain, y_explain, random_state=None, debug=False)[source]

An approach to generate instance using Crossover and Interpolation

Parameters:
  • X_explain (pandas.core.frame.DataFrame) – X_explain (Testing Features)

  • y_explain (pandas.core.series.Series) – y_explain (Testing Label)

  • random_state (int) – Random Seed

  • debug (bool) – True for debugging mode, False otherwise.

Returns:

A dict with two keys ‘synthetic_data’ and ‘sampled_class_frequency’ generated via Crossover and Interpolation.

Return type:

dict

generate_instance_random_perturbation(X_explain, debug=False)[source]

The random perturbation approach to generate synthetic instances which is also used by LIME.

Parameters:
  • X_explain (pandas.core.frame.DataFrame) – X_explain (Testing Features)

  • debug (bool) – True for debugging mode, False otherwise.

Returns:

A dict with two keys ‘synthetic_data’ and ‘sampled_class_frequency’ generated via Random Perturbation.

Return type:

dict

generate_progress_bar_items()[source]

Generate items to be set into hbox (horizontal box)

generate_risk_data(X_explain)[source]

Generate risk prediction and risk score to be visualised

Parameters:

X_explain (pandas.core.frame.DataFrame) – Explained Dataframe generated from RuleFit model.

Returns:

A list of dict that contains the data of risk prediction and risk score.

Return type:

list

generate_sliders()[source]

Generate one or more slider widgets and return as a list. Slider would be either IntSlider or FloatSlider depending on the value in the data

Returns:

A list of slider widgets.

Return type:

list

get_full_ft_names()[source]

getter of self.full_ft_names

Returns:

A list of full feature names in X_train following the same order as X_train

Return type:

list

get_risk_pred()[source]

Retrieve the risk prediction from risk_data

Returns:

A string of risk prediction

Return type:

str

get_risk_score()[source]

Retrieve the risk score from risk_data

Returns:

A float of risk score

Return type:

float

get_top_k_rules()[source]

Getter of top_k_rules

Returns:

Number of top positive and negative rules to be retrieved

Return type:

int

on_value_change(change, debug=False)[source]

The callback function for the interactive slider

Whenever the user interacts with the slider, If the slider is in the non-continuous update mode, only if the mouse click is released, this callback will be triggered. If the slider is in the continuous update mode (not recommended here), this function will be triggered continuously when the user is moving the slider.

This callback will first clear the output of Risk Score Progress Bar and the Bullet Chart. Then it will call funcs to compute the new values to be visualised. When the computing is done, it will soon visualise the new value.

Parameters:

change (dict) – A dict that contains the former(before changing) and later(after changing) data inside the slider

parse_top_rules(top_k_positive_rules, top_k_negative_rules)[source]

Parse top k positive rules and top k negative rules given positive and negative rules as DataFrame

Parameters:
  • top_k_positive_rules (pandas.core.frame.DataFrame) – Top positive rules DataFrame

  • top_k_negative_rules (pandas.core.frame.DataFrame) – Top negative rules DataFrame

Returns:

A dict containing two keys, ‘top_tofollow_rules’ and ‘top_toavoid_rules’

Return type:

dict

retrieve_X_explain_min_max_values()[source]

Retrieve the minimum and maximum value from X_train

Returns:

A dict containing two keys, ‘min_values’ and ‘max_values’

Return type:

dict

run_bar_animation()[source]

Run the animation of Risk Score Progress Bar

set_X_train(X_train)[source]

Setter of X_train

Parameters:

X_train (pandas.core.frame.DataFrame) – X_train data

set_full_ft_names(full_ft_names)[source]

Setter of full_ft_names

Parameters:

full_ft_names (list) – A list of full feature names in X_train following the same order as X_train

set_top_k_rules(top_k_rules)[source]

Setter of top_k_rules

Parameters:

top_k_rules (int) – Number of top positive and negative rules to be retrieved

show_visualisation(title)[source]

Display items as follows, (1) Risk Score Progress Bar (made from ipywidgets) (2) Interactive Slider (made from ipywidgets) (3) Bullet Chart (Generated By D3.js)

update_right_text(right_text)[source]

Update the text on the rightward side of the Risk Score Progress Bar

Parameters:

right_text (widgets.Label) – Text on the rightward side of the Risk Score Progress Bar

update_risk_score(risk_score)[source]

Update the risk score value inside the risk_data

Parameters:

risk_score (int) – Value of risk score

visualisation_data_setup(rule_obj)[source]

Set up the data before visualising them

Parameters:

rule_obj (dict) – A rule dict generated either through loading the .pyobject file or the .explain(…) function

visualise(rule_obj, title=None)[source]

Given the rule object, show all of the visualisation as follows . (1) Risk Score Progress Bar (made from ipywidgets) (2) Interactive Slider (made from ipywidgets) (3) Bullet Chart (Generated By D3.js)

Parameters:

rule_obj (dict) – A rule dict generated either through loading the .pyobject file or the .explain(…) function

Examples

>>> from pyexplainer.pyexplainer_pyexplainer import PyExplainer
>>> import pandas as pd
>>> from sklearn.ensemble import RandomForestClassifier
>>> data = pd.read_csv('../tests/pyexplainer_test_data/activemq-5.0.0.csv', index_col = 'File')
>>> dep = data.columns[-4]
>>> indep = data.columns[0:(len(data.columns) - 4)]
>>> X_train = data.loc[:, indep]
>>> y_train = data.loc[:, dep]
>>> blackbox_model = RandomForestClassifier(max_depth=3, random_state=0)
>>> blackbox_model.fit(X_train, y_train)
>>> class_label = ['Clean', 'Defect']
>>> pyExp = PyExplainer(X_train, y_train, indep, dep, class_label, blackbox_model)
>>> sample_test_data = pd.read_csv('../tests/pyexplainer_test_data/activemq-5.0.0.csv', index_col = 'File')
>>> X_test = sample_test_data.loc[:, indep]
>>> y_test = sample_test_data.loc[:, dep]
>>> sample_explain_index = 0
>>> X_explain = X_test.iloc[[sample_explain_index]]
>>> y_explain = y_test.iloc[[sample_explain_index]]
>>> rule_obj = pyExp.explain(X_explain, y_explain, search_function = 'CrossoverInterpolation', top_k = 3, max_rules=30, max_iter =5, cv=5, debug = False)
>>> pyExp.visualise(rule_obj)
pyexplainer.pyexplainer_pyexplainer.data_validation(data)[source]

Validate that the given data format is a list of dictionary.

Parameters:

data (Any) – Data to be validated.

Returns:

True: The data is a list of dictionary.

False: The data is not a list of dictionary.

Return type:

bool

pyexplainer.pyexplainer_pyexplainer.filter_rules(rules, X_explain)[source]

Get rules that are actually applied to the commit

Parameters:
  • rules (pandas.core.frame.DataFrame) – Rules data under the column called ‘rule’ inside Rules DF generated by RuleFit

  • X_explain (pandas.core.frame.DataFrame) – Features to be explained by the local RuleFit model, can be seen as X_test

Returns:

A DataFrame that contains filtered rules

Return type:

pandas.core.frame.DataFrame

pyexplainer.pyexplainer_pyexplainer.get_base_prefix_compat()[source]

Get base/real prefix, or sys.prefix if there is none.

pyexplainer.pyexplainer_pyexplainer.get_dflt()[source]

Obtain the default data and model

Returns:

A dictionary wrapping all default data and model

Return type:

dict

pyexplainer.pyexplainer_pyexplainer.id_generator(size=15, random_state=RandomState(MT19937) at 0x7F8FC367F840)[source]

Generate unique ids for div tag which will contain the visualisation stuff from d3.

Parameters:
  • size (int) – An integer that specifies the length of the returned id, default = 15. Size should be ion range 1 - 30(both included)

  • random_state (np.random.RandomState, default is None.) – A RandomState instance.

Returns:

A random identifier.

Return type:

str

pyexplainer.pyexplainer_pyexplainer.in_virtualenv()[source]
pyexplainer.pyexplainer_pyexplainer.load_sample_data()[source]
pyexplainer.pyexplainer_pyexplainer.to_js_data(list_of_dict)[source]

Transform python list to a str to be used inside the html <script><script/>

Parameters:

list_of_dict (list) – Data to be transformed.

Returns:

A str to represent a list of dict ending with ‘;’

Return type:

str

pyexplainer.rulefit module

We use the RuleFit implementation as provided by the following url: https://raw.githubusercontent.com/christophM/rulefit/master/rulefit/rulefit.py

Linear model of tree-based decision rules

This method implement the RuleFit algorithm

The module structure is the following:

  • RuleCondition implements a binary feature transformation

  • Rule implements a Rule composed of RuleConditions

  • RuleEnsemble implements an ensemble of Rules

  • RuleFit implements the RuleFit algorithm

class pyexplainer.rulefit.FriedScale(winsorizer=None)[source]

Bases: object

Performs scaling of linear variables according to Friedman et al. 2005 Sec 5

Each variable is first Winsorized l->l*, then standardised as 0.4 x l* / std(l*) Warning: this class should not be used directly.

scale(X)[source]
train(X)[source]
class pyexplainer.rulefit.Rule(rule_conditions, prediction_value)[source]

Bases: object

Class for binary Rules from list of conditions

Warning: this class should not be used directly.

transform(X)[source]

Transform dataset.

Parameters:

X (array-like matrix) –

Returns:

X_transformed

Return type:

array-like matrix, shape=(n_samples, 1)

class pyexplainer.rulefit.RuleCondition(feature_index, threshold, operator, support, feature_name=None)[source]

Bases: object

Class for binary rule condition

Warning: this class should not be used directly.

transform(X)[source]

Transform dataset.

Parameters:

X (array-like matrix, shape=(n_samples, n_features)) –

Returns:

X_transformed

Return type:

array-like matrix, shape=(n_samples, 1)

class pyexplainer.rulefit.RuleEnsemble(tree_list, feature_names=None)[source]

Bases: object

Ensemble of binary decision rules

This class implements an ensemble of decision rules that extracts rules from an ensemble of decision trees.

Parameters:
  • tree_list (List or array of DecisionTreeClassifier or DecisionTreeRegressor) – Trees from which the rules are created

  • feature_names (List of strings, optional (default=None)) – Names of the features

rules

The ensemble of rules extracted from the trees

Type:

List of Rule

filter_rules(func)[source]
filter_short_rules(k)[source]
transform(X, coefs=None)[source]

Transform dataset.

Parameters:
  • X (array-like matrix, shape=(n_samples, n_features)) –

  • coefs ((optional) if supplied, this makes the prediction) – slightly more efficient by setting rules with zero coefficients to zero without calling Rule.transform().

Returns:

X_transformed – Transformed dataset. Each column represents one rule.

Return type:

array-like matrix, shape=(n_samples, n_out)

class pyexplainer.rulefit.RuleFit(tree_size=4, sample_fract='default', max_rules=2000, memory_par=0.01, tree_generator=None, rfmode='regress', lin_trim_quantile=0.025, lin_standardise=True, exp_rand_tree_size=True, model_type='rl', Cs=None, cv=3, tol=0.0001, max_iter=None, n_jobs=None, random_state=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Rulefit class

Parameters:
  • tree_size (Number of terminal nodes in generated trees. If exp_rand_tree_size=True,) – this will be the mean number of terminal nodes.

  • sample_fract (fraction of randomly chosen training observations used to produce each tree.) – FP 2004 (Sec. 2)

  • max_rules (approximate total number of rules generated for fitting. Note that actual) – number of rules will usually be lower than this due to duplicates.

  • memory_par (scale multiplier (shrinkage factor) applied to each new tree when) – sequentially induced. FP 2004 (Sec. 2)

  • rfmode ('regress' for regression or 'classify' for binary classification.) –

  • lin_standardise (If True, the linear terms will be standardised as per Friedman Sec 3.2) – by multiplying the winsorised variable by 0.4/stdev.

  • lin_trim_quantile (If lin_standardise is True, this quantile will be used to trim linear) – terms before standardisation.

  • exp_rand_tree_size (If True, each boosted tree will have a different maximum number of) – terminal nodes based on an exponential distribution about tree_size. (Friedman Sec 3.3)

  • model_type ('r': rules only; 'l': linear terms only; 'rl': both rules and linear terms) –

  • random_state (Integer to initialise random objects and provide repeatability.) –

  • tree_generator (Optional: this object will be used as provided to generate the rules.) – This will override almost all the other properties above. Must be GradientBoostingRegressor or GradientBoostingClassifier, optional (default=None)

  • tol (The tolerance for the optimization for LassoCV or LogisticRegressionCV:) – if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

  • max_iter (The maximum number of iterations for LassoCV or LogisticRegressionCV.) –

  • n_jobs (Number of CPUs to use during the cross validation in LassoCV or) – LogisticRegressionCV. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

rule_ensemble

The rule ensemble

Type:

RuleEnsemble

feature_names

The names of the features (columns)

Type:

list of strings, optional (default=None)

fit(X, y=None, feature_names=None)[source]

Fit and estimate linear combination of rule ensemble

get_feature_importance(exclude_zero_coef=False, subregion=None, scaled=False)[source]

Returns feature importance for input features to RuleFit model.

exclude_zero_coef: If True, returns only the rules with an estimated

coefficient not equalt to zero.

subregion: If None (default) returns global importances (FP 2004 eq. 28/29), else returns importance over

subregion of inputs (FP 2004 eq. 30/31/32).

scaled: If True, will scale the importances to have a max of 100.

return_df (pandas DataFrame): DataFrame for feature names and feature importances (FP 2004 eq. 35)

get_rules(exclude_zero_coef=False, subregion=None)[source]

Return the estimated rules

Parameters:
  • exclude_zero_coef (If True (default), returns only the rules with an estimated) – coefficient not equalt to zero.

  • subregion (If None (default) returns global importances (FP 2004 eq. 28/29), else returns importance over) – subregion of inputs (FP 2004 eq. 30/31/32).

Returns:

rules – the coefficients and ‘support’ the support of the rule in the training data set (X)

Return type:

pandas.DataFrame with the rules. Column ‘rule’ describes the rule, ‘coef’ holds

predict(X)[source]

Predict outcome for X

predict_proba(X)[source]

Predict outcome probability for X, if model type supports probability prediction method

transform(X=None, y=None)[source]

Transform dataset.

Parameters:

X (array-like matrix, shape=(n_samples, n_features)) – Input data to be transformed. Use dtype=np.float32 for maximum efficiency.

Returns:

X_transformed – Transformed data set

Return type:

matrix, shape=(n_samples, n_out)

class pyexplainer.rulefit.Winsorizer(trim_quantile=0.0)[source]

Bases: object

Performs Winsorization 1->1*

Warning: this class should not be used directly.

train(X)[source]
trim(X)[source]
pyexplainer.rulefit.extract_rules_from_tree(tree, feature_names=None)[source]

Helper to turn a tree into as set of rules

Module contents