mousestyles.classification package

Submodules

mousestyles.classification.classification module

mousestyles.classification.classification.fit_gradient_boosting(train_y, train_x, test_x, n_estimators=None, learning_rate=None)[source]

Returns a DataFrame of Gradient Boosting results, containing prediction strain labels and printing the best model. The model’s parameters will be tuned by cross validation, and accepts user-defined parameters. :param train_y: labels of classification results, which are predicted strains. :type train_y: pandas.Series :param train_x: features used to predict strains in training set :type train_x: pandas.DataFrame :param test_x: features used to predict strains in testing set :type test_x: pandas.DataFrame :param n_estimators: tuning parameter of GradientBoosting, which is the number of

boosting stages to perform
Parameters:learning_rate (list, optional) – learning_rate shrinks the contribution of each tree learning_rate
Returns:GradientBoosting results – Prediction strain labels
Return type:pandas.DataFrame
mousestyles.classification.classification.fit_random_forest(train_y, train_x, test_x, n_estimators=None, max_feature=None, importance_level=None)[source]

Returns a DataFrame of RandomForest results, containing prediction strain labels and printing the best model. The model’s parameters will be tuned by cross validation, and accepts user-defined parameters. :param train_y: labels of classification results, which are predicted strains. :type train_y: pandas.Series :param train_x: features used to predict strains in training set :type train_x: pandas.DataFrame :param test_x: features used to predict strains in testing set :type test_x: pandas.DataFrame :param n_estimators: tuning parameter of RandomForest, which is the number of

trees in the forest
Parameters:
  • max_feature (list, optional) – tuning parameter of RandomForest, which is the number of features to consider when looking for the best split
  • importance_level (int, optional) – the minimum importance of features
Returns:

RandomForest results – The first element is the dataframe of prediction strain labels. The second element is the list of tuples of score and important features larger than the importance level.

Return type:

list

mousestyles.classification.classification.fit_svm(train_y, train_x, test_x, c=None, gamma=None)[source]

Returns a DataFrame of svm results, containing prediction strain labels and printing the best model. The model’s parameters will be tuned by cross validation, and accepts user-defined parameters. :param train_y: labels of classification results, which are predicted strains. :type train_y: pandas.Series :param train_x: features used to predict strains in training set :type train_x: pandas.DataFrame :param test_x: features used to predict strains in testing set :type test_x: pandas.DataFrame :param c: tuning parameter of svm, which is penalty parameter of the error term :type c: list, optional :param gamma: tuning parameter of svm, which is kernel coefficient :type gamma: list, optional

Returns:svm results – Prediction strain labels
Return type:pandas.DataFrame
mousestyles.classification.classification.get_summary(predict_labels, true_labels)[source]

Returns a DataFrame of classification result summary, including precision, recall, F1 measure in terms of different strains. :param predict_labels: prediction strain labels :type predict_labels: pandas.DataFrame :param true_labels: true strain labels, used to measure the prediction

accuracy
Returns:classification result summary – 16 rows, for each strain 0-15 Column 0: precision Column 1: recall Column 2: F-1 measure
Return type:pandas.DataFrame, shape (16,3).
mousestyles.classification.classification.prep_data(strain, features, rseed=222)[source]
Returns a list of 4: [train_y, train_x, test_y, test_x]
train_y: pandas.Series of strain labels in train data sets, train_x: pandas.DataFrame of features in train data sets, test_y: pandas.Series of strain labels in test data sets, test_x: pandas.DataFrame of features in train data sets
Parameters:
  • strain (pandas.Series) – classification labels
  • features (pandas.DataFrame) – classification features
  • rseed (int, optional) – random seed for shuffling the data set to separate train and test
Returns:

splitted data – A list of 4 as explained above

Return type:

list

mousestyles.classification.clustering module

mousestyles.classification.clustering.cluster_in_strain(labels_first, labels_second)[source]

Returns a dictionary object indicating the count of different clusters in each different strain (when put cluster labels as first) or the count of different strain in each clusters (when put strain labels as first).

Parameters:
  • labels_first (numpy arrary or list) – A numpy arrary or list of integers representing which cluster the mice in, or representing which strain mice in.
  • labels_second (numpy arrary or list) – A numpy array or list of integers (0-15) representing which strain the mice in, or representing which cluster the mice in
Returns:

count_data – A dictioanry object with key is the strain number and value is a list indicating the distribution of clusters, or the key is the cluster number and the value is a list indicating the distribution of each strain.

Return type:

dictionary

Examples

>>> count_1 = cluster_in_strain([1,2,1,0,0],[0,1,1,2,1])
mousestyles.classification.clustering.fit_hc(mouse_day_X, method, dist, num_clusters=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])[source]
Returns a list of 2: [silhouettes, cluster_labels]

silhouettes: list of float, cluster_labels: list of list,

each sublist is the labels corresponding to the silhouette
Parameters:
  • mouse_day_X (a 170 * M numpy array,) – all columns corresponding to feature avg/std of a mouse over 16 days
  • method (str,) – method of calculating distance between clusters
  • dist (str,) – distance metric
  • num_clusters (range) – range of number of clusters
Returns:

A list of 2

Return type:

[silhouettes, cluster_labels]

mousestyles.classification.clustering.get_optimal_fit_kmeans(mouse_X, num_clusters, raw=False)[source]
Returns a list of 2: [silhouettes, cluster_labels]

silhouettes: list of float, cluster_labels: list of list,

each sublist is the labels corresponding to the silhouette
Parameters:
  • mouse_X (a 170 * M numpy array or 21131 * M numpy array,) – all columns corresponding to feature avg/std of a mouse over 16 days or the raw data without averaging over days
  • num_clusters (range or a list or a numpy array) – range of number of clusters
  • raw (a boolean with default is False) – False if using the 170 * M array
Returns:

A list of 2

Return type:

[silhouettes, cluster_labels]

mousestyles.classification.clustering.get_optimal_hc_params(mouse_day)[source]
Returns a list of 2: [method, dist]
method: {‘ward’, ‘average’, ‘complete’} dist: {‘cityblock’, ‘euclidean’, ‘chebychev’}
Parameters:mouse_day (a 170 * M numpy array,) – column 0 : strain, column 1: mouse, other columns corresponding to feature avg/std of a mouse over 16 days
Returns:method_distance – [method, dist]
Return type:list
mousestyles.classification.clustering.prep_data(mouse_data, melted=False, std=True, rescale=True)[source]
Returns a ndarray data to be used in clustering algorithms:

column 0 : strain, column 1: mouse, other columns corresponding to feature avg/std of a mouse over 16 days

that may or may not be rescaled to the same unit as specified
Parameters:
  • mouse_data
    1. a 21131 * (4 + ) pandas DataFrame, column 0 : strain, column 1: mouse, column 2: day, column 3: hour, other columns corresponding to features

    or (ii) a 1921 * (3 + ) pandas DataFrame,

    column 0: strain, column 1: mouse, column 2: day, other columns corresponding to features
  • melted (bool,) – False if the input mouse_data is of type (i)
  • std (bool,) – whether the standard deviation of each feature is returned
  • rescale (bool,) – whether each column is rescaled or not (rescale is performed by the column’s maximum)
Returns:

Return type:

The ndarray as specified

Module contents