hNMF

hNMF implements Rank-2 NMF for Hierarchical Clustering as described in this paper and repository . hNMF is a fork of hierarchical-nmf-python with several modifications:

  • Interface to hNMF is provided with a scikit-learn compatible BaseEstimator

  • Improved performance timings

  • Convenience methods for interpreting results

Performance

The paper mentions that the hierarchical NMF process takes advantage of a fast 2-rank matrix decomposition. The Python implementation. While this may be true in MATLAB, the original Python implementation was bottlenecked when running the 2-rank decomposition.

Reference

hNMF

hNMF.model
class hnmf.model. HierarchicalNMF ( k: int , unbalanced: float = 0.1 , init: NMFInitMethod = None , solver: NMFSolver = 'cd' , beta_loss: NMFBetaLoss = 0 , alpha_W: float = 0.0 , alpha_H: AlphaH = 'same' , random_state: int = 42 , trial_allowance: int = 100 , tol: float = 1e-06 , maxiter: int = 10000 , dtype: DType = <class 'numpy.float64'> )

Methods

cluster_assignments ([leaves_only, ...])

Returns a mapping of features and their assigned cluster(s)

cluster_features ([leaves_only, id2feature, ...])

Returns the features assigned as a cluster to nodes

enrich_tree (n, id2feature)

Appends decoded top features to self.graph_ Useful if it is desired to export for visualization

fit (X)

Fit HierarchicalNMF to data

get_params ([deep])

Get parameters for this estimator.

json_graph ([fp])

Export the graph to json

set_params (**params)

Set the parameters of this estimator.

top_discriminative_samples_in_node (node[, ...])

Computes most discriminative samples (node vs rest)

top_features_in_node (node[, n, id2feature])

For a given node, return the top n features and values

top_features_in_nodes ([n, id2feature, ...])

Return the top n features and values from all nodes or nodes present in idx if idx is not None

top_nodes_in_feature (feature[, n, ...])

Returns the top nodes for a specified feature

top_nodes_in_features (features[, n, ...])

Returns the top nodes for a specified feature

top_nodes_in_samples ([n, leaves_only, id2sample])

Returns the top nodes for each sample.

top_samples_in_nodes ([n, leaves_only, id2sample])

Returns the top samples for each node

cluster_assignments ( leaves_only : bool = True , id2feature : Vectorizer = None , include_outliers : bool = True ) dict [ Union [ int , str ] , Union [ set [ int ] , set [ None ] ] ]

Returns a mapping of features and their assigned cluster(s)

Parameters
leaves_only

Whether to return only leaf nodes

id2feature

Decodes features

include_outliers

If True, include feature keys that are not assigned a cluster.

cluster_features ( leaves_only : bool = True , id2feature : Vectorizer = None , include_outliers : bool = True ) dict [ int , list [ Union [ str , int ] ] ]

Returns the features assigned as a cluster to nodes

Parameters
leaves_only

Whether to return only leaf nodes

id2feature

Decodes features

include_outliers

If True, features without a node assignment are returned under the key -1

enrich_tree ( n : int , id2feature : Vectorizer )

Appends decoded top features to self.graph_ Useful if it is desired to export for visualization

Parameters
n

The number of top items to add to each node

id2feature

Decodes features

fit ( X : AnyArray ) HierarchicalNMF

Fit HierarchicalNMF to data

Parameters
X csr_matrix or ndarray
Returns
None
json_graph ( fp : Optional [ str , PathLike ] = None )

Export the graph to json

Parameters
fp: str, PathLike

Optional, if provided graph is written to this filepath

top_discriminative_samples_in_node ( node : int , n : int = 10 , sign : Sign = 'abs' , id2sample : Vectorizer = None ) list [ DiscrimSample ]

Computes most discriminative samples (node vs rest)

Parameters
node
n

The number of features to return

sign

One of [‘positive’, ‘negative’, ‘abs’] .

id2sample

Decodes index of sample to something meaningful.

Returns
list of dict with form::

sample: Any node: int node_value: float others_value: float

top_features_in_node ( node : int , n : int = 10 , id2feature : Vectorizer = None ) list [ Tuple ]

For a given node, return the top n features and values

Parameters
node

Index of node to return top items

n

Number of items to return

id2feature

Optional, if provided returns decoded features

top_features_in_nodes ( n : int = 10 , id2feature : Vectorizer = None , nodes : Optional [ list [ int ] ] = None , leaves_only : bool = False ) list [ dict [ str , list [ tuple ] ] ]

Return the top n features and values from all nodes or nodes present in idx if idx is not None

Parameters
n

Number of items to return

id2feature: Vectorizer

Optional, if provided returns decoded items

nodes: list[int]

Optional, if provided, returns top items only for nodes listed

leaves_only: bool

If True and idx is None return top items for all leaf nodes

top_nodes_in_feature ( feature : Union [ int , str ] , n : int = 10 , leaves_only : bool = True , id2feature : Vectorizer = None )

Returns the top nodes for a specified feature

Parameters
feature: int, str
The feature to return the top nodes for. If string, id2feature must not be None or

HierarchicalNMF.id2feature_ must not be None

n

The number of nodes to return

leaves_only

Whether to filter nodes returned to leaf nodes

id2feature

Optional, if provided encodes feature if feature is passed as a string

top_nodes_in_features ( features : Union [ list [ int ] , list [ str ] , NDArray ] , n : int = 10 , leaves_only : bool = True , id2feature : Vectorizer = None ) dict [ Union [ str , int ] , list [ tuple ] ]

Returns the top nodes for a specified feature

Parameters
features
The features to return the top nodes for. If features are strings, id2feature must not be None or

HierarchicalNMF.id2feature_ must not be None

n

The number of nodes to return

leaves_only

Whether to filter nodes returned to leaf nodes

id2feature

Optional, if provided encodes feature if feature is passed as a string

top_nodes_in_samples ( n : int = 10 , leaves_only : bool = True , id2sample : Vectorizer = None )

Returns the top nodes for each sample.

Parameters
n

Number of items to return

leaves_only

Whether to filter top items to nodes identified as leaves

id2sample

Optional, if provided returns decoded samples. Should be of form idx : decoded_value

top_samples_in_nodes ( n : int = 10 , leaves_only : bool = True , id2sample : Vectorizer = None )

Returns the top samples for each node

Parameters
n

The number of samples to include for each node

leaves_only

Only include leaves

id2sample

Decodes samples

hNMF.helpers
hnmf.helpers. hier8_neat ( X , k , random_state=0 , trial_allowance: int = 3 , unbalanced: float = 0.1 , vec_norm: Union[float , int] = 2.0 , normW: bool = True , anls_alg: callable = <function anls_entry_rank2_precompute> , tol: float = 0.0001 , maxiter: int = 10000 )
Parameters
X

Array

k

int, The number of desired leaf nodes

random_state int
trial_allowance int

Number of trials allowed for removing outliers and splitting a node again. See parameter T in Algorithm 3 in the reference paper.

unbalanced float

A threshold to determine if one of the two clusters is an outlier set. A smaller value means more tolerance for unbalance between two clusters. See parameter beta in Algorithm 3 in the reference paper.

vec_norm

Indicates which norm to use for the normalization of W or H, e.g. vec_norm=2 means Euclidean norm; vec_norm=0 means no normalization.

normW

true if normalizing columns of W; false if normalizing rows of H.

anls_alg

The function handle to NNLS algorithm whose signature is the same as @anls_entry_rank2_precompute

tol

Tolerance parameter for stopping criterion in each run of NMF.

maxiter

Maximum number of iteration times in each run of NMF

Returns
From the output parameters, you can reconstruct the tree and “replay” the k-1 steps that generated it.
For a binary tree with k leaf nodes, the total number of nodes (including leaf and non-leaf nodes)

is 2*(k-1) plus the root node, because k-1 splits are performed and each split generates two new nodes.

We only keep track of the 2*(k-1) non-root node in the output.
tree

A 2-by-(k-1) matrix that encodes the tree structure. The two entries in the i-th column are the numberings of the two children of the node with numbering i. The root node has numbering 0, with its two children always having numbering 1 and numbering 2. Thus the root node is NOT included in the ‘tree’ variable.

splits

An array of length k-1. It keeps track of the numberings of the nodes being split from the 1st split to the (k-1)-th split. (The first entry is always 0.)

is_leaf

An array of length 2*(k-1). A “1” at index i means that the node with numbering i is a leaf node in the final tree generated, and “0” indicates non-leaf nodes in the final tree.

clusters

A cell array of length 2*(k-1). The i-th element contains the subset of items at the node with numbering i.

Ws

array of length 2*(k-1). Its i-th element is the topic vector of the cluster at the node with numbering i.

priorities

An array of length 2*(k-1). Its i-th element is the modified NDCG scores at the node with numbering i (see the reference paper).

Notes

Adapted from [rank-2]

hNMF

Indices and tables

Citations

rank-2

Da Kuang, Haesun Park, Fast rank-2 nonnegative matrix factorization for hierarchical document clustering, The 19th ACM SIGKDD International Conference on Knowledge, Discovery, and Data Mining (KDD ‘13), pp. 739-747, 2013.