hNMF 

hNMF implements Rank-2 NMF for Hierarchical Clustering as described in this paper and repository . hNMF is a fork of hierarchical-nmf-python with several modifications:

Interface to hNMF is provided with a scikit-learn compatible BaseEstimator
Improved performance timings
Convenience methods for interpreting results

Performance 

The paper mentions that the hierarchical NMF process takes advantage of a fast 2-rank matrix decomposition. The Python implementation. While this may be true in MATLAB, the original Python implementation was bottlenecked when running the 2-rank decomposition.

Reference 

hNMF 

hNMF.model 

class hnmf.model. HierarchicalNMF ( k: int , unbalanced: float = 0.1 , init: NMFInitMethod = None , solver: NMFSolver = 'cd' , beta_loss: NMFBetaLoss = 0 , alpha_W: float = 0.0 , alpha_H: AlphaH = 'same' , random_state: int = 42 , trial_allowance: int = 100 , tol: float = 1e-06 , maxiter: int = 10000 , dtype: DType = <class 'numpy.float64'> ) 

Methods

`cluster_assignments` ([leaves_only, ...])	Returns a mapping of features and their assigned cluster(s)
`cluster_features` ([leaves_only, id2feature, ...])	Returns the features assigned as a cluster to nodes
`enrich_tree` (n, id2feature)	Appends decoded top features to `self.graph_` Useful if it is desired to export for visualization
`fit` (X)	Fit HierarchicalNMF to data
`get_params` ([deep])	Get parameters for this estimator.
`json_graph` ([fp])	Export the graph to json
`set_params` (**params)	Set the parameters of this estimator.
`top_discriminative_samples_in_node` (node[, ...])	Computes most discriminative samples (node vs rest)
`top_features_in_node` (node[, n, id2feature])	For a given node, return the top n features and values
`top_features_in_nodes` ([n, id2feature, ...])	Return the top n features and values from all nodes or nodes present in idx if idx is not None
`top_nodes_in_feature` (feature[, n, ...])	Returns the top nodes for a specified feature
`top_nodes_in_features` (features[, n, ...])	Returns the top nodes for a specified feature
`top_nodes_in_samples` ([n, leaves_only, id2sample])	Returns the top nodes for each sample.
`top_samples_in_nodes` ([n, leaves_only, id2sample])	Returns the top samples for each node

cluster_assignments ( leaves_only : bool = True , id2feature : Vectorizer = None , include_outliers : bool = True ) → dict [ Union [ int , str ] , Union [ set [ int ] , set [ None ] ] ] 

Returns a mapping of features and their assigned cluster(s)

Parameters

leaves_only: Whether to return only leaf nodes
id2feature: Decodes features
include_outliers: If True, include feature keys that are not assigned a cluster.

cluster_features ( leaves_only : bool = True , id2feature : Vectorizer = None , include_outliers : bool = True ) → dict [ int , list [ Union [ str , int ] ] ] 

Returns the features assigned as a cluster to nodes

Parameters

leaves_only: Whether to return only leaf nodes
id2feature: Decodes features
include_outliers: If True, features without a node assignment are returned under the key -1

enrich_tree ( n : int , id2feature : Vectorizer ) 

Appends decoded top features to self.graph_ Useful if it is desired to export for visualization

Parameters

n: The number of top items to add to each node
id2feature: Decodes features

fit ( X : AnyArray ) → HierarchicalNMF 

Fit HierarchicalNMF to data

Parameters

X csr_matrix or ndarray

Returns

None

json_graph ( fp : Optional [ str , PathLike ] = None ) 

Export the graph to json

Parameters

fp: str, PathLike: Optional, if provided graph is written to this filepath

top_discriminative_samples_in_node ( node : int , n : int = 10 , sign : Sign = 'abs' , id2sample : Vectorizer = None ) → list [ DiscrimSample ] 

Computes most discriminative samples (node vs rest)

Parameters

node
n: The number of features to return
sign: One of [‘positive’, ‘negative’, ‘abs’] .
id2sample: Decodes index of sample to something meaningful.

Returns

list of dict with form::: sample: Any node: int node_value: float others_value: float

top_features_in_node ( node : int , n : int = 10 , id2feature : Vectorizer = None ) → list [ Tuple ] 

For a given node, return the top n features and values

Parameters

node: Index of node to return top items
n: Number of items to return
id2feature: Optional, if provided returns decoded features

top_features_in_nodes ( n : int = 10 , id2feature : Vectorizer = None , nodes : Optional [ list [ int ] ] = None , leaves_only : bool = False ) → list [ dict [ str , list [ tuple ] ] ] 

Return the top n features and values from all nodes or nodes present in idx if idx is not None

Parameters

n: Number of items to return
id2feature: Vectorizer: Optional, if provided returns decoded items
nodes: list[int]: Optional, if provided, returns top items only for nodes listed
leaves_only: bool: If True and idx is None return top items for all leaf nodes

top_nodes_in_feature ( feature : Union [ int , str ] , n : int = 10 , leaves_only : bool = True , id2feature : Vectorizer = None ) 

Returns the top nodes for a specified feature

Parameters

feature: int, str

The feature to return the top nodes for. If string, id2feature must not be None or: HierarchicalNMF.id2feature_ must not be None

n

The number of nodes to return

leaves_only

Whether to filter nodes returned to leaf nodes

id2feature

Optional, if provided encodes feature if feature is passed as a string

top_nodes_in_features ( features : Union [ list [ int ] , list [ str ] , NDArray ] , n : int = 10 , leaves_only : bool = True , id2feature : Vectorizer = None ) → dict [ Union [ str , int ] , list [ tuple ] ] 

Returns the top nodes for a specified feature

Parameters

features

The features to return the top nodes for. If features are strings, id2feature must not be None or: HierarchicalNMF.id2feature_ must not be None

n

The number of nodes to return

leaves_only

Whether to filter nodes returned to leaf nodes

id2feature

Optional, if provided encodes feature if feature is passed as a string

top_nodes_in_samples ( n : int = 10 , leaves_only : bool = True , id2sample : Vectorizer = None ) 

Returns the top nodes for each sample.

Parameters

n: Number of items to return
leaves_only: Whether to filter top items to nodes identified as leaves
id2sample: Optional, if provided returns decoded samples. Should be of form idx : decoded_value

top_samples_in_nodes ( n : int = 10 , leaves_only : bool = True , id2sample : Vectorizer = None ) 

Returns the top samples for each node

Parameters

n: The number of samples to include for each node
leaves_only: Only include leaves
id2sample: Decodes samples

hNMF.helpers 

hnmf.helpers. hier8_neat ( X , k , random_state=0 , trial_allowance: int = 3 , unbalanced: float = 0.1 , vec_norm: Union[float , int] = 2.0 , normW: bool = True , anls_alg: callable = <function anls_entry_rank2_precompute> , tol: float = 0.0001 , maxiter: int = 10000 ) 

Parameters

X: Array
k: int, The number of desired leaf nodes
random_state int
trial_allowance int: Number of trials allowed for removing outliers and splitting a node again. See parameter T in Algorithm 3 in the reference paper.
unbalanced float: A threshold to determine if one of the two clusters is an outlier set. A smaller value means more tolerance for unbalance between two clusters. See parameter beta in Algorithm 3 in the reference paper.
vec_norm: Indicates which norm to use for the normalization of W or H, e.g. vec_norm=2 means Euclidean norm; vec_norm=0 means no normalization.
normW: true if normalizing columns of W; false if normalizing rows of H.
anls_alg: The function handle to NNLS algorithm whose signature is the same as @anls_entry_rank2_precompute
tol: Tolerance parameter for stopping criterion in each run of NMF.
maxiter: Maximum number of iteration times in each run of NMF

Returns

From the output parameters, you can reconstruct the tree and “replay” the k-1 steps that generated it.
For a binary tree with k leaf nodes, the total number of nodes (including leaf and non-leaf nodes): is 2*(k-1) plus the root node, because k-1 splits are performed and each split generates two new nodes.
We only keep track of the 2*(k-1) non-root node in the output.
tree: A 2-by-(k-1) matrix that encodes the tree structure. The two entries in the i-th column are the numberings of the two children of the node with numbering i. The root node has numbering 0, with its two children always having numbering 1 and numbering 2. Thus the root node is NOT included in the ‘tree’ variable.
splits: An array of length k-1. It keeps track of the numberings of the nodes being split from the 1st split to the (k-1)-th split. (The first entry is always 0.)
is_leaf: An array of length 2*(k-1). A “1” at index i means that the node with numbering i is a leaf node in the final tree generated, and “0” indicates non-leaf nodes in the final tree.
clusters: A cell array of length 2*(k-1). The i-th element contains the subset of items at the node with numbering i.
Ws: array of length 2*(k-1). Its i-th element is the topic vector of the cluster at the node with numbering i.
priorities: An array of length 2*(k-1). Its i-th element is the modified NDCG scores at the node with numbering i (see the reference paper).

Notes

Adapted from [rank-2]

hNMF

Indices and tables 

Citations 

rank-2: Da Kuang, Haesun Park, Fast rank-2 nonnegative matrix factorization for hierarchical document clustering, The 19th ACM SIGKDD International Conference on Knowledge, Discovery, and Data Mining (KDD ‘13), pp. 739-747, 2013.