hNMF
hNMF implements Rank-2 NMF for Hierarchical Clustering as described in this paper and repository . hNMF is a fork of hierarchical-nmf-python with several modifications:
-
Interface to hNMF is provided with a scikit-learn compatible BaseEstimator
-
Improved performance timings
-
Convenience methods for interpreting results
Performance
The paper mentions that the hierarchical NMF process takes advantage of a fast 2-rank matrix decomposition. The Python implementation. While this may be true in MATLAB, the original Python implementation was bottlenecked when running the 2-rank decomposition.
Reference
hNMF
hNMF.model
- class hnmf.model. HierarchicalNMF ( k: int , unbalanced: float = 0.1 , init: NMFInitMethod = None , solver: NMFSolver = 'cd' , beta_loss: NMFBetaLoss = 0 , alpha_W: float = 0.0 , alpha_H: AlphaH = 'same' , random_state: int = 42 , trial_allowance: int = 100 , tol: float = 1e-06 , maxiter: int = 10000 , dtype: DType = <class 'numpy.float64'> )
-
Methods
cluster_assignments
([leaves_only, ...])Returns a mapping of features and their assigned cluster(s)
cluster_features
([leaves_only, id2feature, ...])Returns the features assigned as a cluster to nodes
enrich_tree
(n, id2feature)Appends decoded top features to
self.graph_
Useful if it is desired to export for visualizationfit
(X)Fit HierarchicalNMF to data
get_params
([deep])Get parameters for this estimator.
json_graph
([fp])Export the graph to json
set_params
(**params)Set the parameters of this estimator.
top_discriminative_samples_in_node
(node[, ...])Computes most discriminative samples (node vs rest)
top_features_in_node
(node[, n, id2feature])For a given node, return the top n features and values
top_features_in_nodes
([n, id2feature, ...])Return the top n features and values from all nodes or nodes present in idx if idx is not None
top_nodes_in_feature
(feature[, n, ...])Returns the top nodes for a specified feature
top_nodes_in_features
(features[, n, ...])Returns the top nodes for a specified feature
top_nodes_in_samples
([n, leaves_only, id2sample])Returns the top nodes for each sample.
top_samples_in_nodes
([n, leaves_only, id2sample])Returns the top samples for each node
- cluster_assignments ( leaves_only : bool = True , id2feature : Vectorizer = None , include_outliers : bool = True ) dict [ Union [ int , str ] , Union [ set [ int ] , set [ None ] ] ]
-
Returns a mapping of features and their assigned cluster(s)
- Parameters
-
- leaves_only
-
Whether to return only leaf nodes
- id2feature
-
Decodes features
- include_outliers
-
If True, include feature keys that are not assigned a cluster.
- cluster_features ( leaves_only : bool = True , id2feature : Vectorizer = None , include_outliers : bool = True ) dict [ int , list [ Union [ str , int ] ] ]
-
Returns the features assigned as a cluster to nodes
- Parameters
-
- leaves_only
-
Whether to return only leaf nodes
- id2feature
-
Decodes features
- include_outliers
-
If True, features without a node assignment are returned under the key -1
- enrich_tree ( n : int , id2feature : Vectorizer )
-
Appends decoded top features to
self.graph_
Useful if it is desired to export for visualization- Parameters
-
- n
-
The number of top items to add to each node
- id2feature
-
Decodes features
- fit ( X : AnyArray ) HierarchicalNMF
-
Fit HierarchicalNMF to data
- Parameters
-
- X csr_matrix or ndarray
- Returns
-
- None
- json_graph ( fp : Optional [ str , PathLike ] = None )
-
Export the graph to json
- Parameters
-
- fp: str, PathLike
-
Optional, if provided graph is written to this filepath
- top_discriminative_samples_in_node ( node : int , n : int = 10 , sign : Sign = 'abs' , id2sample : Vectorizer = None ) list [ DiscrimSample ]
-
Computes most discriminative samples (node vs rest)
- Parameters
-
- node
- n
-
The number of features to return
- sign
-
One of [‘positive’, ‘negative’, ‘abs’] .
- id2sample
-
Decodes index of sample to something meaningful.
- Returns
-
- list of dict with form::
-
sample: Any node: int node_value: float others_value: float
- top_features_in_node ( node : int , n : int = 10 , id2feature : Vectorizer = None ) list [ Tuple ]
-
For a given node, return the top n features and values
- Parameters
-
- node
-
Index of node to return top items
- n
-
Number of items to return
- id2feature
-
Optional, if provided returns decoded features
- top_features_in_nodes ( n : int = 10 , id2feature : Vectorizer = None , nodes : Optional [ list [ int ] ] = None , leaves_only : bool = False ) list [ dict [ str , list [ tuple ] ] ]
-
Return the top n features and values from all nodes or nodes present in idx if idx is not None
- Parameters
-
- n
-
Number of items to return
- id2feature: Vectorizer
-
Optional, if provided returns decoded items
- nodes: list[int]
-
Optional, if provided, returns top items only for nodes listed
- leaves_only: bool
-
If True and idx is None return top items for all leaf nodes
- top_nodes_in_feature ( feature : Union [ int , str ] , n : int = 10 , leaves_only : bool = True , id2feature : Vectorizer = None )
-
Returns the top nodes for a specified feature
- Parameters
-
- feature: int, str
-
-
The feature to return the top nodes for. If string,
id2feature
must not beNone
or -
HierarchicalNMF.id2feature_
must not beNone
-
The feature to return the top nodes for. If string,
- n
-
The number of nodes to return
- leaves_only
-
Whether to filter nodes returned to leaf nodes
- id2feature
-
Optional, if provided encodes
feature
iffeature
is passed as a string
- top_nodes_in_features ( features : Union [ list [ int ] , list [ str ] , NDArray ] , n : int = 10 , leaves_only : bool = True , id2feature : Vectorizer = None ) dict [ Union [ str , int ] , list [ tuple ] ]
-
Returns the top nodes for a specified feature
- Parameters
-
- features
-
-
The features to return the top nodes for. If features are strings,
id2feature
must not beNone
or -
HierarchicalNMF.id2feature_
must not beNone
-
The features to return the top nodes for. If features are strings,
- n
-
The number of nodes to return
- leaves_only
-
Whether to filter nodes returned to leaf nodes
- id2feature
-
Optional, if provided encodes
feature
iffeature
is passed as a string
- top_nodes_in_samples ( n : int = 10 , leaves_only : bool = True , id2sample : Vectorizer = None )
-
Returns the top nodes for each sample.
- Parameters
-
- n
-
Number of items to return
- leaves_only
-
Whether to filter top items to nodes identified as leaves
- id2sample
-
Optional, if provided returns decoded samples. Should be of form idx : decoded_value
- top_samples_in_nodes ( n : int = 10 , leaves_only : bool = True , id2sample : Vectorizer = None )
-
Returns the top samples for each node
- Parameters
-
- n
-
The number of samples to include for each node
- leaves_only
-
Only include leaves
- id2sample
-
Decodes samples
hNMF.helpers
- hnmf.helpers. hier8_neat ( X , k , random_state=0 , trial_allowance: int = 3 , unbalanced: float = 0.1 , vec_norm: Union[float , int] = 2.0 , normW: bool = True , anls_alg: callable = <function anls_entry_rank2_precompute> , tol: float = 0.0001 , maxiter: int = 10000 )
-
- Parameters
-
- X
-
Array
- k
-
int, The number of desired leaf nodes
- random_state int
- trial_allowance int
-
Number of trials allowed for removing outliers and splitting a node again. See parameter T in Algorithm 3 in the reference paper.
- unbalanced float
-
A threshold to determine if one of the two clusters is an outlier set. A smaller value means more tolerance for unbalance between two clusters. See parameter beta in Algorithm 3 in the reference paper.
- vec_norm
-
Indicates which norm to use for the normalization of W or H, e.g. vec_norm=2 means Euclidean norm; vec_norm=0 means no normalization.
- normW
-
true if normalizing columns of W; false if normalizing rows of H.
- anls_alg
-
The function handle to NNLS algorithm whose signature is the same as @anls_entry_rank2_precompute
- tol
-
Tolerance parameter for stopping criterion in each run of NMF.
- maxiter
-
Maximum number of iteration times in each run of NMF
- Returns
-
- From the output parameters, you can reconstruct the tree and “replay” the k-1 steps that generated it.
- For a binary tree with k leaf nodes, the total number of nodes (including leaf and non-leaf nodes)
-
is 2*(k-1) plus the root node, because k-1 splits are performed and each split generates two new nodes.
- We only keep track of the 2*(k-1) non-root node in the output.
- tree
-
A 2-by-(k-1) matrix that encodes the tree structure. The two entries in the i-th column are the numberings of the two children of the node with numbering i. The root node has numbering 0, with its two children always having numbering 1 and numbering 2. Thus the root node is NOT included in the ‘tree’ variable.
- splits
-
An array of length k-1. It keeps track of the numberings of the nodes being split from the 1st split to the (k-1)-th split. (The first entry is always 0.)
- is_leaf
-
An array of length 2*(k-1). A “1” at index i means that the node with numbering i is a leaf node in the final tree generated, and “0” indicates non-leaf nodes in the final tree.
- clusters
-
A cell array of length 2*(k-1). The i-th element contains the subset of items at the node with numbering i.
- Ws
-
array of length 2*(k-1). Its i-th element is the topic vector of the cluster at the node with numbering i.
- priorities
-
An array of length 2*(k-1). Its i-th element is the modified NDCG scores at the node with numbering i (see the reference paper).
Notes
Adapted from [rank-2]
Indices and tables
Citations
- rank-2
-
Da Kuang, Haesun Park, Fast rank-2 nonnegative matrix factorization for hierarchical document clustering, The 19th ACM SIGKDD International Conference on Knowledge, Discovery, and Data Mining (KDD ‘13), pp. 739-747, 2013.