Open Intent Discovery

This package provides the toolkit for open intent discovery implemented with PyTorch (for semi-supervised clustering methods), and Tensorflow (for unsupervised clustering methods).

Introduction

Open intent discovery aims to leverage limited labeled data of known intents to help find discover open intent clusters. We regard it as a clustering problem, and classifies the related methods into two categories, semi-supervised clustering (with some labeled known intent data as prior knowledge), and unsupervised clustering (without any prior knowledge). An example is as follows:

We collect benchmark intent datasets, and reproduce related methods to our best. For the convenience of users, we provide flexible and extensible interfaces to add new methods. Welcome to contact us (zhang-hl20@mails.tsinghua.edu.cn) to add your methods!

Basic Information

Benchmark Datasets

Dataset Name	Source
BANKING	Paper
CLINC150	Paper
StackOverflow	Paper

Integrated Models

Setting	Model Name	Source	Published
Unsupervised	KM	Paper	BSMSP 1967
Unsupervised	AG	Paper	PR 1978
Unsupervised	SAE-KM	Paper	JMLR 2010
Unsupervised	DEC	Paper Code	ICML 2016
Unsupervised	DCN	Paper Code	ICML 2017
Unsupervised	CC	Paper Code	AAAI 2021
Unsupervised	SCCL	Paper Code	NAACL 2021
Unsupervised	USNID	Paper Code	arXiv 2023
Semi-supervised	KCL*	Paper Code	ICLR 2018
Semi-supervised	MCL*	Paper Code	ICLR 2019
Semi-supervised	DTC*	Paper Code	ICCV 2019
Semi-supervised	CDAC+	Paper Code	AAAI 2020
Semi-supervised	DeepAligned	Paper Code	AAAI 2021
Semi-supervised	GCD	Paper Code	CVPR 2022
Semi-supervised	MTP-CLNN	Paper Code	ACL 2022
Semi-supervised	USNID	Paper Code	arXiv 2023

Results

The detailed results can be seen in results.md.

Overall Performance

KIR means "Known Intent Ratio".

			BANKING			CLINC			StackOverflow
KIR	Methods	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC
0.0	KM	49.30	13.04	28.62	71.05	27.72	45.76	19.87	5.23	23.72
0.0	AG	53.28	14.65	31.62	72.21	27.05	44.12	25.54	7.12	28.50
0.0	SAE-KM	59.80	23.59	37.07	73.77	31.58	47.15	44.96	28.23	49.11
0.0	DEC	62.66	25.32	38.60	74.83	31.71	48.77	58.76	36.23	59.49
0.0	DCN	62.72	25.36	38.59	74.77	31.68	48.69	58.75	36.23	59.48
0.0	CC	44.89	9.75	21.51	65.79	18.00	32.69	19.06	8.79	21.01
0.0	SCCL	63.89	26.98	40.54	79.35	38.14	50.44	69.11	34.81	68.15
0.0	USNID	75.30	43.33	54.82	91.00	68.54	75.87	72.00	52.25	69.28

0.25	KCL	52.70	18.58	26.03	67.98	24.30	29.40	30.42	17.66	30.69
0.25	MCL	47.88	14.43	23.29	62.76	18.21	28.52	26.68	17.54	31.46
0.25	DTC	55.59	19.09	31.75	79.35	41.92	56.90	29.96	17.51	29.54
0.25	GCD	60.89	27.30	39.91	83.69	52.13	64.69	31.72	16.81	36.76
0.25	CDACPlus	66.39	33.74	48.00	84.68	50.02	66.24	46.16	30.99	51.61
0.25	DeepAligned	70.50	37.62	49.08	88.97	64.63	74.07	50.86	37.96	54.50
0.25	MTP-CLNN	80.04	52.91	65.06	93.17	76.20	83.26	73.35	54.80	74.70
0.25	USNID	81.94	56.53	65.85	94.17	77.95	83.12	74.91	65.45	75.76

0.5	KCL	63.50	30.36	40.04	74.74	35.28	45.69	53.39	41.74	56.80
0.5	MCL	62.71	29.91	41.94	76.94	39.74	49.44	45.17	36.28	52.53
0.5	DTC	69.46	37.05	49.85	83.01	50.45	64.39	49.80	37.38	52.92
0.5	GCD	67.29	35.52	48.37	87.12	59.75	70.93	49.57	31.15	53.77
0.5	CDACPlus	67.30	34.97	48.55	86.00	54.87	68.01	46.21	30.88	51.79
0.5	DeepAligned	76.67	47.95	59.38	91.59	72.56	80.70	68.28	57.62	74.52
0.5	MTP-CLNN	83.42	60.17	70.97	94.30	80.17	86.18	76.66	62.24	80.36
0.5	USNID	85.05	63.77	73.27	95.48	82.99	87.28	78.77	71.63	82.06

0.75	KCL	72.75	45.21	59.12	86.01	58.62	68.89	63.98	54.28	68.69
0.75	MCL	74.42	48.06	61.56	87.26	61.21	70.27	63.44	56.11	71.71
0.75	DTC	74.44	44.68	57.16	89.19	67.15	77.65	63.05	53.83	71.04
0.75	GCD	72.21	42.86	56.94	89.38	66.03	76.82	60.14	42.05	65.20
0.75	CDACPlus	69.54	37.78	51.07	85.96	55.17	67.77	58.23	40.95	64.57
0.75	DeepAligned	79.39	53.09	64.63	93.92	79.94	86.79	73.28	60.09	77.97
0.75	MTP-CLNN	86.19	66.98	77.22	95.45	84.30	89.46	77.12	69.36	82.90
0.75	USNID	87.41	69.54	78.36	96.42	86.77	90.36	80.13	74.90	85.66

We welcome any issues and requests for model implementation and bug fix.

Data Settings

Each dataset is split to training, development, and testing sets. We select partial intents as known (the labeled ratio can be changed) intents. Notably, we uniformly select 10% as labeled from known intent data. We use all training data (both labeled and unlabeled) to train the model. During testing, we evaluate the clustering performance of all intent classes. More detailed information can be seen in the paper.

Parameter Configurations

The basic parameters include parsing parameters about selected dataset, method, setting, etc. More details can be seen in run.py. For specific parameters of each method, we support add configuration files with different hyper-parameters in the configs directory.

An example can be seen in DeepAligned.py. Notice that the config file name is corresponding to the parsing parameter.

Normally, the input commands are as follows:

python run.py --setting xxx --dataset xxx --known_cls_ratio xxx --labeled_ratio xxx --cluster_num_factor xxx --config_file_name xxx

Notice that if you want to train the model, save the model, or save the testing results, you need to add related parameters (--train, --save_model, --save_results)

Tutorials

a. How to add a new dataset?

Prepare Data
Create a new directory to store your dataset in the data directory. You should provide the train.tsv, dev.tsv, and test.tsv, with the same formats as in the provided datasets.
Dataloader Setting
Calculate the maximum sentence length (token unit) and count the labels of the dataset. Add them in the file as follows:

max_seq_lengths = {
    'new_dataset': max_length
}
benchmark_labels = {
    'new_dataset': label_list
}

b. How to add a new backbone?

Add a new backbone in the backbones directory. For example, we provide bert-based, glove-based, and sae-based backbones.
Add the new backbone mapping in the file as follows:

from .bert import new_backbone_class
backbones_map = {
    'new_backbone': new_backbone_class
}

Add a new loss in the losses directory is almost the same as adding a new backbone.

c. How to add a new method?

Configuration Setting
Create a new file, named "method_name.py" in the configs directory, and set the hyper-parameters for the method (an example can be seen in DeepAligned.py).
Dataloader Setting
Add the dataloader mapping if you use new backbone for the method. For example, the bert-based model corresponds to the bert dataloader as follows.

from .bert_loader import BERT_Loader
backbone_loader_map = {
    'bert': BERT_Loader,
    'bert_xxx': BERT_Loader
}

The unsupervised clustering methods use the unified dataloader as follows:

from .unsup_loader import UNSUP_Loader
backbone_loader_map = {
    'glove': UNSUP_Loader,
    'sae': UNSUP_Loader
}

Add Methods (Take DeepAligned as an example)

Classify the method into the corresponding category in the methods directory. For example, DeepAligned belongs to the semi-supervised directory, and creates a subdirectory under it, named "DeepAligned".
Add the manager file for DeepAligned. The file should include the method manager class (e.g., DeepAlignedManager), which includes training, evalutation, and testing modules for the method. An example can be seen in manager.py.
Add the related method dependency in init.py as below:

from .semi_supervised.DeepAligned.manager import DeepAlignedManager
method_map = {
    'DeepAligned': DeepAlignedManager
}

(The key corresponds to the input parameter "method")

Run Examples Add a script in the examples directory, and configure the parsing parameters in the run.py. You can also run the programs serially by setting the combination of different parameters. A running example is shown in run_DeepAligned.sh.

Citations

If you are interested in this work, and want to use the codes in this repo, please star/fork this repo, and cite the following works:

TEXTOIR: An Integrated and Visualized Platform for Text Open Intent Recognition
A Clustering Framework for Unsupervised and Semi-supervised New Intent Discovery

@inproceedings{zhang-etal-2021-textoir,
    title = "{TEXTOIR}: An Integrated and Visualized Platform for Text Open Intent Recognition",
    author = "Zhang, Hanlei  and Li, Xiaoteng  and Xu, Hua  and Zhang, Panpan and Zhao, Kang  and Gao, Kai",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    pages = "167--174",
    year = "2021",
    url = "https://aclanthology.org/2021.acl-demo.20",
    doi = "10.18653/v1/2021.acl-demo.20",
}

@ARTICLE{10349963,
  author={Zhang, Hanlei and Xu, Hua and Wang, Xin and Long, Fei and Gao, Kai},
  journal={IEEE Transactions on Knowledge and Data Engineering}, 
  title={A Clustering Framework for Unsupervised and Semi-supervised New Intent Discovery}, 
  year={2023},
  doi={10.1109/TKDE.2023.3340732}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open Intent Discovery

Introduction

Basic Information

Benchmark Datasets

Integrated Models

Results

Overall Performance

Data Settings

Parameter Configurations

Tutorials

a. How to add a new dataset?

b. How to add a new backbone?

c. How to add a new method?

Citations

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open Intent Discovery

Introduction

Basic Information

Benchmark Datasets

Integrated Models

Results

Overall Performance

Data Settings

Parameter Configurations

Tutorials

a. How to add a new dataset?

b. How to add a new backbone?

c. How to add a new method?

Citations