The dataset is available here as a gzip'ed tar:
bert-concept-net_v1.tgz (44M) .
After downloading the package, the
tar
utility can be used to uncompress the data in the current directory:
tar xvzf bert-concept-net_v1.tgz
The accompanying
README.md
provides detailed instructions on the format of the dataset and the associated scripts to filter the dataset.
The raw annotations are also available as an archive:
annotations-release.zip
The code used to process the data and obtain the original clusters before annotation will be made available soon.
Former | NBA | player | Hickson | facing | armed | robbery | charge | in | August | . |
N/A | SEM:named_entity, SEM:entertainment:sport:basketball, POS:proper-noun | N/A | N/A | N/A | N/A | N/A | N/A | N/A | SEM:time:months_of_year | N/A |
The | mood | was | more | muted | at | Emirates | Stadium | at | the | start | of | Arsene | Wenger | 's | long | goodbye | to | Arsenal | . |
N/A | N/A | N/A | N/A | N/A | N/A | SEM:entertainment:sport, SEM:origin:europe:uk | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | SEM:entertainment:sport, SEM:origin:europe:uk | N/A |
He | had | publicly | tried | to | distance | himself | from | it | ... |
N/A | N/A | LEX:suffix:ly, POS:adverb | N/A | N/A | N/A | N/A | N/A | N/A | LEX:dots |
The | US | ten-year | Treasury | bond | return | has | jumped |
N/A | N/A | SEM:time:timeframe, POS:adjective, LEX:hyphenated | N/A | N/A | N/A | N/A | N/A |
Alam | often | critiques | the | government |
POS:proper-noun, SEM:named_entity:person, SEM:demography:muslim_name, LEX:case:title_case | N/A | N/A | N/A | N/A |
@inproceedings{dalvi2020discovering,
title = {Discovering Latent Concepts Learned in {BERT}},
author = {Dalvi, Fahim and
Khan, Abdul and
Alam, Firoj and
Durrani, Nadir and
Xu, Jia and
Sajjad, Hassan},
booktitle = { International Conference on Learning Representations },
year = {2022},
url = {https://openreview.net/forum?id=POTMtpYI1xH}
}