Issue Identification of Overseas Construction Markets from News Articles Based on BERTopic

Joonwoo Baik; Sehwan Chung; Seokho Chi

doi:10.55785/JCAR.2.2.21

Preview

Journal of Construction Automation and Robotics. 30 June 2023. 21-26
https://doi.org/10.55785/JCAR.2.2.21

Issue Identification of Overseas Construction Markets from News Articles Based on BERTopic

BERTopic 기반 뉴스기사 토픽 모델링을 활용한 해외건설 시장 이슈 도출

Joonwoo Baik¹

Sehwan Chung²

Seokho Chi³^*

백 준우¹

정 세환²

지 석호³^*

¹Undergraduate Student, Department of Civil and Environmental Engineering, Seoul National University

²Member, Ph.D. Student, Department of Civil and Environmental Engineering, Seoul National University

³Corresponding Author, Member, Professor, Department of Civil and Environmental Engineering, Seoul National University

¹서울대학교 건설환경공학부 학사과정

²정회원․서울대학교 건설환경공학부 박사과정

³교신저자․정회원․서울대학교 건설환경공학부 교수

^{*Corresponding Author}

ABSTRACT

Understanding the issues of overseas construction markets is crucial for the successful delivery of the construction projects. News articles cover various local issues and thus can be used for an analysis of the issues of the construction market. Topic modeling is the method to extract major topics from text data by grouping the text automatically, and it can be used to extract the major issues from the text data in news articles. This study applied BERTopic, a representative topic modeling method, to the extraction of issues in an overseas construction market. A total of 6,273 BBC news articles were collected for the experimental and validation purpose of the proposed BERTopic method. The results show BERTopic can effectively extract major issues from the news text and represent the issues in an easy-to-understand manner. It is expected that risks of an overseas construction project are identified in advance and mitigated by the support from BERTopic modeling.

Keywords

Overseas Construction

News Article

Topic Modeling

BERTopic

해외건설 시장의 최신 이슈를 파악하는 것은 성공적인 사업 수행을 위해 매우 중요하다. 해외 뉴스기사는 현지에서 발생하는 다양한 사건을 다루기에 이를 분석한다면 효과적으로 해외건설 시장의 이슈를 파악할 수 있다. 토픽 모델링은 텍스트 데이터를 자동으로 군집화함으로써 데이터로부터 주요 토픽을 추출하는 기법이며, 뉴스기사로부터 현지의 주요 이슈를 도출하는 데 활용될 수 있다. 본 연구에서는 대표적인 토픽 모델링 기법인 BERTopic을 뉴스기사 텍스트에 적용하여 해외건설 시장의 이슈를 자동으로 파악하고자 하였다. 실험 및 검증을 위해 BBC 뉴스기사 6,273건을 수집하여 BERTopic 기법을 적용하고 도출된 토픽을 해석하였다. 적용 결과, BERTopic 기법은 뉴스기사 텍스트로부터 주요 이슈들을 이해하기 쉬운 형태로 효과적으로 도출할 수 있음을 확인하였다. 이는 해외건설 사업 수행 시 시시각각 변하는 시장의 이슈를 즉각 파악하여 리스크를 줄이는데 도움을 줄 것이다.

키워드

해외건설

뉴스기사

토픽 모델링

BERTopic

MAIN

1. 서 론
2. 문헌 고찰
3. 연구 방법
3.1 뉴스기사 수집
3.2 토픽 모델링
4. 토픽 모델링 결과 및 논의
4.1 토픽 모델링 적용 결과
4.2 결과 해석 및 논의
5. 결 론

1. 서 론

해외건설 사업 수행 시 시장의 주요 이슈를 파악하는 것은 사업의 불확실성과 리스크를 줄이고 프로젝트의 성공률을 높이는 데 매우 중요한 역할을 한다(Javernick-Will and Scott, 2010). 해외건설 사업 발주국의 뉴스기사에는 현지에서 발생하는 다양한 사건들이 서술되기 때문에, 이를 분석함으로써 해당 시장의 상황을 파악할 수 있다(Goldszmidt et al., 2011). 그러나 하루에 게시되는 뉴스기사의 양이 매우 많기 때문에 이를 수작업으로 수집하여 분석하기 위해서는 많은 시간과 비용이 필요하며, 따라서 자동화된 분석 방법이 적용된다면 장점이 크다.

토픽 모델링은 대량의 텍스트 문서 집합에서 주요 주제를 파악하기 위해 사용되는 기법이다(Wallach, 2006). 건설자동화 분야에서도 안전사고 뉴스(Lee, 2018), Building Information Modeling(BIM) 관련 논문(Choo et al., 2019), 현장 검사 보고서(Lin et al., 2020), 건설 소송 사례(Jallan et al., 2019) 등의 텍스트 데이터에 토픽 모델링 기법을 적용하여 주요 이슈를 도출하는 연구가 수행되었다. 여러 토픽 모델링 기법 중 BERTopic은 최근 자연어처리 분야에서 뛰어난 성능을 보이고 있는 기법으로, 본 연구는 해외 뉴스기사 텍스트에 BERTopic을 적용하여 텍스트로부터 주요 이슈를 도출하는 방법의 가능성을 확인하고자 한다.

2. 문헌 고찰

토픽 모델링은 문서에 대한 라벨링이나 사전에 정의된 카테고리를 필요로 하지 않으며, 원본 텍스트 데이터를 분석하여 숨겨진 주제를 자동으로 추출한다. 따라서 유전자 데이터, 이미지, 소셜 네트워크 등 다양한 유형의 데이터에서 패턴을 찾는 등 여러 응용 분야에서 활용되고 있다(Blei, 2012).

건설자동화 분야에서도 토픽 모델링을 활용한 다양한 연구가 수행되었다. 먼저 국내에서는 BIM을 키워드로 포함하는 논문을 대상으로 BIM 연구 동향을 분석하고(Choo et al., 2019), 건설 프로젝트의 BIM 사용 유무를 분류하였다(Jung and Lee, 2019). 또한, 건설 안전사고 관련 뉴스 데이터를 분석하여 사고발생 원인을 파악하고 향후 발생 가능한 사고관련 이슈를 도출하였다(Lee, 2018). 본 연구와 유사하게 The World Bank 뉴스기사를 분석하여 해외 건설 사업 시 발생하는 주요 이슈를 파악하려는 시도가 있었고(Moon et al., 2018), 현장 검측 데이터를 활용하여 불확실성을 판단하는 연구도 수행된 바 있다(Lin et al., 2020). 중국에서는 웹사이트 Weibo에 작성된 글을 분석하여 건설산업에 대한 기피 현상을 분석하기도 했다(Hou et al., 2022). 그리고, 건설프로젝트에서 발생한 결함 관련 소송 사례로부터 품질문제 발생원인을 파악하고(Jallan et al., 2019), 중국 Three Gorges Dam 건설프로젝트에 대한 학계의 주요 우려 사항을 도출하는 연구도 수행되었다(Jiang et al., 2016).

토픽 모델링의 기법으로는 잠재 디리클레 할당(Latent Dirichlet Allocation, LDA) 기법이 위의 선행연구를 포함해 주로 적용되어왔으나, 최근 제안된 BERTopic 기법이 다른 기법들에 비해 성능이 월등하게 우수한 것이 알려지면서 많은 관심을 받고 있다(Abuzayed and Al-Khalifa, 2021; Egger and Yu, 2022; Hendry et al., 2021; Grootendorst, 2022).

3. 연구 방법

본 연구는 BERTopic을 활용하여 해외건설 시장의 주요 이슈를 파악하는 방법의 가능성을 확인하기 위해, 실제 해외 뉴스기사 데이터를 수집하고 BERTopic 기법을 적용하여 뉴스기사로부터 주요 이슈를 도출하였다. 연구 방법의 주요 절차는 Fig. 1과 같이 총 4단계로 구성된다.

https://cdn.apub.kr/journalsite/sites/ksarc/2023-002-02/N0410020204/images/ksarc_02_02_04_F1.jpg

Figure 1.

Research process

3.1 뉴스기사 수집

본 연구는 웹사이트로부터 정보를 수집하는 Python 기반 웹 스크레이핑 프로그램을 활용하여 2022년 10월에 발행된 BBC News 6,273건 뉴스기사의 제목, 본문, 날짜 정보를 수집하였다. Fig. 2는 수집된 뉴스기사의 예시를 보여 준다. 수집된 뉴스기사 내용을 확인한 뒤 수작업으로 각각의 뉴스기사에 적합한 주제를 분류하였다(Table 1).

https://cdn.apub.kr/journalsite/sites/ksarc/2023-002-02/N0410020204/images/ksarc_02_02_04_F2.jpg

Figure 2.

An example of the collected BBC news articles

Table 1.

Categorization of BBC news articles

Category	Sub-category	No. of News
Politics	Defense, Congress/Parties, Bill, Election, Royal household, Diplomacy, Global politics	210
Economy	Finance, Company, Real estate, Debt, Living economy, Tax, Energy policy	395
Society	Education, Traffic accident, Rescue, Labor, Homeless, Road/Traffic, Drugs, Immigration, Water supply, Crime, Poverty, Incident, Accident, Death case, LGBT, Protest, Missing case, Kid/ Teenager, Remains, Medical treatment, Human right, Racism, Charity, Disability, Trial, Gender, Religion, Hate crime, Region, Development, Opening, Closing, Birth/Abortion, Terror, Strike, Fire, Environment	3,418
Industry	IT/Science, Game, Tourism, Agriculture, Service, Energy/Resource, Space, Automobile, Shipbuilding, Railway, Fashion, Airline, Marine	256
Art / Culture	Concert, Competition, Museum, Broadcast, Movie, Music, Piece of work, Exhibition, Sculpture, Festival, Celebrity	377
Sports	Marathon, Sports	100
Weather	Weather	14
Natural disaster	Natural Disaster, Flood	37
Science	Science, Fauna and flora	112
Specific Issues	Bird flu, Burkina Faso coup, COP27, Cough syrup, COVID-19, Creeslough explosion, Diwali 2022, Donald Trump, Elon Musk, Eurovision, Food bank, Kanye West, Liz Truss, Lucy Letby, Nicola Sturgeon, Northern Ireland Protocol, Pelosi, Rishi Sunak, Suella Braverman, Nobel prize, London Marathon, Ethiopia War, Ukraine War, Protest in Iran, Disaster in Itaewon, Qatar World Cup, Halloween	922
World	China	31
Information	Health, Auction, History, Stamp, People, Book, Memorial, Profile	248
Miscellaneous	Graffiti, Advertisements, Lottery, Photo, Survey, Awards, Eclipse, Quiz, Miscellaneous	153

3.2 토픽 모델링

BERTopic 기법은 문서 임베딩, 차원 축소, 클러스터링 및 키워드 추출의 과정을 통해 토픽을 분류한다(Grootendorst, 2020). 먼저, 사전 훈련된 언어 모델인 Bidirectional Encoder Representations from Transformers(BERT)를 사용하여 개별 뉴스기사를 벡터로 표현한다(임베딩). 그 후, 벡터의 차원을 축소하는 기법인 Manifold Approximation and Projection (UMAP) 기법을 사용하여 임베딩된 뉴스기사 벡터의 차원을 빠르고 정확하게 분석할 수 있도록 축소한다. 뉴스기사 벡터를 군집화하여 군집을 대표하는 의미 있는 토픽을 찾아내기 위해 우선, Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN)을 통해 문서 클러스터를 생성한다. HDBSCAN은 밀도 기반 클러스터링 알고리즘으로 유사한 주제를 가진 문서끼리 군집을 형성할 수 있다. 이 때, 어떤 군집에도 속하지 않는 뉴스기사는 Outlier로 처리된다. HDBSCAN 알고리즘 적용 시, 파라미터를 조정함으로써 토픽 개수를 특정 개수로 설정하거나, 알고리즘에 의해 최적의 토픽 개수를 자동으로 찾아낼 수 있다. 본 연구에서는 BERTopic 사용 시 최적의 성능을 보이는 토픽 개수를 찾기 위해 알고리즘이 토픽 개수를 자동으로 찾도록 설정한 뒤 결과를 비교하였다.

생성된 각 토픽을 대표하는 문서 군집에 class-based Term Frequency-Inverse Document Frequency(c-TF-IDF)를 적용하여 토픽별 키워드를 추출한다. BERTopic에 적용되는 c-TF-IDF는 각각의 군집을 하나의 문서로 간주하여, 군집별 주요 단어의 TF-IDF 값을 계산함으로써 각 토픽의 주제를 설명할 수 있는 주요 키워드를 도출한다(Fig. 3).

https://cdn.apub.kr/journalsite/sites/ksarc/2023-002-02/N0410020204/images/ksarc_02_02_04_F3.jpg

Figure 3.

Process of BERTopic

4. 토픽 모델링 결과 및 논의

4.1 토픽 모델링 적용 결과

BERTopic을 활용하여 토픽 모델링을 수행한 결과 총 87개의 토픽이 생성되었다. 아래 Table 2는 87개 토픽 중 뉴스기사 개수가 많은 상위 10개 토픽에 대한 주요 키워드와, 이를 바탕으로 배정된 토픽별 이름을 보여준다. 토픽 번호가 -1인 토픽은 Outlier를 나타내며, 주요 키워드가 일반적인 뉴스기사에 자주 등장하는 “said,” “people,” “mr,” “bbc” 등의 단어로 도출됨을 확인할 수 있다.

Table 2.

Results of BERTopic

Topic No.	No. of News	Keywords	Topic Label
-1	1,785	Said, people, mr, bbc	Outlier
0	759	Police, man, police said, officers	Crime
1	207	Truss, tax, ms truss, prime	Liz Truss
2	203	Rescue, service, blaze, firefighters	Fire
3	197	Russian, Ukraine, Russia, Ukrainian	Ukraine War
4	131	Council, homes, plans, sites	Development
5	126	Radio, tv, content, media	Broadcast
6	121	Patients, hospital, nhs, care	Medial treatment
7	119	Northern, Ireland, northern Ireland, protocol	Northern Ireland Protocol
8	106	Road, bridge, traffic, highways	Road/Traffic

BERTopic은 토픽별 키워드가 명확하게 나타나 토픽 이름을 할당하는 것이 쉽다는 이점이 있었다. 예를 들어, 1번 토픽에서 생성된 ‘truss’, ‘tax’, ‘ms truss’, ‘prime’와 같은 키워드는 영국의 전 총리인 ‘Liz Truss’와 관련된 단어들로 이루어져 있음을 알 수 있다. 다른 토픽들의 키워드들도 명확하게 나타났기에, 직관적으로 토픽을 배정하는 것이 가능함을 확인하였다.

4.2 결과 해석 및 논의

각 뉴스기사별 BERTopic 모델에 의해 배정된 토픽과 수작업으로 분류한 토픽을 비교함으로써 BERTopic의 성능을 정량적으로 검증하였다(Table 3).

Table 3.

Performance metrics of BERTopic results

Num of Topics	Accuracy score	Precision score	Recall score	F1 score
87	0.394	0.242	0.288	0.244

Table 4.

Example of news articles with high F1 score

Topic	Title	Text
Cough syrup	Indonesia bans all syrup medicines after death of 99 children	The deaths of nearly 100 children in Indonesia have prompted the country to suspend sales…
Diwali 2022	Diwali celebrations planned at Leicester waterways	Activities to celebrate Diwali are being organised around Leicester's waterways…
Ukraine War	Ukraine war: Tortured for refusing to teach in Russian	Ukrainian forces say they have taken back 6,000 sq km of territory, liberating communities...
Creeslough explosion	Creeslough spirit shines amid darkness of tragedy	The names of those killed in the Creeslough petrol station explosion were made known…

성능 평가 결과, F1 score가 0.244로 비교적 낮게 나타났다. 이는 성능 평가 시 Outlier를 전부 오답으로 분류하였기 때문으로 추정되며, 총 6,273개의 기사 중 1,785개(28.5%)에 해당하는 뉴스기사를 전부 오답으로 분류하였기 때문에 성능 평가 결과에 큰 영향을 주었을 것으로 판단된다. 또한, BERTopic이 자동으로 도출한 토픽 개수(87개)가 수작업으로 분류한 토픽 개수(128개)와 차이가 나는 점도 낮은 F1 score에 영향을 미쳤다.

구체적인 논의를 위해서 F1 score가 높게 나온 토픽과 낮게 나온 토픽을 살펴보았다. 가장 높은 F1 score인 0.857을 얻은 토픽은 ‘Cough Syrup’이라는 토픽으로, 인도네시아에서 100명 가까이 되는 아이들이 감기약을 먹고 사망한 사건을 다룬다. 그 외 높게 나온 토픽으로는 인도의 힌두교 축제인 ‘Diwali 2022’(F1 점수 0.842), ‘우크라이나 전쟁’(0.833), 영국 Creeslough 지역의 주유소 폭발 사고를 다루는 ‘Creeslough explosion’ (0.826) 등이 있다. 이 토픽들은 모두 대분류 중 ‘특정 이슈’에 속하는 것으로 나타났다. 즉, BERTopic 모델은 특정 이슈를 다루는 뉴스기사를 정확하게 분류한다는 것을 알 수 있다. 아래 Table 4는 위 토픽들에 해당하는 기사의 제목과 본문 내용을 간단하게 설명한다.

F1 score가 0으로 나온 토픽으로는 ‘국회/정당’과 ‘세금’이 있다. ‘국회/정당’ 토픽의 경우 ‘Liz Truss’와 ‘Rishi Sunak’ 두 개의 항목으로 대부분 분류되었다. 구체적으로, 23개 중 5개의 기사가 ‘Liz Truss(영국 78대 총리)’로, 8개의 기사가 ‘Rishi Sunak(영국 79대 총리)’로 분류되었다. 그 외의 기사는 ‘Donald Trump’, ‘Northern Ireland Protocol’ 등의 토픽으로 분류되었다. Liz Truss와 Rishi Sunak의 경우, 두 정치인에 대한 내용은 실제 뉴스기사에 없지만, 일반적인 정치 기사에서 나타나는 단어와 문맥을 유사하게 따랐기 때문에 BERTopic 모델이 해당 뉴스기사를 ‘국회/정당’으로 분류했을 가능성이 크다. ‘세금’ 토픽 역시 22개 중 17개의 기사가 ‘Liz Truss’로 오분류되었다(Table 5).

Table 5.

Example of news articles with low F1 score

Topic	Title	Text
Congress/Parties	Young Tory chairman apologises for 'Birmingham is a dump' tweet	The chairman of a group of young Conservative Party members has apologised …
Congress/Parties	Graham Brady: The man who sees off Tory prime ministers	Born in Salford in 1967, Sir Graham first became active in the Conservative Party …
Tax	Mini-budget: Keeping 45p tax rate in Wales 'would raise £45m'	Keeping the 45p rate of income tax in Wales would raise about £45m, according to Wales' finance minister…
Tax	Community helpdesk launched to support Jersey taxpayers	A community helpdesk is being launched to support taxpayers in Jersey. Revenue Jersey…

위 사례들로 보아, 낮은 F1 score를 기록한 토픽에 해당하는 기사들일지라도 각 기사의 내용이 수작업으로 분류한 뉴스기사의 토픽과 문맥상으로 유사한 내용을 가지는 것을 알 수 있었다.

5. 결 론

본 연구는 해외 뉴스기사에 BERTopic 기법을 적용하여 주요 토픽을 도출하는 방법의 적용 가능성을 확인하였다. 연구 결과, BERTopic 기법은 토픽별 키워드를 명확하게 제시함으로써 직관적인 토픽 해석이 가능하다는 장점을 가짐을 확인하였다.

본 연구의 한계는 토픽 모델링 결과로 나타난 키워드를 보고 연구자가 직접 토픽 이름을 설정해야 한다는 점이다. 따라서 향후 연구에서는 토픽별 키워드로부터 자동으로 토픽 명칭을 할당하는 알고리즘을 적용하여, 일정한 기준에 따라 토픽을 배정하는 방식을 적용해 볼 수 있을 것이다.

또한, BERTopic 기법의 효과가 다양한 방면에서 검증될 수 있도록, 향후 연구에서는 (1) 기존 토픽 모델링 기법(예: LDA)과의 성능 비교 및 (2) 직접적으로 건설과 관련된 뉴스 데이터(예: “construction”이라는 단어를 포함하는 뉴스)를 활용하여 다양한 실험 및 검증이 수행되어야 할 것이다.

Acknowledgements

이 성과는 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. RS-2023-20241758).

References

Abuzayed, A., and Al-Khalifa, H. (2021). BERT for Arabic topic modeling: An experimental study on BERTopic technique. Procedia Computer Science, 189, pp. 191-194. 10.1016/j.procs.2021.05.096

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), pp. 77-84. 10.1145/2133806.2133826

Choo, S., Park, H., Kim, T., and Seo, J. (2019). Analysis of trends in Korean BIM research and technologies using text mining. Applied Sciences 2019, 9(20), 4424. 10.3390/app9204424

Egger, R., and Yu, J. (2022). A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify twitter posts. Frontiers in Sociology, 7. 10.3389/fsoc.2022.88649835602001PMC9120935

Goldszmidt, R. G. B., Brito, L. A. L., and de Vasconcelos, F. C. (2011). Country effect on firm performance: A multilevel approach. Journal of Business Research, 64(3), pp. 273-279. 10.1016/j.jbusres.2009.11.012

Grootendorst, M. (2020). BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics. Zenodo. 10.5281/zenodo.4381785 (Accessed March 15, 2023).

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794v0571, https://arxiv.org/pdf/2203.05794.pdf (Accessed March 15, 2023).

Hendry, D., Darari, F., Nurfadillah, R., Khanna, G., Sun, M., Condylis, P. C., and Taufik, N. (2021). Topic modeling for customer service chats. 2021 International Conference on Advanced Computer Science and Information Systems, IEEE, Depok, Indonesia, pp. 1-6. 10.1109/ICACSIS53237.2021.9631322

Hou, S., Zhang, X., Yi, B., and Tang, Y. (2022). Public attitudes on open source communities in China: A text mining analysis. Technology in Society, 71, 102112. 10.1016/j.techsoc.2022.102112

Jallan, Y., Brogan, E., Ashuri, B., and Clevenger, C. M. (2019). Application of natural language processing and text mining to identify patterns in construction-defect litigation cases. Journal of Legal Affairs and Dispute Resolution in Engineering and Construction, 11(4), 04519024. 10.1061/(ASCE)LA.1943-4170.0000308

Javernick-Will, A. N., and Scott, W. R. (2010). Who needs to know what? Institutional knowledge and global projects. Journal of Construction Engineering and Management, 136(5), pp. 546-557. 10.1061/(ASCE)CO.1943-7862.0000035

Jiang, H. C., Qiang, M. S., and Lin, P. (2016). Finding academic concerns of the Three Gorges Project based on a topic modeling approach. Ecological Indicators, 60, pp. 693-701. 10.1016/j.ecolind.2015.08.007

Jung, N., and Lee, G. (2019). Automated classification of building information modeling (BIM) case studies by BIM use based on natural language processing (NLP) and unsupervised learning. Advanced Engineering Informatics, 41, 100917. 10.1016/j.aei.2019.04.007

Lee, S.-G. (2018). A study on the trends of construction safety accident in unstructured text using topic modeling. Journal of the Korea Academia-Industrial Cooperation Society, 19(10), pp. 176-182.

Lin, J. R., Hu, Z. Z., Li, J. L., and Chen, L. M. (2020). Understanding on-site inspection of construction projects based on keyword extraction and topic modeling. IEEE Access, 8, pp. 198503-198517. 10.1109/ACCESS.2020.3035214

Moon, Chung, S., and Chi, S. (2018). Topic Modeling of News Article about International Construction Market Using Latent Dirichlet Allocation. KSCE Journal of Civil and Environmental Engineering Research, 38(4), pp. 595-599.

Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. Proc. of the 23rd International Conference on Machine Learning, Association for Computing Machinery, New York, USA, pp. 977-984. 10.1145/1143844.1143967

Journal of Construction Automation and Robotics ISSN:2800-0552(Print) 2951-116X(Online) 건설자동화·로보틱스 논문집

Preview

Issue Identification of Overseas Construction Markets from News Articles Based on BERTopic

ABSTRACT

MAIN

Figure 1.

Research process

Figure 2.

An example of the collected BBC news articles

Table 1.

Categorization of BBC news articles

Figure 3.

Process of BERTopic

Table 2.

Results of BERTopic

Table 3.

Performance metrics of BERTopic results

Table 4.

Example of news articles with high F1 score

Table 5.

Example of news articles with low F1 score

Acknowledgements

References