Appendix: Data Cleaning
To remove duplicates and clean titles to make the data suitable for further analysis. Data cleaning is a fundamental aspect of data analysis and is particularly important when working with real-world datasets, which often contain missing, duplicate, or inconsistent records. We provide detailed steps and rationale behind the steps.
The data-cleaning process is broken down into two major steps:
- Duplicate Removal: Removing duplicate entries based on multiple criteria.
- Title Cleaning: Removing conference proceeding information from the titles.
- Python 3
- Pandas library for data manipulation
- Regular Expressions (regex) for string pattern matching
To remove completely identical rows, the following code snippet can be executed:
import pandas as pd
df = pd.read_csv('Merged_Articles.csv')
df = df.drop_duplicates()
Duplicate articles might have different abstracts or other fields. Thus, duplicates were further removed based on the Title
field alone:
df = df.drop_duplicates(subset='Title')
Finally, duplicates were removed based on the Abstract
field. In cases where the abstract was missing, the first occurrence was kept:
df = df.sort_values(by='Abstract', na_position='last') \
.drop_duplicates(subset='Abstract', keep='first')
Titles were cleaned to remove the conference proceeding details. A Regular Expression (regex) pattern was used to identify and remove such information:
import re
# Regex pattern to match conference titles
conference_pattern = r"^[A-Z]+\s\'\d{2}:\s"
# Replace matched patterns in the 'Title' column with an empty string
df['Cleaned_Title'] = df['Title'] \
.str.replace(conference_pattern, '', regex=True)
The initial dataset had 286 entries. After the duplicate removal process, 185 unique entries remained. The titles were successfully cleaned to remove conference proceedings. The dataset was reduced from 286 to 185 unique entries, and titles were cleaned for further analysis. This process can be applied to similar datasets requiring deduplication and title cleaning.
Out of the initial 185 scientific articles identified, a meticulous manual analysis was conducted to select papers aligned with the study's thematic focus. This rigorous selection process allowed us to refine the pool to 26 critical scientific papers paramount to our research.