gbrigens/Survey.md

## Survey.md

      
    Raw
  

              Survey.md
            
          
    Cleaning Data

Appendix: Data Cleaning
To remove duplicates and clean titles to make the data suitable for further analysis. Data cleaning is a fundamental aspect of data analysis and is particularly important when working with real-world datasets, which often contain missing, duplicate, or inconsistent records. We provide detailed steps and rationale behind the steps.
The data-cleaning process is broken down into two major steps:

Duplicate Removal: Removing duplicate entries based on multiple criteria.
Title Cleaning: Removing conference proceeding information from the titles.

Software and Tools


Python 3
Pandas library for data manipulation
Regular Expressions (regex) for string pattern matching

Step 1: Duplicate Removal

Sub-step 1.1: Remove Identical Rows

To remove completely identical rows, the following code snippet can be executed:
import pandas as pd

df = pd.read_csv('Merged_Articles.csv')
df = df.drop_duplicates()
Sub-step 1.2: Remove Duplicates Based on Titles

Duplicate articles might have different abstracts or other fields. Thus, duplicates were further removed based on the Title field alone:
df = df.drop_duplicates(subset='Title')
Sub-step 1.3: Remove Duplicates Based on Abstracts

Finally, duplicates were removed based on the Abstract field. In cases where the abstract was missing, the first occurrence was kept:
df = df.sort_values(by='Abstract', na_position='last') \
       .drop_duplicates(subset='Abstract', keep='first')
Step 2: Title Cleaning

Titles were cleaned to remove the conference proceeding details. A Regular Expression (regex) pattern was used to identify and remove such information:
import re

# Regex pattern to match conference titles
conference_pattern = r"^[A-Z]+\s\'\d{2}:\s"

# Replace matched patterns in the 'Title' column with an empty string
df['Cleaned_Title'] = df['Title'] \
    .str.replace(conference_pattern, '', regex=True)
Results

The initial dataset had 286 entries. After the duplicate removal process, 185 unique entries remained. The titles were successfully cleaned to remove conference proceedings.
The dataset was reduced from 286 to 185 unique entries, and titles were cleaned for further analysis. This process can be applied to similar datasets requiring deduplication and title cleaning.
Out of the initial 185 scientific articles identified, a meticulous manual analysis was conducted to select papers aligned with the study's thematic focus. This rigorous selection process allowed us to refine the pool to 26 critical scientific papers paramount to our research.