Skip to content

Instantly share code, notes, and snippets.

@gbrigens
Last active March 20, 2024 17:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gbrigens/5fa77cbc0f1265b2fd83678182541a25 to your computer and use it in GitHub Desktop.
Save gbrigens/5fa77cbc0f1265b2fd83678182541a25 to your computer and use it in GitHub Desktop.
A Survey of Machine Learning’s Integration into Traditional Software Risk Management

Cleaning Data

Appendix: Data Cleaning

To remove duplicates and clean titles to make the data suitable for further analysis. Data cleaning is a fundamental aspect of data analysis and is particularly important when working with real-world datasets, which often contain missing, duplicate, or inconsistent records. We provide detailed steps and rationale behind the steps.

The data-cleaning process is broken down into two major steps:

  1. Duplicate Removal: Removing duplicate entries based on multiple criteria.
  2. Title Cleaning: Removing conference proceeding information from the titles.

Software and Tools

  • Python 3
  • Pandas library for data manipulation
  • Regular Expressions (regex) for string pattern matching

Step 1: Duplicate Removal

Sub-step 1.1: Remove Identical Rows

To remove completely identical rows, the following code snippet can be executed:

import pandas as pd

df = pd.read_csv('Merged_Articles.csv')
df = df.drop_duplicates()

Sub-step 1.2: Remove Duplicates Based on Titles

Duplicate articles might have different abstracts or other fields. Thus, duplicates were further removed based on the Title field alone:

df = df.drop_duplicates(subset='Title')

Sub-step 1.3: Remove Duplicates Based on Abstracts

Finally, duplicates were removed based on the Abstract field. In cases where the abstract was missing, the first occurrence was kept:

df = df.sort_values(by='Abstract', na_position='last') \
       .drop_duplicates(subset='Abstract', keep='first')

Step 2: Title Cleaning

Titles were cleaned to remove the conference proceeding details. A Regular Expression (regex) pattern was used to identify and remove such information:

import re

# Regex pattern to match conference titles
conference_pattern = r"^[A-Z]+\s\'\d{2}:\s"

# Replace matched patterns in the 'Title' column with an empty string
df['Cleaned_Title'] = df['Title'] \
    .str.replace(conference_pattern, '', regex=True)

Results

The initial dataset had 286 entries. After the duplicate removal process, 185 unique entries remained. The titles were successfully cleaned to remove conference proceedings. The dataset was reduced from 286 to 185 unique entries, and titles were cleaned for further analysis. This process can be applied to similar datasets requiring deduplication and title cleaning.

Out of the initial 185 scientific articles identified, a meticulous manual analysis was conducted to select papers aligned with the study's thematic focus. This rigorous selection process allowed us to refine the pool to 26 critical scientific papers paramount to our research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment