Skip to content
Pusat Penelitian, Pengabdian kepada Masyarakat dan Publikasi Internasional
twitter
youtube
instagram
Pusat Penelitian, Pengabdian kepada Masyarakat dan Publikasi Internasional
Call Support 0822-7473-7806
Email Support [email protected]
Location Jl. Kolam No. 1 Medan Estate
  • Beranda
  • Tentang
    • Profil
    • Visi dan Misi
    • Struktur Organisasi
    • Pimpinan Pusat
    • Program Kerja
    • Sasaran, Program Strategis dan IK
  • Berita Kegiatan
  • Layanan & Informasi
    • Aplikasi
      • UMA
        • Penjaminan Mutu
        • Himpunan Aplikasi Online
        • Jurnal Ilmiah Online
        • Repositori UMA
        • Open Access Public Catalog
      • Unit
        • Aplikasi Penelitian & Pengabdian (LIPAN)
        • SWAMP-D
        • SUSITAO
        • SINTA Verifikator
        • BIMA Kemdiktisaintek
    • Arsip Digital
    • Helpdesk
    • Pendanaan
      • Penelitian
        • Penelitian Pendanaan Nasional
        • Penelitian Kerjasama Internasional
      • Pengabdian Kepada Masyarakat
        • PKM Pendanaan Nasional
    • Publikasi
      • Internasional Bereputasi
    • Reviewer Penelitian dan PKM
  • Kerjasama
  • Jadwal Kegiatan

Data Preprocessing: The Prep That Makes Data Science Work

Posted on June 18, 2025June 29, 2025 by Fachrur Rozi
0

Before you can build a smart model, make accurate predictions, or impress your boss with a dashboard, there’s one crucial thing you need to do: prep the data. This step is called data preprocessing, and while it may not sound flashy, it’s what separates good data science from garbage results.

Think of it like preparing ingredients before cooking. You don’t toss raw potatoes into a cake and hope for the best, right?


💡 What is Data Preprocessing?

Data preprocessing is the process of transforming raw data into a clean, usable format for analysis or machine learning. It includes everything from fixing errors to normalizing numbers and turning words into numbers a model can understand.

In short, it’s about getting your data ready to be useful.


🔍 Why It Matters

Messy, inconsistent, or incorrectly formatted data can completely break your model — or worse, make it confidently wrong.

✅ Good preprocessing means:

  • Your model learns faster and better
  • Your insights are accurate and trustworthy
  • You spend less time debugging later

🧰 Common Steps in Data Preprocessing

Let’s walk through some of the most common things you’ll do when preparing your data:

1. Cleaning the Data

  • Handle missing values
  • Remove duplicates
  • Fix inconsistent formatting or typos

2. Encoding Categorical Variables

Convert text (like “Male”, “Female”) into numbers:

  • One-hot encoding
  • Label encoding

3. Feature Scaling

Bring all features onto a similar scale (especially important for algorithms like SVM, KNN, neural networks):

  • Standardization: mean = 0, std = 1
  • Normalization: scale between 0 and 1

4. Feature Selection/Extraction

Remove irrelevant or redundant features, or combine features to create new ones that provide more value.

5. Train-Test Split

Separate your data so the model can be trained on one part and tested on another:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

6. Outlier Detection and Treatment

Use visualization or statistical methods to spot extreme values that could skew your analysis.


🧪 Real-World Example

Let’s say you’re working on predicting customer churn for a telecom company. Your raw data includes:

  • Customer ID (not useful)
  • Gender (text)
  • Monthly charges (numeric, but different scales)
  • Missing values in contract type
  • Thousands of entries

Preprocessing might include:

  • Dropping the ID column
  • Encoding gender
  • Filling in missing contract types with the mode
  • Scaling monthly charges
  • Splitting into training and testing sets

⚠️ Common Mistakes to Avoid

  • Forgetting to scale data for models that need it
  • Using test data during training (data leakage!)
  • Dropping too much data when handling missing values
  • Not documenting what transformations were done

📦 Tools to Help

  • Pandas: for data manipulation
  • Scikit-learn: has functions for scaling, encoding, and splitting
  • OpenRefine: great for cleaning and exploring messy data
  • Pipelines: Automate preprocessing in a repeatable way

🧾 Final Thoughts

Data preprocessing isn’t optional — it’s critical. It’s what turns raw, chaotic data into something your models can actually learn from. Think of it as the “pre-game warm-up” before the big match. The better the warm-up, the better the performance.

So while it might not get the spotlight, preprocessing is where great data science truly begins.

Tags: Digital University, Dosen Terbaik, Kampus Terbaik, Kampus Unggul, Kampus Unggulan, Mahasiswa Berprestasi, Sustainable University, UMA Keren, UMA Terbaik, Universitas Swasta, Universitas Terbaik

Berita Terbaru
UMA Kukuhkan Posisi sebagai Kampus Swasta Terbaik di Sumut Versi SJR
Universitas Medan Area kembali mencatatkan pencapaian membanggakan di tingkat nasional dengan meraih predikat sebagai perguruan tinggi swasta terbaik di Sumatera...
UMA Terima Kunjungan STIE Graha Kirana: Perkuat Kolaborasi Tridharma dan Pengelolaan HKI
Medan, 24 April 2026 — Universitas Medan Area (UMA) menerima kunjungan akademik dari Sekolah Tinggi Ilmu Ekonomi (STIE) Graha Kirana...
KAMPUS I
Jalan Kolam Nomor 1 Medan Estate / Jalan Gedung PBSI, Medan 20223
(061) 7360168 CALL CENTER : 0811-6013-888
[email protected]
KAMPUS II
Jalan Sei Serayu No. 70 A / Jalan Setia Budi No. 79 B, Medan 20112
(061) 42402994
[email protected]

Statistik Pengunjung

  • 0
  • 23
  • 18
  • 21,751
  • 23,711
@Copyright 2026 BPDI | Universitas Medan Area

This will close in 10 seconds