
12 Scikit-Learn Pipeline Techniques for Data Scientists
As data scientists, we often find ourselves working with complex datasets and performing multiple tasks such as feature selection, scaling, transformation, and model fitting. In such scenarios, Scikit-Learn’s pipeline feature is a lifesaver. Pipelines allow us to chain together various steps (transformers and estimators) in a linear fashion, making our code more readable and maintainable.
In this article, we’ll explore 12 different Scikit-Learn pipeline techniques that you can use in your next data science project.
1. Simple Pipeline for Predictive Modeling
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with StandardScaler and LogisticRegression
pipeline = Pipeline([
(‘scaler’, StandardScaler()),
(‘model’, LogisticRegression())
])
Fit the pipeline to the training data
pipeline.fit(X_train, y_train)
“`
2. Pipeline with Multiple Feature Selection Techniques
“`python
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with SelectKBest and LogisticRegression
pipeline = Pipeline([
(‘selector’, SelectKBest(k=5)),
(‘model’, LogisticRegression())
])
Fit the pipeline to the training data
pipeline.fit(X_train, y_train)
“`
3. Using GridSearchCV for Hyperparameter Tuning
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with LogisticRegression
pipeline = Pipeline([
(‘model’, LogisticRegression())
])
Define hyperparameters to tune
param_grid = {
‘model__C’: [0.1, 10],
‘model__penalty’: [‘l1’, ‘l2’]
}
Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
“`
4. Using Pipeline with Cross-Validation
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
Create a pipeline with LogisticRegression
pipeline = Pipeline([
(‘model’, LogisticRegression())
])
Perform cross-validation for model evaluation
scores = cross_val_score(pipeline, X, y, cv=5)
print(“Accuracy:”, scores.mean())
“`
5. Using Pipeline with Feature Encoding
“`python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_20newsgroups
Load the dataset
data = load_20newsgroups()
X, y = data.data, data.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with TfidfVectorizer and LogisticRegression
pipeline = Pipeline([
(‘vectorizer’, TfidfVectorizer()),
(‘model’, LogisticRegression())
])
Fit the pipeline to the training data
pipeline.fit(X_train, y_train)
“`
6. Using Pipeline with Image Preprocessing
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
Load the dataset
digits = load_digits()
X, y = digits.data, digits.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with LogisticRegression
pipeline = Pipeline([
(‘model’, LogisticRegression())
])
Perform preprocessing on the data (e.g., resampling)
preprocessing_pipeline = Pipeline([
(‘resampler’, Preprocessor(resample=True))
])
Fit the preprocessing pipeline to the training data
preprocessing_pipeline.fit(X_train)
Create a final pipeline by combining the preprocessing and model pipelines
final_pipeline = Pipeline([
(‘preprocessing’, preprocessing_pipeline),
(‘model’, pipeline)
])
Fit the final pipeline to the training data
final_pipeline.fit(X_train, y_train)
“`
7. Using Pipeline with Time Series Data
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_airbnb
Load the dataset
airbnb = load_airbnb()
X, y = airbnb.data, airbnb.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with LogisticRegression
pipeline = Pipeline([
(‘model’, LogisticRegression())
])
Perform preprocessing on the data (e.g., seasonality extraction)
preprocessing_pipeline = Pipeline([
(‘seasonalizer’, SeasonalityExtractor())
])
Fit the preprocessing pipeline to the training data
preprocessing_pipeline.fit(X_train)
Create a final pipeline by combining the preprocessing and model pipelines
final_pipeline = Pipeline([
(‘preprocessing’, preprocessing_pipeline),
(‘model’, pipeline)
])
Fit the final pipeline to the training data
final_pipeline.fit(X_train, y_train)
“`
8. Using Pipeline with Graph-Based Data
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wiki
Load the dataset
wiki = load_wiki()
X, y = wiki.data, wiki.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with LogisticRegression
pipeline = Pipeline([
(‘model’, LogisticRegression())
])
Perform preprocessing on the graph data (e.g., feature extraction)
preprocessing_pipeline = Pipeline([
(‘extractor’, FeatureExtractor())
])
Fit the preprocessing pipeline to the training data
preprocessing_pipeline.fit(X_train)
Create a final pipeline by combining the preprocessing and model pipelines
final_pipeline = Pipeline([
(‘preprocessing’, preprocessing_pipeline),
(‘model’, pipeline)
])
Fit the final pipeline to the training data
final_pipeline.fit(X_train, y_train)
“`
9. Using Pipeline with Anomaly Detection
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_anomalies
Load the dataset
anomalies = load_anomalies()
X, y = anomalies.data, anomalies.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with LocalOutlierFactor
pipeline = Pipeline([
(‘model’, LocalOutlierFactor())
])
Fit the pipeline to the training data
pipeline.fit(X_train)
“`
10. Using Pipeline with Clustering
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_cluster
Load the dataset
cluster = load_cluster()
X, y = cluster.data, cluster.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with KMeans
pipeline = Pipeline([
(‘model’, KMeans())
])
Fit the pipeline to the training data
pipeline.fit(X_train)
“`
11. Using Pipeline with Dimensionality Reduction
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_reduce
Load the dataset
reduce = load_reduce()
X, y = reduce.data, reduce.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with PCA
pipeline = Pipeline([
(‘model’, PCA())
])
Fit the pipeline to the training data
pipeline.fit(X_train)
“`
12. Using Pipeline with Feature Selection
“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_features
Load the dataset
features = load_features()
X, y = features.data, features.target
Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a pipeline with SelectFromModel
pipeline = Pipeline([
(‘model’, SelectFromModel())
])
Fit the pipeline to the training data
pipeline.fit(X_train)
“`
In conclusion, pipelines are a powerful tool in scikit-learn for creating complex machine learning workflows. By combining multiple steps into a single pipeline, you can simplify your code and make it easier to maintain and modify. The examples above demonstrate how to use pipelines with various types of data and models, including classification, regression, clustering, dimensionality reduction, feature selection, anomaly detection, graph-based data, time series data, and more.
I hope this helps! Let me know if you have any questions or need further assistance.