12 Scikit-Learn Pipeline Techniques for Data Scientists

As data scientists, we often find ourselves working with complex datasets and performing multiple tasks such as feature selection, scaling, transformation, and model fitting. In such scenarios, Scikit-Learn’s pipeline feature is a lifesaver. Pipelines allow us to chain together various steps (transformers and estimators) in a linear fashion, making our code more readable and maintainable.

In this article, we’ll explore 12 different Scikit-Learn pipeline techniques that you can use in your next data science project.

1. Simple Pipeline for Predictive Modeling

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

Load the dataset

iris = load_iris()
X, y = iris.data, iris.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with StandardScaler and LogisticRegression

pipeline = Pipeline([
(‘scaler’, StandardScaler()),
(‘model’, LogisticRegression())
])

Fit the pipeline to the training data

pipeline.fit(X_train, y_train)
“`

2. Pipeline with Multiple Feature Selection Techniques

“`python
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

Load the dataset

iris = load_iris()
X, y = iris.data, iris.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with SelectKBest and LogisticRegression

pipeline = Pipeline([
(‘selector’, SelectKBest(k=5)),
(‘model’, LogisticRegression())
])

Fit the pipeline to the training data

pipeline.fit(X_train, y_train)
“`

3. Using GridSearchCV for Hyperparameter Tuning

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

Load the dataset

iris = load_iris()
X, y = iris.data, iris.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with LogisticRegression

pipeline = Pipeline([
(‘model’, LogisticRegression())
])

Define hyperparameters to tune

param_grid = {
‘model__C’: [0.1, 10],
‘model__penalty’: [‘l1’, ‘l2’]
}

Perform grid search for hyperparameter tuning

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
“`

4. Using Pipeline with Cross-Validation

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

Load the dataset

iris = load_iris()
X, y = iris.data, iris.target

Create a pipeline with LogisticRegression

pipeline = Pipeline([
(‘model’, LogisticRegression())
])

Perform cross-validation for model evaluation

scores = cross_val_score(pipeline, X, y, cv=5)
print(“Accuracy:”, scores.mean())
“`

5. Using Pipeline with Feature Encoding

“`python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_20newsgroups

Load the dataset

data = load_20newsgroups()
X, y = data.data, data.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with TfidfVectorizer and LogisticRegression

pipeline = Pipeline([
(‘vectorizer’, TfidfVectorizer()),
(‘model’, LogisticRegression())
])

Fit the pipeline to the training data

pipeline.fit(X_train, y_train)
“`

6. Using Pipeline with Image Preprocessing

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression

Load the dataset

digits = load_digits()
X, y = digits.data, digits.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with LogisticRegression

pipeline = Pipeline([
(‘model’, LogisticRegression())
])

Perform preprocessing on the data (e.g., resampling)

preprocessing_pipeline = Pipeline([
(‘resampler’, Preprocessor(resample=True))
])

Fit the preprocessing pipeline to the training data

preprocessing_pipeline.fit(X_train)

Create a final pipeline by combining the preprocessing and model pipelines

final_pipeline = Pipeline([
(‘preprocessing’, preprocessing_pipeline),
(‘model’, pipeline)
])

Fit the final pipeline to the training data

final_pipeline.fit(X_train, y_train)
“`

7. Using Pipeline with Time Series Data

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_airbnb

Load the dataset

airbnb = load_airbnb()
X, y = airbnb.data, airbnb.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with LogisticRegression

pipeline = Pipeline([
(‘model’, LogisticRegression())
])

Perform preprocessing on the data (e.g., seasonality extraction)

preprocessing_pipeline = Pipeline([
(‘seasonalizer’, SeasonalityExtractor())
])

Fit the preprocessing pipeline to the training data

preprocessing_pipeline.fit(X_train)

Create a final pipeline by combining the preprocessing and model pipelines

final_pipeline = Pipeline([
(‘preprocessing’, preprocessing_pipeline),
(‘model’, pipeline)
])

Fit the final pipeline to the training data

final_pipeline.fit(X_train, y_train)
“`

8. Using Pipeline with Graph-Based Data

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wiki

Load the dataset

wiki = load_wiki()
X, y = wiki.data, wiki.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with LogisticRegression

pipeline = Pipeline([
(‘model’, LogisticRegression())
])

Perform preprocessing on the graph data (e.g., feature extraction)

preprocessing_pipeline = Pipeline([
(‘extractor’, FeatureExtractor())
])

Fit the preprocessing pipeline to the training data

preprocessing_pipeline.fit(X_train)

Create a final pipeline by combining the preprocessing and model pipelines

final_pipeline = Pipeline([
(‘preprocessing’, preprocessing_pipeline),
(‘model’, pipeline)
])

Fit the final pipeline to the training data

final_pipeline.fit(X_train, y_train)
“`

9. Using Pipeline with Anomaly Detection

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_anomalies

Load the dataset

anomalies = load_anomalies()
X, y = anomalies.data, anomalies.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with LocalOutlierFactor

pipeline = Pipeline([
(‘model’, LocalOutlierFactor())
])

Fit the pipeline to the training data

pipeline.fit(X_train)
“`

10. Using Pipeline with Clustering

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_cluster

Load the dataset

cluster = load_cluster()
X, y = cluster.data, cluster.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with KMeans

pipeline = Pipeline([
(‘model’, KMeans())
])

Fit the pipeline to the training data

pipeline.fit(X_train)
“`

11. Using Pipeline with Dimensionality Reduction

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_reduce

Load the dataset

reduce = load_reduce()
X, y = reduce.data, reduce.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with PCA

pipeline = Pipeline([
(‘model’, PCA())
])

Fit the pipeline to the training data

pipeline.fit(X_train)
“`

12. Using Pipeline with Feature Selection

“`python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_features

Load the dataset

features = load_features()
X, y = features.data, features.target

Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create a pipeline with SelectFromModel

pipeline = Pipeline([
(‘model’, SelectFromModel())
])

Fit the pipeline to the training data

pipeline.fit(X_train)
“`

In conclusion, pipelines are a powerful tool in scikit-learn for creating complex machine learning workflows. By combining multiple steps into a single pipeline, you can simplify your code and make it easier to maintain and modify. The examples above demonstrate how to use pipelines with various types of data and models, including classification, regression, clustering, dimensionality reduction, feature selection, anomaly detection, graph-based data, time series data, and more.

I hope this helps! Let me know if you have any questions or need further assistance.

Paul

Administrator

Visit Website View All Posts

Post Views: 165

Related Stories

18 OpenAI GPT Model Applications for Business

6 ELK Stack Configurations for System Monitoring

10 GitHub Actions Workflows for Development Teams

You may have missed

18 OpenAI GPT Model Applications for Business

6 ELK Stack Configurations for System Monitoring

10 GitHub Actions Workflows for Development Teams

6 AWS CI/CD Pipeline Implementation Strategies

12 Scikit-Learn Pipeline Techniques for Data Scientists

1. Simple Pipeline for Predictive Modeling

Load the dataset

Split the data into training and test sets

Create a pipeline with StandardScaler and LogisticRegression

Fit the pipeline to the training data

2. Pipeline with Multiple Feature Selection Techniques

Load the dataset

Split the data into training and test sets

Create a pipeline with SelectKBest and LogisticRegression

Fit the pipeline to the training data

3. Using GridSearchCV for Hyperparameter Tuning

Load the dataset

Split the data into training and test sets

Create a pipeline with LogisticRegression

Define hyperparameters to tune

Perform grid search for hyperparameter tuning

4. Using Pipeline with Cross-Validation

Load the dataset

Create a pipeline with LogisticRegression

Perform cross-validation for model evaluation

5. Using Pipeline with Feature Encoding

Load the dataset

Split the data into training and test sets

Create a pipeline with TfidfVectorizer and LogisticRegression

Fit the pipeline to the training data

6. Using Pipeline with Image Preprocessing

Load the dataset

Split the data into training and test sets

Create a pipeline with LogisticRegression

Perform preprocessing on the data (e.g., resampling)

Fit the preprocessing pipeline to the training data

Create a final pipeline by combining the preprocessing and model pipelines

Fit the final pipeline to the training data

7. Using Pipeline with Time Series Data

Load the dataset

Split the data into training and test sets

Create a pipeline with LogisticRegression

Perform preprocessing on the data (e.g., seasonality extraction)

Fit the preprocessing pipeline to the training data

Create a final pipeline by combining the preprocessing and model pipelines

Fit the final pipeline to the training data

8. Using Pipeline with Graph-Based Data

Load the dataset

Split the data into training and test sets

Create a pipeline with LogisticRegression

Perform preprocessing on the graph data (e.g., feature extraction)

Fit the preprocessing pipeline to the training data

Create a final pipeline by combining the preprocessing and model pipelines

Fit the final pipeline to the training data

9. Using Pipeline with Anomaly Detection

Load the dataset

Split the data into training and test sets

Create a pipeline with LocalOutlierFactor

Fit the pipeline to the training data

10. Using Pipeline with Clustering

Load the dataset

Split the data into training and test sets

Create a pipeline with KMeans

Fit the pipeline to the training data

11. Using Pipeline with Dimensionality Reduction

Load the dataset

Split the data into training and test sets

Create a pipeline with PCA

Fit the pipeline to the training data

12. Using Pipeline with Feature Selection

Load the dataset

Split the data into training and test sets

Create a pipeline with SelectFromModel

Fit the pipeline to the training data

About the Author

Related Stories

You may have missed