Scikit-learn (Sklearn) is the most useful and robust Python machine learning package. It offers a set of fast tools for machine learning and statistical modeling, such as classification, regression, clustering, and dimension reduction, through a Python interface. This mostly Python package is based on NumPy, SciPy and Matplotlib. In this article you will learn more about the linear regression of sklearn.

## What is SKlearn linear regression?

Scikit-learn is a Python package that facilitates the application of various machine learning (ML) algorithms for predictive data analysis, such as linear regression.

Linear regression is defined as the process of determining the straight line that best corresponds to a set of scattered data points:

The line can then be designed to predict new data points. Due to its simplicity and basic characteristics, linear regression is a basic method of machine learning.

## Sklearn’s linear regression concepts

When working with the scikit-linear learn regression approach, you will come across the following fundamental concepts:

- Best fit – the straight line in the graph that minimizes the discrepancy between the associated scattered data points
- Coefficient – also known as a parameter, is the factor that is multiplied by a variable. The coefficient in the linear regression represents changes in the response variable
- Coefficient of determination – This is the correlation coefficient. In regression, this term is used to define the precision or degree of fit
- Correlation – the measurable intensity and degree of association between two variables, often known as the ‘degree of correlation’. Values range from -1.0 to 1.0
- Dependent element – Variable represented as y in the slope equation y = ax + b. Also called output or response
- Approximate regression line – the straight line that best matches a set of randomly distributed data points
- Independent characteristic – a variable represented by the letter x in the equation of inclination y = ax + b. Also called input or predictor
- Intercept – This is the point at which the slope intersects the Y axis, denoted by the letter b in the slope equation y = ax + b
- Least squares – a method for calculating the best correspondence with the data by minimizing the sum of the squares of the discrepancies between the observed and estimated values
- Average – average value of a group of numbers; however, in linear regression the mean is represented by a linear function
- OLS (Simple least squares regression) – sometimes known as linear regression.
- Residue – the vertical distance between a data point and the regression line
- Regression – is an estimate of the predicted change in a variable in relation to changes in other variables
- Regression model – The optimal formula for approximating regression
- Response variables – This category covers both the predicted response (the value predicted by the regression) and the actual response (the actual value of the data point)
- Slope – the steepness of a regression line. The linear relationship between two variables can be defined using slope and segment: y = ax + b
- Simple linear regression – Linear regression with one independent variable

## How to create a linear regression model of Sklearn

### Step 1: Import all necessary libraries

import numpy as np import pandas as pd import marine such as sns import matplotlib.pyplot as plt from sklearn import pretreatment, svm by sklearn.model_selection import train_test_split by sklearn.linear_model import LinearRegression |

### Step 2: Read the dataset

cd C: Users Dev Desktop Kaggle Salinity # Change the location for reading the file to the location of the dataset df = pd.read_csv (‘bottle.csv’) df_binary = df[[‘Salnty’, ‘T_degC’]] # Retrieve only the selected two attributes from the dataset df_binary.columns = [‘Sal’, ‘Temp’] # Rename columns to make code easier to write df_binary.head () # Show only 1st rows along with column names |

### Step 3: Study of data scattering

sns.lmplot (x = “Sal”, y = “Temp”, data = df_binary, order = 2, ci = None) # Draw scattered data |

### Step 4: Clear data

# Eliminate NaN or missing input numbers df_binary.fillna (method = ‘ffill’, inplace = True) |

### Step 5: Train our model

X = np.array (df_binary[‘Sal’]) .reshape (-1, 1) y = np.array (df_binary[‘Temp’]) .reshape (-1, 1) # Divide the data into independent and dependent variables # Convert each data frame to a numpy array # because each data frame contains only one column df_binary.dropna (inplace = True) # Drop all rows with Nan values X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.25) # Split data into training and testing data regr = LinearRegression () regr.fit (X_train, y_train) print (regr.score (X_test, y_test)) |

### Step 6: Study our results

y_pred = regr.predict (X_test) plt.scatter (X_test, y_test, color = ‘b’) plt.plot (X_test, y_pred, color = ‘k’) plt.show () # Scatter data from predicted values |

The poor assessment of the accuracy of our model shows that our regression model does not correspond very well to the current data. This means that our data do not meet the conditions for linear regression. However, a data set can accept a linear regressor if only part of it is taken into account. Let’s explore this option.

### Step 7: Work with a smaller data set

df_binary500 = df_binary[:][:500] # Select the first 500 rows of data sns.lmplot (x = “Sal”, y = “Temp”, data = df_binary500, order = 2, ci = None) |

We can see that the first 500 lines adhere to a linear pattern. Continue in the same way as before.

df_binary500.fillna (method = ‘ffill’, inplace = True)

X = np.array (df_binary500[‘Sal’]) .reshape (-1, 1)

y = np.array (df_binary500[‘Temp’]) .reshape (-1, 1)

df_binary500.dropna (inplace = True)

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.25)

regr = LinearRegression ()

regr.fit (X_train, y_train)

print (regr.score (X_test, y_test))

y_pred = regr.predict (X_test)

plt.scatter (X_test, y_test, color = ‘b’)

plt.plot (X_test, y_pred, color = ‘k’)

plt.show ()

- Sklearn regression models
- Sklearn clustering
- Sklearn SVM
- The trees of Sklearn’s decisions
- Stochastic gradient drop in SKlearn

## Conclusion

Enroll in Simplilearn’s PGP Data Science program to learn more about the Python application and become better python and data professionals. This Economic Times Data Science postgraduate program is ranked No. 1 in the world, offering over a dozen tools and skills and concepts, and includes seminars by Purdue scientists and IBM professionals, as well as private hackathons and IBM Ask Me Anything sessions.

https://www.simplilearn.com/tutorials/scikit-learn-tutorial/sklearn-linear-regression-with-examples