As the name suggests, a sparse matrix is one whose elements have fewer non-zero values. Sparse matrices are encountered during machine learning and its application. They are very common in data, data preparation, and machine learning subfields. Working with such matrices as if they were dense results in a waste of resources in terms of time and space complexity.
This article talks about sparse matrix and explains that they are different from dense matrix. You will find out where rarefied matrices meet, their advantages and disadvantages. We will talk about what is the rarity and temporal, spatial complexity of working with these matrices. We’ll also go through Python implementations of several different formats used when working with a sparse matrix.
A matrix consisting mostly of nonzero values is a sparse matrix, in other words, it is a matrix in which most of the elements are zero. An example of such a matrix is given below.
A dense matrix
Unlike the above matrix, the dense matrix consists mostly of non-zero elements. Below is an example of the same.
When dealing with a sparse matrix, one must speak in terms of the sparseness of that matrix.
Sparsity = (number of zero elements) / (matrix size)
We’ll look at the Python commands to calculate this in the next parts of the tutorial.
High order problems
In practical scenarios, any large matrix consists mostly of zeros. If we represent these matrices as if they are dense, even though the non-zero elements are much less, it will require a lot of memory and therefore waste resources.
An example of a large matrix is when we try to represent people’s purchases from a huge product catalog like Amazon in a matrix. Such a matrix would require much more space if we represented it as a dense matrix.
Let’s assume that we have a large sparse matrix and we are trying to do some calculations like matrix multiplication with it. Most of the operations would be simple addition/multiplication of zeros.
The execution cycle of the most basic algorithms will be dominated by 0s operation only. This would again result in a waste of resources and time.
Also read: How to use Pandas Melt – pd.melt() for AI and Machine Learning.
Sparse matrices in machine learning
Sparse matrices are encountered in a number of machine learning scenarios
You can encounter a sparse matrix in data of various sizes. Examples of a sparse matrix can be:
- Whether an article contains words from the complete dictionary.
- Whether a user has viewed products on Amazon.
- Whether a user has watched a movie from the Netflix movie catalog.
There are different coding systems used for data preparation. A few of these where we see a high frequency are:
- TF-IDF Coding (Inverse Term Frequency Document)
- Example: Presenting frequency scores of dictionary words.
- Encoding number
- Example: Representing the frequency of air travel in a year
- One hot coding
- Example: Transform categorical data into sparse binary vectors.
Fields of study
In cases where the input data is almost always sparse, we need to create specialized models to handle it.
A few examples of these areas are:
- Using computer vision for working with photos that have a lot of black pixels.
- Natural language processing when working with text documents.
- Building a recommendation system in scenarios where the total number of items has a high number, but the typical user just uses a subset of those items.
Working with Sparse matrix
To work efficiently with a sparse matrix, we need to use alternative data structures to represent the non-zero values.
Sparse matrix formats can be effectively divided into 3 main categories. Let’s go through them one by one.
Dictionary of Keys (DOK)
This format uses coordinates of the non-zero elements as keys to the map and the non-zero element as the value for that key.
Element access can be reduced to O(1) by using a hash map as the underlying data structure. The downside here is that it is slow for arithmetic operations where you need to loop through elements.
Coordinate Format (COO)
The coordinate format stores non-zero elements as triplets. Tuples of row index, column index and data value are stored in 3 slices. Allowing the use of element (ed[i]column[i]) = value[i].
Appending non-nulls to the end of data is fast. The problem arises with random reads, in which case it takes O(n) time to get the value of an element. Sorting the values in a COO response can improve the overall access time, but it still won’t be as efficient.
Compressed sparse row (CSR) format
This format is similar to COO above, except that the line index slice is compressed.
The row index slice stores the cumulative number of non-zero elements in each row, so row[i] contains the index in both column and data of the first non-zero element of the row [i].
The storage requirement is reduced and random access is faster. Updating to the null elements is relatively slow because the insertions must be done on the slices.
Compressed Sparse Column (CSC) format
This is identical to CSR except that the column index slice is compressed rather than the row index slice as with CSR. Under the CSC format, values are stored in column-major order and can be seen as a natural transpose of CSR.
Diagonal Format (DIA)
The diagonal format is used specifically for a symmetric diagonal matrix. A square matrix has an equal number of rows and columns. A symmetric diagonal matrix is a square matrix with nonzero elements only along its length top left to bottom right diagonal. The diagonal format stores only the diagonal elements of the matrix.
Sparse matrices in Python
SciPy, short for scientific python, is an open source Python library. Provides the ability to visualize and manipulate data.
Start by declaring a NumPy array. For now, we can call this the original matrix, as we can see that it is a sparse matrix.
We can perform operations such as calculating the bitwiseness of a matrix. Using SciPy, there is no method to calculate this directly, but usually people do 1 – (number_of_non-zero_elements / size_of_matrix).
SciPy is written on top of NumPy and offers a full-featured version of linear algebra that NumPy lacks. Let’s check some forms of sparse matrices in python.
Using csr_matrix command to generate a CSR format for the matrix. The to_dense() command helps you convert a format back to the original matrix.
Using csc_matrix command to generate the CSC format for the matrix.
You can check out more on the topic from:
A check of official documentation of SciPy.
Investigating the implementation of other spare matrix formats in python.
Also Read: What Are Word Embeddings?
This tutorial helps you explore sparse matrices in python and how to work with them using the SciPy package.
What we learned is:
You will find out where rarefied matrices meet, their advantages and disadvantages. We will talk about what rarity is and the temporal, spatial complexity of working with it. We will also look at Python implementations for working with a sparse matrix.
- Difference between sparse and dense matrix in matrix data structure.
- The problems one faces while working with these matrices.
- The various areas where you are likely to encounter a rare matrix.
- Ways to format sparse data and work with it efficiently
What is a Sparse Matrix? How is it Used in Machine Learning?