Frequency Distribution Analysis – Dataconomy

Frequency Distribution Analysis using Python Data Stack – Part 2

Ernest Bonat, Ph.D. — Wed, 07 Jun 2017 09:00:53 +0000

This is the continuation of the Frequency Distribution Analysis using Python Data Stack – Part 1 article. Here we’ll be analyzing real production business surveys for your review.

Application Configuration File

The configuration (config) file config.py is shown in Code Listing 3. This config file includes the general settings for Priority network server activities, TV Network selection and Hotel Ratings survey. It’s important to know and understand that using config file is an excellent tool to store local and global application settings without hardcoding them inside in the application code. While google searching you may find bad practices of hardcoding in Python programs.

Many Data Science programs require the default value of the algorithm parameters. Python Developer has been hardcoded these parameters even in the constructors (__in__()) of the class objects. This is a very bad Continuous Integration and implementation practices.

[codesyntax lang=”python” lines=”normal”]

# network activities log file settings
network_activities_file = "network_activities.csv"
# network activities log file settings
data_file_network_activities = "network_activities.csv"
column_name_priority = "Priority"
column_name_message = "Message"
network_activities_image1 = "network_activities1.png"
network_activities_image2 = "network_activities2.png"
network_activities_image3 = "network_activities3.png"
info_class = "Info"
error_class = "Error"
warning_class = "Warning"
alert_class = "Alert"
plot_style = "ggplot"

# priority column settings
plot_x_label_priority = "Priority"
plot_y_label_priority = "Percent Frequency"
plot_title_priority = "Priority Percent Frequency Distribution"
plot_legend_priority = "Priority Amount of Messages"

# error column settings
plot_x_label_error = "Percent Frequency"
plot_y_label_error = "Error Message"
plot_title_error = "Error Percent Frequency Distribution"
plot_legend_error = "Error Amount of Messages"

# warning column settings
plot_x_label_warning = "Percent Frequency"
plot_y_label_warning = "Warning Message"
plot_title_warning = "Warning Percent Frequency Distribution"
plot_legend_warning = "Warning Amount of Messages"

# tv network settings
data_file_tv_networks = "tv_networks.csv"
image_file_tv_networks = "tv_networks_image.png"
column_name_tv_networks = "Network"
plot_x_label_tv_networks = "TV Networks"
plot_y_label_tv_networks = "Percent Frequency"
plot_title_tv_networks = "TV Network Percent Frequency Distribution"
plot_legend_tv_networks = "Amount of TV Networks"

# hotel ratings settings
data_file_hotel_ratings = "hotel_ratings.csv"
image_file_hotel_ratings = "hotel_ratings_image.png"
column_name_hotel_ratings = "Rating"
plot_x_label_hotel_ratings = "Hotel Rating"
plot_y_label_hotel_ratings = "Percent Frequency"
plot_title_hotel_ratings = "Hotel Rating Percent Frequency Distribution"
plot_legend_hotel_ratings = "Amount of Hotel Rating"

[/codesyntax]

Code Listing 3. Program configuration file (config.py).

Frequency Distribution Subclass

The frequency_distribution_subclass.py module contains the FrequencyDistribution(FrequencyDistributionLibrary) subclass object inherited from FrequencyDistributionLibrary(object). The code is provided in Code Listing 4. As you can see the constructor initialize frequency distribution in the superclass library, make sure to get a project directory path and set a data file path. The frequency_distribution_target() function prints frequency summary tables, builds and saves vertical bar chart images for any target columns. The frequency_distribution_message() function does the same but builds and saves a horizontal bar chart image.

[codesyntax lang=”python” lines=”normal”]

import os
from frequency_distribution_superclass import FrequencyDistributionLibrary

class FrequencyDistribution(FrequencyDistributionLibrary):
"""
frequency distribution subclass inhered from frequency distribution superclass library
"""
def __init__(self, data_file):
"""
frequency distribution class constructor
:param data_file: data file name
"""
# initialize frequency distribution superclass library
FrequencyDistributionLibrary.__init__(self)
# get project directory path
self._project_directory_path = self.get_project_directory_path()
# set data file path
self._data_file_path = os.path.join(self._project_directory_path, data_file)

def frequency_distribution_target(self, image_file, column_target, plot_x_label, plot_y_label, plot_title, plot_legend):
"""
print frequency summary table, build and save vertical bar chart image
:param image_file: image file name
:param column_target: column target name
:param plot_x_label: plot x-axis label name
:param plot_y_label: plot y-axis label name
:param plot_title: plot title
:param plot_legend: plot legend
:return none
"""
# set image file path
image_file_path = os.path.join(self._project_directory_path, image_file)
# get x and y axises list data
x_axis, y_axis = self.load_x_y_axis_data(self._data_file_path, column_target)
# print frequency summary table
self.print_frequency_summary_table(plot_x_label, plot_y_label, x_axis, y_axis)
# build and save vertical bar chart image
self.build_bar_chart_vertical(x_axis, y_axis, image_file_path, plot_x_label, plot_y_label, plot_title, plot_legend)

def frequency_distribution_message(self, image_file, column_message, column_target, column_target_class, plot_x_label, plot_y_label, plot_title, plot_legend):
"""
print frequency summary table, build and save horizontal bar chart image
:param image_file: image file name
:param column_message: column message name
:param column_target: column target name
:param column_target_class: column target class name
:param plot_x_label: plot x-axis label name
:param plot_y_label: plot y-axis label name
:param plot_title: plot title
:param plot_legend: plot legend
:return none
"""
# set image file path
image_file_path = os.path.join(self._project_directory_path, image_file)
# get x and y axises list error message data
x_axis, y_axis = self.load_x_y_axis_data(self._data_file_path, column_message, column_target, column_target_class)
# print error message summary table
self.print_frequency_summary_table(plot_x_label, plot_y_label, x_axis, y_axis)
# build and save error message vertical bar chart image
self.build_bar_chart_horizontal(x_axis, y_axis, image_file_path, plot_x_label, plot_y_label, plot_title, plot_legend)

[/codesyntax]

Code Listing 4. Frequency distribution subclass FrequencyDistribution(FrequencyDistributionLibrary).

Now that we have defined and organized the superclass and subclass objects, we are ready to create the client program to do the Frequency Distribution Analysis for the network server activities log file. The code shown in Code Listing 5 provides a complete analysis including:

Print priority frequency summary table, build and save vertical bar chart image for priority column (Table 2 and Figure 1)
Print error frequency summary table, build and save horizontal bar chart image for error column (Table 3 and Figure 2)
Print warning frequency summary table, build and save horizontal bar chart image for warning column (Table 4 and Figure 3)

You can see how clean this code is. All the work is done by the subclass FrequencyDistribution(FrequencyDistributionLibrary), and of course nothing has been hardcoded. Any necessary change made to the config file parameters will not affect the program production code. Moving default algorithm parameter values to the config file is necessary for any Data Analytics program deployment and maintenance.

[codesyntax lang=”python” lines=”normal”]

import time
from frequency_distribution_subclass import FrequencyDistribution
import config

def main():
# create frequency distribution class object
frequency_distribution = FrequencyDistribution(config.data_file_network_activities)
# print frequency summary table, build and save vertical bar chart image for priority column
frequency_distribution.frequency_distribution_target(config.network_activities_image1, config.column_name_priority, config.plot_x_label_priority, config.plot_y_label_priority, config.plot_title_priority, config.plot_legend_priority)
# print frequency summary table, build and save horizontal bar chart image for error column
frequency_distribution.frequency_distribution_message(config.network_activities_image2, config.column_name_message, config.column_name_priority, config.error_class, config.plot_x_label_error, config.plot_y_label_error, config.plot_title_error, config.plot_legend_error)
# print frequency summary table, build and save horizontal bar chart image for warning column
frequency_distribution.frequency_distribution_message(config.network_activities_image3, config.column_name_message, config.column_name_priority, config.warning_class, config.plot_x_label_warning, config.plot_y_label_warning, config.plot_title_warning, config.plot_legend_warning)

if __name__ == '__main__':
start_time = time.time()
main()
end_time = time.time()
print("Program Runtime: " + str(round(end_time - start_time, 1)) + " seconds" + "\n")

[/codesyntax]

Code Listing 5. Program code for Network Server Activities Frequency Distribution Analysis.

Table 2 shows the Priority frequency summary table. The Priority contains four classes: Info, Error, Warning and Alert.

Priority	Percent Frequency
Info	33.90%
Error	30.70%
Warning	26.80%
Alert	8.70%

Table 2. Priority frequency summary table.

The bar chart of the Priority Percent Frequency Distribution is shown below in Figure 1. This figure shows 30.7% of occurred Errors (red light) and 26.8% of Warning (yellow light) messages. The system admin team would like to know these messages for network server maintenances and optimization.

Figure 1. Priority Percent Frequency Distribution.

Table 3 shows the Error Message frequency summary table. Nine classes has been defined.

Error Message	Percent Frequency
Interface X0 Link Is Down	30.80%
Interface X8 Link Is Down	20.50%
Interface X5 Link Is Down	17.90%
Interface X4 Link Is Down	15.40%
Interface X9 Link Is Down	5.10%
Interface X2 Link Is Down	2.60%
Interface X3 Link Is Down	2.60%
Interface X6 Link Is Down	2.60%
Interface X7 Link Is Down	2.60%

Table 3. Error Message frequency summary table.

The horizontal bar chart of the Error Percent Frequency Distribution is shown in Figure 2. The figure shows that the most occurred Error messages are: Interface X0 Link Is Down (30%), Interface X8 Link Is Down (20%), Interface X5 Link Is Down (17%) and Interface X4 Link Is Down (15%). This information is very important for the network admin team to start looking for the right solutions to fix these server occurred errors.

Figure 2. Error Percent Frequency Distribution.

Table 4 shows the Warning Message frequency summary table. Seven classes has been defined.

Warning Message	Percent Frequency
Interface X0 Link Is Up	32.40%
Interface X5 Link Is Up	20.60%
Interface X8 Link Is Up	20.60%
Interface X4 Link Is Up	17.60%
Interface X1 Link Is Up	2.90%
Interface X9 Link Is Up	2.90%
Wan IP Changed	2.90%

Table 4. Warning Message frequency summary table.

The horizontal bar chart of the Warning Message Frequency Distribution is shown in Figure 3. As the figure shows the most occurred Warning messages are: Interface X0 Link Is Up (32%), Interface X8 Link Is Up (20%), Interface X5 Link Is Up (20%) and Interface X4 Link Is Up (17%). The message Interface X0 Link Is Up, is critical and requires some attention by the network admin team.

Figure 3. Warning Percent Frequency Distribution.

The Network Server Activities Frequency Distribution Analysis can provided to the admin team the following conclusions:

The server Error and Warning messages are the second and third occurred with 30.7% and 26.8% respectively.
The Error messages occurred in the following percent order:

Interface X0 Link Is Down (30%)
Interface X8 Link Is Down (20%)
Interface X5 Link Is Down (17%)
Interface X4 Link Is Down (15%)

The Warning messages occurred in the following percent order:

Interface X0 Link Is Up (32%)
Interface X8 Link Is Up (20%)
Interface X5 Link Is Up (20%)
Interface X4 Link Is Up (17%).

The Error messages have the same class sequence as the Warning messages.
The Warning messages Interface X0 Link Is Up (32%), Interface X8 Link Is Up (20%) and Interface X5 Link Is Up (20%) are most likely to occur at this point.

Let’s apply this logic to real business examples so you can see how quickly to you can develop and analyze any frequency business data.

Example 1. TV Networks Frequency Distribution Analysis

Table 5 shows ten rows of the TV Network data file (tv_networks.csv). This file contains the most-watched television networks for a period of time.

Network

CBS

ABC

CBS

FOX

NBC

CBS

NBC

FOX

NBC

Table 5. Ten rows of the TV Network data file.

The config file session ‘tv network settings’ (Code Listing 6.) includes the required seven parameters settings.

[codesyntax lang=”python” lines=”normal”]

# tv network settings
data_file_tv_networks = "tv_networks.csv"
image_file_tv_networks = "tv_networks_image.png"
column_name_tv_networks = "Network"
plot_x_label_tv_networks = "TV Networks"
plot_y_label_tv_networks = "Percent Frequency"
plot_title_tv_networks = "TV Network Percent Frequency Distribution"
plot_legend_tv_networks = "Amount of TV Networks"

[/codesyntax]

Code Listing 6. Config file session for ‘tv network settings’.

The main program is shown in Code Listing 7 and provides the complete Frequency Distribution Analysis for the TV Network data file. This code includes:

Create frequency distribution class object
Print frequency summary table, build and save vertical bar chart image for TV Networks column

[codesyntax lang=”python” lines=”normal”]

import time
from frequency_distribution_subclass import FrequencyDistribution
import config

def main():
# create frequency distribution class object
frequency_distribution = FrequencyDistribution(config.data_file_tv_networks)
# print frequency summary table, build and save vertical bar chart image for tv networks column
frequency_distribution.frequency_distribution_target(config.image_file_tv_networks, config.column_name_tv_networks, config.plot_x_label_tv_networks, config.plot_y_label_tv_networks, config.plot_title_tv_networks, config.plot_legend_tv_networks)

if __name__ == '__main__':
start_time = time.time()
main()
end_time = time.time()
print("Program Runtime: " + str(round(end_time - start_time, 1)) + " seconds" + "\n")

[/codesyntax]

Code Listing 7. Main program for TV Network Frequency Distribution Analysis.

The program generates the TV Network frequency summary table (Table 6) and the percent frequency distribution vertical bar chart (Figure 4).

TV Network	Percent Frequency
CBS	30.00%
NBC	27.00%
FOX	24.00%
ABC	19.00%

Table 6. TV Network frequency summary table.

As you can see based on the TV Network data file the most-watched is CBS with 30%, following with NBC with 27%.

Figure 4. TV Network Percent Frequency Distribution.

Example 2. Hotel Ratings Frequency Distribution Analysis

This example shows ten records of a Hotel Rating survey from the data file hotel_ratings.csv (Table 7)

Rating

Poor

Very Good

Excellent

Poor

Excellent

Average

Very Good

Average

Very Good

Table 7. Ten rows of the Hotel Rating survey data file.

The config file session ‘hotel rating settings’ (Code Listing 8.) includes the required seven parameters settings.

[codesyntax lang=”python” lines=”normal”]

# hotel ratings settings
data_file_hotel_ratings = "hotel_ratings.csv"
image_file_hotel_ratings = "hotel_ratings_image.png"
column_name_hotel_ratings = "Rating"
plot_x_label_hotel_ratings = "Hotel Rating"
plot_y_label_hotel_ratings = "Percent Frequency"
plot_title_hotel_ratings = "Hotel Rating Percent Frequency Distribution"
plot_legend_hotel_ratings = "Amount of Hotel Rating"

[/codesyntax]

Code Listing 8. Main program for Hotel Ratting Frequency Distribution Analysis.

The main program shown in Code Listing 9 provides the complete Frequency Distribution Analysis for the Hotel Ratting survey data file. This code includes:

Create frequency distribution class object
Print frequency summary table, build and save vertical bar chart image for Hotel Ratting column

[codesyntax lang=”python” lines=”normal”]

import time
from frequency_distribution_subclass import FrequencyDistribution
import config

def main():
# create frequency distribution class object
frequency_distribution = FrequencyDistribution(config.data_file_hotel_ratings)
# print frequency summary table, build and save vertical bar chart image for hotel ratings column
frequency_distribution.frequency_distribution_target(config.image_file_hotel_ratings, config.column_name_hotel_ratings, config.plot_x_label_hotel_ratings, config.plot_y_label_hotel_ratings, config.plot_title_hotel_ratings, config.plot_legend_hotel_ratings)

if __name__ == '__main__':
start_time = time.time()
main()
end_time = time.time()
print("Program Runtime: " + str(round(end_time - start_time, 1)) + " seconds" + "\n")

[/codesyntax]

Code Listing 9. Main program for Hotel Ratting Frequency Distribution Analysis.

The program generates the Hotel Ratting frequency summary table (Table 8) and the percent frequency distribution vertical bar chart (Figure 4).

Hotel Rating	Percent Frequency
Very Good	38.80%
Excellent	28.80%
Average	16.50%
Poor	9.60%
Terrible	6.30%

Table 8. Hotel Ratting frequency summary table.

As you can see based on the Hotel Ratting survey data file 38.8% of the guests rated ‘Very Good’ and 28.8% rated ‘Excellent’. The guest messages for the ‘Poor’ and ‘Terrible’ rates were not provided.

Figure 4. Hotel Ratting Percent Frequency Distribution.

Recommendations

I would like to recommend reading the following OOP materials? applied to Data Analytics projects:

An Introduction to Object Oriented Data Science in Python. By Sev Leonard, Data Scientist – The Data Scout
Raschka, Sebastian. Python Machine Learning. Birmingham, UK: Packt Publishing, 2015

Based on the info provided in this paper we can summarize the following two conclusions:

Conclusions

Applying Object-Oriented programming to your Data Analytics make your code more organize, reusable and robust. It will allow an easy implementation of the required Continuous Integration and Unit Test standards methodologies.
Store local and global Data Analytics program settings in a configuration file is a good designed and developed practice. Required default value of the algorithm parameters should be stored in the configuration file. Hardcoding these values in the program code is a bad programming and deployment practices.

Feel free to send to Ernest any questions about his paper.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Frequency Distribution Analysis using Python Data Stack – Part 1

Ernest Bonat, Ph.D. — Wed, 31 May 2017 09:00:11 +0000

During my years as a Consultant Data Scientist I have received many requests from my clients to provide frequency distribution reports for their specific business data needs. These reports have been very useful for the company management to make proper business decisions quickly. In this paper I would like to show how to design and develop a generic frequency distribution library that will allow you to reduce your development time and provide a good summary table and image report for your clients. One important topic to be covered is this paper is a logic conversion of a top-bottom Python code in a generic reusable super class library for future Object-Oriented Programming (OOP) development applied data analytics and visualization.

I’ll be using the following three main Python Data Stack libraries:

1. NumPy – is the fundamental package for scientific computing.
2. pandas – is an open source library, providing high-performance, easy-to-use data structures and data analysis tools
3. Matplotlib – is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Frequency Statistical Definitions

The frequency of a particular data value is the number of times the data value occurs. A frequency distribution is a tabular summary (frequency table) of data showing the frequency number of observations (outcomes) in each of several non-overlapping categories named classes. The objective is to provide a simple interpretation about the data that cannot be quickly obtained by looking only at the original raw data.

The Frequency Distribution Analysis can be used for Categorical (qualitative) and Numerical (quantitative) data types. I have seen the most use of it for Categorical data especially during the data cleansing process using pandas library. In general, there are two types of frequency tables, Univariate (used with a single variable) and Bivariate (used with multiple variables). Univariate tables will be used in this paper. The Bivariate frequency tables are presented as (two-way) Contingency Tables. These tables are used in Chi-squared Test Analysis for the Goodness-Of-Fit Test and Test of Independence. We’ll be covering these topics in future papers.

Network Server Activities Frequency Distribution Analysis

The windows network server activities log file (network_activities.csv) is provided in Table 1.

Time	Priority	Category	Message
10:47.2	Info	Firewall Event	SonicWALL initializing
10:55.2	Error	Firewall Event	Interface X0 Link Is Down
10:55.2	Warning	Firewall Event	Interface X1 Link Is Up
10:55.2	Error	Firewall Event	Interface X2 Link Is Down
10:55.2	Info	Authenticated Access	Administrator login allowed
10:55.2	Error	Firewall Event	Interface X4 Link Is Down
10:55.2	Alert	Intrusion Prevention	Possible port scan detected
10:55.2	Error	Firewall Event	Interface X6 Link Is Down
10:55.2	Info	Authenticated Access	GUI administration session ended
10:55.2	Error	Firewall Event	Interface X8 Link Is Down
10:55.2	Error	Firewall Event	Interface X9 Link Is Down
11:02.2	Alert	Firewall Event	SonicWALL activated
33:20.4	Warning	Firewall Event	Interface X8 Link Is Up
33:23.4	Warning	Firewall Event	Interface X9 Link Is Up
33:56.0	Error	Firewall Event	Interface X8 Link Is Down

Table 1. Fifteen rows of network activities log file.

As you can see from Table 1, the log data file contains four columns as Time, Priority, Category and Message. In real production environment this log file may have hundreds of thousands of rows.

Network Server Activities Analysis

The server administrator team has requested a statistical analysis and report of the networking activities to be created for maintenance and management review. In general, this frequency statistical report includes two components:

Frequency Summary Table
Percent Frequency Distribution Chart

The Code Listing 1 shows a simple top-bottom Python code for Frequency Distribution Analysis.

[codesyntax lang=”python” lines=”normal”]

import sys
import os
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def main():

# Frequency Distribution 1 (Vertical Bar Chart)
----------------------------------------------------------------------------------------------------

# set file path name
file_path_name = r"C:\Users\Ernest\git\test-code\test-code\src\percent_frequency_distribution\network_activities.csv"

# set image path name
image_path_name1 = r"C:\Users\Ernest\git\test-code\test-code\src\percent_frequency_distribution\network_activities.png"

# get network activity data frame
df_network_activity1 = pd.read_csv(filepath_or_buffer = file_path_name, sep = ",")

# get relative frequencies in a pandas serie
ds_network_activity1 = df_network_activity1["Priority"].value_counts(normalize = True)
print(ds_network_activity1)

# define the x and y axis’s
x_axis = []
y_axis = []
for x, y in ds_network_activity1.iteritems():
x_axis.append(x)
y_axis.append(y * 100)

# build and plot the network activity vertical bar chat
colors = []
for x_value in x_axis:
if x_value == "Error":
colors.append('r')
elif x_value == "Warning":
colors.append('y')
else:
colors.append('g')
plt.style.use("ggplot")
x_pos = np.arange(len(x_axis))
rects = plt.bar(x_pos, y_axis, width = 0.7, color = colors, align = "center", alpha = 0.7, label = "Amount of Messages")
for rect in rects:
rec_x = rect.get_x()
rec_width = rect.get_width()
rec_height = rect.get_height()
height_format = float("{0:.1f}".format(rec_height))
plt.text(rec_x + rec_width / 2, rec_height , str(height_format) + "%", horizontalalignment = "center", verticalalignment = 'bottom')
plt.xticks(x_pos, x_axis)
plt.xlabel("Priority")
plt.ylabel("Percent Frequency")
plt.title("Priority Message Percent Frequency Distribution")
plt.legend(loc = 1)
plt.tight_layout()
plt.savefig(image_path_name1, dpi = 100)
plt.show()

# Frequency Distribution 2 (Horizontal Bar Chart)
-------------------------------------------------------------------------------------------------

# set image file path name
image_path_name2 = r"C:\Users\Ernest\git\test-code\test-code\src\percent_frequency_distribution\network_activities2.png"

# get network activity data frame for priority and message columns
df_network_activity2 = pd.read_csv(filepath_or_buffer = file_path_name, sep = ",")

# group by priority column
df_column_group = df_network_activity2.groupby("Priority")

# get relative frequencies by message column
ds_network_activity2 = df_column_group["Message"].value_counts(normalize = True)

# define the x and y axis’s
x_axis = []
y_axis = []
for x, y in ds_network_activity2.iteritems():
if x[0] == "Error":
x_axis.append(x[1])
y_axis.append(y * 100)

# build and plot the network activity horizontal bar chat
plt.style.use("ggplot")
x_pos = np.arange(len(x_axis))
colors = ["r"]
rects = plt.barh(x_pos, y_axis, color = colors, align = "center", alpha = 0.8, label = "Amount of Messages")
for rect in rects:
rec_y = rect.get_y()
rec_width = int(rect.get_width())
rec_height = rect.get_height()
plt.text(rec_width - 0.6, rec_y + rec_height / 2, str(rec_width) + "%", horizontalalignment = "center", verticalalignment = 'bottom')
plt.yticks(x_pos, x_axis)
plt.xlabel("Percent Frequency")
plt.ylabel("Error Message")
plt.title("Error Server Percent Frequency Distribution")
plt.legend(loc = 1)
plt.tight_layout()
plt.savefig(image_path_name2, dpi = 100)
plt.show()

if __name__ == '__main__':
start_time = time.time()
main()
end_time = time.time()
print("Program Runtime: " + str(round(end_time - start_time, 1)) + " seconds" + "\n")

[/codesyntax]

Code Listing 1. Top-bottom code for Frequency Distribution Analysis.

As you can see from this Code Listing 1 the majority of the input data has been hardcoding in the program and the only way to use this program is to copy and paste in another module file, and of course change the data input values after that – a lot works and a very bad programming practices for sure! Some of the input data hardcode are: data file and images paths, data column name, many plot parameters, etc.

I have seen many Python programmers doing this type of Data Analytics implementation using Python Jupyter Notebook or any modern text editor today. It’s like they don’t understand/know the importance of Object-Oriented Programming design and implementation, Continuous Integration deployment practices, Unit and System Tests, etc.

Frequency Distribution Main Library

We need to create a reusable and extensible library to considerably reduce the Data Analytics development time and necessary code. I have developed a frequency_distribution_superclass.py module that contains the frequency distribution class library FrequencyDistributionLibrary(object) shown in Code Listing 2.

[codesyntax lang=”python” lines=”normal”]

import os
import sys
import traceback
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import config

class FrequencyDistributionLibrary(object)

   """

   generic frequency distribution superclass library
   """        
   def __init__(self):
       pass
        
   def print_exception_message(self, message_orientation = "horizontal"):

       """

       print full exception message

       :param message_orientation: horizontal or vertical

       :return none
       """

       try:

           exc_type, exc_value, exc_tb = sys.exc_info()
           file_name, line_number, procedure_name, line_code = traceback.extract_tb(exc_tb)[-1]            
           time_stamp = " [Time Stamp]: " + str(time.strftime("%Y-%m-%d %I:%M:%S %p"))
           file_name = " [File Name]: " + str(file_name)
           procedure_name = " [Procedure Name]: " + str(procedure_name)
           error_message = " [Error Message]: " + str(exc_value)        
           error_type = " [Error Type]: " + str(exc_type)                    
           line_number = " [Line Number]: " + str(line_number)                
           line_code = " [Line Code]: " + str(line_code)
           if (message_orientation == "horizontal"):

               print( "An error occurred:{};{};{};{};{};{};{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))
           elif (message_orientation == "vertical"):
               print( "An error occurred:\n{}\n{}\n{}\n{}\n{}\n{}\n{}".format(time_stamp, file_name, procedure_name, error_message, error_type, line_number, line_code))
           else:
               pass                    
       except Exception:
           pass

       
   def get_project_directory_path(self):

       """

       get project directory path from the calling file
       """
       project_directory_path = None
       try:  
           project_directory_path = os.path.dirname(sys.argv[0])            
       except Exception:
           self.print_exception_message()                    
       return project_directory_path


   def format_float_number(self, decimal_point, real_value):

       """
       format float numbers with digits
       :param decimal_point:
       :param real_value:
       :return formatted float number
       """
       format_value = 0.0
       try:
           if decimal_point == 1:
               format_value = float("{0:.1f}".format(real_value))
           elif decimal_point == 2:
               format_value = float("{0:.2f}".format(real_value))
           elif decimal_point == 3:
               format_value = float("{0:.3f}".format(real_value))
           elif decimal_point == 4:
               format_value = float("{0:.4f}".format(real_value))
           elif decimal_point == 5:
               format_value = float("{0:.5f}".format(real_value))
           else:
               format_value = float("{0:.3f}".format(real_value))
       except Exception:                                                          
           self.print_exception_message()
       return format_value
 

   def load_x_y_axis_data(self, data_file_name, column_name, group_by_colum = None, column_name_class = None):

       """
       define x and y axis data
       :param data_file_name:
       :param column_name:
       :param group_by_colum:
       :return x and y axis data
       """
       x_axis = []
       y_axis = []        
       try:    
           data_frame = pd.read_csv(filepath_or_buffer = data_file_name, sep = ",")         
           if (group_by_colum is not None):                
               data_frame = data_frame.groupby(group_by_colum)                                
           data_serie = data_frame[column_name].value_counts(normalize = True)      
           if (group_by_colum is not None):   
               for x, y in data_serie.iteritems():     
                   if x[0] == column_name_class:
                       x_axis.append(x[1])           
                       y_axis.append(self.format_float_number(1, y * 100))                                               
           else:
               for x, y in data_serie.iteritems():
                   x_axis.append(x)        
                   y_axis.append(self.format_float_number(1, y * 100))                            
       except Exception:
           self.print_exception_message()
       return x_axis, y_axis


   def print_summary_table(self, first_column_name, second_column_name, x_axis, y_axis):
       """
       print tabular summary table
       :param first_column_name: class column
       :param second_column_name: frequency numerical column
       :param x_axis: x axis data
       :param y_axis: y axis data
       :return none
       """
       try:  
           print("{}\t{}".format(first_column_name, second_column_name))
           for x, y in zip(x_axis, y_axis):
               print("{}\t\t{}".format(x, str(y) + "%"))
       except Exception:
           self.print_exception_message()
        

   def build_bar_chart_vertical(self, x_axis, y_axis, image_file_name, plot_xlabel, plot_ylabel, plot_title, plot_legend):        

       """
       build vertical bar chart
       :param x_axis: x axis data
       :param y_axis: y axis data
       :param image_file_name: image file path and name
       :return none
       """
       try:
           colors = []
           for x_value in x_axis:
               if x_value == config.error_class:
                   colors.append('r')
               elif x_value == config.warning_class:
                   colors.append('y')
               else:
                   colors.append('g')          
           plt.style.use(config.plot_style)       
           x_pos = np.arange(len(x_axis))         
           rects = plt.bar(x_pos, y_axis, width = 0.7, color = colors, align = "center", alpha = 0.7, label = plot_legend)
           for rect in rects:
               rec_x = rect.get_x()
               rec_width = rect.get_width()        
               rec_height = rect.get_height()  
               height_format = self.format_float_number(1, rec_height)      
               plt.text(rec_x + rec_width / 2, rec_height , str(height_format) + "%", horizontalalignment = "center", verticalalignment = 'bottom')            plt.xticks(x_pos, x_axis)   
           plt.xlabel(plot_xlabel)
           plt.ylabel(plot_ylabel)      
           plt.title(plot_title)    
           plt.legend(loc = 1)    
           plt.tight_layout()
           plt.savefig(image_file_name, dpi = 100)
           plt.show()       
       except Exception:                                                          
           self.print_exception_message()
          

   def build_bar_chart_horizontal(self, x_axis, y_axis, image_file_name, plot_xlabel, plot_ylabel, plot_title, plot_legend):        

       """
        build horizontal bar chart
       :param x_axis: x axis data
       :param y_axis: y axis data
       :param image_file_name: image file path and name
       :return none
       """
       try:  
           plt.style.use(config.plot_style)  
           x_pos = np.arange(len(x_axis))                     
           colors = ["r"]    
           rects = plt.barh(x_pos, y_axis, color = colors, align = "center", alpha = 0.8, label = plot_legend)    
           for rect in rects:    
               rec_y = rect.get_y()
               rec_width = int(rect.get_width())
               width_format = self.format_float_number(1, rec_width)   
               rec_height = rect.get_height()        
               plt.text(rec_width - 0.8,  rec_y + rec_height / 2, str(width_format) + "%", horizontalalignment = "center", verticalalignment = 'bottom')           
           plt.yticks(x_pos, x_axis)   
           plt.xlabel(plot_xlabel)
           plt.ylabel(plot_ylabel)      
           plt.title(plot_title)    
           plt.legend(loc = 1)    
           plt.tight_layout()
           plt.savefig(image_file_name, dpi = 100)
           plt.show()   
       except Exception:                                                          

           self.print_exception_message()

[/codesyntax]

Code Listing 2. Frequency distribution superclass FrequencyDistributionLibrary(object).

This library contains six main functions used in the paper for any complete Frequency Distribution Analysis:

print_exception_message(self, message_orientation = “horizontal”)
format_float_number(self, decimal_point, real_value)
load_x_y_axis_data(self, data_file_name, column_name, group_by_colum = None, column_name_class = None)
print_summary_table(self, first_column_name, second_column_name, x_axis, y_axis)
build_bar_chart_vertical(self, x_axis, y_axis, image_file_name, plot_xlabel, plot_ylabel, plot_title, plot_legend)
build_bar_chart_horizontal(self, x_axis, y_axis, image_file_name, plot_xlabel, plot_ylabel, plot_title, plot_legend)

In Part 2 we’ll be covering how to inherit from this library to create a subclass module. Real business examples of Frequency Distribution Analysis will be provided.