Posted in BIG Data, Cloud Computing

How to Analyze Big Data with Hadoop


Learn how to perform data analysis using Hadoop and Python packages in this tutorial by Sridhar Alla, a big data expert and the author of Big Data Analytics with Hadoop 3.

Data analytics is the process of applying qualitative and quantitative techniques when examining data, with the goal of providing valuable insights.

In this article, you will analyze big data using Hadoop and Python packages. For this, you will need to perform a basic Python installation, Anaconda installation, and you will need to set up Jupyter Notebook with the Python interpreter to perform data analysis. Additionally, you can find the necessary files for this article here.

First, download OnlineRetail.csv from the link. Now, you can load the file using Pandas.

The following is a simple way of reading a local file using Pandas:

import pandas as pd

path = ‘/Users/sridharalla/Documents/OnlineRetail.csv’

df = pd.read_csv(path)

However, since you are analyzing data in a Hadoop cluster, you should use hdfs instead of a local system. The following is an example of how the hdfs file can be loaded into a pandas DataFrame:

import pandas as pd

from hdfs import Insecure

Clientclient_hdfs = InsecureClient(‘http://localhost:9870’)

with client_hdfs.read(‘/user/normal/OnlineRetail.csv’, encoding = ‘utf-8’) as

reader:

df = pd.read_csv(reader,index_col=0)

Enter this line of code:

df.head(3)

You will get the following result:

Basically, it displays the top three entries in the DataFrame.

You can now experiment with the data. Enter the following:

len(df)

That should output the following result:

65499

This just indicates the length or size of the DataFrame. It’s telling you that there are 65,499 entries in the entire file.

Now enter this code:

df2 = df.loc[df.UnitPrice > 3.0]

df2.head(3)

You have defined a new DataFrame called df2 and set it as–all the entries in the original DataFrame with unit prices greater than three.

Then, you can tell it to display the top three entries, as seen in the following screenshot:

The following lines of code select the indices with data that has a unit price above 3.0 and set their description to Miscellaneous. It then displays the first three items:

df.loc[df.UnitPrice > 3.0, [‘Description’]] = ‘Miscellaneous’

df.head(3)

This is the result:

As you can see, entry number 2 (with the index of 1) has its description changed to Miscellaneous because its unit price is $3.39 (which is over 3 as specified earlier).

The line of code outputs the data with index 2:

df.loc[2]

The output is as follows:

And finally, you can create a plot of the Quantity column as shown in the following code:

df[‘Quantity’].plot()

There are plenty more functions to explore.

Here’s an example of the usage of the .append() function.

You can define a new df object, df3, and set it equal to the first 10 rows of df combined with the rows 200–209 of df. In other words, you can append rows 200–209 to rows 0–9 of df:

df3 = df[0:10].append(df[200:210])

df3

This is the resulting output:

Now, imagine that you’re only concerned with a few columns, namely StockCodeQuantityInvoiceDate, and UnitPrice. You can define a new DataFrame object to contain only those columns in the data:

df4 = pd.DataFrame(df, columns=[‘StockCode’, ‘Quantity’, ‘InvoiceDate’, ‘UnitPrice’]

df4.head(3)

This is the result:

Pandas offers different ways to combine data. More specifically, you can mergeconcatenatejoin, and append. You have already covered append, so now you can take a look at concatenating data.

Take a look at this code block:

d1 = df[0:10]

d2 = df[10:20]

d3 = pd.concat([d1, d2])

d3

Basically, set d1 to be a DataFrame object containing the first 10 indices in df. Then, set d2 to be the next ten indices in df. Finally, set d3 to be the concatenation of d1 and d2. This is the result once they are concatenated:

You can do more with this. You can specify keys that will make it easier to distinguish between d1 and d2. Take a look at the following line of code:

d3 = pd.concat([d1, d2], keys=[‘d1’, ‘d2’])

As you can see, it is much easier to differentiate between the two datasets. You can call the keys anything you want, even simple keys like x and y will work. If you had three datasets d1, d2, and some d3, you can say that the keys are (xyz) so that you can distinguish between all three datasets.

Now, you can move on to concatenation with different columns. By default, the concat() function uses outer join. This means that it combines all the columns. Think of two sets, A and B, where set A contains all the column names belonging to d1 and set B contains all the column names belonging to d2. If you concatenate d1 and d2 using the line of code used earlier, the columns you will see are represented by the union of A and B.

You can also specify that you want to use inner join, which is represented by the intersection of A and B. Take a look at the following lines of code:

d4 = pd.DataFrame(df, columns=[‘InvoiceNo’, ‘StockCode’, ‘Description’])[0:10]

d5 = pd.DataFrame(df, columns=[‘StockCode’, ‘Description’, ‘Quantity’])[0:10]

pd.concat([d4, d5])

As you can see, it used all the column labels.
Remember, by default, concat() uses outer join. So, saying pd.concat([d4, d5]) is the same as saying:

pd.concat([d4, d5], join=’outer’)

Now, you can use inner join. Keep everything else the same, but change the call to the concat() function. Look at the following line of code:

pd.concat([d4, d5], join=’inner’)

That should now output the following:

As you can see, this time you only have the column labels that both d4 and d5 had in common. And once again, you can add keys to make it easier to distinguish between the two datasets in the table.
Merge is slightly more complicated. This time, you can choose between outer join, inner join, left join, and right join, and you can also choose the column to merge on.
Now keep modifying your original definitions of d4 and of d5:

d4 = pd.DataFrame(df, columns=[‘InvoiceNo’, ‘StockCode’, ‘Description’])[0:11]
d5 = pd.DataFrame(df, columns=[‘StockCode’, ‘Description’, ‘Quantity’])[10:20]

The brackets you see at the end of the d4 definition mean that you’re taking the first 11 elements of that particular DataFrame, as defined. The brackets at the end of the d5 definition mean that you’re taking the elements 10 through 20 to put into d5, as opposed to the whole thing.
It is noticeable to see that they will have an overlapping element, and this will come into play soon.
Start with the merge function and do a left join merge of d4 and d5:

pd.merge(d4, d5, how=’left’)

This used all the columns of the left DataFrame in the pair d4 and d5 and added the columns of d5 to that. As you can see, since you defined d5 to contain elements 10 through 20, there are no values of quantity from indices 0 through 10. However, since element 11 is in both d5 and d4, you see a data value for that under Quantity.
Similarly, you can do the same thing for right join:

pd.merge(d4, d5, how=’right’)

This used all the columns of the left DataFrame in the pair d4 and d5 and added the columns of d5 to that. As you can see, since you defined d5 to contain elements 10 through 20, there are no values of quantity from indices 0 through 10. However, since element 11 is in both d5 and d4, you see a data value for that under Quantity.
Similarly, you can do the same thing for right join:

pd.merge(d4, d5, how=’right’)

Now, it uses the column labels of d5 as well as the data of d5 (which spans from elements 10 through 20). As you can see, the data at index 0 is shared with d4, hence it’s completed in this particular table. This is because element number 11 (with index 10) overlaps with the first element of d5 (index 10).
Now do the same thing for inner join:

pd.merge(d4, d5, how=’inner’)

Inner join means that it only includes elements that both DataFrames have in common. In this case, the element shown there is element number 11, with index 10 in df. Because it exists in both d4 and in d5, it has data for both InvoiceNo and for Quantity (since the data for InvoiceNo exists in d4 and the data for Quantity exists in d5).
Now, you can do outer join:

pd.merge(d4, d5, how=’outer’)

As you can see, outer join means that it includes all columns (the union of the columns in d4 and in d5).
Any data values that don’t exist are labeled NaN. For example, there is no column labeled InvoiceNo in d5, so all the data values there are shown as NaN.
Now, you can venture into joining on a column. You can introduce a new parameter, on=, in your function call. Here is an example of merging on the StockCode column:

pd.merge(d4, d5, on=’StockCode’, how=’left’)

The graph is similar to the table generated when you merge d4 and d5 using left join. However, the exception is that since Description is a column shared by both d4 and d5, it adds both of them but distinguishes between them with _x and _y respectively.
As you can see in the last entry, it is shared by both d4 and d5, so both Description_x and Description_y are the same.
Remember, you can only enter column names that both DataFrames share in common. So, you can do either StockCode or Description to merge.
This is what it looks like if you merge Description instead:

pd.merge(d4, d5, on=’Description’, how=’left’)

Again, it distinguishes between the columns that they both share by adding _x and _y respectively to represent d4 and d5.
You can actually pass in a list of column names instead of a single column name. So, now you have:

pd.merge(d4, d5, on=[‘StockCode’, ‘Description’], how=’left’)

However, in this case, you can see that this is the same table:

pd.merge(d4, d5, how=’left’)

In this particular case, the list you passed contained all the column names both of them shared. This would not be the case if they shared three columns and you only passed in two.
To illustrate this, suppose the following:

d4 = pd.DataFrame(df, columns=[‘InvoiceNo’, ‘StockCode’, ‘Description’, ‘UnitPrice’])[0:11]
d5 = pd.DataFrame(df, columns=[‘StockCode’, ‘Description’, ‘Quantity’, ‘UnitPrice’])[10:20]

Now, try this again:

pd.merge(d4, d5, on=[‘StockCode’, ‘Description’], how=’left’)

So, now your table will look as follows:

You can also specify that you want all columns to be present, even the shared ones.
Consider this:

pd.merge(d4, d5, left_index = True, right_index=True, how=’outer’)

You can specify any type of joining that you want and it will still display all the columns. However, in this example, it will use outer join:

Now you can move on to the join() function. One thing to note is that it will not allow you to join two DataFrames if they share a column name. So, the following is not allowed:

d4 = pd.DataFrame(df, columns=[‘StockCode’, ‘Description’, ‘UnitPrice’])[0:11]
d5 = pd.DataFrame(df, columns=[ ‘Description’, ‘Quantity’, ‘InvoiceNo’])[10:20]
d4.join(d5)

Otherwise, it would result in an error.
Now, look at the following lines of code:

d4 = pd.DataFrame(df, columns=[‘StockCode’, ‘UnitPrice’])[0:11]
d5 = pd.DataFrame(df, columns=[ ‘Description’, ‘Quantity’])[10:20]
d4.join(d5)

That would result in this table:

So, it takes the d4 table and adds the columns and corresponding data from d5. Since d5 has no data for description or quantity from indices 0 through 9, they are all displayed as NaN. Since d5 and d4 both share data for index 10, that element has all its data displayed in the corresponding columns.
You can also join them the other way around:

d4 = pd.DataFrame(df, columns=[‘StockCode’, ‘UnitPrice’])[0:11]
d5 = pd.DataFrame(df, columns=[ ‘Description’, ‘Quantity’])[10:20]
d5.join(d4)

 

It’s the same logic, except the columns of d4 and corresponding data are added onto the table for d5.
Next, you can combine the data using combine_first().
Look at the following code:

d6 = pd.DataFrame.copy(df)[0:5]
d7 = pd.DataFrame.copy(df)[2:8]

d6.loc[3, [‘Quantity’]] = 110
d6.loc[4, [‘Quantity’]] = 110

d7.loc[3, [‘Quantity’]] = 210
d7.loc[4, [‘Quantity’]] = 210
pd.concat([d6, d7], keys=[‘d6’, ‘d7’])

The .copy added after pd.DataFrame ensures that you make a copy of the original df as opposed to editing the original df itself. That way, d6 changing the quantity to 110 for indices 3 and 4 shouldn’t affect d7 and vice versa. Keep in mind that this won’t work if you pass in a list of columns to select, so you can’t have something as follows:

pd.DataFrame(df, columns=[‘Quantity’, ‘UnitPrice’])

After running the preceding code, this is the resulting table:

Notice that both d6 and d7 have elements in common, namely the elements with indices 2 through 4.
Now, take a look at this code:

d6.combine_first(d7)

 

This combined the data of d7 data with that of d6, but with preference given to d6. Remember that you set the quantity of indices 3 and 4 to 110 in d6. As you can see, the data of d6 was kept, where both datasets had indices in common. Now look at this line of code:

d7.combine_first(d6)

 

Now you’ll see that where both elements had indices in common (at indices 3 and 4), the data of d7 was kept.
You can also get the occurrence counts of every value in a category of choice using value_counts(). Take a look at this code:

pd.value_counts(df[‘Country’])

 

One thing to consider during your merges is the fact that you might come across duplicate data values. To resolve these, use .drop_duplicates().
Consider this:

d1 = pd.DataFrame(df, columns = [‘InvoiceNo’, ‘StockCode’, ‘Description’])[0:100]
d2 = pd.DataFrame(df, columns = [‘Description’, ‘InvoiceDate’, ‘Quantity’])[0:100]

pd.merge(d1, d2)

 

Now, scroll all the way to the bottom:

As you can see, there are many duplicate data entries. To remove them all, you can use drop_duplicates(). In addition, you can specify what column data you can use to determine the duplicate entries that need to be removed. For example, you can use StockCode to remove all duplicate entries, assuming that each item has a unique stock code. You could also assume a unique description for each item and remove items that way. So, now look at this code:

d1 = pd.DataFrame(df, columns = [‘InvoiceNo’, ‘StockCode’, ‘Description’])[0:100]
d2 = pd.DataFrame(df, columns = [‘Description’, ‘InvoiceDate’, ‘Quantity’])[0:100]

pd.merge(d1, d2).drop_duplicates([‘StockCode’])

 

Now scroll to the bottom:

You will see that many duplicate entries are removed. You could have also passed in Description, StockCode, or Description and it would have yielded the same results.
You’ll notice then that the indices are all over the place. You can use reset_index() to fix it. Look at the following code:

d1 = pd.DataFrame(df, columns = [‘InvoiceNo’, ‘StockCode’, ‘Description’])[0:100]
d2 = pd.DataFrame(df, columns = [‘Description’, ‘InvoiceDate’, ‘Quantity’])[0:100]

d3 = pd.merge(d1, d2).drop_duplicates([‘StockCode’])
d3.reset_index()

This is what it will look like:

Now clearly, that is not what you probably had in mind. It reset the index, yes, but it added the old index as a column. There is a simple fix and that is to introduce a new parameter. Now, look at this code:

d3.reset_index(drop=True)

 

Much better! By default, drop=False, so if you don’t want the old index to be added to the data as a new column, then remember to set drop=True.
You may remember the .plot() function from earlier. You can use this to help visualize DataFrames, especially if they are large.
Here is one such example involving a single column:

d8 = pd.DataFrame(df, columns=[‘Quantity’])[0:100]
d8.plot()

Here, only the first 100 elements are selected to make the graph less crowded and illustrate the example better.
Now, you’ll have:

Now, suppose that you want multiple columns to show up. Look at the following:

d8 = pd.DataFrame(df, columns=[‘Quantity’, ‘UnitPrice’])[0:100]
d8.plot()

 

Just remember that it will not plot qualitative data columns such as Description but only things that can be graphed such as Quantity and UnitPrice.

You have successfully used Python to perform data analysis with Jupyter Notebook. If you found this article interesting, you can check out Sridhar Alla’s Big Data Analytics with Hadoop 3. Apache Hadoop is one of the most popular platforms for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. This book shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s