Nikola Jupyter Notebook Publishing

This week I spent some time converting my Wordpress blog over to a static site generator; Nikola. If you were led here to figure out how to do this, then I suggest you go here:

https://mglerner.github.io/posts/switching-to-nikloa-for-jupyter-notebooks-and-a-static-site.html

In :
%matplotlib inline


Sample code to generate a sine wave

In :
import numpy as np
import matplotlib.pyplot as plt

time = np.arange(0, 10, 0.1);
amplitude   = np.sin(time)
plt.plot(time, amplitude)
plt.title('Sine wave')
plt.xlabel('Time')
plt.ylabel('Amplitude = sin(time)')
plt.grid(True, which='both')
plt.axhline(y=0, color='k')

plot.show() In [ ]:



Interpreting Correlations

This is a blog post that illustrates how to interpret different correlations. I have needed to explain this to many people and thought that it would be useful for others. Here is an embeded IPython notebook with explanation.

Edit Distance Hierarchical Clustering

This is a quick blog post to illustrate how you can use Scipy, NLTK and Numpy to perform hierarchical clustering based on edit distance. Performing hierarchical clustering based on edit distance allows you to see what words or set of text is similar based on edit distance. Edit distance is the number of changes required to make text x turn into text y. Essentially, you could quickly look at a number of labels to check for typos or duplications.

Here is a code snippet illustrating how you can cluster some labels and create a dendrogram of the relationships.

import numpy as np
import maplotlib.pyplot as plt
from nltk.metrics.distance import edit_distance
from scipy.cluster.hierarchy import dendrogram

labels = [
'Tacos',
'Tacoes',
'Tomatos',
'Tomatoes',
'Beef',
'Beer',
'Steak',
'Pig',
'Twig'
]

def d(coord):
i, j = coord
return edit_distance(labels[i], labels[j])

coords = np.triu_indices(len(labels), 1)
result = np.apply_along_axis(d, 0, coords)

dend = dendrogram(clust, labels=labels, leaf_font_size=8)
plt.show()


The dendrogram from this set of data looks like this: The x axis shows the labels of text and the y axis shows the edit distance. The words "Beer" and "Beef" are clustered together as only one change must be made to make the words similar. The words "Pig" and "Twig" require two changes.

I thought that this was very useful in a scenario of data cleansing where there were many duplications of a given label. Automating the actual corrections of the text may depend on your use case, but visualizing these discrepencies up front can help you make a better decision.