Given a directed social graph, we have to predict missing links to recommend friends/connnections/followers (Link Prediction in graph)
Dataset from facebook's recruting challenge on kaggle https://www.kaggle.com/c/FacebookRecruiting
data contains two columns: source and destination edge pairs in the directed graph.
- Data columns (total 2 columns):
- source_node int64
- destination_node int64
- Map this to a binary classification task with 0 implying an absence of an edge and 1 implying the presence of the edge.
Now, we need to featurize a pair of vertices (u_i,u_j) such that these features can help us predict the presence/absence of an edge.
- Both precision and recall are important, hence F1 score is good choice
- Confusion matrix
- Python3
- numpy
- pandas
- matplotlib
- seaborn
- networkx
NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. We are using it to implement our Graphs.
- Run FB_EDA file, which has all the Exploratory Data analysis, and train test split process.
- Run FB_Featurization which has all featurization done.
- Run FB_Models, training part is done by Random Forest Classifier
we will create these features for both train and test data points
- jaccard_followers
- jaccard_followees
- cosine_followers
- cosine_followees
- num_followers_s
- num_followees_s
- num_followers_d
- num_followees_d
- inter_followers
- inter_followees
- adar index
- is following back
- belongs to same weakly connect components
- shortest path between source and destination
- Weight Features
- weight of incoming edges
- weight of outgoing edges
- weight of incoming edges + weight of outgoing edges
- weight of incoming edges * weight of outgoing edges
- 2*weight of incoming edges + weight of outgoing edges
- weight of incoming edges + 2*weight of outgoing edges
- Page Ranking of source
- Page Ranking of dest
- katz of source
- katz of dest
- hubs of source
- hubs of dest
- authorities_s of source
- authorities_s of dest
- Jaccard Distance: http://www.statisticshowto.com/jaccard-index/
- Cosine Similarity(Otsuka-Ochiai coefficient): https://en.wikipedia.org/wiki/Cosine_similarity
- Page Rank
- https://networkx.github.io/documentation/networkx1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html
- Shortest Path: https://stackoverflow.com/questions/9430027/networkx-shortest-path-length
- Weakly Connected Components: https://www.quora.com/What-are-strongly-and-weakly-connected-components
- Adamic/Adar Index: https://en.wikipedia.org/wiki/Adamic/Adar_index
- Katz Centrality:
- https://www.geeksforgeeks.org/katz-centrality-centrality-measure/
- HITS Score(Hubs and Authority): https://en.wikipedia.org/wiki/HITS_algorithm