Skip to content

Given a directed social graph, we have to predict missing edges to recommend friends/connnections/followers in the graph.

License

Notifications You must be signed in to change notification settings

VinitSR7/Instagram-Followee-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Social-network-Graph-Link-Prediction---Facebook-Challenge

Problem statement:

Given a directed social graph, we have to predict missing links to recommend friends/connnections/followers (Link Prediction in graph)

Data Overview

Dataset from facebook's recruting challenge on kaggle https://www.kaggle.com/c/FacebookRecruiting
data contains two columns: source and destination edge pairs in the directed graph.
- Data columns (total 2 columns):
- source_node int64
- destination_node int64

Mapping the problem to supervised learning problem:

  • Map this to a binary classification task with 0 implying an absence of an edge and 1 implying the presence of the edge.
    Now, we need to featurize a pair of vertices (u_i,u_j) such that these features can help us predict the presence/absence of an edge.

Performance metric for supervised learning:

  • Both precision and recall are important, hence F1 score is good choice
  • Confusion matrix

Requinments:

  • Python3
  • numpy
  • pandas
  • matplotlib
  • seaborn
  • networkx
    NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. We are using it to implement our Graphs.

How To Use?

  1. Run FB_EDA file, which has all the Exploratory Data analysis, and train test split process.
  2. Run FB_Featurization which has all featurization done.
  3. Run FB_Models, training part is done by Random Forest Classifier

Featurization:

we will create these features for both train and test data points

  1. jaccard_followers
  2. jaccard_followees
  3. cosine_followers
  4. cosine_followees
  5. num_followers_s
  6. num_followees_s
  7. num_followers_d
  8. num_followees_d
  9. inter_followers
  10. inter_followees
  11. adar index
  12. is following back
  13. belongs to same weakly connect components
  14. shortest path between source and destination
  15. Weight Features
    • weight of incoming edges
    • weight of outgoing edges
    • weight of incoming edges + weight of outgoing edges
    • weight of incoming edges * weight of outgoing edges
    • 2*weight of incoming edges + weight of outgoing edges
    • weight of incoming edges + 2*weight of outgoing edges
  16. Page Ranking of source
  17. Page Ranking of dest
  18. katz of source
  19. katz of dest
  20. hubs of source
  21. hubs of dest
  22. authorities_s of source
  23. authorities_s of dest

Reference to Features:

  1. Jaccard Distance: http://www.statisticshowto.com/jaccard-index/
  2. Cosine Similarity(Otsuka-Ochiai coefficient): https://en.wikipedia.org/wiki/Cosine_similarity
  3. Page Rank
  4. https://networkx.github.io/documentation/networkx1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html
  5. Shortest Path: https://stackoverflow.com/questions/9430027/networkx-shortest-path-length
  6. Weakly Connected Components: https://www.quora.com/What-are-strongly-and-weakly-connected-components
  7. Adamic/Adar Index: https://en.wikipedia.org/wiki/Adamic/Adar_index
  8. Katz Centrality:
  9. https://www.geeksforgeeks.org/katz-centrality-centrality-measure/
  10. HITS Score(Hubs and Authority): https://en.wikipedia.org/wiki/HITS_algorithm