Skip to content

danniesim/pf-interview

Repository files navigation

Pathfinder Interview Exercise

Interviewee: Daniel Sim

Overview

A web app that enables users to find University Courses offered in the UK by selecting UK SIC (Industry Classification) and Geographical Location

Both data sets are public domain:

Requires

  • Ubuntu 16.04 (and follwing service installed with apt-get)
    • NodeJS 4.2+
    • ElasticSearch 6+
  • Python 3.6+ (and following modules installed with pip)
    • Pandas 0.22+
    • TQDM
    • googlemaps
    • nltk
    • gensim
  • ReactiveSearch (https://reactjs.org/tutorial/tutorial.html)

Run Instructions

  • Get data (see README.md in ./data)
  • Run NodeJS and Elasticsearch services
    • Add following lines in /etc/elasticsearch/elasticsearch.yml
http.cors.enabled : true
http.cors.allow-origin : "*"
http.cors.allow-headers: "X-Requested-With, Content-Type, Content-Length, Authorization"
  • Wrangle and import data to Elasticsearch
    • run:
      • ./import_trees.py
      • Run index put command found in elastic_search_commands.txt
      • ./import_data.py
  • Run nodejs server
    • cd ./pf-react
    • npm start
    • Browser opens with web app

Approach used for this exercise

Day 1 - Barely MVP

  • Explored Django as backend
    • Thinking of doing most data wrangling and queries in Pandas
  • Elasticsearch
    • Use current text search capabilities
  • Explore preserving Unistats XML data relationships
    • Convert to JSON for upload to elasticsearch
    • Went with loading CSV instead of XML in the end for simplicity
  • Find frontend widgets for Elasticsearch
    • Went with ReactiveSearch
  • Use googlemaps API to get Geography from lat/lon
  • UKPRN lookups skipped
  • How to infer course industry?
    • Use Verbs, Nouns and Adjectives?
    • Look up course webpage and infer from text
      • A number of links are outdated
  • Clean courses without location record
  • Clean duplicate courses

Day 2 - "Human-level?" Semantic Matching

Day 3 - Successful integration of Word2Vec

  • Due to time pressure added basic string matching of courses to industry first
  • Integrated Google's pre-trained word cosine similarity model: http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/
  • The current semantic matching is pretty alright (found here: http://damsim.ddns.net:3000/) it uses GenSim Word2Vec and Google's pre-trained neural net that was trained on Google News feeds with a vocabulary of over 3 million words. Cosine distance of the average of word vectors was used to judge relevancy between Course Title and Industry Classification. As Word2Vec only judges similarity between 2 words and both Course Titles and Industry Classification sentences have unique syntax, bespoke tokenization methods were used to convert sentences to words that can be feed into Word2Vec.
  • Some human-level capabilities:
    • Selecting "Malt" will bring up courses on Brewing Beer
    • "Racing" brings up Equine Management
    • "Barite", Exploration and Resource Geology

Day 4 - Improved Tokenization and Seperate Industry Division Filters

  • Today, in attempts to improve the semantic matching further I've tried 2 more things: 1) Document Similarity With Word Movers Distance (http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/) and 2) A more generalized tokenization method.
  • Word Movers Distance did not seem to improve performance, it looks like its tokenization routines were not suited for the task. It also increased processing time of data to upload to Elastic Search from 2 hours to over 30 hours (estimated)
  • A more generalized tokenization method looks promising, though it increases the false positives, it also has the the effect of matching more relevant industries to courses. Data upload time was increased from 2 to around 10 hours
    • Remove words 'a', 'for', 'the', 'and', 'or', 'of', 'nec', 's', 'other'
    • Ignores non-words (e.g. Numbers and Punctuations)
    • Truncated sentence after these phrases: 'except ', 'exc. ', 'other than ', 'not ', 'without ', 'no '

Processing Optimizations

  • GoogleMaps API lookups are cached in a pickled dictionary for reuse
  • Industry categories tree structure is generated and pickled for reuse
  • Similarity scores for Industry vs Course are also cached and pickled
    • This increases processing speed by up to 10 times

About

Pathfinder Interview Exercise

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published