stackoverflow-code-identifier

Untagged code identifier for StackOverflow posts.

Some ammount of StackOverflow Q&A forums have code snippets which aren't identified with the <code> </code> tags. This is a partial solution to this problem using REGEX with C/FLEX and scripting with Python3.

There is a focus on C/Java alike code but also recognising some Python, Bash Prompt and PHP. There has been the effort to parse markup languages but it's handling with REGEX has showned unpredictable, even with the escape characters, due to the body content text extraction.

Prerequisites

For this program to run you will need FLEX, GCC and Python3. The Mac installation it's made with Homebrew, whick it's installed with
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)".

There is a similar package manager to Linux machines called Linuxbrew and it can be fount at http://linuxbrew.sh/. I leave it's installation up to those interested, being the examples shown bellow accepted in most Linux-like OS.

For FLEX installation on a Linux machine you need to run the following commands:

sudo apt-get update 
sudo apt-get upgrade 
sudo apt-get install flex bison

or for Mac, using the brew install manager:

brew install flex

For GCC, on Linux machine:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install build-essential

For Mac, use the brew command:

brew install gcc

For Python3 installation on Linux it would be the command:

sudo apt-get update
sudo apt-get install python3.6

Or, again, in a Mac:

brew install Python3

Usage

For this package usage you would need to download the StackExchange forum of your like at https://archive.org/download/stackexchange. Inside this archive you downloaded will be all data and meta-data of that particular forum. It will be of interest the "Posts.xml" file since it's where the real posts body is stored.

After this procedure you must change directory to the run.py directory and run the following command:

python3 run.py

At this point you'll be asked which file to analyze and you may type

Posts.xml

but note, you can analyze any XML file that follows the StackOverflow Posts.xml pattern.

Results

Some StackExchange forums data have hundreds of MB and it's result may take a while to finish. There is indeed a status briefing in the form of 4 de 11 ( 36.36363636363637 % ) posts identificados com mais codigo - Last body: 122764 and when it's finished the result can be verified in the Posts.xml file named before, having the same content but with added tags.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Makefile		Makefile
README.md		README.md
final.l		final.l
parser.l		parser.l
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stackoverflow-code-identifier

Prerequisites

Usage

Results

About

Releases

Packages

Languages

angelorscoelho/stackoverflow-code-identifier

Folders and files

Latest commit

History

Repository files navigation

stackoverflow-code-identifier

Prerequisites

Usage

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages