-
Notifications
You must be signed in to change notification settings - Fork 9
Conversation
b580f5c
to
2f1af04
Compare
class FeatureStorage(): | ||
def __init__(self, test_db=False): | ||
"""Set up database engine""" | ||
self.TEST_MODE = test_db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that testing logic should be separate from our classes. If we create a test db that mirrors the real one, and set PGDATABSE=testfpsd
in our test fixtures, that should obviate the need for this.
Ok so after discussion, the TODOs in |
@conorsch Do you think you'd be able to make a PR for the Ansible tasks in another branch against the |
I made a quick change to how the unified features view is being created. The query that generates the view now joins on a single column table that contains the exampleids that existed at the time of the start of feature generation. Otherwise, if the feature generation code is not re-run after any new traces are collected, the features view would have contained examples for which all the features are Null. That’s A Bad Thing. Here is a wee diagram that shows a high level view of what the feature generation code is doing in three steps: The input is the raw schema, and the final output used by the machine learning code is the features view (Also - I'm not rebasing yet to preserve the comments on the diff, but I can do so whenever needed) Also, just a note for any Ansible warriors out there, the database creation and table creation is now done by Ansible as of PR #27 being merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is just review part one (just features.py) 😵.
I think it would be best if you first locally committed the needed changes as per my review, rebasing if you feel it's necessary, and then pushed to this remote. So, in other words, wait to push the modified and rebased version of this branch you have locally, until you've addressed the review comments.
Much of the important bits of this being in SQL made it impossible (without spending time I don't have to learn SQL well) for me to carefully review the queries. On this note, I'm definitely going to have to way push back my implementation of Wa-kNN because it's going to be a while until I can actually spend the time it takes to learn PostgreSQL and learn how to use the relevant Go libraries.
@@ -0,0 +1,765 @@ | |||
#!/usr/bin/env python3.5 | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My OCD would like if you put these imports in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, ordered alphabetically now
|
||
# For some features we need to know the ordering of bursts, | ||
# so let's save this information as well | ||
ranks = list(range(1, len(bursts) + 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why there's a need to create the ranks
list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is needed because other feature queries need to know what position a given burst was at: e.g. in order to know that the 5th burst had length 11, you need to keep track of which burst was 5th - this is what the ranks list does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still just picking up SQL. I know now that databases don't preserve order of rows, so this is necessary.
for row in df.itertuples(): | ||
if row.ingoing == False: | ||
current_burst_length += 1 | ||
elif row.ingoing == True and current_burst_length > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bursts are defined as "a sequence of packets in the same direction" in Table 3.1 of Wang's thesis. In http://www.cs.kau.se/pulls/hot/go-kNN/ Pulls points out:
- Next are calculations for bursts, later turned into features. There are at least three bugs here in both versions of fextractor.py:
- The burst calculation is only for incoming bursts (inconsistent with the KNN paper, that mentions bursts as outgoing packets only). The cited paper they take this from (Peek-a-Boo…) defines bursts in both directions.
In https://github.com/pylls/go-knn/blob/master/features/knn.fixed/fextractor.go, Pulls has included sequences in both directions.
So, IMO we should do the same as Pulls's fextractor.go
on this one, and in general prefer a strict interpretation of Table 3.1 over trying to follow "the KNN paper"--whatever Pulls is referring to--or whatever fextractor.py
does (in this case only counts incoming packets--the opposite of your code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - I’ve rewritten compute_bursts()
to compute both. I've also modified the corresponding tests to reflect the new functionality (e.g. see in test.test_features.RawFeatureGenerationTest.test_burst_table_creation
)
|
||
# For some features we need to know the ordering of bursts, | ||
# so let's save this information as well | ||
ranks = list(range(1, len(bursts) + 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I would do is separate out the test fixture block in fpsd/tests/test_database_methods.py
that sets the PGDATABASE
environment variable to testfpsd
into it's own module, or a function in a "test_utils.py
" module. Then I'd make sure to call this function in the setUp()
of the relevant test cases, and all the TEST_MODE
logic in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok removed this
return bursts, ranks | ||
|
||
|
||
class FeatureStorage(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a base class that provides some fundamental methods for various database handlers. However, in the interest of time, it's okay to defer writing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, once ml-classifiers
also gets merged in we can refactor the database handling code
self.execute_query_from_string(query) | ||
return "features.cell_ordering_differences" | ||
|
||
def generate_table_binned_counts(self, num_features=100, size_window=30): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, maybe num_features
is good actually instead of num_cells
. I just think it should be standardized. I'm not going to go back, find and change that other comment. I'm just reviewing sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using num_features
here instead of num_cells
because each feature does not correspond to an individual cell whereas in the num_cells
case they do. I’ve tried to make sure that when num_cells
is used it refers to a feature corresponding to an individual cell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
self.drop_stale_feature_table(feature_table_name) | ||
|
||
if num_features > 1: | ||
feature_columns = ["num_outgoing_packets_in_window_{}_of_size_{}".format(x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, why do tables like these need to have dynamic instead of static names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so I’m using dynamic names in cases like this so that if one wants to generate multiple feature tables with a different e.g. number of windows or window sizes and use both of them, then it’s possible to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got ya!
table public.current_bursts with all bursts in | ||
the following format: | ||
|
||
burstid | burst | exampleid | rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't exampleid
always be the leftmost column? Might not matter in practice, but for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional as in this case the primary key is burstid
(for every exampleid
there are many burstid
s). By convention, primary key is usually the leftmost column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
Args: | ||
lengths [list of int]: number of lengths to create bins | ||
between [0, length] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you mean [length, ∞)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
feature_tables.append(db.generate_table_outgoing_cell_ordering()) | ||
feature_tables.append(db.generate_table_outgoing_cell_ordering_differences()) | ||
feature_tables.append(db.generate_table_binned_counts()) | ||
feature_tables = feature_tables + db.generate_burst_tables() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this second part of the review, I actually looked at the SQL now that I've learned enough to actually understand it. Here's what I looked at:
- Re-reviewed the sections you changed during the first review round.
- A read through of each query for correctness. I also looked for reasonable edge cases in which queries might fail, and ways you might simplify queries. (In a couple instances I skipped this step—my comments explain why.)
- Testing each query on our remote database to make sure the output looked correct (when possible).
I didn't want to go over the tests yet. Will wait for @conor to make the Ansible PR against this branch first.
Besides the in-line comments I have a few general things to add:
create_table
should be used in place ofgenerate_table
to match the SQL syntax.- I found a lot of your queries hard to read because of formatting. I took some cues from http://www.sqlstyle.guide/, though didn't spend the time to follow the spec to a T (which would require some renaming in the database anyway). In particular, I like when subqueries are indented from their parent queries, each clause in a query is on its own line (indenting the 2nd, 3rd,... lines for longer clauses), and when the ends of each clause keyword in a (sub)query are aligned. Also, wrapping query strings in
"""
instead of quoting each line not only makes them more readable, but makes it easier to test the queries w/ psql, pgadmin3, etc.. We don't need to make or follow some strict specification, but some standardization—and a cleanup of what you have now—would be nice. I've provided a couple examples of what I consider to be more readable querystr
s. - All use of
packets
should be replaced bycells
. - Each time another method calls
_create_temp_packet_positions
, it drops and re-creates the temporary table, effectively eliminating the computational effort temporary tables save us. You should fix this—I believe the way you handle this in thegenerate_burst_tables
method in order to no re-create thepublic.current_bursts
table is the way to go. Further, for both these tables you can immediately drop the (temporary) tables they create as soon as they are no longer needed—emphasizing this may add a little clarity. - The
create_bursts
function is similar to the_create_temp_packet_positions
in that it is neither called directly nor present in the unified view. For these reasons I believe it should also be private and create only a temporary table. It's also nice to have the full table that is created in the method name, so this would be_create_temp_current_bursts
.
|
||
self.drop_table("features.undefended_frontpage_examples") | ||
|
||
query = ("CREATE TABLE features.undefended_frontpage_examples AS ( " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CREATE TABLE features.undefended_frontpage_examples AS (
SELECT exampleid FROM raw.frontpage_examples
);
Produces the same result in my tests. Is there a reason why we could not shorten this query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya this is intentional: for whatever reason rarely the crawler occasionally but rarely puts traces with no cells in the database (probably when it crashes). Those need to get filtered out but I really don’t think it’s worth messing about with the crawler for this so I just made this column to join against that selects those traces that have one or more cells
self.drop_table("packet_positions") | ||
|
||
if outgoing_only: | ||
where_only_outgoing = "WHERE ingoing <= false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just be =
, not <=
, which doesn't make sense for a Boolean column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truuu. changed
" ROW_NUMBER() OVER " | ||
" (PARTITION BY exampleid ORDER BY t_trace) " | ||
" AS position, " | ||
" t.* " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about being more specific with t.exampleid, t.t_trace, t.ingoing
instead of t.*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, changed
" AS position, " | ||
" t.* " | ||
" FROM raw.frontpage_traces t) " | ||
"x {} );").format(where_only_outgoing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write this as
CREATE TEMP TABLE packet_positions AS
(SELECT exampleid,
ROW_NUMBER() OVER (
PARTITION BY exampleid ORDER by t_trace) as position,
ingoing
FROM raw.frontpage_traces
{});
to simplify, eliminating the subquery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subquery is necessary here - this won’t work where we are selecting only the rows where ingoing = false
" AS position, " | ||
" t.* " | ||
" FROM raw.frontpage_traces t) x " | ||
" WHERE x.position <= 10') " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be WHERE x.position <= {num_cells}
, and unpack locals (**locals()
) in format
method after the position arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added num_cells
, good catch
"features.burst_binned_lengths", | ||
"features.burst_lengths"] | ||
|
||
def _get_column_tables_of_table(self, schema_name, table_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling this _list_columns
? A one-line docstring would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, changed
join_query = ("LEFT OUTER JOIN {name} t{num} " | ||
"ON foo.exampleid = t{num}.exampleid ").format(name=table_name, | ||
num=table_num) | ||
full_join_query = full_join_query + join_query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
full_join_query = ("FROM ( (SELECT exampleid FROM " | ||
"features.undefended_frontpage_examples) foo ") | ||
|
||
for table_num, table_name in enumerate(list(master_features.keys())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enumerate(list(master_features))
is equivalent to enumerate(list(master_features.keys()))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
word! fixed
"features.undefended_frontpage_examples) foo ") | ||
|
||
for table_num, table_name in enumerate(list(master_features.keys())): | ||
prefix = 't{}.'.format(table_num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not prefix the table name instead of t{0,1,...}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I rolled this back - it makes an already very long query extremellllly long (really annoying for troubleshooting - which i just did) and also it doesn't work just using table_name
because it includes .
characters which we would need to strip off and along with the name of the schema
for table_num, table_name in enumerate(list(master_features.keys())): | ||
prefix = 't{}.'.format(table_num) | ||
prefixed_columns = [prefix + s for s in master_features[table_name]] | ||
columns_to_select = columns_to_select + ', ' + ', '.join(prefixed_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case because we can't read the query altogether anyway, it's better if you just don't create the columns_to_select
var and instead just use SELECT *
in the uppermost parent query of the create_new_view
var. See my comments there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that we don’t want to select all columns because we don’t want e.g. exampleid many times in the features table
OK addressed these including those in the 4 bullets in the main comment (a small number are probably not worth the time investment to refactor at this stage imho but most suggested changes I made) |
Changes overall look good! The one thing I thought was important that didn't happen was the rewrite of |
One more note is that it would be helpful if you made your comments a little less granular (e.g., renaming was split between numerous comments when it could have just been one). I find it a little easier to review changes this way and makes the log nicer. |
Not saying you should rebase or anything, just for future reference. |
Here's a weird idea that might just make actually fully testing this PR nice. @conorsch can make a PR that implements the database changes described herein against master, and we can merge that after #54 gets merged. Then @redshiftzero can rebase this PR on top of master. The one problem is that Maybe the three of us can sync on this this afternoon and figure out exactly what @conorsch needs to write to support this next pipeline step. |
87a0dd0
to
3d83e34
Compare
Tests refactored, and tested this on the new Ansible test set up in PR #72 and everything works well. 10-4 on the granularity of git messages 😸 |
a556598
to
cb38f91
Compare
Can you rebase this again? Should be the last time. Also, add your test module to |
cb38f91
to
824d769
Compare
Sure, done and done! |
How's this one looking @fowlslegs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inline comments represent the last changes this needs before merging. What follows are notes relevant to a future refactor, that aren't relevant to the merging of this PR:
- The not-yet-existent database base class should not read
PGDATABASE
from the environment, but be passed a relevant info. In fact, it should not be reading any environment variables; everything should be explicitly passed to its constructor. The tests that use the database as well as the production modules should by default pass in the respective test/prod database object initialization object from some "common" module, instead of reproducing all the values each time (or perhaps in config.ini). - Instead of dropping the whole features schema, maybe drop all tables in the schema.
-
- This file would be a great first place to test out hypothesis.
|
||
def tearDown(self): | ||
cleanup(self.db.engine) | ||
self.db.drop_table("public.current_bursts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a
if __name__ == '__main__':
unittest.main()
here.
def test_incoming_burst(self): | ||
df = pd.DataFrame({'ingoing': [True, True, True]}) | ||
bursts = compute_bursts(df) | ||
burst_positions = range(1, len(bursts) + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're not doing anything with these values, why create them?
query = """CREATE TABLE features.burst_lengths AS | ||
(SELECT * FROM crosstab( | ||
'SELECT exampleid, rank, burst | ||
FROM public.current_bursts') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add an ORDER BY RANK
directive to this inner, crosstab query, and update the last method of the last TestCase
accordingly.
Great, thanks for the comments @fowlslegs ! Fixed/changed those things. Also, I support future refactoring of |
@@ -230,6 +227,8 @@ def test_burst_table_creation(self): | |||
actual_output['exampleid'].append(row[1]) | |||
actual_output['burst_length'].append(row[2]) | |||
actual_output['burst_rank'].append(row[3]) | |||
|
|||
pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p d b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol sorry
Just get rid of the debug statement and consider this merged! Thanks for dealing with the long review process, and I was glad to learn SQL in the process 😸 |
#!/usr/bin/env python3.5 | ||
import os | ||
import pandas as pd | ||
import pdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this import too.
from decimal import Decimal | ||
import os | ||
import pandas as pd | ||
import pdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one.
Another reason why we need to get rid of environment variable usage in our code. See #30 (review).
Another reason why we need to get rid of environment variable usage in our code. See #30 (review).
Another reason why we need to get rid of environment variable usage in our code. See #30 (review).
The PR adds feature generation code and tests. It takes data from the
raw
schema, generates all features relevant to Tor traffic from Wang et al. 2014, and stores the results in thefeatures
schema.One can generate (or regenerate) all features using:
Most of the feature generation code - with the exception of some preprocessing for the “burst” features - has been rewritten in SQL for speed. The burst preprocessing (in
compute_bursts()
) is done in Python and is thus slow (Issue #29 now exists to address this). Otherwise, each feature generation method is a Python wrapper for a SQL query. Where possible, the feature generation code is parametrized such that variants of the features used in Wang et al. 2014 can be easily generated. For example, instead of binning our traces into windows of size 30 and computing counts of outgoing cells in the first 100 bins, we may wish to vary both the window size and number of bins. As such, both of these parameters are passed to the feature generation method for this featuregenerate_table_binned_counts(num_features=100, size_window=30)
.The calling of these various functions is done in the
main()
function but I would not futz too much with that as all these tasks will be integrated with a data science pipeline tool to automate their execution upon some condition e.g. new rows entering theraw
schema.This PR also adds tests for each feature generation function (both Python or SQL using a test database) in
test/test_features.py
. This is implemented by using a testing database that has the same structure as the production database but is empty. Each test inserts some rows into the test database duringsetUp()
, runs the test, and then removes those rows and any tables that have been created by the feature generation function duringtearDown()
. Run all tests with:A couple of packages are added to the requirements files (in
ml-requirements
):pandas
, a general data analysis librarytqdm
, a nice progress bar used to display progress on the burst generation computationFinally, details of how the test database is set up are written out in
test_db_setup.md
which we can add to be configured in the vagrant setup after branchsorter-db-integration
is merged.