Skip to content

szarski/stat_fu

Repository files navigation

StatFu

Stat_fu is a plugin for managing statistics in Rails 2.xx as background jobs.

It is run in a production environment and does well, but as it’s pretty specific, you might want to think if you like its logic before applying it. The logic should allow you to create any kind of stats though.

If your stats are fairly simple and do not require much computation or db queries, you might think of calculating them on the fly instead of using the plugin. Otherwise this might be a solution for you.

As the code changes often, there’s no guarantee the current version works, please use tags instead.

Comments and feedback appreciated.

The Idea

The idea is based on a few assumptions:

  • Each statistic is separate
  • Each statistic has one db table and one model which contains all the code, rake tasks etc.
  • A statistic is a function that takes a set of arguments and returns a set of results
  • One record is associated with just one set of arguments and one set of results
  • We calculate one record at a time (the way the logic works)
  • We optimize the amount of queries and computations by caching partial results, not by sharing the computations between records
  • Each statistic has:
    • a definition of input parameters
    • a method for calculating a record
    • a method for checking if a record is up to date
    • a method for checking if a record is coherent (correct)
    • rake tasks definitions
    • output definitions (after read processing)
  • Both input parameters and output vaues are stored as normal activerecord fields and are of standard types (Date, Integer, String, etc..)
  • Checking for errors relies on redundant computations that sould show the same results

Install

script/plugin install git://github.com/szarski/stat_fu.git

Example

Let’s say we’ve got an app with models:

  • User
  • Project

There’s a named_scope Project.with_category(category_name* )*

User has_many *Project*s. Projects have a category_name field. We want to be able to retreive the amount of projects for a certain user for a certain date. Also, we want to be able to filter them by category name. Also, we want to be able to determine which user was the most productive and how many projects there were on a certain date.

We generate a model and a migration:

script/generate statistic_model users_projects_daily date:date user_id:integer category:string projects_count:integer

Open app/models/statistic/users_projects_daily.rb. We get a prepopulated model. We want to:

  • set the default parameters
  • fill in the count, check, up_to_date? methods (check will be a redundant form of count)
  • define rake tasks
  • define our output

After filling everything in we get a model that looks as follows:

class Statistic::UsersProjectsDaily < Statistic::Base
  set_table_name "statistic_users_projects_dailies"

  parameters :date, :user_id, :category, :optional => [:category]

  def count
    user = User.find self.user_id
    if self.category
      self.projects_count = user.projects.with_category(self.category).count
    else
      self.projects_count = user.projects.count
    end
  end

  def check
    if self.category
      return (self.projects_count == Project.find(:all, :conditions => {:user_id => self.user_id, :category_name => self.category}).count)
    else
      return (self.projects_count == Project.find(:all, :conditions => {:user_id => self.user_id}).count)
    end
  end

  def up_to_date?
    return (self.updated_at.to_date - self.date > 1)
  end

  UPDATE_HASH = {:user_id => lambda {User.all.map(&:id)}, :date => lambda {(3.months.ago.to_date..Time.now.to_date).to_a}}

  rake_tasks do |r|
    r.namespace :users_projects_dailies do |n|
      n.namespace :update do |u|
        u.default_update :all, "update all stats for users_projects_dailies", UPDATE_HASH
        u.namespace :all do |a|
          a.default_update :force, "update all stats for users_projects_dailies, force up_to_date ones", UPDATE_HASH.merge({:force => true})
        end
      end
      n.namespace :clear do |c|
        c.default_clear :all, "clear all users_projects_dailies stats"
      end
      n.custom_task "Some Custom Task" do
        # code here
      end
    end
  end

  describe_output :projects_count => :sum, :most_busy_user_id => lambda{|records| records.sort{|a,b| a.projects_count <=> b.projects_count}.last.user_id}
end

To prevent from doing too much db queries, say, when checking result, we can cache them as follows:

def check
  projects = self.cache_in_batch "projects for user #{self.user_id}" do |records|
    return Project.find(:all, :conditions => {:user_id => self.user_id})
  end
  if self.category
    return (self.projects_count == projects.select {|p| p.category_name == self.category}.count)
  else
    return (self.projects_count == projects.count)
  end
end

Off course we can cache many more things this way, we could, for instance cache projects for the whole stats collection (reffered here as records).

Now for the fun part. To run your stats, just add to your crontab:

rake stats:users_projects_dailies:update:all

Or to remove:

rake stats:users_projects_dailies:remove:all

To see all the rake tasks related to stats run:

rake stats:help

Now, to run in the console:

# Retrieve all available stats and check if they're up_to_date
batch = Statistic::UsersProjectsDaily.batch(:user_id => [1,2,3], :date => Time.now.to_date)
# Update if needed
unless batch.unsatisfied_combinations.empty?
  batch.fill_up
end
batch.most_busy_user_id # gives us the most productive user
batch.projects_count # gives us the total amount of projects

Logics

The way we use the stats is we:

  • retrieve a batch
  • check if all the stats are up to date
  • update the old ones
  • grap the output (with methods defined with describe_output)

To update all the records in batch, including these that are already up_to_date?, we just pass:

:force => true

This might be helpfull if, for instance, we fixed a bug and some of the stats may be corrupted.

As we process stats in background, we don’t care that much about the performance of the count and check functions.

What we do care about though is the up_to_date? function as it will be run each time we retrieve a batch. Since it’s run while the user is waiting for stats to display, we want it to be as light as possible.

Also, the output functions should just perform simple transormations of the results we store in out db table.

Methods

To get a better idea of what methods are available, you can always take a look at the specs.

Copyright © 2010 Jacek Szarski (jacek@applicake.com), released under the MIT license

About

A Rails statistics framework

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages