I’ve teamed with my wife to ship HackerBooks.com, a search engine for books quoted on StackOverflow and HackerNews.
The app is currently fairly simple and more is planned – yet I hope you will like the site just as much as the HackerNews crowd!
A few usage tips
There’s an advanced search built-in which is not really advertised yet but you may want to use:
- search for has:kindle to find all kindle ebooks
- search for site:stackoverflow to find all stackoverflow books
- or site:hackernews to restrict to hackernews books
Technical notes
It deserves more blogging, definitely, but I’ll give at least some details today.
The app is largely an ETL which munges multiple data sources (including the StackExchange data dump, a HackerNews data dump and a HackerNews crawler I wrote, as well as calls to the Amazon API) to extract, conform and load books and quotes from these sites users into a MongoDB back-end.
On top of that, Sunspot and Solr are used to index the books and allow a full-text search with a flexible weighting (see below):
Sunspot.search(Book) do
unless query[:keywords].blank?
keywords query[:keywords], :fields => [:title, :description] do
boost_fields :title => 1.5, :description => 0.4
boost(function { product(:karma, 3) })
end
else
order_by :karma, :desc
end
with(:quoted_on,query[:quoted_on]) unless query[:quoted_on].blank?
with(:kindle_edition,true) if query[:kindle]
Then we have a RubyOnRails 3 app which runs on top of RVM/Ruby 1.9.2.
To make a snappy app, we’ve used Nginx and Passenger, then Redis for the caching, coupled with a “warmup” script which fills the cache.
Keeping things RAM light
Initially the crawler was relying on resque (my favourite background processing tool) and resque-scheduler.
I really like both tools for my work but here we were severely restricted in RAM. Resque spawns a child process and resque-scheduler needs a dedicated process too, so it wasn’t a good fit here.
So another welcome addition to my toolset in this specific case is rufus-scheduler by John Mettraux, which allowed me to quickly (2 hours) convert the existing crawler to a more lightweight, single-process one.
require 'rubygems'
require 'bundler'
# restrict required gems with a bundler group to save RAM
ENV['BUNDLE_GEMFILE'] = File.expand_path('../Gemfile', File.dirname(__FILE__))
Bundler.require(:crawler)
require 'rufus/scheduler'
scheduler = Rufus::Scheduler.start_new
# ...
scheduler.cron '*/15 * * * *' do
puts "Scheduling FetchHackerNewsNewestJob"
FetchHackerNewsNewestJob.perform
end
scheduler.cron '*/2 * * * *' do
puts "Scheduling ExtractHackerNewsPageJob"
ExtractHackerNewsPageJob.perform
end
def scheduler.handle_exception(job, exception)
puts "job #{job.job_id} caught exception #{exception}"
puts "notifying hoptoad"
HoptoadNotifier.notify(:error_class => exception.class,
:error_message => exception.to_s)
end
scheduler.join
Automating the sysadmin
For this specific project, the (currently single) server is managed from start to finish with chef-solo – and I must thank vagrant too which helped me create/tweak the recipes at home.
Here’s what I have automated on this project:
- initial rvm bootstrapping
- ssh configuration (port change, no root/password allowed etc)
- firewall with csf
- cloudkick agent deployment
- god setup
- mongodb compilation, configuration and launching
- same for redis
- same for jetty, solr and sunspot
- nginx with passenger module
- 2 apps deployment (the front-end and the crawler)
Of course, all the god watches as well as all the config files (jetty configuration, mongodb etc, nginx config) are managed via erb and chef-solo.
This is the first time I’m not using Capistrano anymore for app deployment.
I cannot stress enough the usefulness of using both vagrant and chef-solo, especially for single-server deployment; I’m thinking about writing an e-book on that topic, as the learning curve has been fairly steep.
Having used both gives me for instance the ability to go from a new host (eg: Linode) to a fully configured stack and app with the necessary data restored from backup, in less than 15mn.
Voilà – stay tuned for more description of the underlying architecture of HackerBooks.com!
The comments system is brand new - don't be afraid to comment!
- On JRuby, Resque and Windows (August 2nd, 2010)
- Notes from Sinatra, Heroku and MongoHQ deployment (June 29th, 2010)
- Monitoring File Changes and Getting Notified via Growl (February 14th, 2010)
- How to use Google Calendar and Rufus-Google for Basic Time Tracking (November 27th, 2009)
- Using JRuby to prototype VST plugins (November 17th, 2009)
- Introducing Learnivore.com (September 15th, 2009)
- How to create small, unique tokens in Ruby (July 2nd, 2009)
- Detecting Which Ruby Interpreter is Running (JRuby, IronRuby) (March 4th, 2009)
- How to create an empty Rails Edge application (January 28th, 2009)
- How to Freeze Gems with Rails >= 2.1 (December 23rd, 2008)
- Thoughts on IronRuby and .Net Testing (December 1st, 2008)
- How to Retrieve Delicious Tags and Number of Bookmarks for a Given Url (November 30th, 2008)
- Fixing Symbol not found _rl_filename_completion_function (November 6th, 2008)
- How to Generate a Gradient for your CSS using RMagick (October 21st, 2008)

