Naive Bayesian Classifier and Django

Tags: django NewsPet project

I am currently taking an introductory course to artificial intelligence. I find it really interesting. I have learned more about search algorithms than I could have ever imagined. Recently we were instructed to create some artificial intelligence agent to do any task, as long as it uses some element of our class. Thus News Pet was created.

News Pet is a trainable (hence the name) news fetcher, that hopefully will be like Pandora for news. The scope of this project has pretty limited features (mostly a proof of concept at this point), but I will offer it up as food for thought. Often people have lots of RSS feeds that they subscribe to, but they do not want to read all of the items that come from these feeds, either that or they want to categorize them in some way. NewsPet categorizes or trashes all the articles that come through your reader.

We will be using Naive Bayesian Classification, which I will discuss briefly later for any one who is unfamiliar with Bayesian Classification. Our project has three important parts: Classification, Initial Learning, and Feedback (or Subsequent Learning).

Bayesian Network and Classification

Bayesian networks are unidirectional graphs using probability, specifically Bayes Theorm, to help calculate the likeliness of events based on partial evidence. The way the graph works is an edge between node A and B represents P (A`|`B), or the probability distribution of A given B. So lets give an example. You have a nuclear plant with a core, we have the temperature of the core (T), the reading from the temperature gauge or the perceived temperature (G), the probability of the the gauge being faulty (F_g), the probability of the alarm going off (A), and the probability that the alarm is faulty(F_a). The alarm is more likely to go off if the perceived temperature is high, and the gauge is more likely to be faulty if the temperature is high. So lets look at the graph.

/media/img/BayesNetwork.jpg

Naive Bayesian Classifiers (NBC) use Bayesian networks to classify documents. It is often used for spam filtering. It is trained by taking a batch of documents and giving it to the NBC and telling it if each document should be accepted or should not be accepted. The NBC then builds a Bayesian network. When you give a document to an NBC it uses that document as evidence and it can give you a confidence value of whether the document should be accepted.

Classification

Our approach to classification is nothing special. There is a list of feeds and there are a list of categories. Users create the categories and they all have an NBC associated with them. The way they are trained will be talked about in Initial Training and Feedback. Each item from the feeds will be tested against each of these categories. If the confidence is greater than some constant it will be added to that category. The default constant will be decided after some experimentation, and it will also be customizable. If the item fails to be added to any category, it will be added to the special category, Trash.

Initial Training

Because each category has its own Bayesian classifier it needs to be initially trained. We have several ideas for this right now: choose from some pre-trained categories (ie Business, Programming), submit a batch of files to train the classifier, or use a single word or phrase. The first two are fairly straight forward, but they are not extremely helpful, that being said it is impossible to train a Bayesian classifier with just one word. Because google is a reliable source for finding documents based on single words, we are going to use google search to retrieve a number of documents to train the Bayesian classifier. We will be doing experimentation to decide which to use, the final solution will probably be the ability to choose which training you would like.

Feedback

One important part of this feed reader is that its trainable. For all items you will have the choice to thumbs it up, thumbs it down, or say that it belongs to another category. Behind the scene thumbs up will and thumbs down will be more training cases for that category's NBC. When you move it to a new category, it becomes a training case for both category's NBCs. This means that the NBCs will always be changing and improving.

Django

Many of you may be skipping right to this section to see how Django has anything to do with this. I have convinced my group members that a Django powered website would be the best choice for the user interface. In the background doing the Bayesian classification will be Java. While this may be disappointing, it was for a couple of reasons: my project members do not know Python, and we couldn't find a good a good NBC library for python. If anyone knows of any, I would be interested in hearing about them, so leave a comment.

Django Schedule

Tags: django documentation project

Django-schedule has just been released, supporting recurring events. Doing this required a paradigm switch. In this post I will describe the paradigm switch as well as explaining some features.

Events and Occurrences

The new idea is to think of Events as a thing that a person would like to track, and an Occurrence as a instance of an event with a specific time and date. It works best if we think about it with an example. You have a 'Weekly Staff Meeting', this is an Event. Its a meeting that happens every week. Now 'Tuesday's Staff Meeting' is an Occurrence. It is a specific instance of the Event 'Weekly Staff Meeting'. So now lets look at how this works with the code.

>>> user = User.objects.get(username='thauber')
>>> start = datetime.datetime(2008,1,1,14,0)
>>> end = datetime.datetime(2008,1,1,15,0)
>>> rule = Rule.objects.get(name = "Weekly")
>>> event = Event(title = 'Staff Meeting',
...           start = start,
...           end = end,
...           rule = rule,
...           description = "description")
>>> event.create_relation(user)

What we just created here was an event called "Staff Meeting." Don't worry about the create_relation line we will deal with that in Relations. Now we can worry about getting the Occurrences. Lets say that you want all occurrences of that event from today to a week from today.

>>> start = datetime.datetime.now()
>>> end = start + datetime.timedelta(days=7)
>>> event.get_occurrence(start, end)

This would return all of the occurrences of this event between start and end.

Periods

So now you have a list of events, and you would like all of the occurrences for that list. You can do this with the Period class.

>>> events = Event.objects.get_for_object(user)
>>> period = Period(events, start, end)
>>> period.get_occurrences()

If you are wondering why there is a class for this there are several reasons.

1) It is useful to know which events start in this period, end in this period, or are just continued in this period. To deal with this there is a function, get_occurrence_partials, which returns what I like to call Occurrence Partials. Meaning Occurrences relevant to to a discrete period of time. Each element in the returned list is a dictionary {'event': event, 'class': 0} the classes are as follows:

  • 0: The event begins in this period
  • 1: The event begins and ends in this period
  • 2: The event doesn't begin or end in this period, but it exists in this period (AKA it continues during this period)
  • 3: The event ends during this period

2) It can be subclassed so that special functionality can be added to special periods. Some subclasses that are included out-of-the-box are Month, Week, and Day. These subclasses have some specific functionality that you may find helpful, for example Month has get_weeks, which returns the Week periods for that specific Month period. Month, Week, and Day are all initialized by a date or a datetime object.

>>> date = datetime.datetime(2008,5,20)
>>> month = Month(date)
>>> month.start
datetime.datetime(2008,5,1,0,0)
>>> month.end
datetime.datetime(2008,6,1,0,0)

Notice that the end of a period is not inclusive in the period.

To see more information on the Period class you should view the source.

Rules

Rules are how you define the recurrence pattern of an Event. This uses the rrule in the dateutil module (not included with python). For more information on rrule you should see the documentation. Rule is a model so it can be created through the admin interface. As of now the fields are

Name
The name of the recurrence pattern (ie Weekly, Every other Month)
Description
A more verbose definition of the recurrence pattern.
Frequency
Defines the frequency set for the rrule. Must be YEARLY, MONTHLY, WEEKLY, DAILY, HOURLY, MINUTELY, SECONDLY.
Params
This field holds the params that allow you to customize the rrule. It is key value pairs seperated by semi-colons(;) the key value pairs are seperated by colons(:). The value must be integers, or list of integers. An example would be count:2;byweekday:0,1,2; (see source for more help).

Eventually the admin will be easier to work with for this model, and it will come with some builtin Rules, like Weekly, Monthly, Yearly, Every Weekday, etc.

Relations

There is a built in relationship table for relating events to generic objects. This also works with calendars. You do not need to worry about the relationship table as it all happens behind the scene. Lets say you want to relate a calendar to a Group, which represents a group of users. This is really simple to do.

>>> group = Group.objects.get(name = "Pythonistas")
>>> cal = Calendar.objects.get(name = "Pythonistas' Calendar")
>>> cal.create_relation(group)
# Now to get that calendar
>>> Calendar.objects.get_calendars_for_object(group)

Both Calendar and Event have create_relation functions. If you know that there should only be one Calendar you can use get_calendar_for_object. It will return one Calendar or raise Calendar.DoesNotExist. Or if you only want there to be one calendar, but you don't know if there is one you can use get_or_create_calendar.

>>> Calendar.objects.get_or_create_calendar(group, name = "Pythonistas' Calendar")

As you can see there is an optional keyword name. If the Calendar needs to be created it will get the name name.

Conclusion

There is some work that still needs to be done. I would like upgraded forms, templatetags, and I am always looking for more features to be implemented. If you have an comments you can let us know at the Django-schedule page.

A special thanks to Yann Malet for his help getting event recursion working

UPDATE fixed some typos, see yml's and Guenter's post below.

My New Design

Tags: Blog Django Personal

I have finally redesigned (and fixed) my blog. I understand this is a long awaited, momentous occasion, and, with no further adieu, here it is, soon to be a hot-spot of information with a wealth of insight. A new feature I have added, which many blogs have had before me, is the Link. Where I can link to cool things I find on the internet. Some features haven't been implemented yet, such as the buttons on the right, post detail pages, tag pages, a blog roll, and a 'find me on other websites' section.