Thursday, 14 July 2011

MongoDB

MongoDB is an open source, high-performance, schema-free, document-oriented database written in the C++ programming language.
 

Introduction:

With Mongo, you do less "normalization" than you would perform designing a relational schema because there are no server-side joins. Generally, you will want one database collection for each of your top level objects.

You do not want a collection for every "class" - instead, embed objects. For example, in the diagram below, we have two collections, students and courses. The student documents embed address documents and the "score" documents, which have references to the courses.


for any query you can directly mail me at:

Send Mail



Embed vs. Reference

The key question in Mongo schema design is "does this object merit its own collection, or rather should it embed in objects in other collections?" In relational databases, each sub-item of interest typically becomes a separate table (unless denormalizing for performance). In Mongo, this is not recommended - embedding objects is much more efficient. Data is then colocated on disk; client-server turnarounds to the database are eliminated. So in general the question to ask is, "why would I not want to embed this object?"

So why are references slow? Let's consider our students example. If we have a student object and perform:

print( student.address.city );


This operation will always be fast as address is an embedded object, and is always in RAM if student is in RAM. However for:

print( student.scores[0].for_course.name );



if this is the first access to this course, the shell or your driver must execute the query:
// pseudocode for driver or framework, not user code

student.scores[0].for_course = db.courses.findOne({_id:_course_id_to_find_});




General Rules

  • "First class" objects, that are at top level, typically have their own collection.
  • Line item detail objects typically are embedded.
  • Objects which follow an object modelling "contains" relationship should generally be embedded.
  • Many to many relationships are generally by reference.
  • Collections with only a few objects may safely exist as separate collections, as the whole collection is quickly cached in application server memory.
  • Embedded objects are harder to reference than "top level" objects in collections, as you cannot have a DBRef to an embedded object (at least not yet).
  • It is more difficult to get a system-level view for embedded objects. For example, it would be easier to query the top 100 scores across all students if Scores were not embedded.
  • If the amount of data to embed is huge (many megabytes), you may reach the limit on size of a single object.
  • If performance is an issue, embed.



Use Cases

Let's consider a few use cases now.
  • Customer / Order / Order Line-Item
    • orders should be a collection. customers a collection. line-items should be an array of line-items embedded in the order object.
  • Blogging system.
    • posts should be a collection. post author might be a separate collection, or simply a field within posts if only an email address. comments should be embedded objects within a post for performance.


How Many Collections?

As Mongo collections are polymorphic, one could have a collection objects and put everything in it! This approach is taken by some object databases. For performance reasons, we do not recommend this approach. Data within a Mongo collection tends to be contiguous on disk. Thus, table scans of the collection are possible, and efficient. Collections are very important for high throughput batch processing.

schema design basics videosclick here

to download source code of MongoDB click here:source code

No comments:

Post a Comment