MongoDB

Starter

文档型数据库: 一条记录就是一个文档,由字段和值对组成(BSON)

官网 | Doc | 中文社区 | Github | Jira (bug fix)

  1. 特性:

    • 无数据结构限制(每条记录可以有完全不同的结构)
    • 完全的索引支持(单键/多键索引,复合索引,全文索引,地理位置索引)
    • 丰富的查询语言(Shell内置javascript引擎可以直接执行JS代码)
    • 方便的冗余和扩展(复制集Replica Set:保证数据安全,高可用性,分片Sharding:水平扩展数据规模)
    • 插件化的存储引擎支持(WiredTiger, MMAPv1, In-Memory, Encrypted, 3rd Party Engine)
      • WiredTiger(存储引擎):
        • MongoDB3.2 之后的默认存储引擎
        • 拥有效率非常高的缓存机制
        • 支持在内存中和磁盘上对索引进行压缩(压缩时使用了前缀压缩的方式以减少RAM的使用)
        • 能够保证对于同一个文档的操作都是原子的,任意一个写操作都不能原子性地影响多个文档或者多个集合
  2. 使用:

    • CRUD -> 索引创建和使用 -> 复杂聚合查询 -> 数据集分片,在不同分片间维持数据均衡 -> 数据备份和恢复 -> 数据迁移
  3. 部署:

    • 单机 -> 拥有冗余容错功能的复制集 -> 分片大规模数据集群(横向扩展,自动分片,轻松支持TB-PB数量级) -> 集群的自动部署 High Deploy

数据模型

RelationDB vs. MongoDB :

  1. RelationDB:
    • Table -> Record -> field & value
  2. MongoDB:
    • Collection -> Document -> key & value
    • 表示文档间关系的方式:
      • Reference 引用:标准化的数据模型,可通过某个引用值链接到其他文档(与 MySQL 的外键非常相似,但不会对引用的对象是否真正存在做出任何的约束) model-reference
      • Embedded Data 嵌入:非标准化的数据模型,相关联的数据保存在同一个文档结构之内 model-embeded
    • 注:MongoDB不对Collection的数据结构进行限制,但在实际使用中,尽量同一个Collection中的文档具有类似的结构

搭建MongoDB测试环境

这里直接使用Docker搭建MongoDB测试环境

  1. 搜索下载mongo image
     $ docker search mongo
     $ docker pull mongo
    
  2. 启动一个容器(micro-mongo)作为mongo server

     # 建一个单独的数据卷store-mongo (为防止docker machine重启后数据丢失)
     $ docker create --name store-mongo mongo
    
     # 创建启动一个container as mongo server : micro-mongo,并挂载数据卷store-mongo
     $ docker run --volumes-from store-mongo --name micro-mongo -p 27017:27017 -e MONGO_INITDB_ROOT_USERNAME=mongoadmin -e MONGO_INITDB_ROOT_PASSWORD=123456 -d mongo:latest
    
     # 检查
     $ docker ps
     $ docker log
    
  3. 使用:
    • 方式一:直接进入刚才创建的mongo server容器(micro-mongo)
        $ docker exec -it micro-mongo /bin/bash
        root@480e8bf33600:/# mongo -u mongoadmin -p 123456 --authenticationDatabase admin
        > show dbs
        admin   0.000GB
        config  0.000GB
        demo    0.000GB
        local   0.000GB
      
    • 方式二:启动一个容器(mongo)作为mongo client连接刚才的micro-mongo
        $ docker run -it --rm --link micro-mongo:mongod --name mongo-client mongo:latest mongo -host mongod -u mongoadmin -p 123456 --authenticationDatabase admin demo
        > show dbs
        admin   0.000GB
        config  0.000GB
        demo    0.000GB
        local   0.000GB
      
    • 方式三:客户端工具 MongoDB Compass
  4. 导入测试数据
     # load(filename): Loads and runs a JavaScript file into the current shell environment
     > load("testData.js")
    

Server/Client

  • Server: 启动服务,使用命令mongod

      $ vi conf/mongod.conf
      port = 12345
      dbpath = data
      logpath = log/mongod.log
      fork = true
    
      $ mongod -f conf/mongod.conf
    
  • Client: 连接数据库,使用命令mongo

      $ mongo --help
      $ mongo localhost:12345/testdb
      > use admin
      > db.shutdownServer()
    

Mongo Shell

MongoDB的Shell内置javascript引擎可以直接执行JS代码

# 列出所有可用db
show dbs

# 查看当前正在使用的db
db

# 切换当前数据库上下文,即切换当前使用db
# Note:可以切换到一个不存在的db,后面执行db.<collection>.insert(...)操作时会自动创建
use <database>

# 查看当前db的集合集(table)
show collections
show tables


# 格式化打印结果
db.myCollection.find().pretty()
# 无格式打印
db.myCollection.print()
# JSON格式打印
db.myCollection.printjson()

# 多行操作
# 代码行以 '(','{','[' 结束,则随后一行将以省略号 "..." 开始,直到输入对应的')','}',']'
if(x>0){
...count++;
...print(x);
...}


# Tab命令补全
# 例如下面 <Tab> 补全后将列出各种以‘c’开头的方法
db.myCollection.c<Tab>

ACID

Transaction

事物特性 MongoDB MySQL (InnoDB)
原子性 Atomicity 单行/文档级原子性 多行原子性
一致性 Consistency 强一致或最终一致 强一致
隔离性 Isolation 提交读 可重复读
持久性 Durability 日志及复制 日志

原子性 Atomicity

Atomicity Non-Atomicity

  • 支持: 单行/文档级原子性
      db.users.update({ username:"Tom"},{$set:{salary:5000}});
    
  • 尚不支持(4.0以下): 多行/多文档/多语句原子性
      db.users.update({ salary:{$lt:5000}},{$set:{salary:5000}});
    
    • 开始状态:
      username salary
      James 3000
      Tom 4000
      Melody 4500
      Frank 2500
      Kelly 3500
      Lucy 7600
    • 结束状态:中间出错(eg:宕机)
      username salary
      James 5000
      Tom 5000
      Melody 5000
      Frank 2500 <- 宕机
      Kelly 3500
      Lucy 7600

一致性 Consistency

Consistency

多文档一致性处理:

  • 通过建模来避免
  • 二阶段提交
  • 记录日志,人工干预

注:

  • 传统数据库:规则校验主外键
  • 分布式数据库:多节点数据一致(Read your writes) MongoDB:可调一致性

隔离性 Isolation

Isolation

Isolation Level Default Setting
Serializable /
Repeatable Read MySQL
Read Committed PostgreSQL,Oracle
Read UnCommitted MongoDB

持久性 Durability

Durability

  • 机制: MongoDB vs. 传统数据库
  • MongoDB单节点写操作 MongoDB单节点写操作
  • MongoDB多节点写操作
    MongoDB多节点写操作

写关注机制 WriteConcern

{ w: <value>, j: <boolean>, wtimeout: <number> }
  • 用来指定mongoDB对写操作的回执行为
  • 可在`connection level`或者`写操作level`(insert/update/delete操作的最后一个参数)指定
  • 支持以下值:
    • w: 0/1/n/majority/tag
    • j: true/false 或 0/1
    • wtimeout: millis -- only applicable for w>1

具体:

  1. w: write acknowledgement instance

    • w:0 Unacknowledged (无任何回执)

      • 2.2及以前版本的默认行为
      • 网络丢包,系统崩溃,无效数据(早期版本丢数据之罪魁祸首) Unacknowledged
    • w:1 Acknowledged (Mongod在写完内存后返送确认)

      • 2.4版本以后的默认行为
      • 能够处理网络故障,无效数据等错误状态
      • 系统崩溃时可能会丢失最多100ms数据 Acknowledged
    • w:2/n/majority Replica Acknowledged (等待数据复制到n个/大部分节点后再发送回执) Replica Acknowledged

  2. j: journal(恢复日志)

    • 用于系统宕机时恢复内存数据,保证MongoDB中数据的持久性,journal与检查点Checkpoint协作:
      • 在数据文件中查找上一个Checkpoint标识符;
      • 在 journal 日志文件中查找Checkpoint标识符对应的记录;
      • 重做对应记录之后的全部操作;
    • 刷盘间隔:
      • MMAP: 30~100ms
      • WiredTiger: 100MB/Checkpoint
      • eg: 每隔60s 或在 journal文件数据的写入达到2GB时设置一次检查点Checkpoint
    • 默认为异步刷盘,可用j:1强制同步刷盘
    • j:1/true Journaled,强制 journal 文件的同步 (Journal刷盘后再发送写回执) Journaled

示例:

  1. 场景:插入一些无效数据(eg:10个document同一个_id),检查实际插入数据数目

    • 不使用写关注 {w:0} : 未报错,本以为是10条记录但却显示1条记录

        > db.test.count()
        0
        > for(var i=0;i<10;i++){
            var res=db.test.insert({_id:10,a:i},{writeConcern:{w:0}})
            if(!res.getWriteError())
                print("Inserted doc #"+(i+1));
            else
                print(res.getWriteError().errmsg);
        }
      
        Inserted doc #1
        Inserted doc #2
        Inserted doc #3
        Inserted doc #4
        Inserted doc #5
        Inserted doc #6
        Inserted doc #7
        Inserted doc #8
        Inserted doc #9
        Inserted doc #10
      
        > db.test.count()
        1                            # 期望值为10
      
    • 指定写关注{w:1} (2.4以后版本默认为1) : 会acknowledge写错误,返回给客户端,最后显示1条记录

        > db.test.count()
        0
        > for(var i=0;i<10;i++){
            var res=db.test.insert({_id:10,a:i},{writeConcern:{w:1}})
            if(!res.getWriteError())
                print("Inserted doc #"+(i+1));
            else
                print(res.getWriteError().errmsg);
        }
      
        Inserted doc #1
        E11000 duplicate key error index: test.test.$_id_ dup key: { : 10.0 }
        E11000 duplicate key error index: test.test.$_id_ dup key: { : 10.0 }
        E11000 duplicate key error index: test.test.$_id_ dup key: { : 10.0 }
        E11000 duplicate key error index: test.test.$_id_ dup key: { : 10.0 }
        E11000 duplicate key error index: test.test.$_id_ dup key: { : 10.0 }
        E11000 duplicate key error index: test.test.$_id_ dup key: { : 10.0 }
        E11000 duplicate key error index: test.test.$_id_ dup key: { : 10.0 }
        E11000 duplicate key error index: test.test.$_id_ dup key: { : 10.0 }
        E11000 duplicate key error index: test.test.$_id_ dup key: { : 10.0 }
      
        > db.test.count()
        1                            # 期望值为1
      
  2. 场景:系统崩溃导致数据丢失(eg:w:1 高速持续写入数据,kill -9 mongod强制关闭mongo服务,然后重新启动mongo检查程序汇报写入的数据和实际插入的数据)

     function journalDataLoss(journal){
         var count=0,start=new Date();
         try{
             var docs=[];
             for(var i=0;i<1000;i++)
                 docs.push({a:i});
             while(true){
                 var res=db.test.insert(docs,{writeConcern:{j:journal}})        # 0/1
                 count+=res.nInserted;
                 if(count%100000==0)
                     print("inserted "+count+" time used:"+(((new Date()).getTime()-start.getTime())/1000)+" seconds");    
             }
         }catch(error){
             print("Total doc inserted successfully:"+count);
         }
     }
    
    • j:0 不实时刷日志

        > journalDataLoss(0)
        Inserted 10000 times used: 3.579 seconds
        Inserted 20000 times used: 7.123 seconds
        ...
        ...                                         <- execute: kill -9 mongod
        Total doc inserted successfully:715000
      
        # 重启mongod,检查插入数量
        > db.test.count()
        713000                        # 数据丢失
      

      No Journaled

    • j:1 实时刷日志

        > journalDataLoss(1)
        Inserted 10000 times used: 4.579 seconds
        Inserted 20000 times used: 8.123 seconds
        ...
        ...                                         <- execute: kill -9 mongod
        Total doc inserted successfully:726000
      
        # 重启mongod,检查插入数量
        > db.test.count()
        726000                        # 数据未丢失
      
  3. 场景:主备置换导致数据丢失(eg:w:1/majority,j:1高速持续写入数据,kill -9 mongod主节点,连接到新的主节点,检查实际插入的数据与程序汇报的插入数据)

    • w:1 CaseC-1 CaseC-2 CaseC-3
    • w:majority 设置确认数据写到大部分节点再返回 CaseC-Solution

MongoDB数据安全总结

WriteConcern ReadPerference

Read preference describes how MongoDB clients route read operations to the members of a replica set.By default, an application directs its read operations to the primary member in a replica set.

CRUD

Create(insert)

Insert Document into Collection 注:

  • 单条document操作是原子性的
  • 每一个document都有一个唯一的_id 字段作为 primary_key(若未指定,则MongoDB自动为_id生成一个ObjectId
  • _id字段值无法修改
db.myCollection.insert({...})            # 返回WriteResult对象
db.myCollection.insert([{},{},..])         # 返回BulkWriteResult对象

db.myCollection.insertOne({...})
db.myCollection.insertMany([{},{},..])

Read(find)

db.myCollection.find(<query filter>,<projection>)
db.myCollection.findOne(...)

// 参数<query filter>/<projection>格式:
{
  <field1>: <value1>,
  <field2>: { <operator>: <value> },
  ...
}
  1. <query filter>

    • value:
      • 普通特定值
      • { <operator>: <value> }:specify conditions
    • operator:

        #  Comparison
        $eq,$gt,$gte,$lt,$lte,$ne,$in,$nin
      
        # Logical
        $or,$and,$not,$nor
      
        # Element
        $exists,$type
      
        # Evaluation
        $mod,$regex,$text,$where
      
        # Geospatial
        $geoWithin,$geoIntersects,$near,$nearsphere
      
        # Array
        $all,$elemMatch,$size
      
        # Bitwise
        $bitsAllSet,$bitsAnySet,$bitsAllClear,$bitsAnyClear
      
        # Comments
        $comment
      
  2. <projection>

    • value:
      • 1/true: 返回文档包含该字段
      • 0/false: 返回文档排除该字段
      • { <operator>: <value> }: specify conditions
    • operator:
        $,$elemMatch,$slice,$meta
      
    • 注:不能使用数组索引来指定映射的特定数组元素,eg: { "ratings.0": 1 } -- wrong
  3. 示例:

    • 查询

        db.users.find( { status:"A" } )
        db.users.find( { status:{ $in:["P","D"] } } )
        db.users.find({
            status:"A",
            $or:[ {age:{$lt:30}},{type:1} ]
        })
      
        # 嵌入式document匹配
        db.users.find({
            favorites:{artist:"Picasso"}
        })
        db.users.find({
            "favorites.artist":"Picasso"
        })
      
        # 精确匹配
        db.users.find({
            badges:["blue","black"]
        })    
      
        # 匹配包含black的
        db.users.find({
            badges:"black"
        })    
      
        # 匹配第一个元素为black的
        db.users.find({
            "badges.0":"black"
        })    
      
        # 查询 finished 数组至少包含一个大于15并且小于20的元素的文档
        # $elemMatch 为数组元素指定复合条件,查询数组中至少一个元素满足所有指定条件的文档
        db.users.find({ 
            finished: { $elemMatch: { $gt: 15, $lt: 20 } } 
        })
      
        # 查询 finished 数组中包含大于15 ,或者小于20的元素的文档
        db.users.find({ 
            finished: { $gt: 15, $lt: 20 } 
        })
      
    • 查询返回部分字段

        // 返回文档 _id,name,status 字段
        db.users.find(
            {status:"A"},
            {name:1,status:1}
        )
      
        // 返回文档 name,status 字段
        // Note: 除了 _id 字段,不能在映射文档中组合包含和排除语句
        // Eg: name:1,status:0 -- wrong!
        db.users.find({
            {status:"A"},
            {name:1,status:1,_id:0}
        })
      
        // 返回文档不显示 favorites,points字段
        db.users.find({
            {status:"A"},
            {favorites:0,points:0}
        })
      
        // 返回文档显示_id,name,status,points的bonus字段
        db.users.find({
            {status:"A"},
            {name:1,status:1,"points.bonus":1}
        })
      
        // 使用$slice映射操作符来返回 points数组中最后的元素
        db.users.find({
            {status:"A"},
            {name:1,status:1,"points":{$slice:-1}}
        })
      
  4. 注:

    • 查询值为Null或不存在的字段

        { "_id" : 900, "name" : null },
        { "_id" : 901 }
      
      • db.users.find({name:null}): 会返回name为null和不存在name字段的记录(注:若使用了sparse稀疏索引,只会匹配到name为null的记录,不会匹配到不存在的)
      • db.users.find({name:{$type:"null"}}): 只会返回name为null的记录 ($type 类型筛查
      • db.users.find({name:{$exists:false}}): 只返回不存在name字段的documents ($exists 存在性筛查)
    • 游标Cursor

        var myCursor=db.users.find({type:"string"})
      
        while(myCursor.hasNext()){
            printjson(myCursor.next())
        }
      
        myCursor.forEach(printjson);
      
        myCursor.forEach(function(myDoc){
            print("user:"+myDoc.name);
        })
      
        var myArray=myCursor.toArray();
        myArray[3];
      
        myCursor[1]             // same with myCursor.toArray()[1]
      
      • find方法返回的(若不赋给一个变量,则会自动遍历显示,默认是展示20条记录,可以使用DBQuery.shellBatchSize 来改变迭代结果的数量)
      • Cursor Method: count,hint,forEach,map,limit,sort,size,skip,toArray,... More

Update

db.myCollection.update(<query filter>,<update document>,<option>)    # 默认情况下只更新 一个 文档
db.myCollection.updateOne(...)
db.myCollection.updateMany(...)
db.myCollection.replaceOne(...)

// 更多:
db.myCollection.findOneAndReplace()
db.myCollection.findOneAndUpdate()
db.myCollection.findAndModify()
  1. <query filter>: 同上
  2. <update document>:

    • { <field1>: <value1>, ... }
    • {<update operator>: { <field1>: <value1>, ... },...}
    • update operators:

        # Field
        $inc
        $mul
        $rename
        $setOnInsert
        $set
        $unset
        $min
        $max
        $currentDate
      
        # Array
        $
        $[]
        $[<identifier>]
        $addToSet
        $pop
        $pull
        $push
        $pullAll
      
        # Modifiers
        $each
        $position
        $slice
        $sort
      
        # Bitwise
        $bit
      
  3. <option>

    • multi:false/true 配置是否更新多个文档
    • upsert:false/true 表示不存在时是否插入新的document
  4. 示例:

     db.users.update(
        { name: "xyz" },
        { name: "mee", age: 25, type: 1, status: "A", favorites: { "artist": "Matisse", food: "mango" } }
     )
    
     db.users.update(
         {status:"A"},
         {
             $set:{status:"B",type:0},
             $currentDate:{lastModified:true}
         },
         {multi:true}
     )
    

Delete

// delete
db.myCollection.deleteOne(<query filter>)

// delete all matched
db.myCollection.deleteMany(...)
db.myCollection.remove(<query filter>, <justOne>)

// delete all,include index
db.myCollection.drop()

// 删除以指定顺序排序的文档中的第一个文档.
db.myCollection.findOneAndDelete().

示例:

// delete all documents
db.users.deleteMany({})
db.users.remove({})

// delete all matched
db.users.remove({status:"D"})
db.users.deleteMany({status:"D"})

// delete just one
// <justOne>:1
db.users.remove({status:"D"},1}
db.users.deleteOne({status:"D"})

WriteConcern

在安全写情况下,可以指定MongoDB写操作要求的确认级别(insert/update/delete 操作的最后一个参数)

{ w: <value>, j: <boolean>, wtimeout: <number> }
  • w:0/1/"majority" -- write acknowledgement instance
  • j:true/false -- journal
  • wtimeout -- only applicable for w>1

索引 Index

  • 优点:加快索引相关的查询;
  • 缺点:增加磁盘空间消耗,降低写入性能

CRUD

  1. 创建索引

     // <field>: < 1 or -1 >
     // 1 : 索引正序,-1 : 索引倒序
     db.collection.createIndex( <Key Index specification>, <options>)
    
  2. 查看索引

     db.collection.getIndexes()
    
  3. 删除索引

     db.collection.dropIndex({...})
     db.collection.dropIndexs()
    
  4. 重建索引: ( drop all indexes,include _id index,then rebuilds all in the background,rebuild _id index in the foreground,which takes the db's write lock.)

     db.collection.reIndex()
    
  5. 检查索引

     // 扫描集合中的数据和索引以检查正确性的内部命令
     db.collection.validate()
    

单字段/复合/多键索引

  1. 单字段索引 Single Field Index

     db.users.createIndex({age:-1});
    
    • 在任意一个field上(包括Embedded Field)建立索引
    • 默认索引 _id:类型ObjectId(代替递增的 id,能够解决分布式的 MongoDB 生成唯一标识符的问题),12个字节构成: Timestamp(4)+MachineIdentifier(3)+ProcessIdentifier(2)+Coounter(3)
    • 可配置顺序/倒序(1: 正序,-1:倒序)
  2. 复合索引 Compound Index

     db.users.createIndex( { username: 1, age: -1 } )
     db.users.createIndex( { username: 1, age: 1 } )
     // 方便按照username和age查找
     db.users.find({username:"Tom",age:5});
    
    • 多个不同field组成
    • 注:上面的两个索引是完全不同的,在磁盘上的 B+树 其实也按照了完全不同的顺序进行存储,在使用查询语句对集合中数据进行查找时,是会使用不同的索引,所以在索引创建时要考虑好使用的场景,避免创建无用的索引。
  3. 多键索引 Multikey Index

     //{ _id: 1, item: "ABC", ratings: [ 2, 5, 9 ] }
     db.survey.createIndex( { ratings: 1 } )
    
    • 在一个储存数组的键上添加索引,会对数组中的每个元素都添加索引项,加速对数组中元素的查找
    • 注:不需要显示地指定索引为多键类型,MongoDB会自动地决定是否需要创建一个多键索引

文本索引

Text Index 文本索引(也叫全文索引)

  • 支持在字符串内容上的文本检索查询
  • 一个集合只能有一个文本检索索引,但是这个索引可以覆盖多个字段
  • 创建文本索引(name:"text"):
      db.stores.createIndex({name:"text",description:"text"})        # 在name,description字段建立文本索引
    
  • 查询:使用$text操作符在一个创建了text index的Collection上执行文本检索
      {
        $text:
          {
            $search: <string>,
            $language: <string>,
            $caseSensitive: <boolean>,
            $diacriticSensitive: <boolean>
          }
      }
    
  • 查询示例:
    • $text 会使用空格和标点符号作为分隔符对检索字符串进行分词 (OR操作)
        // 包含 aa or bb or cc
        db.stores.find({ $text:{$search:"aa bb cc"} })
        // 包含 (aa or bb ) and not cc
        db.stores.find({ $text:{$search:"aa bb -cc"} })
        // 包含 aa or "bb cc"
        db.stores.find({ $text:{$search:"aa \"bb cc\""} })
      
    • $text+$meta 显示匹配相似度
        # Sort by Additional Query and Text Search Score and Return top 2 matching documents
        # score: 计算一个相关性分数,表明该文档与查询的匹配程度
        # 显式地对 $meta:"textScore" 字段进行映射然后基于该字段进行sort排序
        db.stores.find({
            {$text:{$search:"aa bb cc"}},
            {score:{$meta:"textScore"}}
        }).sort({ date:1, score:{$meta:"textScore"} }).limit(2)
      
    • 聚合管道中,在 $match 阶段使用文本搜索(限制:只能是管道中的第一个阶段,$text只能在阶段中出现一次,$text不能出现在$or,$nor表达式中)
        db.articles.aggregate([
            { $match:{ $text:{$search:"aa bb"} } },
            { $sort:{ score:{$meta:"textScore"} } },
            { $project:{ title:1,_id:0 } }
        ])
        db.articles.aggregate([
            { $match:{ $text:{$search:"aa bb"} } },
            { $project:{ title:1,_id:0,score:{$meta:"textScore"} } },
            { $match:{ score:{$gt:1.0} } }
        ])
      
  • 注:如果要做更细致的全文索引的话,推荐使用 Elasticsearch

索引属性

创建索引时的可配置选项<options>,说明Index的一些特性

db.collection.createIndex( <Key Index specification>, <options>)
  1. expireAfterSeconds (TTL): 过一段时间后自动移除集合中的document

     db.users.createIndex(
         {lastModifiedDate:1},
         {expireAfterSeconds:3600}
     )
     // 设置文档过期的时间expireAt字段的值一致
     db.log_events.createIndex( 
         { "expireAt": 1 }, 
         { expireAfterSeconds: 0 } 
     )
    
    • index 字段存储数据类型必需是date或date数组(若是数组,则其中最低过期阈值得到匹配时,删除此document)
    • 不能保证过期数据会被立刻删除,删除过期数据的后台任务每隔60秒运行一次
    • 应用场景:机器生成的事件数据,日志,会话信息等,这些数据都只需要在数据库中保存有限时间
  2. unique

     db.members.createIndex( { "user_id": 1 }, { unique: true } )
    
  3. collation: Case Insensitive Indexes

     db.fruit.createIndex( 
         { type: 1},
         { collation: { locale: 'en', strength: 2 } }
     )
    
  4. sparse( 稀疏索引): null值不计入索引,常和唯一索引连用

     db.users.createIndex({name:1},{sparse:true})
    
  5. partialFilterExpression ( 局部索引Partial Indexes ):稀疏索引进化版,一种在指定赛道上(可跨赛道),消耗更低的索引

     db.users.createIndex(
         {name:1},
         {unique:true,partialFilterExpression:{age:{$gt:18}}}
     )
    
    • 不能作为分片的片键
    • _id不能创建局部索引
    • 同一个索引不能和sparse同时使用
    • 一个键上不能有多个不同的局部索引
    • particalFilterExpression 支持的过滤操作:
        $eq,$gt,$gte,$lt,$lte,
        $exists:true,
        $type,
        $and -- at top level
      
  6. background : 后台创建索引会比默认的慢,但不会锁表(生产环境使用background:true比较好)

     db.users.createIndex({username:1},{background:true})
    

衡量索引使用情况

  1. mongostat工具-查看mongodb运行状态的程序
     mongostat --help
     mongostat -h localhost:12345
    
  2. profile集合
     db.getProfilingLevel()
     db.getProfilingStatus()
     db.setProfilingLevel(2)
     show tables
     db.system.profile.find().sort({$natural:-1}).limit(10)
    
  3. 日志

     # 配置日志
     vim conf/mongod.conf
     ...
     verbose = vvvvv    # v,vv,...,vvvvv
    
  4. 查看容量占用(注意: 确保索引与内存相适应)

     // indexSizes: 查看索引在磁盘存储的大小
     db.users.stats().indexSizes`
     // indexDetails: 查看索引占用内存的大小 (内存中会放最近使用的索引)
     db.users.stats({indexDetails:true}).indexDetails
     // totalIndexSize 查询索引大小 
     db.users.totalIndexSize()
    
  5. $indexStats 索引具体信息

     // { $indexStats: { } } 返回包括 name,key,host,accesses字段
     db.orders.aggregate([ 
         { $indexStats: { } } 
     ])
     db.orders.aggregate([ 
         { $indexStats: { } } ,
         {$match:{name:"_id_ type_1_item_1"}}
     ])
    
     // Eg: Return
     {
        "name" : "_id_",
        "key" : {"_id" : 1},
        "host" : "examplehost.local:27017",
        "accesses" : {
           "ops" : NumberLong(0),
           "since" : ISODate("2015-10-02T14:31:32.479Z")
        }
     }
     {
        "name" : "type_1_item_1",
        "key" : {"type" : 1,"item" : 1},
        "host" : "examplehost.local:27017",
        "accesses" : {
           "ops" : NumberLong(1),
           "since" : ISODate("2015-10-02T14:31:58.321Z")
        }
     }
    
  6. explain() 返回查询计划(query plan),即查询的详细信息

     // 返回一个含有查询过程的统计数据的文档,包括所使用的索引,扫描过的文档数,查询所消耗的毫秒数
     db.users.find(...).explain("executionStats")
    
  7. hint() 以索引作为方法参数, 强制 MongoDB使用指定的索引来匹配查询

     db.people.find(
        { name: "John Doe", zipcode: { $gt: "63000" } }
     ).hint( { zipcode: 1 } ).explain("executionStats")
    
     // 指定 $natural 操作符来避免MongoDB使用任何索引(注; 亦即,查询不会使用索引)
     db.people.find(
        { name: "John Doe", zipcode: { $gt: "63000" } }
     ).hint( { $natural: 1 } ).explain("executionStats")
    

索引优化

索引策略:

  • 创建索引以支持查询
  • 使用索引来排序查询结果
  • 确保索引与内存相适应
  • 创建能确保选择力的查询

优化策略:

  1. 重复率越低越适合做索引(distinct/count 越接近1越适合),例如状态,性别等重复率不适合
  2. 联合索引,索引前缀由低到高,eg: db.test.createIndex({a:1,b:1,c:1}),则 a b c, a b, a
  3. 索引顺序:等值,范围,顺序
     // 查询
     db.test.find({
         a:2,
         b:{$gt2,$le:10}
     }).sort({c:1})
     // 则建立索引如下所以更优:
     db.test.createIndex({a:1,c:1:b:1})
    
  4. 有条件尽量匹配覆盖索引
     db.test.createIndex({a:1,b:1,c:1})
     db.test.find({a:3},{b:1,c:1,_id:0})     # select b,c from test where a=3 -> 能用到索引,且获取的值在索引中
    
  5. 指定从节点创建索引,这样不影响主节点的写入,不会驱逐掉常规缓存(适用于BI,报表查询,需添加大量索引的情况),步骤:
    1. 指定从节点priority为0 (不会变成主节点)
    2. 单机模式重启该从节点(配置文件注释)
    3. 添加索引
    4. 副本集模式重启该节点(配置文件还原)

聚合 Aggregate

MongoDB中有三种实现聚合的方式:

  • 聚合管道 Aggregator pipeline
  • Map-Reduce function
  • 单目聚合方法(group,count,distinct)

聚合管道

Aggregation Pipeline

  • 基于数据处理的聚合管道,使用内置的原生sql操作,效率高 (支持类似mysql的group by功能 )
  • 每个document通过一个由多个阶段(stage)组成的管道,经过一系列的处理(例如对每个阶段的管道进行分组、过滤等)输出相应的结果
  • 限制:
    • 每个阶段管道限制为100MB的内存,超过将报错(可以设置allowDiskUsetrue来在聚合管道节点把数据写入临时文件来解决100MB的内存的限制)
    • 输出的结果只能保存在一个文档中,BSON Document大小限制为16M(2.6后:DB.collect.aggregate()方法返回一个指针cursor ,可以返回任何结果集的大小)
    • 可作用在分片集合,但结果不能输在分片集合(MapReduce可作用在分片集合,结果也可输在分片集合)

aggregation-pipeline

Doc

// 方式一:
db.collection.aggregate(pipeline, options)

// 方式二:
db.runCommand({
  aggregate: "<collection>" || 1,
  pipeline: [ <stage>, <...> ],
  //options ( belows are optional )
  explain: <boolean>,
  allowDiskUse: <boolean>,
  cursor: <document>,
  maxTimeMS: <int>,
  bypassDocumentValidation: <boolean>,
  readConcern: <document>,
  collation: <document>,
  hint: <string or document>,
  comment: <string>,
  writeConcern: <document>
})
  1. 参数pipeline:

     [<stage>,<Stage>,...]
    
     //<stage>
     $stageOperator:{<key>:<value>,<key>:<expression>...}
    
     // <expression> 
     { <expressionOperator>:<arg> }
     { <expressionOperator>:[<arg1>,<arg2>,... <argN>] }
    
    • StageOperator 阶段操作符

        $match          // 过滤数据,传输到下一个阶段管道
        $project        // 投影,过滤显示文档字段(可选择字段,重命名字段,派生字段)
        $group          // 将数据根据key进行分组,统计结果
        $unwind         // 将文档中的某一个数组类型字段拆分成多条,每条包含数组中的一个值
        $lookup         // 进行两个集合之间左连接操作
        $sort           // 将输入文档排序后输出
        $skip           // 跳过指定数量的文档,返回余下的文档
        $limit          // 限制返回的文档数
        $redact         // 根据字段所处的document结构的级别,对文档进行“修剪”,通常和“判断语句if-else”结合使用,即$cond
        $sample         // 抽样输出
        $geoNear        // 用于地理位置数据分析
        $indexStats     // 返回数据集合的每个索引的使用情况 { $indexStats: { } }
        $out            // 将最后计算结果写入到指定的collection中,必须为pipeline最后一个阶段管道
        ...
      
        // 常用组合
        $match -> $project
        $match -> $group
        $match -> $group -> $sort
        $match -> $match -> $ project -> $group
        $match -> $lookup -> $match -> $sort
        $match -> $project -> $sort -> $skip -> $limit
        $match -> $project -> $unwind -> $group -> $sort ->$skip -> $limit
      
    • ExpressionOperator 表达式操作符

        // 1. 布尔管道 Boolean Operators 
        // eg: $or: [ { $gt: [ "$qty", 250 ] }, { $lt: [ "$qty", 200 ] } ]
        $and,$or,$not
      
        // 2. 条件操作符 Conditinal Operator 
        // eg: $cond: { if: { $gte: [ "$qty", 250 ] }, then: 30, else: 20 }
        $cond,$ifNull,$switch
      
        // 3. 数据类型 DataType Operators
        $type
      
        // 4. 集合操作 Set Operators
        // eg: $setUnion: [ "$A", "$B" ] 
        $setEquals                                     // 完全相等
        $setIsSubset                                   // 完全被包含
        $anyElementTrue                                // 集合中任一元素符合,则true
        $allElementsTrue                               // 集合中所有元素符合,则true
        $setIntersection                               // 交集
        $setUnion                                      // 并集
        $setDifference                                 // 差集
      
        // 5. 比较操作符 Comparison Operators
        $cmp                                         // 0/1/-1
        $eq,$gt,$lt,$gte,$lte,$ne
      
        // 6. 算术操作符 Arithmetic Operators
        // eg: $abs: { $subtract: [ "$start", "$end" ] } 
        $abs                                         // 绝对值
        $in
        $add, $substract, $multiply, $divide, $mod
        $ceil, $floor ,$trunc
        $log, $log10, $sqrt, $pow, $exp
      
        // 7. 字符串操作 String Operators
        // eg: $split: ["$city", ", "]
        $concat, $split
        $trim, $ltrim, $rtrim
        $toLower, $toUpper
        $toString
        $dateFromString, $dateToString
        $strcasecmp                                  // case-insensitive,return 0,1/-1
        $substr                                      // Deprecated!
        $substrBytes, $substrCP
        $indexOfBytes, $indexOfCP
        $strLenBytes,$strLenCP
      
        // 8. 数组 Array Operators
        // eg: $arrayElemAt: [ "$favorites", -1 ]
        $arrayElemAt
        $map
        $filter
        $slice                                      // subset
        $zip                                        // merge two arrays
        $reduce                                     // combine into a single value return
        $isArray, $arrayToObject, $objectToArray
        $concatArrays
        $reverseArray
        $indexOfArray
        $in
        $range
        $size
      
        // 9. 日期 Date Operators
        $dateFromParts, $dateToParts
        $dateFromString, $dateToString
        $dateOfMonth, $dateOfWeek, $dateOfYear
        $year,$month, $week, $hour, $minute, $second, $millisecond
        $isoDayOfWeek, $isoWeek, $isoWeekYear
        $toDate
      
        ...
      
  2. 参数options:

     {
       explain: <boolean>,
       allowDiskUse: <boolean>,
       cursor: <document>,
       maxTimeMS: <int>,
       bypassDocumentValidation: <boolean>,
       readConcern: <document>,
       collation: <document>,
       hint: <string or document>,
       comment: <string>,
       writeConcern: <document>
     }
    

示例:

  1. Prepare Test data

     // catalogues: name,description
     db.catalogues.insert([
     {name:"Spring",description:"spring framework"},
     {name:"ReactJS",description:"reactJS front framework"},
     {name:"NoSql",description:"not only sql databases"}
     {name:"Docker",description:"Build, Ship, and Run Any App, Anywhere"}
     ]);
    
     // articles: title,author,description,tags,catalogue,postDate,content
     db.articles.insert([
     {title:"Spring Basic",author:"Tom",description:"introduce spring basic",tags:["java","spring"],catalogueId:db.catalogues.findOne({name:"Spring"})._id,postDate:"2015-01-01",content:"spring basic:ioc,aop",click:1},
     {title:"Spring MVC",author:"Tom",description:"introduce spring mvc",tags:["java","spring","mvc"],catalogueId:db.catalogues.findOne({name:"Spring"})._id,postDate:"2015-01-11",content:"spring mvc:dispatchServlet,restful",click:5},
     {title:"Spring Security",author:"Tom",description:"introduce spring security",tags:["java","spring","security"],catalogueId:db.catalogues.findOne({name:"Spring"})._id,postDate:"2015-01-21",content:"spring security:securityFilter,authentication,accessDecide",click:20},
     {title:"ReactJS Basic",author:"Lucy",description:"introduce reactJS front framework basic",tags:["front","reateJS"],catalogueId:db.catalogues.findOne({name:"ReactJS"})._id,postDate:"2015-02-01",content:"reactJS basic:component,lifecycle,props,state",click:30},
     {title:"ReactJS Flux",author:"Lucy",description:"introduce reactJS Flux",tags:["front","reateJS"],catalogueId:db.catalogues.findOne({name:"ReactJS"})._id,postDate:"2015-02-11",content:"reactJS Flux:reflux,redux"},
     {title:"Redis",author:"Jack",description:"introduce redis key-value db",tags:["nosql","redis"],catalogueId:db.catalogues.findOne({name:"NoSql"})._id,postDate:"2015-03-11",content:"redis:install,master-slave,persist,subscribe,crud",click:0},
     {title:"MongoDB",author:"Jack",description:"introduce mongo document database",tags:["nosql","mongodb"],catalogueId:db.catalogues.findOne({name:"NoSql"})._id,postDate:"2015-03-21",content:"mongodb:mongo shell,crud,index,aggregation,replica,sharding",click:25}
     ]);
    
  2. $lookup: Join

     # $lookup
     # {
     #      $lookup: {
     #           from: <collection to join>,
     #           localField: <field from the input documents>,
     #           foreignField: <field from the documents of the "from" collection>,
     #           as: <output array field>
     #      }
     # }
     > db.catalogues.aggregate([ 
         {$project:{id:1,name:1}},
         { $lookup:{ from:"articles", localField:"_id", foreignField:"catalogueId" ,as:"articles"} },
         {$project:{"_id":0,"name":1,"articles.title":1,"articles.click":1}} 
     ])
     { "name" : "Spring", "articles" : [ { "title" : "Spring Basic", "click" : 1 }, { "title" : "Spring MVC", "click" : 5 }, { "title" : "Spring Security", "click" : 20 } ] }
     { "name" : "ReactJS", "articles" : [ { "title" : "ReactJS Basic", "click" : 30 }, { "title" : "ReactJS Flux" } ] }
     { "name" : "NoSql", "articles" : [ { "title" : "Redis", "click" : 0 }, { "title" : "MongoDB", "click" : 25 } ] }
     { "name" : "Docker", "articles" : [ ] }
    
  3. $group : 按照特定的字段的值进行分组(非流式运算:必须等收到所有的文档之后,才能对文档进行处理)

     > db.articles.count()
     7
    
     # 1. $sum
     # select count(*) as count from articles;
     > db.articles.aggregate([
         { $group:{_id:"null",count:{$sum:1}} }
     ])
     { "_id" : "null", "count" : 7 }
    
     # select catalogueId as _id,count(*) as count as from articles group by catalogueId
     > db.articles.aggregate([
         { $group:{_id:"$catalogueId",count:{$sum:1}} }
     ])
     { "_id" : ObjectId("5b8e342212b995b45c17d5ec"), "count" : 2 }
     { "_id" : ObjectId("5b8e342212b995b45c17d5eb"), "count" : 2 }
     { "_id" : ObjectId("5b8e342212b995b45c17d5ea"), "count" : 3 }
    
     # select author as _id, sum(click) as total_click from articles group by author
     > db.articles.aggregate([
         { $group:{_id:"$author",total_click:{$sum:"$click"}} }
     ])
     { "_id" : "Lucy", "total_click" : 30 }
     { "_id" : "Jack", "total_click" : 25 }
     { "_id" : "Tom", "total_click" : 26 }
    
     # 2. $max/$min
     # select author as _id, max(click) as max_click from articles group by author
     > db.articles.aggregate([
         { $group:{_id:"$author",max_click:{$max:"$click"}} }
     ]) 
     { "_id" : "Lucy", "max_click" : 30 }
     { "_id" : "Jack", "max_click" : 25 }
     { "_id" : "Tom", "max_click" : 20 }
    
     # 3. $avg
     # select author as _id, avg(click) as avg_click from articles group by author
     > db.articles.aggregate([
         { $group:{_id:"$author",avg_click:{$avg:"$click"}} }
     ]) 
     { "_id" : "Lucy", "avg_click" : 30 }
     { "_id" : "Jack", "avg_click" : 12.5 }
     { "_id" : "Tom", "avg_click" : 8.666666666666666 }
    
     # 4. $first/$last
     > db.articles.aggregate([
         { $group:{_id:"$author",click_list:{$first:"$click"}} }
     ]) 
     { "_id" : "Lucy", "click_list" : 30 }
     { "_id" : "Jack", "click_list" : 0 }
     { "_id" : "Tom", "click_list" : 1 }
    
     # 5. $push/$addToSet
     > db.articles.aggregate([
         { $group:{_id:"$author",click_list:{$push:"$click"}} }
     ]) 
     { "_id" : "Lucy", "click_list" : [ 30 ] }
     { "_id" : "Jack", "click_list" : [ 0, 25 ] }
     { "_id" : "Tom", "click_list" : [ 1, 5, 20 ] }
    
  4. $match 过滤(尽量放在前面,提高效率)

     # select _id,title,author from articles where click>0 and click<25
     db.articles.aggregate([
         { $match:{click:{$gt:0,$lt:25}} },
         { $project:{"_id":0,"title":1,"author":1,"click":1}}
     ]) 
     { "title" : "Spring Basic", "author" : "Tom", "click" : 1 }
     { "title" : "Spring MVC", "author" : "Tom", "click" : 5 }
     { "title" : "Spring Security", "author" : "Tom", "click" : 20 }
    
     # select author as _id,sum(click) as total_click from articles group by author having sum(click)>0 and sum(click)<30
     db.articles.aggregate([
         { $group:{_id:"$author",total_click:{$sum:"$click"} } },
         { $match:{total_click:{$gt:0,$lt:30}} }
     ]) 
     { "_id" : "Jack", "total_click" : 25 }
     { "_id" : "Tom", "total_click" : 26 }
    
  5. $project: 选择字段,重命名字段,派生字段

     # 选择字段: field:1/0,表示选择/不选择 field;将无用的字段从pipeline中过滤掉,能够减少聚合操作对内存的消耗
     > db.articles.aggregate([
         { $project:{"_id":0,"title":1,"postData":1,"click":1,"tags":1} }
     ])
     { "title" : "Spring Basic", "tags" : [ "java", "spring" ], "click" : 1 }
     { "title" : "Spring MVC", "tags" : [ "java", "spring", "mvc" ], "click" : 5 }
     { "title" : "Spring Security", "tags" : [ "java", "spring", "security" ], "click" : 20 }
     { "title" : "ReactJS Basic", "tags" : [ "front", "reateJS" ], "click" : 30 }
     { "title" : "ReactJS Flux", "tags" : [ "front", "reateJS" ] }
     { "title" : "Redis", "tags" : [ "nosql", "redis" ], "click" : 0 }
     { "title" : "MongoDB", "tags" : [ "nosql", "mongodb" ], "click" : 25 }
    
     # 重命名字段: 引用符$,格式是:"$field",表示引用doc中 field 的值
     > db.articles.aggregate([
         { $project:{_id:0,title:1,"preClick":"$click"} }
     ])
     { "title" : "Spring Basic", "preClick" : 1 }
     { "title" : "Spring MVC", "preClick" : 5 }
     { "title" : "Spring Security", "preClick" : 20 }
     { "title" : "ReactJS Basic", "preClick" : 30 }
     { "title" : "ReactJS Flux" }
     { "title" : "Redis", "preClick" : 0 }
     { "title" : "MongoDB", "preClick" : 25 }
    
     # 派生字段: 对字段进行计算,派生出一个新的字段
     > db.articles.aggregate([
         { $project:{
             "title":1,
             "click":1,
             "result": { $or: [ { $gt: [ "$click", 20 ] }, { $lt: [ "$click", 100 ] } ] }
           } 
         }
     ])
     { "_id" : ObjectId("5be2b3347ec0ecb208b54816"), "title" : "Spring Basic", "click" : 1, "result" : true }
     { "_id" : ObjectId("5be2b3347ec0ecb208b54817"), "title" : "Spring MVC", "click" : 5, "result" : true }
     { "_id" : ObjectId("5be2b3347ec0ecb208b54818"), "title" : "Spring Security", "click" : 20, "result" : true }
     { "_id" : ObjectId("5be2b3347ec0ecb208b54819"), "title" : "ReactJS Basic", "click" : 30, "result" : true }
     { "_id" : ObjectId("5be2b3347ec0ecb208b5481a"), "title" : "ReactJS Flux", "result" : true }
     { "_id" : ObjectId("5be2b3347ec0ecb208b5481b"), "title" : "Redis", "click" : 0, "result" : true }
     { "_id" : ObjectId("5be2b3347ec0ecb208b5481c"), "title" : "MongoDB", "click" : 25, "result" : true }
    
     # 派生字段
     > db.articles.aggregate([
         { $project:{
             _id:0,
             title:1,
             click:1,
             level: { $cond:{if:{$gte:["$click",20]},then: "High",else: "Low"} }
          } 
         }
     ])
     { "title" : "Spring Basic", "click" : 1, "level" : "Low" }
     { "title" : "Spring MVC", "click" : 5, "level" : "Low" }
     { "title" : "Spring Security", "click" : 20, "level" : "High" }
     { "title" : "ReactJS Basic", "click" : 30, "level" : "High" }
     { "title" : "ReactJS Flux", "level" : "Low" }
     { "title" : "Redis", "click" : 0, "level" : "Low" }
     { "title" : "MongoDB", "click" : 25, "level" : "High" }
    
  6. $limit,$skip,$sort

     > db.articles.aggregate([
         {$skip:2},
         {$limit:3},
         {$sort:{"postDate":1}},
         {$project:{"_id":0,"title":1,"postDate":1}}
     ])
     { "title" : "Spring Security", "postDate" : "2015-01-21" }
     { "title" : "ReactJS Basic", "postDate" : "2015-02-01" }
     { "title" : "ReactJS Flux", "postDate" : "2015-02-11" }
    
     > db.articles.aggregate([
         {$sort:{"postDate":1}},
         {$limit:3},
         {$skip:2},
         {$project:{"_id":0,"title":1,"postDate":1}}
     ])
     { "title" : "Spring Security", "postDate" : "2015-01-21" }
    
  7. $unwind: 拆分数组字段

     > db.articles.aggregate([
         {$project:{"_id":0,"title":1,"tags":1}},
         {$unwind:"$tags"}
     ])
     { "title" : "Spring Basic", "tags" : "java" }
     { "title" : "Spring Basic", "tags" : "spring" }
     { "title" : "Spring MVC", "tags" : "java" }
     { "title" : "Spring MVC", "tags" : "spring" }
     { "title" : "Spring MVC", "tags" : "mvc" }
     { "title" : "Spring Security", "tags" : "java" }
     { "title" : "Spring Security", "tags" : "spring" }
     { "title" : "Spring Security", "tags" : "security" }
     { "title" : "ReactJS Basic", "tags" : "front" }
     { "title" : "ReactJS Basic", "tags" : "reateJS" }
     { "title" : "ReactJS Flux", "tags" : "front" }
     { "title" : "ReactJS Flux", "tags" : "reateJS" }
     { "title" : "Redis", "tags" : "nosql" }
     { "title" : "Redis", "tags" : "redis" }
     { "title" : "MongoDB", "tags" : "nosql" }
     { "title" : "MongoDB", "tags" : "mongodb" }
    
    • 参数数组字段为空或不存在时,待处理的文档将会被忽略,该文档将不会有任何输出
    • 参数不是一个数组类型时,将会抛出异常
  8. $out: 把执行的结果写入指定数据表(会先清空原数据)

     > db.articles.aggregate([
         { $group:{_id:"$author",total_click:{$sum:"$click"}} },
         { $out:"author_click" }
     ])
     > db.author_click.find()
     { "_id" : "Lucy", "total_click" : 30 }
     { "_id" : "Jack", "total_click" : 25 }
     { "_id" : "Tom", "total_click" : 26 }
    
     > db.articles.aggregate([
         { $group:{_id:"$author",total_click:{$sum:"$click"}} },
         { $project:{"_id":0}},
         { $limit:2},
         { $out:"author_click" }
     ])
     > db.author_click.find()
     { "_id" : ObjectId("5be3a0ec58072db74ca83569"), "total_click" : 30 }
     { "_id" : ObjectId("5be3a0ec58072db74ca8356a"), "total_click" : 25 }
    
  9. options:explain 返回aggregate各个阶段管道的执行计划信息

     > db.articles.aggregate([
         { $group:{_id:"$author",click_list:{$push:"$click"}} }
     ]) 
     { "_id" : "Lucy", "click_list" : [ 30 ] }
     { "_id" : "Jack", "click_list" : [ 0, 25 ] }
     { "_id" : "Tom", "click_list" : [ 1, 5, 20 ] }
    
     > db.articles.aggregate([
         { $group:{_id:"$author",click_list:{$push:"$click"}} }
     ],{explain:true})
     {
         "stages" : [
                 {
                         "$cursor" : {
                                 "query" : {
    
                                 },
                                 "fields" : {
                                         "author" : 1,
                                         "click" : 1,
                                         "_id" : 0
                                 },
                                 "queryPlanner" : {
                                         "plannerVersion" : 1,
                                         "namespace" : "demo.articles",
                                         "indexFilterSet" : false,
                                         "parsedQuery" : {
    
                                         },
                                         "winningPlan" : {
                                                 "stage" : "COLLSCAN",
                                                 "direction" : "forward"
                                         },
                                         "rejectedPlans" : [ ]
                                 }
                         }
                 },
                 {
                         "$group" : {
                                 "_id" : "$author",
                                 "click_list" : {
                                         "$push" : "$click"
                                 }
                         }
                 }
         ],
         "ok" : 1
     }
    

Map-Reduce

  • 是一种计算模型(能够在多台Server上并行执行),将大批量的工作(数据)分解(MAP)执行,然后再将结果合并成最终结果(REDUCE)
  • 分为两个阶段:Map和Reduce;主要分为三步:Map,Shuffle,Reduce
    • Map: 将操作映射到每条document,产生key和value。eg:Map一个document产生key value对:{female,{count:1}},{male,{count:1}}
    • Shuffle: 按照key进行分组,并将key相同的value组合成数组。eg:产生(female:[{count:1},{count:1},...]),(male:[{count:1},{count:1},...])
    • Reduce: 把Value数组化简为单值(聚合运算统计)。eg:(female:{count:20}),(male:{count:15})
    • 注:Map和Reduce需要显式定义,shuffle由MongoDB来实现

map-reduce

db.collection.mapReduce(
     <map>,                            // map 映射函数,生成键值对 (遍历 collection,调用emit(key, value))
     <reduce>,                        // reduce 统计函数 (key-values -> key-value,把values数组变成一个单一的值value)
     {
       out: <collection>,            // 存放统计结果 (不指定则使用临时集合,在客户端断开后自动删除)
       query: <document>,            // 一个筛选条件,只有满足条件的文档才会调用map函数
       sort: <document>,            // 发往map函数前给文档排序
       limit: <number>,                // 发往map函数的文档数量的上限
       finalize: <function>,
       scope: <document>,
       jsMode: <boolean>,
       verbose: <boolean>,
       bypassDocumentValidation: <boolean>
     }
)

示例:

> db.articles.find({},{"_id":0,"title":1,"author":1,"click":1,"tags":1})
{ "title" : "Spring Basic", "author" : "Tom", "tags" : [ "java", "spring" ], "click" : 1 }
{ "title" : "Spring MVC", "author" : "Tom", "tags" : [ "java", "spring", "mvc" ], "click" : 5 }
{ "title" : "Spring Security", "author" : "Tom", "tags" : [ "java", "spring", "security" ], "click" : 20 }
{ "title" : "ReactJS Basic", "author" : "Lucy", "tags" : [ "front", "reateJS" ], "click" : 30 }
{ "title" : "ReactJS Flux", "author" : "Lucy", "tags" : [ "front", "reateJS" ] }
{ "title" : "Redis", "author" : "Jack", "tags" : [ "nosql", "redis" ], "click" : 0 }
{ "title" : "MongoDB", "author" : "Jack", "tags" : [ "nosql", "mongodb" ], "click" : 25 }

# 1. 各author的click大于0的文章总数
> db.articles.mapReduce(
    function(){emit(this.author,1)},
    function(key,values){return Array.sum(values)},
    {query:{click:{$gt:0}},out: "author_sum"}
)
{
        "result" : "author_sum",            // 储存结果的collection的名字
        "timeMillis" : 122,                    // 执行花费的时间,毫秒为单位
        "counts" : {
                "input" : 5,                // 满足条件被发送到map函数的文档个数
                "emit" : 5,                    // 在map函数中emit被调用的次数,也就是所有集合中的数据总量
                "reduce" : 1,                // reduce函数调用次数
                "output" : 3                // 结果集合中的文档个数
        },
        "ok" : 1                            // 是否成功,成功为1
}
> db.author_sum.find()
{ "_id" : "Jack", "value" : 1 }
{ "_id" : "Lucy", "value" : 1 }
{ "_id" : "Tom", "value" : 3 }

# 2. 各author文章的平均点击数
> db.articles.mapReduce(
    function(){emit(this.author,{click:this.click||0,article:1})},
    function(key,values){ 
        reducedVal={sum_click:0,sum_article:0};
        values.forEach(function(item){
            reducedVal.sum_click+=item.click;
            reducedVal.sum_article+=item.article;
        })
        return reducedVal.sum_click/reducedVal.sum_article;
    },
    {out:"author_sum"}
)
{
        "result" : "author_sum",
        "timeMillis" : 241,
        "counts" : {
                "input" : 7,
                "emit" : 7,
                "reduce" : 3,
                "output" : 3
        },
        "ok" : 1
}
> db.author_sum.find()
{ "_id" : "Jack", "value" : 12.5 }
{ "_id" : "Lucy", "value" : 15 }
{ "_id" : "Tom", "value" : 8.666666666666666 }

单目聚合操作

distinct

  1. count: db.collection.count(query, options)

     > db.articles.count()
     7
     > db.articles.count({click:{$gt:10}})
     3
     > db.articles.count({click:{$gt:10},tags:{$size:2}})
     2
     > db.articles.find({click:{$gt:10},tags:{$size:2}}).count()
     2
    
  2. distinct: db.collection.distinct(field, query, options)

     > db.articles.distinct("click")
     [ 1, 5, 20, 30, 0, 25 ]
    
     > db.articles.distinct("tags")
     ["java","spring","mvc","security","front","reateJS","nosql","redis","mongodb"]
    
     > db.articles.distinct("tags",{click:{$gt:20}})
     [ "front", "reateJS", "mongodb", "nosql" ]
    

Security

安全:

  • 物理隔离(最安全)
  • 网络隔离
  • IP白名单隔离(防火墙配置等)
  • 用户名密码鉴权

开启权限认证:

  • auth开启
      > vim conf/mongod.conf
      ...
      auth = true
    
  • keyfile开启

操作:

  1. 查看用户db.getUsers(),查看角色db.getRoles()

     > use admin
     switched to db admin
     > db.getUsers()
     [
         {
                 "_id" : "admin.mongoadmin",
                 "user" : "mongoadmin",
                 "db" : "admin",
                 "roles" : [
                         {
                                 "role" : "root",
                                 "db" : "admin"
                         }
                 ],
                 "mechanisms" : [
                         "SCRAM-SHA-1",
                         "SCRAM-SHA-256"
                 ]
         }
     ]
     > db.getRoles()
     [ ]
    
  2. 创建用户db.createUser (role内建类型:read,readWrite,dbAdmin,dbOwner,userAdmin)

     > db.createUser({
         user:"cj",
         pwd:"123",
         roles:[
             {role:"userAdmin",db:"demo"},
             {role:"read",db:"local"}
         ]
     })
     > mongo localhost:12345 -u cj -p 123
     > use testdb
     > show tables
    
  3. 创建角色db.createRole

     # 用户角色:
     # 数据库角色(read,readWrite,dbAdmin,dbOwner,userAdmin)
     # 集群角色(clusterAdmin,clusterManager,...)
     # 备份角色(backup,restore,...)
     # 其他特殊权限(DBAdminAnyDatabase,...)
     > db.createRole({
         role:"appUser",
         db:"myApp"
         privileges:[
             {
                 resource:{db:"myApp",collection:" "},
                 actions:["find","createCollection","dbStates","collStats"]
             },
             {
                 resource:{db:"myApp",collection:"logs"},
                 actions:["insert"]
             },
             {
                 resource:{db:"myApp",collection:"data"},
                 actions:["insert","update","remove","compact"]
             },
             {
                 resource:{db:"myApp",collection:"system.indexes"},
                 actions:["find"]
             }
         ],
         roles:[]
     })
    

复制集 & 分片

  1. 复制集 Replica Set (纵向): 基于领导(Leader-based)复制状态机 (关键:选举和数据复制)
  2. 分片 Sharding(横向):将数据进行拆分,水平的分散到不同的服务器上;架构上:读写均衡,去中心化
  3. 分片与复制集集群对比
    - Shard Replication
    实现意义|提升并发性能,提高大量数据随机访问性能|数据冗余,提升读性能
    架构上|水平化|中心化
    实现原理|数据打散分布|数据镜像
    维护成本|相对较高|相对容易

More about Sharding

  1. 分片成员节点: Sharding nodes

    • Shard节点: 存储数据的节点(单个mongod或者副本集) mongod --shardsvr --rpelSet
    • Config Server:存储元数据,为mongos服务,将数据路由到Shard mongod --configsvr
    • Query routers: 查询路由节点,即Mongos节点,接入Client请求,根据路由规则转发给合适的shard或者shards mongos --configdb <configdb server>
  2. 概念:

    1. 分片片键(Shard Key):集合里面选个键,用该键的值作为数据拆分的依据,例如配置sh.shardCollection("records.people", {user_id:"hashed"})
    2. 数据块(Chunk):mongodb分片后,存储数据的单元块,默认大小为64M
    3. 拆分 (Split Chunk):一个后台进程避免chunk增长的过大,当chunk尺寸超过指定的chunk size时,拆分此chunk(split后shard将会修改config server上这个chunk的metadata元信息)
    4. 平衡 (Balancing Chunks):一个后台线程用于对chunks迁移以达到平衡,会周期性的检查分片是否存在不均衡,如果存在则会进行块的迁移(balancer均衡器运行在mongos上,注:balancer进行均衡的条件是块数量的多少,而不是块大小)
    5. 拆分->平衡过程: Spliting Balancing
  3. Sharding Strategy:

    • Hashed Sharding(哈希切片) : 能将写入均衡分布到各个 shard
    • Ranged Sharding(范围切片): 能很好的支持基于 shard key的范围查询
    • Tag aware Sharding
    • 好的shard key:
      • key分布足够离散 sufficient cardinality: 片键相同导致数据块不拆分,容易形成大的数据块,导致数据不均
      • 写请求均匀分布 (evenly distributed write): 例如单调递增的_id或者时间戳作为片键,会导致一直往最后一个复本集添加数据
      • 尽量避免 scatter-gather 查询 (targeted read)
  4. 添加分片过程:

    1. 连接到mongos
    2. Add Shards
    3. Enable Sharding
    4. 对一个Collection进行分片
  5. 示例:手动分片(减少自动平衡过程带来的IO等资源消耗,前提:充分了解数据,对数据进行预先划分)

     # 关闭自动平衡 auto balance    
     sh.stopBalancer()    # Currently enabled: no
    
     # 分片切割 spliting
     > use admin
     > db.runCommand({"enablesharding":"myapp"})
     > db.runCommand({"shardcollection":"myapp.users","key":{"email":1}})
    
     for(var x=97;x<97+26;x++){
         for(var y=97;y<97+26;y+=6){
             var prefix=String.fromCharCode(x)+String.fromCharCode(y);
             db.runCommand({split:"myapp.users",middle:{email:prefix}})
         }
     }
    
     # 手动移动分割块 balancing
     var shServer=[
         "ShardServer 1",
         "ShardServer 2",
         "ShardServer 3",
         "ShardServer 4",
         "ShardServer 5"
     ]
     for(var x=97;x<97+26;x++){
         for(var y=97;y<97+26;y+=6){
             var prefix=String.fromCharCode(x)+String.fromCharCode(y);
             db.adminCommand({moveChunk:"myapp.users",find:{email:prefix},to:shServer[(y-97)/6]})
         }
     }
    
     # 循环(y-97)/6的数值结果为0,6,12,18,24; shServer[0] -> shServer[4]
    

MongoDB 4.0 新特性

  1. 多文档事务

    • 4.0: 单文档事务 -> 跨文档事务
    • 4.2: 复制集事务 -> 分片集群事务
  2. 聚合类型转换: 引入$convert聚合操作符来简化ETL(抽取,转化,加载)流程和负荷

    • 可结合:
        # 1. 类型转换
        $toBool,$toDate,$toDecimal,$toDouble,$toInt,$toLong,$toObjectId,$toString
        # 2. 日期操作转换
        $dateToParts,$dateFromParts,$dateFromString
        # 3. 修剪
        $trim,$rtrim,$ltrim
        {$trim:{input:<expression>}}
        {$trim:{input:[<expression>],chars:<string>}}
      
    • prepare test data:
        > db.address.insert([
            { street: "Canal st", building: NumberDecimal(21), _id: 0},
            { street: "43rd st", building: "229", _id: 1},
            { street: "Fulton st", building: "31", _id: 2 },
            { street: "52nd st", building: "11w", _id: 3}, 
            { street: "78th st", building: null, _id: 4}, 
            { street: "78th st",  _id: 5}, 
            { street: "Rector st", building: NumberInt(10), _id: 6,last_visited: {year: 2017, month: 10}}
        ]);
      
    • $convert示例:
        > db.address.aggregate( [
          {
            $addFields: {
              building: {
                $convert: {
                  input: "$building",
                  to: "int",
                  onError: 0,
                  onNull: -1
                }
              }
            }
          },
          { $sort: {building: 1}}
        ]);
        { "_id" : 4, "street" : "78th st", "building" : -1 }
        { "_id" : 5, "street" : "78th st", "building" : -1 }
        { "_id" : 3, "street" : "52nd st", "building" : 0 }
        { "_id" : 6, "street" : "Rector st", "building" : 10, "last_visited" : { "year" : 2017, "month" : 10 } }
        { "_id" : 0, "street" : "Canal st", "building" : 21 }
        { "_id" : 2, "street" : "Fulton st", "building" : 31 }
        { "_id" : 1, "street" : "43rd st", "building" : 229 }
      
    • $dateFromParts示例:
        # Add 15 months to month date field:
        > db.address.aggregate([
          {
            $addFields: {
              next_visit: {
                  $convert:{
                    input: {
                      $dateFromParts: {
                        year: "$last_visited.year",
                        month: {$add:[15, "$last_visited.month"]},
                      }},
                    to: "date",
                    onNull: "",
                    onError: ""
                }
              }
            }
          }
        ]);
        { "_id" : 0, "street" : "Canal st", "building" : NumberDecimal("21.0000000000000"), "next_visit" : "" }
        { "_id" : 1, "street" : "43rd st", "building" : "229", "next_visit" : "" }
        { "_id" : 2, "street" : "Fulton st", "building" : "31", "next_visit" : "" }
        { "_id" : 3, "street" : "52nd st", "building" : "11w", "next_visit" : "" }
        { "_id" : 4, "street" : "78th st", "building" : null, "next_visit" : "" }
        { "_id" : 5, "street" : "78th st", "next_visit" : "" }
        { "_id" : 6, "street" : "Rector st", "building" : 10, "last_visited" : { "year" : 2017, "month" : 10 }, "next_visit" : ISODate("2019-01-01T00:00:00Z") }
      
    • $trim示例:

        # conversion error:
        > db.address.aggregate( [
          {
            $addFields: {
               building: {$convert: { input: "$building", to: "int"  }}  }
          },
          {$sort: {building: 1}}
        ]);
        Error: command failed: {
            "ok" : 0,
            "errmsg" : "Failed to parse number '11w' in $convert with no onError value: Bad digit \"w\" while parsing 11w",
            "code" : 241,
            "codeName" : "ConversionFailure"
        } : aggregate failed
      
        # avoid conversion error: Using $trim expression with longer list of chars to remove:
        > db.address.aggregate( [
          {$match: { building: {$type: "string"} }},
          {
            $addFields: {
              building: {
                $convert: {
                  input: {$trim: {
                    input: "$building",
                    chars: "abcdefghijklmnopqrstuvwxyz "}},
                  to: "int"  }
                }
             }
          },
          {$sort: {building: 1}}
        ]);
        { "_id" : 3, "street" : "52nd st", "building" : 11 }
        { "_id" : 2, "street" : "Fulton st", "building" : 31 }
        { "_id" : 1, "street" : "43rd st", "building" : 229 }
      
  3. 修改订阅扩展

    • 3.6 集合层面的修改订阅
    • 4.0 数据库/集群层面的修改订阅(为修改事件返回clusterTime)
  4. 后备节点读取: 阻塞 -> 非阻塞性(引入快照机制)

    • 阻塞性后备节点: Block Block Read and Write
      • 读取:需等待批量数据复制写入操作完成才可
      • 写入: 需等待数据读取操作完成,否则会导致一致性问题
      • 形成了一个恶性循环
    • 非阻塞性后备节点: Unblock UnBlock Read and Write
    • 改进优化了性能: Delay Cmp

Application

Reference

MongoDB 3.4 中文文档 MongoDB 4.0 Manual MongoDB 教程 『浅入浅出』MongoDB 和 WiredTiger