“未来科技竞争力分析系统”项目总结（Web应用）

一个非常awkward的项目。。

项目需求概述

该系统主要用于分析国外的英文论文，通过统计论文的研究领域、被引用次数、发表机构、所在大学和第一作者等数据，以静态表图、动态表图、地图等形式展示出来，方便研究人员分析某个国家的科技影响力、资金投入情况等因素。系统主要分为四大模块，主界面如下图：

其中我负责 “生产力影响力分析”、“词义词典”，“投入结构分析“ 的开发。使用的框架为Seam-2.3，应用服务器为JBoss-as.7.1.0

领域知识框架构建：

用户上传PDF格式的论文文档，系统会自动分析论文内容，提取出论文目录、章节概要，并以树形列表的形式展示出来。用户可以根据自己的需要添加、删除章节，修改摘要内容，最后生成“领域知识框架”以供研究。

PDF文件如下图所示：

PDF分析页面：

可视化展示页面：

生产力和影响力分析：

用户上传包含论文数据的Excel表格文件，由系统统计论文被引用次数（具体到每年）、某大学共发表的论文篇数（具体到每年），并以图表、地图形式展示，可以同时选择多个大学进行比较。点击论文可表示论文基本信息，并提供自动关键字高亮功能（由http://www.alchemyapi.com/提供此功能）。

单个国家数据：

多国家比较：

柱状图：

关键字高亮：

动态显示（尚不清楚用户要这功能是做什么用的）：

地图展示：

研究水平揭示：

用户构造正则表达式，按此规则抽取论文内容。

语义词典：

简单的CRUD操作。

，

投入结构分析：

与生和力分析相同，只是显示的是投入的美元数而已。

数据表：

我们3个人商量了半小时，设计了8张表。后来老师提出有地方不合适，进行修改。最终结果如下：

用到的“技术”（勉强可以叫技术吧，没什么科技含量）：

Excel文件读取

用户上传Excel后，首先要读取Excel内容并保存到数据库中。这里用Apache POI提供的类库完成Excel读取操作。核心代码如下：

/**
	 * 将Excel表格中的PDF数据取出，存放到paper表中
	 * 
	 * <p>implementation note: 使用第三方类库(POI)读取/遍历Excel表格，将Excel中每一行数据
	 * 封装成一个Paper对象，调用EntityManager的persist()方法将论文保存到paper表中．
	 * Paper的类型为PaperType.EXCEL
	 * 
	 * <p>caution: 遍历Excel时需跳过第一条Row,因为第一个Row是各个字段的标题，而非有效数据．遍历Row时要
	 * 跳过前２个cell, 因为它们在paper表中没有对应字段．
	 * 
	 * <p>caution: 遍历Row中的cell时不能使用for-each循环，否则程序会跳过内容为空的单元格而导致一系列错误.
	 * 可能是POI的一个bug.
	 * 
	 * @param filePath Excel文件路径
	 * @exception NullPointerException 如果filePath参数为null则抛此异常.
	 * @author wanghongfei
	 */
	public void parseExcel(String filePath, ExcelType type) {
		if(null == filePath)
			throw new NullPointerException("Excel文件路径不能为空!");
		
		
		InputStream in = null;
		
		try {
			//in = new FileInputStream("/home/bruce/work/future-data/atom-data.xls");
			in = new FileInputStream(filePath);
			Workbook wb = WorkbookFactory.create(in);
			Sheet sheet = wb.getSheetAt(0);
			
			// 遍历表格
			boolean isFirstRow = true; // 是否是第一个Row
			for(Row row : sheet) {
				// 跳过第一个Row
				if(true == isFirstRow) {
					isFirstRow = false;
					continue;
				}
	
				if(type == ExcelType.PAPER) {
					Paper paper = persistPaper(row);
					entityManager.persist(paper);
				} else if(type == ExcelType.ORGNIZATION) {
					Organization org = persistOrg(row);
					entityManager.persist(org);
				} else { // never happen
					log.error("ExcelType类型错误");
				}
				
				entityManager.flush();
				
			} // for-row ends
			
		} catch (IOException ex) {
			log.error("IO Failure: 读取Excel文件失败");
			ex.printStackTrace();
		} catch (InvalidFormatException ex) {
			log.error("文件格式错误: 非Excel文件");
			ex.printStackTrace();
		}
	
	} // method ends

统计数据

用户上传的excel中共有700多条记录，每条记录代表一篇论文，包含该论文的发表日期、发表杂志、引用次数、各种检索号、全部作者、内容摘要、所属机构、所在大学等字段。我需要统计出不同的机构、不同的大学、不同的国家分别在每一年发表的论文数和论文被引用次数。为了实现该功能，我自定义了一个“可变整数”类：

/**
 * 可变的整数，用来提高计数器的计数效率
 * 同时也封装国家名和各个年份发表的论文数(一个HashMap)
 * @author wanghongfei
 *
 */
public class MutableInteger {
	private int value; // times cited 或 项目数
	private int found; // 申请到的资金, 或 论文数量
	private String date; // 项目开始时间
	
	public Map<Integer, MutableInteger> map; // <年份, 当年论文被引用次数>
	public Map<Integer, MutableInteger> paperMap; // <年份，前年发表的论文数量>

	public MutableInteger(int val) {
		this.value = val;
	}
	
	public MutableInteger(int val, int found) {
		this.value = val;
		this.found = found;
	}
	
	public MutableInteger(int val, int found, String date) {
		this.value = val;
		this.found = found;
		this.date = date;
	}
// getters and setters
}

由于需要统计3种机构的数据，功能都是统计，但不同机构的统计规则不同，所以我设计了一个简单的继承结构，将公共的统计规则放到父类中，由子类重写父类的方法来重新定义子类所需的特殊规则。

Counter类定义如下，该类提供了基本的统计、排序方法：

/**
 * A generic counter. You can extend this class to implement special counter.
 * @author wanghongfei
 *
 * @param <Entity> The kind of object to be counted.
 */
public abstract class Counter<Entity> {
	private List<Entity> list; // a collection of entites that to be counted
	private Map<String, MutableInteger> map; // original counter
	private TreeMap<String, MutableInteger> sortedMap; // sorted map
	
	/**
	 * Entity list must be set when construct Counter object.
	 * @param list
	 */
	public Counter(List<Entity> list) {
		this.list = list;
	}
	
	public abstract String toJson(Comparator<String> comp, boolean flush);
	public String toAllJson() {
		return null;
	}
	
	/**
	 * Get the result of counting.
	 * <p>Perform count action by using entity.toString() as key and MutableInteger as value.
	 * Do it if you haven't done counting yet.
	 * @param refresh Whether should perform re-count action. If you have called setList() method,
	 * this parameter must be set to true.
	 * @return A HashMap<String, MutableInteger> which contains the count result.
	 */
	public Map<String, MutableInteger> getResult(boolean refresh) {
		if(null == getMap() || true == refresh) {
			performCount();
		}
		
		return getMap();
	}
	
	/**
	 * Get the result of sorted counting.
	 * <p>Using descend order by default. Do sorting if you haven't done it yet.
	 * @param refresh Whether should re-count. If you have called setList() method,
	 * this parameter must be set to true.
	 * @param comp An object that implements Comparator interface. If it is null,
	 * default Comparator is used.
	 * @return A TreeMap<String, MutableInteger> which contains the sorted result.
	 */
	public TreeMap<String, MutableInteger> getSortedResult(boolean refresh, Comparator<String> comp) {
		if(null == getSortedMap() || true == refresh) {
			if(null == comp) {
				performSort(new CounterComparator(getMap()));
			} else {
				performSort(comp);
			}
		}
		
		return getSortedMap();
	}
	
	/**
	 * Perform counting action.
	 * <p>Perform count action by using entity.toString() as key and MutableInteger as value.
	 */
	protected void performCount() {
		if(null == getList())
			throw new NullPointerException("entity list cannot be null!");
		
		map = new HashMap<String, MutableInteger>();
		for(Entity e : getList()) {
			MutableInteger newValue = new MutableInteger(1);
			MutableInteger oldValue = map.put(e.toString(), newValue);
			
			if(oldValue != null) {
				newValue.setValue(oldValue.getValue() + newValue.getValue());
			}
		}
	}
	
	/**
	 * Perform sorting action.
	 * Using descend order by default.
	 * @param comp
	 */
	protected void performSort(Comparator<String> comp) {
		if(null == getMap())
			performCount();
		
		sortedMap = new TreeMap<String, MutableInteger>(comp);
		if(null == getMap())
			System.out.println("------------getMap() return null");
		sortedMap.putAll(getMap());
	}

	/**
	 * Call this method to change the entities you want to count.
	 * @param list New collection of entity.
	 */
	public void setList(List<Entity> list) {
		this.list = list;
	}

	// User cannot call following setter and getter method
	protected Map<String, MutableInteger> getMap() {
		return map;
	}

	protected void setMap(Map<String, MutableInteger> map) {
		this.map = map;
	}

	protected TreeMap<String, MutableInteger> getSortedMap() {
		return sortedMap;
	}

	protected void setSortedMap(TreeMap<String, MutableInteger> sortedMap) {
		this.sortedMap = sortedMap;
	}

	protected List<Entity> getList() {
		return list;
	}
	
}

这里在计数的时候用了一种较高效率的方法，见我另一博文：高效HashMap计数器(http://blog.csdn.net/neosmith/article/details/17041757)

然后子类NsfCounter、OrganizationCounter、CountryCounter根据自己的需要部分重写了performCount()方法，实现了toJson()方法。

该类在ExcelProcessor这个Session Bean中被引用，调用to***Json()方法生成统计好的JSON字符串：

/**
	 * 将机构计数结果以JSON格式返回.
	 * 该方法需要在JavaScript中被调用.
	 * @return JSON字符串. 格式:{countryName1: {amount: 1200, year:[2011: 500, 2012: 700]}, countryName2: {...} }
	 */
	public String toOrgCounterJson() {
		if(null == oCounter) {
			if(false == queryPaper()) {
				log.error("paper表中没有type为EXCEL的数据");
				return "{}";
			}
			oCounter = new OrganizationCounter(papers);
		}
		
		return oCounter.toJson(null, false);
	}

其实更好的方法应该是将这些Counter类（组件）通过@In注入到Session Bean中，这样可以大大降低组件之间的耦合度。但这一个小项目就几个小组件，不依赖注入也没什大碍。

动态图表所需数据的生成

项目中用到的Bubble图、Force图，是DataV.js(http://datavlab.org/datavjs/#treemap)所提供的功能。它们所需要的数据格式类似于如下JSON:

var source = 
[[year,country,survival,children,population,region],
[1989,Japan,0.9935,1.61,121832924,Asia],
[1989,Jamaica,0.961,3.01,2352279,America],
[1989,Italy,0.99,1.31,56824792,Europe],
[1989,Israel,0.988,3.02,4384139,Asia],
[1989,Ireland,0.9906,2.06,3530188,Europe],
[1989,Iraq,0.9532,6.05,16927393,Arab],
[1989,Iran,0.9324,5.15,53437770,Arab],
[1989,Indonesia,0.9108,3.23,181197879,Asia]]

类似的，我们通过generate***Csv()方法生成这些字符串，并在JavaScript代码中嵌入EL表达式以达到将后台数据传送到前端的目的。

CRUD

所有CRUD操作都通过Seam提供的EntityHome，EntityQuery组件和JPA中的EntityManager接口实现，没有太多值得一提的地方。

前端页面

在产出和影响力分析模页面中，用户可以点击“国家”、“机构”、“论文”按钮来切换显示不同的表格，其实只是用JavaScript实现的。国家、机构、论文的数据其实都已经传到页面中去了，点按钮仅仅是控制div的display属性，来显示不同的表格而已。

折线图、柱状图用的是基于JQuery的flot插件(http://www.flotcharts.org/)。将用户勾选不同的国家、机构时，用JS将相应的统计数据全部找出来，然后重新显示图表，传入多组数据，即可实现对比效果。

关键字高亮

向http://www.alchemyapi.com/发送ajax请求，它会以JSON格式返回这段文字的关键字。主要代码如下：

// bind click event
// send ajax request to obtain keywords
$('#highlight-btn').click(function() {
	var newText = null;
	$.getJSON(
	'http://access.alchemyapi.com/calls/text/TextGetRankedKeywords?apikey=your-api-key&text=' + paper.abstract + '&outputMode=json&jsonp=?',
	function(data, status, xhr) {
		console.log('ajax status:' + status + ',type:' + typeof(status));
		if('success' != status) {
			alert('网络不通或API调用次数超限!');
			return;
		}
		newText = highlight(paper.abstract, data);
		$('#modal-paper-abstract').html(newText);
	}

得到关键字后，在原文本的对应单词前后添加<span>标签并指定样式即可。

Google Map热力图

这个热力图的实现分为2部分：一是通过调用GeoCoding服务的相关API，发送一个大学名，google就能将该大学的经纬度坐标返回。二是，根据返回的坐标，以热力图的形式将这些点标注到地图上。上面是140多个有效坐标的显示效果。

显示地图的主要代码如下：

function initialize() {
    var mapOptions = {
        center: new google.maps.LatLng(40.800, -96.000),
        //center: new google.maps.LatLng(37.774546, -122.433523),
        zoom: 5,
        mapTypeId: google.maps.MapTypeId.SATELLITE
    };

    map = new google.maps.Map(document.getElementById("map-canvas"),
        mapOptions);

    console.log(nsfData.length);
    var pointArray = new google.maps.MVCArray(nsfData);

    heatmap = new google.maps.visualization.HeatmapLayer({
        data: pointArray
    });

    heatmap.setMap(map);
}

请求坐标的主要代码如下：

function fetchCoord(data) {
    var geocoder = new google.maps.Geocoder(); 
    var delay = 0;

    // key: index
    // val: univ name
    $.each(data, function(key, val) {
        var coord = new google.maps.LatLng(0.00, 0.00);
        var rq = {
            'address': val,
            'latLng': coord
        };
        setTimeout(function(){sendRequest(geocoder, rq)}, delay);
        //setTimeout('sendRequest(geocoder, rq)', delay);
        delay += 300;
    }); 
}

function sendRequest(geocoder, requestData) {
    geocoder.geocode(requestData, callback);
    console.log('send request for: ' + requestData.address);
}

题外话

我是逃课大王

这个项目共3人开发，我大二，他们大三，主要功能历时两周完成。其实用Seam做这东西可以更快，但偏偏赶上期末考试，所以我们几个人不得不抽时间复习功课。。不过说这话我非常心虚，因为这学期自己就没怎么去上过课，基本上把全部时间都投入到实验室Web开发学习上了。不去上课还有一个原因，那就是学校课程现在还停留在C语言，C++语法层面上，进度比蜗牛还慢。。PS：我概率论挂科了。。

需求频繁增加

这是最最痛苦的事情了。本来说好需要XX功能，XX功能，我们实现了。过几天突然打电话说又要添加XX功能，于是再去实现。可几天后用户又专门跑了一趟提要求说还需要XX功能。崩溃。我们的心情变化一直是，感觉要做完了，快完工了-->高兴；突然又来任务了 --> 失落；又快做完了 --> 高兴；又来任务了 --> 失落 + 疑惑 + 郁闷。。不过这也有好处，让我认识到了重构(Refactoring)的重要性，刚好前阵子看的《重构－－改善现有代码的质量》派上了用场。现在如果客户再要求统计另一类数据的话，我只需要写一个类继承Counter，再重写toJson()方法就行了。

关于JPA

这个项目中数据表是通过eclipse的插件ERMaseter设计的，建表的SQL也是用它自动生成的。各个Entity Bean的相互关联关系也是由seam-gen负责生成的代码。这样用起来确实方便，但对于初学者来说实在太坑。所以自己应该再多巩固下JPA实体关联方面的知识，JDBC更应该熟悉。

作者：tracker_w 发表于2013-12-28 2:11:50 原文链接

阅读：117 评论：0 查看评论