从此走进深度人生 Deep net, deep life.

作者: deepoo

  • 李代:在“预言—复现”范式下重估计算社会科学

    人工智能在科学研究中的运用方兴未艾。早在20世纪90年代,社会科学就出现过“人工智能”热潮:“社会学家之前已经熟练地掌握了如此多种多样的统计工具,有人或许以为我们的量化方法论已经彻底成熟。因此,神经网络竟然可以和多元回归等地位得到公认的方法竞争,令人惊讶。尤其是对文本数据的分析,社会人工智能或许会被证明比其他方法更加优越。对于量化数据的管理和分析,人工智能也可能会扮演重要的角色。” 

    “社会科学智能”是计算社会科学的一个子集,借助机器学习等人工智能方法研究社会科学问题。近年来由机器学习方法的突破引发的人工智能浪潮,是将改变社会科学研究的格局,还是像20世纪90年代一样无法留下影响深远的遗产?本文采取“预言—复现”范式理解“社会科学智能”的内在逻辑,回应人工智能方法“不易阐释”“数据驱动”等批评,通过列举“社会科学智能”的五种应用来探讨“阐释”与“预言”的权衡,并指出学术界面临的若干现实挑战。 

    一、计算社会科学的“预言—复现”范式 

    社会科学共同体基于朴素的证伪主义理解“科学”的边界。基于这一证伪主义方法论,可以建立计算社会科学与传统量化研究者相互理解的桥梁。 

    (一)证伪主义视域下的“社会科学” 

    “科学”的边界何在?波普尔的证伪主义给出过一个界定方案:可证伪的命题即科学命题。在波普尔的论述中,科学命题未必是全称命题,但因为科学命题的适用范围越广则价值越大,因此科学家更倾向于追寻适用范围广的命题。可见,波普尔提出的方案不只包含用来给命题“定性”的判定标准,也包含对思想经济性的“量化”考量。由此出发,符合波普尔定义的、关于社会的命题或可称为“社会科学”。在这种立场看来,“社会科学”是一般意义上的科学的一个子集。需要指出,这种观点未必是“科学主义”的:“科学”高于“非科学”,科学家应该做科学而不做非科学。本文则悬置这一价值判断,文中带引号的“社会科学”仅是“可以被证伪的、关于社会的命题”的简写。 

    波普尔的证伪主义过于理想化。科学实践并未把逻辑上的可证伪性作为评价的单一标准。柯林斯的“智识网络”和布迪厄的“场域理论”把科学家看作社会行动者,科学理论的发展离不开科学家共同体的社会互动。因而,学术共同体的一个重要使命是约定实践中“证伪”的标准和规则。即使共同体不能约定种种情况下“证伪”的标准和规则,至少研究者本人也应该明确自己特定研究的被证伪条件:如果作者可接受的证伪条件极为苛刻,可能意味着其结论适用的范围也相当有限,因而价值不高,这和波普尔关于思想经济性的考量是一致的。 

    基于上述关于证伪主义的讨论,我们可以总结一种对“社会科学”边界的理解方案:“社会科学”指的是关于社会的可证伪命题。命题适用的范围越广,价值也就越高;不过,适用的范围越广,也越容易遭遇反例。在二者权衡之下,“社会科学”共同体对自身研究的价值和贡献可以给出适度的评估。在这一点上,计算社会科学家和其他社会科学研究者可以达成共识。 

    (二)证伪主义基础上的“预言—复现” 

    基于上述逻辑,可以想象科学研究实践遵循“预言—复现”范式。为了准确描述这一范式,首先需要对关键概念作出说明。 

    第一,本文将“对研究过程进行重复的行为”称为“重复”,而将“得到与原研究类似结果”称为“复现”,对某研究进行重复,不论结果是否复现,都将其称为“重复研究”。 

    社会学等“社会科学”在实践中重视重复研究的程度远逊于其方法论主张。如果在学术实践中拒绝开展、发表重复研究,就无从发现反例、检验科学命题。这样,科学研究难以持续积累,在诸多问题上或许只能浅尝辄止。 

    第二,还需要界定“预言”的含义。“社会科学”的预言有明确前提条件。若忽视这些条件,会导致对“社会科学”抱有不切实际的期盼,或者无法切实评估研究结果的复现水平。关于“预言”或“预测”,陈云松等的探讨值得参考。词源学表明,“prediction”由表示“在前”的词根“pre-”和表示“说”的词尾“-dict”构成,因此译为“预言”比“预测”更加准确。问题是,在什么之前说?在日常语境下,“预言”似指在事件发生之前说。但在“社会科学”语境下,“预言”指的是在答案揭晓之前说,预言之“预”发生在认知维度而非时间维度上。 

    科学家在使用“预言”一词时有至少三种不同的用法:“(模型内)样本内预言”“(模型内)样本外预言”和“模型外预言”。样本内预言,指的是对于给定的样本数据,用一部分数据训练模型,再用另一部分来检验模型预言的表现,例如V折交叉检验,其目的往往是避免“过拟合”问题。样本外预言,指的是用旧样本数据训练出模型后预言新样本数据中输出变量的情况。样本外预言是一种典型的重复研究,也最符合“预言—复现”的应有之义。 

    样本内预言和样本外预言都有一个隐含的前提,那就是用来预言的模型前后不发生改变。由此,本文将前两类预言称为“模型内预言”。此外还有模型外预言,超出经验模型的范畴。例如,“大学排名”就是这样一个例子:经验世界中并不存在一个客观的变量“大学质量”,用来排名的指标的权重、最终排名的高低注定是人为产物。在这个意义上,“大学排名”是“不可证伪”的,也不存在“科学”的排名方法。 

    区分三类预言有助于澄清社会科学智能的限度。社会科学智能利用“样本内预言”训练模型,进行“样本外预言”,“模型外预言”则不在其能力范围之内。这意味着把机器学习方法用于“社会科学”不总是能提高研究质量。诸如“大学排名”等问题在原则上就不太可能通过这类方法解答。不仅如此,还有现实条件制约。例如,多来源的行政管理数据往往缺失重要变量,或数据结构、口径不统一,难以直接应用机器学习方法。另一局限体现在“因果推断”(causal inference)。因为“反事实”不可能被观测到,也就不能被直接用于检验模型预言的准确性。在这个意义上,目前机器学习不能直接用于识别因果,只能辅助进行因果推断,例如通过随机森林方法建构“反事实”做参照组,研究者可以计算政策效应。 

    二、社会科学智能与传统量化研究的观念张力 

    社会科学中传统量化研究亦可在“预言—复现”的框架下理解与评估,这是社会科学智能与量化研究兼容的前提。 有些量化研究并不可证伪,因此量化研究与狭义的“社会科学”并不能画等号。在知识或哲学层面,证伪主义或“预言—复现”思想中的诸多成分对社会科学来说并不陌生;但在实践层面,传统量化研究对这些规范重视不足。在这个意义上,裹挟着另一套“做研究”实践和规范的计算社会科学异军突起,促使人重新审视量化研究“做研究”的惯习,彰显二者之间的观念张力。 

    在“预言—复现”范式的框架下,量化研究实践中的一些观念与社会科学智能不同。二者之间的差别与其说是关于“真伪”的认识论差别,不如说是关于“好坏”的价值观差别。本文并不拟说服任何一方哪种观念更加优越,仅试图澄清二者之间的张力。 

    (一)“不可知论”的社会科学智能研究 

    传统量化研究对机器学习方法的第一个批评是其“不易阐释”。在此首先需要辨别,计算机科学或统计学界也会谈论机器学习方法“不易阐释”,但很多时候“不易阐释”的是计算过程。例如,计算机科学家谈论卷积神经网络可以根据图片数据中每个像素及其周围像素的信息对图片内容进行分类,这个过程可能是难以“阐释”的,但这里的“阐释”属于计算机科学专业知识,与一般社会科学家感兴趣的问题相去甚远。本文谈论的“不易阐释”仅限社会科学家发出的疑问,即复杂性较高的模型结果何以加深我们对社会现实的理解。 

    阐释为什么必要?这反映了两种文化、两套观念的差异。社会科学存在两种研究文化:一种认为模型反映现实生活中的社会机制,另一种认为社会机制高度复杂、不易观察,因而“不可知”(agnostic)。这种对阐释的理解继承了对统计学中“数据模型”与“算法模型”两种文化的思考。数据模型文化假设模型反映客观世界中变量间的关系,往往采用参数模型形式;而算法模型文化不假设模型符合客观世界中的机制,更看重模型的预言表现。 

    从观念来看,传统量化研究的惯习更接近数据模型文化。不过反映“不可知论”思想的机器学习方法亦可在社会科学研究中有一席之地。下文总结五种机器学习方法的用法,据此提出判断阐释重要性的约定:对方法进行阐释的责任应与其在论证中的中心性成正比。 

    第一,机器学习方法的发现可以用于启发进一步研究。有学者主张机器学习的发现可以尝试用其他方法校验,这时前者可以被理解成“指月”之“指”,既已见月,其任务便已完成。例如,采用主题模型对文本进行分析后发现了一些研究者预想不到的主题,以此为契机重返文本可能发现推进研究的新方向,并通过其他研究方法对此进行检验。如果此时作者对方法不加辩护,也就不应主张其具备证明效力。 

    第二,机器学习方法可以为数据进行编码。例如对文本、图片、视频之类的数据进行识别、分类。此时使用者的主要关注点是编码是否准确、是否存在系统性偏误。例如访谈者将访谈录音转为文字,只要准确性差强人意即可。不过机器学习算法的预言准确率会影响基于其结果的后续分析模型中变量系数的大小和显著性,应先进行矫正。 

    上述两种情况下,研究者对机器学习算法可阐释性的要求极低。第二类仅对模型预言准确性有要求,而第一类的要求更少。它们与研究其他环节耦合程度较低、模块化较强,相应的,研究者需要承担的阐释责任也较低。机器学习方法更深入地卷入“社会科学”研究,则又包括以下两种情况。 

    第三,机器学习方法可以辅助其他量化研究方法。例如,用LASSO回归筛选与因变量相关的协变量,再用LASSO筛选与处置变量(treatment)相关的协变量,最后用这两组变量进行使用最小二乘法的线性回归分析。 

    第四,机器学习方法可以生成变量用于后续分析。例如,结构主题模型把文本分为若干主题,而每个文档中这些主题的占比可以被用作变量,探究其与其他变量之间的关系。与第二类用法不同,这里“主题”是算法生成的变量,不能在原有数据中找到,通过已有数据来校验算法的准确性较为困难。这时“社会科学”往往通过某种外部校验来评估方法的效度。对于非监督学习方法,这个问题有一定普遍性。 

    跟前两种用法相比,研究者在这两种机器学习用途中对算法的阐释责任变得更大。在第三类用法中,研究者需要解释为什么采用某机器学习方法能帮助研究者找到效果较好的模型,但不一定需要讨论该方法与特定经验中的社会事实之间有何关系。在第四类用法中,研究者需要解释算法生成的变量与社会现实之间的关联,从而说服读者接受作者不乏主观性的外部校验。 

    第五,机器学习方法还可以被直接用于寻找变量之间的关系、回答研究问题。例如,用词嵌入(word embedding)模型探讨20世纪的英语文本中“阶级”概念的七个维度如何演变。词嵌入模型自身有清晰的“预言—复现”含义:根据大量文本中词语共同出现的关系训练出的模型可以根据输入的词语预言接下来出现什么词语。在这一过程中词嵌入模型生成了一个巨大的、描述词语有多大可能一起出现的网络。学者利用这一网络描画词语关系的结构及其变迁,在此,机器学习直接被用以回答研究问题,研究者肩负的阐释责任也最大。这意味着提高机器学习结果的可阐释性对于其应用非常重要。 

    由此看来,机器学习方法在“社会科学”中完成的任务不同,其“不可知”的程度不同、研究者肩负的阐释责任也不同,不可一概而论。因此“不易阐释”作为一个问题在不同研究中可接受的程度不同,应当专事专论。 

    在认知终极,“预言—复现”与“阐释”趋于统一。如前所述,科学的任务是生产可证伪命题,并且在不断的重复研究中确认其逼真性。如果成功的“阐释”可以在符合论的意义上揭示客观世界中变量间的关系,那么能够最好地“阐释”世界的模型也能最好地“预言”世界。由此而言,“预言”与“阐释”在认知的终极统一,因此二者之间的对立是表面的、暂时的。 

    当然,现实中人类没法走到这个终极,因而总要对二者进行权衡。假想如下情况:针对一个特定社会现象存在两个模型。数据模型(例如一个线性回归模型)简单易懂,但它预言时准确率仅有20%;算法模型预言的准确率提高到80%,但社会科学家“看不懂”模型建立的输入变量与输出变量的关系。两相比较,哪一个模型更加可取? 

    在“预言—复现”范式下,算法模型更加可取,因为其较高的预言准确率暗示它的结构可能更接近现实世界中相关变量之间的关系。社会科学家能否理解它未必重要,因为此时我们已经得到适用范围较广的可证伪命题,且其逼真性得到重复研究的校验。这符合前述学者的主张,即应首先选择预言能力最好的模型,之后再尽可能搞明白为什么。我们并非鼓励研究者放弃理解、阐释模型,只是不主张将这些作为研究发表的必要前提,模型的阐释工作可以后置。 

    此外,把阐释作为锚点来批评算法模型文化,自身也有问题。如果事先把某些被认为具有“可阐释性”的形式作为认知世界的前提,有可能陷入认知陷阱。例如,线性模型就是这样一种看上去很容易阐释的形式。但是即使物理世界较为简单,物理定律的形式也千变万化;人类社会更为复杂,我们却往往采用简单的线性模型进行分析。对非线性的社会事实采取线性模型本就是“不可知论”式的用法。量化研究者用着“不可知论”的方法,若因反对“不可知论”而从原则上反对机器学习方法,在逻辑上就不自洽了。 

    社会科学智能研究者理念与算法模型文化更亲和,而传统量化研究者观念更靠近数据模型文化。双方对阐释问题的重要性估量不同,算法模型文化认为阐释可以后置,或者由学术分工中的其他环节来解决,而先发表研究结果才能给阐释创造更有利的条件;数据模型文化则认为阐释优先,如果作者不能给出令人满意的阐释就不应允许其结果进入知识世界(例如在同行评议的期刊发表)。与这些观念差异相比,双方在统计技术和对社会世界的认识上存在的差别可能只是次要的。 

    两种观念的差异与两种文化下的实践差异紧密相联。如果研究能够短周期、高频率得到公开发表与重复,则可以通过频繁的纠错和迭代来甄别有价值的研究,每个研究也就可以悬置阐释。而若研究发表的周期长、通过重复研究得到检验的频率低,每个研究一开始就需要阐释得很充分。传统量化研究中的很多规范,目的恰恰是在低重复预期下力图提升研究质量,而这不一定适用于所有的研究实践。 

    (二)“数据驱动”的“社会科学”研究 

    对机器学习的第二个批评是它由“数据驱动”,与“理论驱动”存在本质差异。“理论驱动”的研究绝对优于“数据驱动”的研究,这似乎是部分“社会科学”研究者的信念,这种信念的依据其实并不明晰。 

    批评“数据驱动”的一个角度是“过拟合”。学术界不赞同先进行数据分析再包装成理论驱动的研究,这被称作“根据结果提出假设”(Hypothesizing After the Results are Known,HARKing)。这种做法不仅在道德上不诚实,还可能导致其理论结果在新数据得到重复的可能性降低,因为从数据中归纳的理论有可能拟合了数据中的“噪音”,造成“过拟合”的问题。机器学习研究受数据驱动,自然存在过拟合的风险。此外,如果这类研究一定要符合理论驱动的口味才能发表,往往被倒逼形成“根据结果提出假设”。对此,下述两方面问题有待澄清。 

     首先,在“预言—复现”范式看来,不论是“数据驱动”还是“理论驱动”,只要能更经济地生产逼真性高、适用范围广的可证伪命题就够了。除非有证据表明“理论驱动”的研究能更经济地生产逼真性高、适用范围广的可证伪命题,否则仅就这一范式来看二者没有优劣之分。传统量化研究周期长、重复罕,采取“理论驱动”可能是经济上合理的。但当学术共同体具备了周期短、重复繁的研究条件后,就不一定要坚持“理论驱动”了。过拟合问题也是如此,即便初始研究因为“数据驱动”而拟合了“噪音”,只要在重复研究中发现这一问题并将其证伪即可,在周期短、重复繁的条件下,这不会造成长远的影响。相反,即便研究者非常谨慎地进行“理论驱动”研究,仍不能保证其结果一定可以得到复现,而不进行重复研究就不能对此作出有根据的评判。这一点跟“不易阐释”遇到的问题类似,两类观念的张力根源在于研究实践的差异。 

    其次,不同的研究文化对于命题如何上升成“理论”的理解存在差异。例如,不少传统量化研究回答的仅仅是适用于一时一地的具体问题,这在一些批评者看来已经相当缺乏“理论”贡献,所以才有认为这些研究“精致的平庸”的批判。与此同时,一些社会科学智能研究的现象甚至没法回应已有文献中的“理论”问题,而自己另起炉灶提出了新的问题,这样似乎更加脱离“理论脉络”。 

    在“预言—复现”范式下,如果研究命题的普适性更强,固然价值更大;如果能形成一套这样的命题构成的系统性解释,固然价值更大。但是限于种种现实条件,有时研究者的选择是把更多精力投入到结果的扎实性上。到底看上去更加普适或更加系统性但逼真性可能不太高的结果更有贡献,还是看上去非常具体但逼真性更有说服力的结果更有贡献?这本身便是价值判断而没有天然正确的答案。 

    总之,关于“数据驱动”还是“理论驱动”的问题,还需要更贴合研究实践的讨论。采取“数据驱动”进行研究到底会带来什么恶果?对此需要更多经验证据来揭示。 

    三、社会科学智能带来的现实挑战 

    在讨论“预言—复现”范式的基本逻辑时我们已经触及了思想的经济性问题,但对在真实世界的政治、经济、技术条件下运作的社会科学智能,目前的讨论还相当不足。实际上社会科学智能的勃兴根源在技术条件的改变,而其发展也不能忽视下述三方面问题。 

     首先,社会科学智能面临比较明显的数据不平等问题。社会科学智能肇始于新技术条件下人类行为数据的大规模沉淀。尽管智能方法也可以用于分析小规模数据,但往往不能取得比传统方法更好的效果。然而获取大量数据的成本相当高昂,因而常见的数据是政府或企业在业务中沉淀的数据。这带来两个问题,第一,数据中采集哪些信息、以什么方式采集并不由研究者主导,因而最终数据不一定特别符合学术研究的需要。这也是为什么社会科学智能研究常常呈现为“数据驱动”——数据采集本身就是为业务而非理论服务的。第二,这类数据有高度的排他性,往往不能开放任学术界分析。这时可能会出现少数学者掌握数据进而掌握话语权的问题,而由于缺乏可比的数据,同行很难通过重复研究对其进行评估与监督,甚至可能存在偏见或利益冲突。虽然数据不平等不是随着社会科学智能新出现的问题,但是因技术和隐私原因,大数据更难公开或完整地进入社会科学研究。 

    其次,社会科学智能面对一定程度的算力不平等问题。社会科学智能的方法本身不具备排他性,研究者可以自己学习使用。但是要实现算法需要的算力却并非如此。例如,当前备受关注的大语言模型参数可达上百亿,训练这样的模型所需的算力远非普通研究者自己能够满足。虽然也可对产品化的模型进行调优,但是这对技术和资源的要求已经把不少社会科学界的研究者拦在门槛之外。除非学术共同体有意提供相应的基础设施,否则算力不平等问题势必不同程度地存在,研究者将日益依赖大型技术企业的支持。 

    以上两点虽然非常值得警惕,但不能因此而否定整个领域。近年来关于企业或其他组织中数据化治理的研究表明,企业或其他组织正逐步开展大量的社会科学智能研究,在此基础上可以理解系统中利益攸关方的行为,甚至寻找操控攸关方行为的干预方案。如果学术界不能获得对等的认知能力,难以为有效地监督滥用算法的行为建言献策。 

    最后,目前的学术出版实践不满足发展社会科学智能的要求。如前文所说,社会科学智能研究包含大量的技术问题,如果不加以澄清则重复研究难以实现。传统量化研究也面对这一问题,因此目前国内外有一些高水平刊物提供在线技术附录以及代码。但是实际上仅靠期刊提供的代码能成功重复原有研究的比例相当低。由于社会科学智能涉及的非参数方法内在的不确定性往往更大,这一问题对社会科学智能来说更为突出。 

     从更大的制度视野来看,社会科学智能研究面临着注意力经济问题。正如批评者所言,社会科学智能研究或传统量化研究生产了大量“精致的平庸”的成果。在“预言—复现”框架下,“精致的平庸”无可厚非——逼真性更低的宏大和逼真性更高的片面各有其价值。如果在更高逻辑层次将这些碎片化的研究整合在一起,或许能焕发其真正的价值。但是实际上,研究成果的数量高速增长,相应水平的更高逻辑层次的整合却并没有发生;反而由于论文太多,学者有限的注意力更容易集中在已经得到广泛引用的文献上,新研究要想给领域带来突破变得更加困难。.因此,问题并不在于那一篇篇看上去不够激动人心的研究本身,而在于学术界的注意力分配机制已经不适应目前学术界的生产力水平。要想解决这一问题,呼吁个体研究者改变行为恐怕于事无补,突破目前以学术期刊为载体的注意力分配机制才更有希望。 

    总之,社会科学方法获得合法性并不仅仅依赖方法论上的合法性,还受其所处的社会现实条件的影响。对此必须从社会学的视角加以理解,否则难免陷入“不接地气”的窘境。 

    结语 

    通过观察计算社会科学和传统的社会科学量化研究,本文尝试归纳出一个基于波普尔证伪主义的“预言—复现”范式,作为理解其各自研究运作逻辑和二者间张力的方案。由此回应量化研究对社会科学智能的两条主要批评,即其“不易阐释”和受“数据驱动”而非“理论驱动”的问题。在“预言—复现”范式下这些批评都有逻辑自洽的解答方案,成为问题的与其说是方法本身,不如说是方法所内嵌其中的学术研究条件和学术共同体的实践。在数据积累、方法更新、范式转型逐渐显现的当下,就“社会科学”研究方法论达成一定共识,形成新的约定,从而与社会各界共同探索数据、算力和理论等各种要素在新研究实践中的作用方式,是亟须学术共同体重视的问题。 

    需要补充的是,基于仿真的计算社会科学研究——例如采取多主体行为仿真方法、以演绎逻辑为主的研究,是否也统摄于本文所说的“预言—复现”范式之下?这类研究可以用于生成可证伪的命题、建构待探索的理论,因而在“预言—复现”范式中扮演了“命题—理论”生成的角色,从而可以被统摄于“预言—复现”范式之下;亦有部分并不能生成可证伪命题,但或许能给读者带来启发,价值不可抹杀。多主体行为仿真的系统参数设定可能需要参考经验数据,因而也并非无本之木。在这个意义上,两类计算社会科学研究互相渗透。关键不在于具体的研究方法本身,而在于研究者如何使用这一方法。 

     总之,社会科学智能为“社会科学”带来启发,并不只是纯粹智识上的冲击。如前文所说,“预言—复现”范式的很多成分在智识层面已或多或少地为人所知,真正的冲击来自一群遵循相当不同的研究实践规范的研究者“侵入”了社会科学的领域、采取不同的方法研究人类行为问题、取得吸引人的成果。如果采用社会科学智能可以比理论驱动的量化研究更经济地生产逼真性高、适用范围广的可证伪命题,那么传统的量化研究的存在意义可能面临危机。客观来说,很多非社会科学的学科甚至非学术界的主体已经在这一方面体现出优势,而社会科学家的“死与生”甚至从未进入其考量。中国社会科学界理应积极应对方法论和方法的新问题,从而为建设中国特色哲学社会科学作出贡献。本文正欲在这一点上发力,尝试为不熟悉计算社会科学的研究者提供理解计算社会科学研究模式及意义的图示,从而打通不同研究者的惯习系统,帮助我们更具自觉地从事、评价社会科学研究。 

    本文转自《中国社会科学评价》2024年第4期

  • 卜宪群:秦汉的乡里社会与国家治理

    秦统一后,“海内为郡县,法令由一统”,建立了中国历史上第一个大一统中央集权郡县制国家。郡县制的基础是乡里,先秦以来的乡里制度至秦的统一被整齐划一,推行全国。汉兴,实行郡国并行制,但封国之内采取的并不是等级分封制而是郡县制,乡里制度也继续得到巩固发展。乡里制度的持续发展,形成了人口庞大、结构复杂的乡里社会。

    乡里社会流动与治理

    秦汉大一统国家建立后,统一的政治环境与社会环境为乡里社会流动创造了更好的条件,乡里社会的流动性特征也更加突出。一是政治性流动。秦统一后为防止“六国后”、豪强等政治势力扰乱地方和巩固边防,曾多次大规模迁徙,如秦始皇二十六年“徙天下豪富于咸阳十二万户”,三十五年“因徙三万家丽邑,五万家云阳”,三十六年“迁北河榆中三万家”,还有“以适(谪)遣戍”“徙谪”等方式的大规模移民(《史记·秦始皇本纪》)。秦虽短祚,但移民数量十分庞大。汉兴,这些迁徙政策继续得到执行,并扩大到吏二千石等高级官员范围。政治性迁徙对被迁者来说是被动的,但仍为社会流动的一部分。在迁徙过程中,他们的社会身份与地域空间都被改变,所处的社会结构也被改变。政治性迁移虽然是针对特定人群的有组织的集中迁移,但他们一般也都散落在所迁之地甚至沿途的乡里之中,并非按照原有乡里组织的整体搬迁或重新建制。《史记·货殖列传》记载秦破赵后“诸迁虏少有余财,争与吏,求近处”,只有卓氏夫妻推车到达被迁处。这说明原赵国豪富是一家一户被迁徙的,卓氏夫妻与“求近处”的迁虏也都会被安置在沿途乡里之中。东汉马援的祖先武帝时以吏二千石身份从邯郸被迁往茂陵,安置在成懽里(《后汉书·马援列传》注引《东观纪》),马氏必成为成懽里的居民之一。二是贫困性流动。秦汉民众一般都居住在固定的里中,拥有户籍,称为编户齐民,受到严格管理,非经政府允许不能擅自迁徙,但在贫困等特殊情况下,其迁徙往往突破限制。《汉书·食货志》云:“贫生于不足,不足生于不农,不农则不地著,不地著则离乡轻家。”秦汉民众因贫困、灾害和统治阶级的压迫而“离乡轻家”者人数众多,《汉书·鲍宣传》记载鲍宣列举了水旱灾害、官府重责、贪吏苛取、豪强蚕食、徭役无度、盗贼劫掠等七种致使民众流亡的情况,称之为“民有七亡”,大体反映了汉代流民实际。与政治性迁移不同,他们离开乡里是散乱无序的,也没有明确的目的地,往往还被法律定为失去户籍的“亡命”或“脱亡名数”之人。三是职业性流动。士农工商是秦汉职业划分的基本形式,乡里从事农业生产的农民是多数,但实际上他们的职业是多样化的。有人教书、游学,如陈平“伯常耕田,纵平使游学”(《汉书·陈平传》),王充“后归乡里,屏居教授”;有人出为官府小吏,如王尊“求为狱小吏”(《汉书·王尊传》),许荆“家贫为吏”(《后汉书·循吏列传》注引《谢承书》),这些“吏”不是官员而是为官府服役的穷苦人;有人打柴、放牧、庸作,如朱买臣“常艾薪樵,卖以给食”(《汉书·朱买臣传》),公孙弘“牧豕海上”(《汉书·公孙弘传》),《汉书·昭帝纪》云:“比岁不登,民匮于食,流庸未尽还。”师古曰:“流庸,谓去本乡而行为人庸作。”足见灾年为庸者人数不少;还有人从事商业、手工业及其他各种行业,如齐地“好末技,不田作”(《汉书·龚遂传》),张楷“常乘驴车至县卖药”(《后汉书·张霸传》),崔寔“以酤酿贩鬻为业”(《后汉书·崔寔传》),申屠蟠“庸为漆工”(《后汉书·申屠蟠列传》),贡禹曾描述西汉后期在繁重的剥削压迫下,“故民弃本逐末,耕者不能半。贫民虽赐之田,犹贱卖以贾”(《汉书·贡禹传》)。汉代察举又称为“乡举里选”,乡里民众或因受到“乡论”赞誉而被察举出仕,或被本地政府辟任属吏,大多离开了故土,又因致仕、罢免等各种原因回归乡里,形成闭环式流动。

    秦汉国家针对上述社会流动都制定了不同的治理措施。对政治性迁徙者,国家不仅没有剥夺他们的财产,还给予经济上的宽松政策和一定的政治待遇。如汉初迁徙齐楚大族到关中时就“与利田宅”(《汉书·高帝纪》),而对普通民众的迁徙,或赐予爵位,或备好房屋和生产工具,保障他们的基本生活条件甚至社会政治地位。对贫困性迁移者,国家采取安抚政策,如允许从狭乡迁往宽乡,或给予流民基本生活生产资料,但并不主张流民留在他乡,而是让他们返归故里。对职业性迁移者,除官员外,国家并不鼓励民众离开乡土。秦汉有“禁民二业”的政策和“重农抑商”的政治传统,但实际作用并不大。由于官员的户籍可能仍留在乡里,故回归乡里是官员仕途终结后的常态。《后汉书·苏不韦列传》云:“汉法,免罢守令,自非诏征,不得妄到京师。”至少反映法律规定部分官员不能随意留在京城。

    乡里社会阶层结构变化与治理

    战国以来的户籍制度推动了乡里编户齐民社会阶层的形成。秦及汉初,动荡结束之后,国家鼓励军人和民众回归乡里,登记户籍,按照爵位身份分配土地,成为编户齐民。“编户者,言列次名籍也”(《汉书·高帝纪》颜注),“齐等无有贵贱,故谓之齐民”(《史记·平准书》《集解》引如淳注)。编户齐民就是在乡里登记户籍的民户,国家按照什伍编制将他们组织起来,承担赋税徭役,维护基层秩序。他们身份平等,没有贵贱之分。一家一户的编户齐民是秦及汉初最为广泛的社会阶层,对稳定基层社会起到了很大作用。汉初“文景之治”及社会繁荣局面的形成,与编户齐民社会结构有很大关系。但是,这样稳定的局面并没有延续很长时间,汉兴六七十年后,一种称为“豪民”的社会阶层兴起,他们暴虐乡里、兼并土地、扰乱社会,其势力持续发展,成为影响乡里社会结构的重要力量。

    汉代的豪民来源大致有三种:一是六国贵族豪民的延续。史称“汉承战国余烈,多豪猾之民”。虽然汉初对他们采取了强硬的迁徙政策,但由于在经济上不仅没有剥夺他们的财产,甚至给予照顾,使得他们在乡里的经济势力又很快得到恢复和发展,“其并兼者则陵横邦邑,桀健者则雄张闾里”(《后汉书·酷吏列传》),就是写照。二是编户齐民自身的分化。汉代的编户齐民在法律形式上是平等的,但国家并不赋予他们经济上的平等,乡里民众的经济状况实际上千差万别。如张家山汉墓竹简《二年律令·户律》关于爵位授田的规定中,关内侯至庶人的授田标准从九十五顷至一顷不等,数额相差巨大。又据竹简《奏谳书》十六记载,里中还居住有关内侯、大庶长、右庶长、封君、五大夫等高爵之人,他们在经济上与普通编户民显然不同。此外,编户民各家人口不同、劳动力强弱不同、居住地域不同以及抗衡自然灾害能力不同等因素,也使他们的贫富差别客观上各不相同。如江陵凤凰山汉简《郑里廩簿》所载郑里的25户百姓中,家庭人口有1人到8人不等,能田作者1人到4人不等,占有土地的数量有8亩到54亩不等,贫富差距明显。三是工商业者和贵族官僚向乡里转移。汉武帝时期,工商业者受到新的官营政策打击,“以末致财,用本守之”的理念愈加普遍,其势力“大者倾郡,中者倾县,下者倾乡里者,不可胜数”(《史记·货殖列传》)。贵族官僚“身宠而载高位,家温而食厚禄,因乘富贵之资力,以与民争利于下”(《汉书·董仲舒传》)。

    “豪民”不是法律上的称呼,一般也不具备特殊的政治权益,但因为经济势力强大而改变了乡里社会结构,原本平等的编户齐民之间有了明显的高下贫富之分。如《史记·平准书》云武帝时:“网疏而民富,役财骄溢,或至兼并豪党之徒,以武断于乡曲。”《索隐》云:“谓乡曲豪富无官位,而以威势主断曲直,故曰武断也。”又《后汉书·仲长统列传》云:“汉兴以来,相与同为编户齐民,而以财力相君长者,世无数焉。”豪民因财富形成了独立于国家权力之外的权威,“宁负二千石,无负豪大家”的谚语就是写照(《汉书·酷吏传》)。西汉中后期到东汉,编户齐民的人身依附程度不断加强,乡里社会阶层结构也随之变化。

    秦汉治理豪民的政策基本是打击,如政治上迁徙、经济上限制、社会地位上贬低和使用酷吏等,也采取利用方式,改造他们“佐吏为治”或“以为爪牙”(《史记·酷吏列传》)。豪民也通过自身转变,学习经学、调整自我行为,进入官僚队伍。西汉中后期到东汉,豪民及子弟或辟为属吏或察举入仕成为常态,促进了豪民的国家认同。

    乡里宗族势力兴起与治理

    先秦宗法制度解体后,以血缘关系为纽带的宗族意识与宗族组织仍然顽强保留了下来,这些宗族不仅限于六国后,基层社会也有宗族。如《岳麓书院藏秦简(叁)》中的《识劫案》,就记载了秦王政时大夫沛在娶为妻后,还要召集宗人征求意见,获得许可后其妻方可入宗,说明宗族在基层社会中仍发挥作用。又,刘邦说“今萧何举宗数十人皆随我”(《史记·萧相国世家》),萧何当时的身份地位也不高,但却有自己的宗族“数十人”。经过汉初六七十年的发展,宗族势力开始抬头,故汉武帝设置刺史时,把打击“强宗豪右”作为考核官吏的第一条标准。当然,在整个西汉,宗族内部之间的联系还不十分紧密,族内公共事务管理机制也不完善,宗族活动大都限于宗族内部家庭之间的议事、赈恤、互助或政治上的提携,干预国家政治、扰乱乡里社会的现象并不十分突出。两汉之际的政治动荡和东汉政权的建立,推动了宗族势力的大发展。战争为有势力的家庭扩大在宗族中的影响力创造了条件,宗族的凝聚力加强。东汉政权的建立者,其本身或为大族,或有大族背景。东汉宗族内部的制度建设、联系机制、政治活动都明显加强。“祠堂”(《盐铁论·散不足》)在西汉中期已经产生,至东汉更普遍,“郡县豪家”往往兴造“庐舍祠堂”(《潜夫论·浮侈》),祭祀祖先。宗族内部已经有较明确的族际范围,有一定的议事制度,有族长支配全族,有相互收养赈恤义务,有法律连带责任以及宗族内部的礼仪规范等。东汉的宗族显然较之前有了更全面的发展。

    取代宗法制度的宗族及其组织在秦汉的发展,本质上是社会经济发展在社会结构与阶级结构上的反映。史书上所称的“大姓”“强宗”“豪族”“衣冠”“著姓”等,往往都是宗族中的代表性人物或家庭,背后都有强大的宗族支撑,他们既是乡里社会中的一员,也与乡里社会融为一体。而宗族的发展及其内部组织化、规范化程度的不断提高,也对乡里社会产生重大影响。一方面,宗族组织的内部互恤救助等行动,弥补了国家政权力量的不足,起到稳定乡里社会、保全民众的作用;另一方面,宗族中的代表性人物以及这个社会阶层,在政治上对国家权力产生强烈要求。东汉后期,不仅乡里基层政权多为各地宗族势力所把持,察举选官制度的核心“乡论”也被宗族势力所垄断。

    秦汉四百多年的历史,乡里不仅始终是国家最重要的经济基础、政治基础、文化基础与社会基础,也始终是国家治理的重要对象。秦汉乡里社会与国家治理积累了丰富的历史经验,也留下了深刻的教训,在中华文明发展史与国家治理史上占有十分重要的地位。

  • 高江涛:都邑遗址考古与中华文明探源——陶寺考古的历程与成就

    陶寺位于临汾盆地的核心区域、塔儿山脚下,如果将其置于更大的时空背景中考察,可以发现陶寺遗址处于我国两大农耕区的交汇地带。考古学界多用“重瓣花朵、多元一体”形容史前文化格局,其中,花蕊所在区域就是中原地区,即陶寺所在区域。可以说,陶寺遗址是探索中华五千多年文明的代表性遗址之一,是新中国成立以来中国考古学发展历程中的“亮点”,也是探索与传承中华文明丰富内涵和精神标识的典型遗址之一。全面系统梳理陶寺遗址发现与研究的历程,并在此基础上阐释与总结其成就和贡献,对于中国考古学的未来发展具有启示意义。

    初识陶寺

    1958年,山西省开展文物普查工作,在陶寺村的南沟与赵王沟之间,发现面积可能为数万平方米的史前遗址,陶寺遗址遂被发现。1959年,中国科学院考古研究所(今中国社会科学院考古研究所)组建山西队。同年,徐旭生的夏墟调查研究给当时的中国考古学带来了新的研究方向,推动了“夏文化”研究的升温,甚至成就了时至今日未曾中断的研究热点,夏文化探索也成为“考古中国”的重大项目之一。

    1959年至1963年秋冬,中国科学院考古研究所山西队在晋南地区进行了四次大规模的考古调查,从行政区域看包括临汾地区和运城地区的15个县,8000余平方公里,发现仰韶文化至北朝时期遗址306处,其间发掘垣曲县的丰村、龙王崖、口头遗址等,尤其是1963年冬,在陶寺村南、李庄东南、中梁村东北和沟西村北又发现4处遗址。以往学界对1959至1963年晋南大调查的重要性认识不足,这次调查不仅发现了众多遗址、开展了一些田野发掘,一定程度上还奠定了山西考古的早期基础,也是徐旭生夏墟调查工作的延续,调查中发现的陶寺遗址以及东下冯遗址等成为探索晋南夏文化的重点遗址。更为重要的是,这些考古调查材料揭示出河南龙山—二里头时代晋南政治中心的兴衰, 学界普遍认为与陶唐、夏墟传说紧密相连。此外,在考古学理论与方法上,此次区域调查可以说是聚落形态考古引入中国前,具有聚落考古特点的“区域系统调查”。

    值得注意的是,1973年中国社会科学院考古研究所与山西省文物工作委员会复查了陶寺遗址,敏锐地发现之前陶寺周边的几处遗址基本连成一片,面积已达到数百万平方米,陶寺从一个普通规模的遗址跃居成为超大型遗址。考古工作者初步认识到它是一处属于龙山文化时期的十分重要的遗址,于是将其确定为晋南首选发掘对象。1977年,高天麟、高炜、郑文兰与襄汾县文化馆的尹子贵、陶富海,再次复查陶寺遗址,为接下来的正式考古发掘打下基础。1978年4月初,中国社会科学院考古研究所与山西省临汾行署文化局合作,开始正式发掘陶寺遗址,拉开了陶寺考古科学发掘与研究的大幕。

    1978年至1985年,是陶寺遗址的初始发掘阶段。考古学是一门实证学科,根据人类活动遗留下来的实物研究人类社会的历史,田野发掘是解决问题的关键手段。这一阶段,陶寺遗址的发掘获得突破性进展,发现了面积达4万平方米的墓地,发掘了1309座墓葬。此外,对居住址进行了小规模发掘,发现年代属于庙底沟二期文化的遗存。随着墓地和居址的发掘,在边发掘、边整理、边研究的理念指导下,发掘者初步认识了陶寺遗址的内涵、特征、年代,并建立起了陶寺文化早、中、晚三期文化发展序列,将其文化性质认定为中原龙山文化的“陶寺类型”,为深入研究奠定了坚实基础。

    研究深入

    发掘整理与研究阐释应是考古学同等重要而不可偏废的两个基本方面,不能重“发掘”,而轻“研究”。1985年之后,陶寺考古在之前重要发掘的基础上,开始转入不断深化研究与阐释阶段。

    首先是对陶寺发现墓地和墓葬的研究,主要成果集中于对大、中、小型墓葬进行了细分:9座大墓分为甲、乙两种,80余座中型墓分为甲、乙、丙三种,610余座小型墓分为甲、乙两种。这些墓地和墓葬的研究表明,陶寺文化早期族群内部就已经呈现金字塔式社会组织结构,极少数贵族占有大量财富和拥有权力,90%的墓主很可能是族群一般成员甚至是奴隶,没有任何随葬品。发掘者据此认为其可能进入了阶级社会,国家的雏形已经产生。

    具有突破性意义的研究是张岱海所撰《陶寺文化与龙山时代》和高炜所撰《陶寺考古发现对探索中国古代文明的意义》,在之前中原龙山文化“陶寺类型”基础上,直接改称其为“陶寺文化”。文化命名的提出意义非凡,使得学界对史前时期这一特殊区域、特定年代、特有属性的文化体及相关人群有了具体而清晰的研究对象,大大拓展了研究的深度和广度,使探讨中华文明起源及早期国家形成等重大问题成为可能。

    需要强调的是,“陶寺文化”并不等同于“陶寺遗址”,陶寺遗址被发掘之后,考古学家发现它周边还有很多与其物质文化类似的其他遗址,经过考古调查,大概有近百处,主要分布于晋南地区的临汾盆地,即峨嵋岭以北汾河下游及其支流浍河、滏河流域。陶寺遗址是其中典型代表且发现较早,所以以“陶寺”命名这个地域的考古学文化为“陶寺文化”。自此,分布于晋南区域的这类龙山时代遗址有了统一的考古学文化称谓。即使如此,这一考古学文化却也存在观点上的分歧。1990年,一些学者曾提出陶寺文化早期属于庙底沟二期文化而非陶寺文化的看法,可谓是基本统一认识之前的小插曲,这恰恰是研究不断被引向深入的直接体现。

    正是因为陶寺遗址所在“夏墟”的史迹和探索夏文化的预设背景,早期发掘后主流观点认为其是夏文化遗存,1983年发掘简报的结论就指出其是探索“夏文化”的重要资料。同时,高炜、高天麟、张岱海撰写《关于陶寺墓地的几个问题》一文,专门探讨了陶寺遗存与夏文化,认为陶寺遗址及墓地很可能就是夏人的遗存,但同时也不排除这里是与夏人居处邻近的另一个部落。这种肯定而又留出开放讨论空间的认识为之后夏文化属性的探讨拉开了序幕。何建安、刘起釪、黄石林等学者认同陶寺文化很可能是夏文化,有的学者考虑到陶寺遗存的测年数据与传统认为的夏纪年有所抵牾,认为陶寺类型文化可能是夏代早期文化,而二里头文化可能是夏代中晚期文化。1985年,十分关注考古发现的先秦史研究专家李民首先挑战了当时的主流观点,在《尧舜时代与陶寺遗址》一文中提出了陶寺遗址与尧舜相关而与夏文化无关的看法。一石激起千层浪,陶寺一类遗存的属性姓“夏”,还是姓“尧”,乃至“尧舜”及其他,争论逐渐火热起来。1987年,王文清明确指出陶寺文化可能是陶唐氏文化遗存。紧随其后,罗新和田建文等明确指出陶寺文化属唐尧文化。大体同时,刘绪也专文否定陶寺类型属夏文化。还有学者如许宏与安也致持陶寺文化属有虞氏文化遗存的看法。值得注意的是,1994年高炜等专家改变原来夏文化的看法转而主张尧文化,而2001年王克林改变原来认为龙盘与夏人有关的认识,认为陶寺文化是以陶唐氏为首的联合有虞氏和夏后氏等氏族部落联盟所在的文化遗存。

    硕果纷呈

    世纪之交的1999年,陶寺遗址重启田野发掘工作。2000年,考古人员在陶寺遗址的北部,钻探解剖了疑似夯土墙遗迹,分别编号Q1、Q2和Q3,并排除了城墙Q1和确定了城墙Q2、Q3为陶寺文化中期的性质,确定了城墙Q2、Q3的年代。在遗址东部与南部,发现并确认陶寺文化中期城墙Q4、Q5与Q6。陶寺三面城墙相连而成、面积约280万平方米的大城逐步得到确认。此外,2001年还在大城以内的东北部发现有疑似墙基的夯土遗迹,即墙Q8~Q11,暗示大城内还有重要城墙类遗存。1999年至2001年,经历三年的逐步发掘确认,陶寺发现了当时黄河流域最大的“城址”,一座面积达到280多万平方米、沉睡4000多年的大城渐渐露出“庐山真面目”。陶寺遗址再现大城意义重大,成为陶寺考古史上的又一次突破,使得陶寺遗址由一个重要的聚落成为一个大型城址。

    2002年至今,陶寺考古进入了新阶段。随着中华文明探源工程的启动与推进,陶寺遗址重大发现层出不穷,主要有“观象台遗迹”、中期墓地及中期大墓M22、城北夯土建筑基址、手工业区官署基址、宫城及其门址、宫城内的大型宫殿基址等一系列重大发现。其中,2012年至2017年钻探发现并历时五年发掘确认大城之内面积近13万平方米的“宫城”,2018至2022年又连续在宫城内发掘面积达6500平方米的1号宫殿基址和面积近600平方米的2号夯土基址,使得陶寺出现了外有郭城内有宫城的“双城”结构。中华文明探源工程中的这些重要发现逐渐确立了陶寺遗址为史前一处都城的重要地位。一座“文明都邑”逐渐显现在世人面前,而且这座都邑又很可能与我们共同的文明先祖尧密切相关。

    值得关注的是,陶寺遗址正式发掘伊始,就较为广泛地采取自然科学手段如古地磁、孢粉分析、动物考古、铜器分析、地貌水文考察等,有着先进的发掘理念和研究思路。中华文明探源工程开展以来更是如此,包括系列测年、古环境、动植物考古、手工业技术、天文考古、同位素分析、古DNA、残留物分析等手段全面展开并推向深入。可以说,从单一学科研究到多学科、多层次、多角度联合科技攻关,陶寺遗址是考古技术与方法理论的“实验场”与“孵化器”。在不久的未来,陶寺考古还将在都邑的年代序列、环境、资源、生业、人群族属等科技考古方面取得丰硕成果。

    六十余年陶寺遗址的一系列不间断重大发现和丰硕研究成果表明,陶寺文化已经呈现早期国家特征,礼乐文明初步形成。距今约4300至3900年间,以陶寺为代表的中原地区,在广泛吸收各地文明要素的基础上创造发展、迅速崛起、走向一体,中华文明形成与发展进入新的重要阶段。这些考古成果体现着几代考古人的不懈求索、薪火相传,展望未来,我们仍将脚踏实地、孜孜以求,向着不断推进陶寺考古取得新的成果稳步迈进。

    本文转自《光明日报》( 2025年03月03日)

  • 林辉煌:贫困的能力结构——一个解释框架

    中国的脱贫攻坚战,到2020年已经进入尾声。但是,作为一个社会问题,贫困尤其是相对贫困依然会以各种形态存在于2020年之后的中国社会。如何巩固脱贫攻坚战的既有成果、预防返贫及新型贫困形态的产生、有效治理相对贫困,是2020年之后贫困治理工作的关键所在。为此,我们必须从既有的扶贫经验出发,进一步在理论层面上厘清贫困的属性与生产机制。

    一、收入、消费与贫困

    学界在界定贫困问题的时候,一般都是围绕收入展开的。然而因为被调查者倾向于隐藏自身的真实收入,导致收入的测算有可能被低估。因此一些学者提出,采用消费/支出变量来测量贫困状况更为真实可靠。以消费为变量,可以对贫困进行不同的分类:在所有时间内都保持低消费的是持久性贫困,由于消费的跨期变动而导致的贫困为暂时性贫困,由于平均消费持续低迷的是慢性贫困。也有学者结合收入和消费两个变量重新理解贫困的类型,将家庭的收入和消费都低于贫困线标准的状态称为持久性贫困,将家庭的收入低于贫困线而消费高于贫困线的状态称为暂时性贫困,而将家庭收入高于贫困线、但是消费低于贫困线的状态称为选择性贫困。根据消费来测量贫困可能存在两个问题:第一,收入低于贫困线而消费高于贫困线的家庭,不一定是因为既有资产较多,也有可能是通过举债来消费,其自身的真实消费能力不一定很高;第二,收入高于贫困线而消费低于贫困线的家庭,如果消费是可以自行控制的,仅仅是因为生活习惯或宗教习惯而保持低消费水平,那么就没有理由将其视为贫困户。

    以收入指标为基础,我们可以进一步讨论贫困的属性。绝对贫困理论认为,贫困是一种客观的存在,而不仅仅是比较(相对)的产物或想象(主观)的产物。当家庭的可支配收入不足以维持家庭成员身体正常功能所需的“最低”或“基本”数量的生活必需品集合(主要包括食品、衣服等),这种生计资源的匮乏状态就是一种典型的绝对贫困,亦即生计贫困。生计贫困的概念始于20世纪初期,用来描述一个家庭难以生存的绝对困境。从生物学的角度来看,维持生存需要最基本的营养条件,而这些营养条件是可以精准测量并转化为基本的收入指标。到20世纪中期,考虑到贫困家庭的社会需求和人力资本积累的需要,诸如公共卫生、教育和文化设施等社会保障内容被加入绝对贫困的收入测度中,由此产生了基本需求的概念。所以,作为真实存在、触手可及的贫困,一般被描述为家庭基本需求的匮乏,人们可以利用绝对贫困线来测度贫困的广度和深度。大致而言,家庭基本需求包括食物、穿戴等基本生存需求,以及基础教育、基本医疗、基本住房等基本社会需求;贫困所描述的正是家庭可支配收入低于家庭基本需求成本的一种状态。

    根据家庭基本需求的成本,可以合理确定贫困线的水平,具体方法包括预算标准法、食物支出份额法、马丁法和食物-能量摄取法等。从现有贫困线的确定方法来看,主要依据的是食物支出,强调食物在维持家庭成员身体能量的作用是贫困线确定的基础。虽然非食物支出在贫困线的确定过程也被考虑进去,但是基本上都属于家庭基本生存需求,至于教育、医疗、住房等基本社会需求的成本则较少在贫困线的确定中得到充分反映。换言之,官方的绝对贫困线标准常常低于实际的家庭基本需求成本。

    如果说绝对贫困测量的主要是家庭收入无法满足基本需求的一种匮乏状态,那么相对贫困测量的主要是社会的不平等;相对贫困不再基于基本需求,而是基于社会比较。如果所有家庭都能够实现其基本需求,那么还存在贫困问题吗?相对贫困理论要回答的就是这个问题。根据该理论,那些在物质和生活条件上相对于他人匮乏的状态就是相对贫困。相对贫困关注的不仅仅是物质条件在客观上的差异,还有因为这种差异所可能带来的社会排斥与相对剥夺感。经济发展所带来的贫富差距的扩大,以及这一差距所带来的严重的社会和政治紧张局面,对社会凝聚力具有极大的破坏性。贫富差距剧增以及相对贫困的形成,实质上是整个社会资源分配不平等所导致的相对窘迫状态。

    相对贫困的测量,一般以相对贫困线为标准。而相对贫困线的制定方法主要有以下四种:第一种是预算标准法,即由专家所研究的贫困群体的代表根据社会认可的生活水平制定的收入贫困线;第二种是社会指标法,即通过计算群体成员的剥夺程度、依据收入和剥夺程度的关系来计算贫困线;第三种是ELE法(extended linear expenditure system),即以拓展线性支出系统为理论基础制定的贫困线;第四种是收入法,即以社会收入集中趋势的一定比例作为相对贫困线,如均值和中位数,比如世界银行认为只要是低于平均收入1/3的社会成员即可视为相对贫困人口,欧盟则将收入水平位于中位收入60%之下的人口归入相对贫困人口。

    前文的讨论主要涉及贫困问题的两个层面,即贫困的客观性问题和贫困的测量指标问题。关于贫困第三个层面的讨论是如何测量总体贫困,即如何对穷人进行“加总”,这是制定减贫政策的必要前提。

    对穷人的“加总”,就是把对个别穷人的描述变成某种贫困的测量。流行的做法是,先计算穷人人数,再计算穷人人数相对于社会总人数的比率。这种数人头的方法(head-count measure)实际上测度的是贫困发生率,这在阿玛蒂亚·森看来至少存在两大缺陷:第一,没有考虑穷人收入低于贫困线的程度(贫困深度),在不影响富人收入的情况下,整体穷人的收入减少并不会改变对穷人的人数度量;第二,对穷人之间的收入分配不敏感,尤其是当收入从一个穷人向富人转移时,穷人的人数度量也不会增加。以贫困发生率为基础制定出来的减贫政策,往往导致扶贫资源分配上的“劫贫济富”效应。因为这一类减贫政策的评价标准主要是降低贫困发生率(减少贫困人口数量),而实现该目标最有效的方式就是集中资源优先扶助那些收入接近贫困线的较“富裕”的贫困人口,忽视最贫困的人口。

    为避免上述问题,总体贫困的测度应当包含三个维度,即贫困广度(贫困人口数相对于总人口数的比率)、贫困深度(贫困人口收入与贫困线之间的差距)、贫困强度(收入在贫困人口间的分配)。利用森构建的公式,即为P=H{I+(1-I)G},P是总体贫困度量,H是贫困人口比率,I是收入缺口比率,G是穷人之间收入分配的基尼系数。Sen指数确立了贫困指数研究的基本框架,后续的研究者虽然提出很多其他指数,但是除了SST指数(Sen-Shorrocks-Thon)和FGT指数(Foster、Greer & Thorbecke)外,在测量性能上明显超越Sen指数的几近于无。SST指数克服了Sen指数在连续上的不足并消除了Sen指数在转移公理上的局限性,而FGT指数对贫困深度的反映更直接、更细致,且拥有Sen指数和SST指数所没有的加性分解性(Additive decomposability axiom)。

    无论是Sen指数,还是SST指数和FGT指数,都是在一个特定时间点静态地度量家庭的贫困状况,而没有将家庭的未来福利或风险因素考虑进去。针对这个问题,近年来兴起了有关贫困脆弱性的研究,揭示了非贫困家庭陷于贫困的风险可能性。从这个意义上讲,贫困脆弱性是一种前瞻性的测量,测度的是家庭暴露于未来风险而给家庭生存发展可能带来的影响。

    实际上,贫困脆弱性的理论需要解决两个层面的问题。第一是贫困的本质问题,即回答未来的贫困是什么?在这一点上,贫困脆弱性与收入贫困并无二致,都是将贫困界定为家庭收入无法充分满足家庭基本需求的一种匮乏状态或相较于其他社会成员的相对匮乏状态。第二层面的问题,就是研究可能导致未来家庭陷于贫困的风险因素,本质上就是对致贫因素的研究。在这一点上,贫困脆弱性的研究开启了下一节有关资产和能力的研究。

    二、资产、能力与贫困

    上一节主要讨论贫困的属性问题,即个体贫困的识别指标、贫困的客观性以及总体贫困的测度。这一节将从既有的资产理论和能力理论入手,讨论贫困生产的机制。

    资产理论认为,资产的匮乏是贫困之所以发生的根源。我们应当超越以前那种将减贫政策集中在收入和消费基础上的做法,更多关注储蓄、投资和资产的积累,建立以资产为基础的福利政策,寻求社会政策与经济发展的有效整合。以资产为基础的政策设计,不仅仅是针对家庭,而且也针对社区。

    资产理论相信,建立以资产积累为核心的社会政策,比紧紧盯着收入的政策更有利于促进经济社会的发展,从长期来看,一种投资驱动的经济要远优于消费驱动的经济。拥有资产被认为能够改善经济稳定性,将人们与可行有望的未来相联系,有助于中产阶级的形成和壮大,培育能够进行财富积累、长期思维、具备积极的公民性的现代家庭。英国于2005年建立了儿童信托基金,赋予所有在英国出生的新生儿一份个人存款账户,而且对低收入家庭给予了更多的补助,这是全球第一个全民性的(所有儿童)、进步性的(穷人获得更多补助)、以资产为基础的社会政策。新加坡的中央公积金则是全世界内容最丰富的以资产为基础的社会政策。

    我们可以将收入和资产置于同一个连续统的两端,收入的关键尺度是稳定性,资产的关键尺度是限定性,收入和资产在连续统的中间几乎会合——一种稳定的权利收入在很大程度上相当于一种完全限定性资产。私人或公共来源的权利收入是最稳定的收入,比如基于残疾或孤寡的补贴。完全限定性资产由个人拥有,但是个人不能直接占有这些资产,比如退休养老金。个人退休账户,则属于部分限定性资产。对所有形式的金融证券、房地产和其他资产的投资,属于非限定性资产。(见图1)

    图1 收入与资产的连续统

    资产在形态上包括有形资产和无形资产,它们共同构成了家庭收入的来源。有形资产主要包括货币储蓄、不动产、机器、家庭耐用品等。无形资产主要包括享有信贷、人力资本、文化资本、非正式社会资本或社会网络等。

    作为影响收入的关键因素,资产的分布状况在很大程度就决定了贫困的分布状况。一般来说,资产不平等的国家,其收入不平等的情况通常也比较严重。在发展中国家,收入不平等的一种重要关联因素是土地分配的不平等。自然资源的贫乏或开发利用不足,在很大程度上造成了区域性的贫困;低水平的人力资本,则使得贫困人口几乎被锁定在一个经济社会低度发展甚至停滞的恶性循环之中。

    由此可见,资产的多寡可以解释家庭可支配收入的来源。但是,资产理论作为贫困生产的解释机制,也存在不足之处。经验表明,对于权利和能力缺失的人群而言,即使拥有房子和土地等资产也不一定能够确保其过上富足的生活。这意味着存在一个权利结构和能力结构的问题,它们的缺失很可能会影响资产的收入转化率。所谓“能力”,看起来似乎与资产理论中的政治资本和部分人力资本、社会资本类同,然而在阿玛蒂亚·森看来,这些都属于个人资源的范畴。森的能力理论认为,所有资源都还存在一个转化的问题,而转化率受到权利和能力整体设置的影响。也就是说,资源和能力应作为两个理论范畴区分开来。按此分析,对资产与贫困关系的解释并不具有必然性,最后往往要回到能力的问题上。

    正是基于对以资源(尤其是收入)为基础的减贫政策的不满,森提出了能力贫困的概念。在他看来,贫困必须被视为是一种对基本能力的剥夺,而不仅仅是收入低下;贫困应当被视为达到某种最低可接受的目标水平的基本能力的缺失;换言之,贫困并不是个体福利少,而恰恰是缺少追求个体福利的能力;如果我们只关注收入的多少,那么剥夺的程度就可能被低估,因此有必要明确引入能力缺失的概念。如果我们将能力作为贫困的属性来理解森的能力理论,很容易陷入过度抽象化以致于难以测量贫困的困境之中;在这里,森的能力理论存在解释层次错位的问题。为避免这一问题,我们可以从贫困生产的角度来从新解读森的能力理论,即把能力的匮乏视为贫困产生的原因而非贫困的属性。这样一种解读方法不仅不会减损森的理论贡献,而且能够使其能力理论的论述层次更为清晰。

    森的能力理论包含着一对关系紧密的概念,“生活内容”和“能力”。“生活内容”既包括最基本的生活内容,如获得良好的营养供应、避免那些本可避免的死亡和早夭等;也包括更为复杂的成就,如获得自尊、能够参与到社会活动中等等。而与“生活内容”概念密切相连的是可实现生活内容的“能力”概念,它表示人们能够获得的各种生活内容(包括某种生存状态与活动)的不同组合,反映了人们能够选择过某种类型的生活的自由。这些“生活内容”,在很大程度上可以视为“家庭基本需求”;而“能力”则是家庭基本需求能否得到满足的原因。

    受到森的能力贫困理论的影响,联合国在1997年《人类发展报告》中提出一个度量贫困的新指标,即“人类贫困指数(HPI:Human Poverty Index)”。根据人类贫困指数,在发展中国家,贫困是由未存活到40岁的人的百分比、文盲率、缺乏保健服务和安全饮用水的人所占的百分比,以及5岁以下的儿童体重不足的人所占的百分比来衡量的;发达国家则是由未存活到60岁的人的百分比,功能性文盲率、收入低和长期失业来衡量。2000/2001年世界银行的《世界发展报告》也吸收了能力贫困概念,将贫困定义为福利被剥夺的状态,它不仅指收入地位和人力发展不足,还包括人对外部冲击的脆弱性,以及缺乏发言权、权利被社会排斥在外。

    从相对贫困的角度来看,贫困的本质是一个不平等的问题,贫困的治理则是对平等的合理恢复。在很大程度上,收入和资产的平等分配都可以归结为德沃金的资源平等问题,与此相对应的则是森的能力平等,这是针锋相对的两种平等理论。两种平等理论的分歧在于:第一,资源平等关注的是个人所拥有的资源是否平等,而能力平等关注的则是资源转化能力是否平等。第二,资源平等主张排除原生运气对分配的影响,使人们在非人格资源(如土地、房屋等)上达到平等,并对人格资源(健康、才能等)处于不利地位者进行补偿;能力平等认为不仅应该关注资源的分配问题,更应注重由社会环境以及偏见等因素所造成的不平等。第三,资源平等对人际相异性的问题视而不见,而能力平等则强调人际相异性的重要。大略而言,资源平等更为关切的是程序上的平等,只要对资源进行最大限度的平等配置(包括对初始条件不平等的弥补)即可,至于资源本身的使用效果则无需予以考虑;能力平等则更强调实质平等,因此要关注资源转化(为自由)的能力是否平等,以及由于社会结构本身的问题所可能造成的不平等。

    能力理论对贫困产生的原因做出了深刻的分析,贫困的治理不仅仅是资源能否平等配置的问题,更是资源能否平等转化为“生活内容”亦或“自由”的问题。但是,森的能力理论也存在自身的困境。第一,能力的概念过于抽象,没有明确具体的内容,这在一定程度上降低了该理论对具体贫困问题的解释力以及在具体政策制定中的指导意义。第二,森的能力理论不能有效解释家庭基本需求成本,因而无法全面解释贫困生产的机制。

    三、贫困的能力结构

    对于贫困生产的讨论,能力是一个关键的概念。为了克服森的可行能力理论所存在的问题,我们需要重构能力的理论框架,将能力概念操作化,同时引入社区和国家的视角,从而尝试对家庭基本需求成本的产生和控制作出解释。我们将改造后的理论称为“贫困的能力结构”,它不否定在贫困生产过程中个体主观能动性的作用,但是更为强调结构本身的决定性作用。引入新的主体之后,能力结构理论被操作为家庭能力、社区能力和国家能力三个层面,他们共同作用于家庭可支配收入和家庭基本需求成本,从而形塑了贫困的生产机制。之所以不把个体因素纳入能力结构体系之中,是因为个体因素在很大程度上取决于家庭能力的影响,个体是否聪明、健康、努力,最终都可以归因于家庭、社区和国家的结构性作用。

    贫困的形成,首要原因在于家庭能力的匮乏,无法获得足够的收入来满足家庭基本需求。家庭能力主要包含知识能力、健康能力和交往能力等;家庭能力水平越高,家庭可支配收入越高。知识能力可以用家庭平均受教育水平(或家庭成员受教育的最高水平)来衡量。健康能力可以用家庭平均健康水平(营养、身高、寿命、患病情况等)来衡量。交往能力可以用家庭社会网络的规模来衡量。社会网络的规模越大,家庭的社会支持度越高,可以获得的资源(经济救济、工作机会)越多。知识能力、健康能力、交往能力既可能相互强化,在家庭资源有限的约束下,三者也存在竞争关系。例如,在家庭资源匮乏的情况下,投入教育的资源增多,意味着投入健康和社会交往的资源就会减少。

    在现代国家建设中,社区能力的本质在于实现社区需求与国家资源的有效对接,从而为社区成员提供公共服务和公共品的能力。社区能够提供越多、越好的公共品,家庭的可支配收入就有可能得到提升,而家庭基本需求成本则有可能得以降低,从而减少贫困发生的可能性。社区能力可以进一步分解为三种能力,即表达能力、整合能力和执行能力。表达能力是指社区作为一个整体表达意见和需求的能力,可以通过表达人数和表达渠道来衡量表达能力的强弱。整合能力是指社区作为一个整体对不同意见、不同利益进行协商并使之达成一致的能力,可以通过协商次数和协商达成一致的次数来衡量整合能力的强弱。执行能力是指社区作为一个整体将社区公共意志落到实处的能力,可以通过治理钉子户的效果和公共品建设是否如期完成来衡量执行能力的强弱。社区的表达能力、整合能力、执行能力环环相扣,互相渗透。在社区公共意志的整合、执行过程中,实际上也离不开表达能力的基础性作用;而充分的社区表达,实际上也能起到一定的整合功能,社区执行能力的有效实现,在本质上就是对不同意见的再整合;充分的社区表达与有效的社区整合,最终将有利于推动社区公共意志的执行。

    与社区能力类似,国家能力的核心功能在于有效提供公共产品,区别在于,在现代社会,由国家提供的公共品更为广泛、更具基础性。国家能力越强,能够提供越多、越好的公共品,一方面可以提高家庭可支配收入,另一方面可以降低家庭基本需求的成本。国家能力还可以具体细分为四种能力,即渗透能力、动员能力、统筹能力和治理能力。渗透能力是指政府自上而下投入人力、财力的能力,衡量标准是人力、财力的投入量和效果。动员能力是指政府动员人力、财力的能力,衡量标准是因政府动员而新增的人力、财力的数量和效果。统筹能力是指政府对既有资源进行优化配置、公平分配的能力,衡量标准是政府统筹既有资源的数量、效果以及统筹层级与统筹需求的匹配程度。治理能力是指政府与社会对接的能力,衡量标准是政府与社会互动的频率和效果。渗透能力、动员能力、统筹能力、治理能力构成统一的国家能力体系,缺少哪一方面,国家的公共品建设都不容易实现。渗透能力、动员能力、统筹能力分别涉及政府对资源的投放、筹集和配置,而这三个方面都离不开治理能力来沟通国家与社会的关系;而国家与社会良性互动的能力,则是在政府投放、筹集和配置资源的过程中逐渐形成与强化的。

    贫困往往不是哪一种能力的匮乏单独造成的,而是在家庭能力、社区能力和国家能力的共同作用下产生的。因此有必要仔细分析这三种能力之间的相互作用。

    家庭的教育水平、健康水平越高,交往能力越强,社区作为一个整体越有可能充分表达和整合不同意见,并且将形成的合作方案落到实处,从而推动社区公共品的建设。社区能力越强,越有可能将国家资源引入社区、形成公共产品,从而为提升家庭的教育、健康和交往水平提供条件。有些政府项目虽然已经到达村口,但是因为村民无法达成一致意见或者无法有效治理钉子户,结果导致项目进不了村,农民享受不了相应的国家资源。良好的社区能力,不仅能够带来公共产品的有效落地,还有助于抑制不合理的社会交往成本,使人情不至于异化。

    家庭能力越强,越有可能与国家形成良好的互动,准确表达家庭发展的内在需求,使国家资源的投放更具针对性。换言之,现代化的国家建设,离不开现代化的家庭基础。而家庭能力的发展与积累,更离不开国家能力的支撑。国家对资源的筹集、配置与投放,是家庭享受良好教育和医疗条件的重要保障;减少医疗和教育方面的“非收入贫困”,公共部门进行有针对性的干预具有关键性的作用。从这个意义上讲,家庭能力的匮乏,本质上是国家能力不足的后果。

    国家资源的投放要最大程度发挥效用,需要准确回应社会需求,这就离不开社区能力的作用。社区能力的本质在于搜集、整合、执行分散农户的需求,只有当社区能力足够强,方能将这些分散的需求整合起来并实现与国家资源的有效对接。离开社区,让国家直接与个体家庭打交道,既无效率也不现实。社区能力的发展与积累,也离不开强有力的国家支持。社区的功能就在于实现国家资源与社会需求的有效对接,如果没有国家资源的持续性输入,社区能力往往会逐渐萎缩。

    作为能力结构的三个维度,家庭能力、社区能力、国家能力在贫困生产与治理过程中共同发挥作用。家庭能力的积累,很大程度上取决于家庭资源的配置模式。若家庭资源只够维持基本的生存需求,而没有更多的资源投入到教育、健康和社会交往上,那么家庭能力就不可能得到发展。因此,发展家庭能力,需要国家资源的有效介入,比如建立良好的教育系统、医疗系统、水利系统、社保系统等,将国家投放的教育资源、医疗资源、水利资源、社保资源等转化为家庭能力发展的资源,从而降低风险和冲击带来的影响、防止贫困的发生。然而,国家资源不可能直接渗透到家庭,这些资源需要通过社区这一中介发挥作用。换言之,家庭发展需要什么样的资源,只能借助社区的整合得以表达,从而实现需求与资源的对接;国家资源往往以公共品的形式发挥作用,而这些公共品要真正落地,也离不开有效的社区支持。

    四、贫困治理与现代国家转型

    贫困的形成,直接原因是家庭可支配收入不足以支付家庭基本需求成本。而低收入水平和高昂的家庭基本需求成本,从根本上讲是能力结构的缺陷造成的。国家能力、社区能力和家庭能力的不足,导致家庭成员一方面没有能力获得好的工作机会(从而获得稳定的收入),另一方面却要支付不合理的基本需求成本。从这个意义上讲,贫困治理应当聚焦于能力结构的进一步完善,从国家能力、社区能力、家庭能力三个维度出发,巩固既有的减贫成果,构建一套预防贫困、治理相对贫困及返贫问题的有效制度。

    完善能力结构的过程,实际上也是现代国家的转型过程。现代国家的主要特征是,第一,国家能够提供有效的公共品建设;第二,良好的社会自治水平;第三,公民较高的国家认同。这三个特征分别反映了国家、社区和家庭的能力发展水平。

    现代国家被要求承担越来越多的公共品建设职能,实现公共资源的有效配置和公平配置。配合这一职能的改革,是财税制度的集权化,越来越多的财税资源由政府(中央政府)掌控。这些资源的有效、公平配置,离不开强有力的国家能力。可以认为,国家能力是整个能力结构的核心,恰似整个经济社会建设的发动机。通过国家能力这一发动机,各项公共资源不断输入到社区和家庭,逐渐转化为社区能力和家庭能力。因此,贫困治理关键就看国家资源是否有效提升了社区能力和家庭能力。

    现代国家不应是简单的、全盘官僚化的国家,更不是警察国家,由国家完全控制和按计划分配所有资源;现代国家的核心标志应当是国家资源(意志)与社会需求的有效对接。要实现这一对接,离不开社区的中介作用。如果说现代国家建设的宗旨是更好地造福于民众,那么国家能力的意义就在于将国家资源转化为家庭可持续发展的内生能力。而实现这一转化的重要媒介就是社区,通过社区能力这一转化器,分散的家庭需求可以整合起来对国家资源提出要求,国家资源也能够通过社区来准确回应家庭的需求。社区能力的积累,一方面要借助国家的资源,回应民众需求,另一方面也需要保持自身的主体性,而不至于演变成为国家官僚层级的一部分,或者是民众需求的简单传输器。社区能力建设的关键就在于能够实现民众与国家的有效对话,通过对话使双方学会合理妥协与良性合作的技能,共同完成公共品的建设。

    现代国家,说到底就是现代家庭和现代公民。这意味着家庭应具备内生发展的能力,能够利用国家提供的各项公共品,提升家庭成员的受教育水平、健康水平和社会交往水平,并在这个过程中形成良好的现代国家认同。换言之,现代家庭不是简单地接受国家资源(等靠要),而是具备将这些资源转化为发展的能力。需要指出的是,家庭能力的积累,除了发挥主观能动性之外,更需要国家层面的政策制度设计和社区层面的有效整合机制。可以认为,贫困的生产首先源于家庭能力的不足,而家庭能力的不足则根源于社区能力和国家能力的不足。

    总言之,贫困治理不应是简单的国家资源输入(到家庭),而需要建立家庭能力的积累机制;而家庭能力的有效积累,则离不开社区能力和国家能力的支持。减贫政策,不应简单地着眼于家庭收入表面的提升,而应当直接回应贫困的生产机制,致力于解决致贫的根本原因。换言之,减贫政策只有解决了贫困的原因,即推动家庭、社区和国家三层能力的持续积累,才能真正减少贫困、预防贫困。传统的减贫政策很大程度上只是一种临时性的、事后的补偿机制,无法通过能力建设来抵御贫困的风险。从这个意义上讲,能力结构的理论框架作为一个整体,既是理解贫困生产的关键,也是制定减贫政策的理论基础。当然,三种能力的水平在很大程度上受制于国家和地区的经济社会发展状况,能力建设本身也需要大量的资源投入。因此,应当历史地看待能力结构的问题,而不应急于求成;如何科学合理地布局家庭能力、社区能力和国家能力的发展,是另外一项值得深入探讨的课题。

    本文转自《乡村治理评论》2024年第2期

  • R C Dieter:为了选票的杀戮:美国死刑存废背后的政法逻辑

    我们不应该赞同承担着公正司法任务的人为竞选活动提供资金;也不应该认可那种仅为了讨好选民,就预测尚未走完流程的案件的结果或承诺裁决方式的行为。在竞选中承诺“严厉打击犯罪”或“执行死刑”是竞选人存在偏见的证据,应该使他们失去对刑事案件的审理资格。

    美国联邦最高法院大法官

    约翰·保罗·史蒂文斯

    1996年

    引言

    死刑问题在政治选举中的渗透已达到新的极端,并扭曲了刑事司法系统。尽管利用死刑判决来获取政治优势并非新鲜事,但旨在加速处决的煽动性言论却变得更为普遍。不仅立法职位的候选人在竞选中高调谈论死刑,甚至法官和地方检察官也在竞选中提及他们送多少人上了刑场。这些负责解释和实施法律的人对死刑的政治化推广,干扰了公正听证的权利,并增加了无辜被告被处决的可能性。

    许多挥舞反犯罪旗帜的人不仅主张死刑,还试图在扩大死刑适用范围、减少上诉和撤销对死刑犯辩护这些至关重要的方面据理力争。所以,尽管法官经常决定被告的生死,但只有在他们面临选举、任命或确认程序之前,不判处死刑的裁决才会被攻击为“对犯罪行为的软弱”。同时,检察官在追求死刑方面几乎拥有无限的裁量权,这使得他们有机会通过寻求死刑判决来展示自己“对犯罪行为的强硬”。

    如果法官的裁决可能决定其在下次选举中的命运,那么即便他的裁决被认可且毫无疑问是正确的,宪法权威也必将受到严重侵蚀。

    ——美国大法官拜伦·怀特

    死刑政治化对美国公民造成了巨大损害。本认为死刑无效的候选人不敢在竞选中发声;受人尊敬的法官在正确裁定某些死刑案件后因违宪被迫下台;死刑审判了成为法官和检察官的竞选表演。而死刑犯,虽有一些被证明是无辜的,但也有一些因被剥夺了听证或辩护的权利而丧失生命,可他们如果能接受公正审判是绝不会被判死刑的,这种现象的原因正是公正的上诉无法成为有力的竞选口号。

    一、政治、法官与死刑判决

    在美国,有38个州支持死刑,其中32个州的法官需要经过选举。(美国1996年数据)令人不安的是,遵循法律并推翻死刑判决的法官经常被认为是“对犯罪的软弱”。公众被鼓动的认知中,似乎任何妨碍死刑判决的行为都是一种侵害正义的“技术性”说辞。法官若遵循法律和宪法判决,可能会面临不利后果。

    (一)选举法官被“一票”出局

    在田纳西州,最高法院大法官彭妮·怀特是当时该法院唯一的女性,在1994年由民主党州长内德·麦克沃特任命。她在下级法院出色地任职了两年,她审理的绝大多数刑事定罪得到了认可。但在她审理的第一个死刑案件中,她与其他法官投票推翻了对理查德·奥多姆的死刑判决,因为根据她和其他三名法官的意见,根据田纳西州的法律,没有足够的证据支持奥多姆因强奸和谋杀被判死刑。

    这给了田纳西保守联盟(Tennessee Conservative Union,田纳西州最大、历史最悠久的保守派组织,致力于影响税收和宪法问题)在1996年8月的司法选举中攻击她为死刑反对者的机会。怀特的对手,包括该州的共和党领导人,指责她“从未投票支持死刑定罪”(尽管这是她审理的第一个死刑案件),并声称她“想释放越来越多的罪犯,还嘲笑犯罪过程中的受害者”。田纳西州的两名共和党参议员公开宣布他们因怀特在这一案件中的死刑立场而反对她。共和党州长唐·桑德奎斯特在选举前宣布,除非他确定提名人支持死刑,否则他永远不会任命任何人担任刑事法院法官。在整个竞选期间,根据有关规定,怀特法官被禁止讨论奥多姆案以及发表个人法律观点。最终,她竞选失败,不再担任法官一职。显然,如果她违背初衷,投票支持处决理查德·奥多姆,她今天或许仍是法官。

    在密西西比州,最高法院大法官詹姆斯·罗伯逊在1992年的一场罢免选举中被免职,其对手在选举中攻击罗伯逊在死刑案件中的裁决。罗伯逊甚至因认为强奸罪不适用死刑而受到批评,尽管这一立场是美国最高法院长期以来的既定裁决。针对罗伯逊大法官的宣传中,“投票反对罗伯逊,因为他反对死刑,还想放走罪犯。”

    在德克萨斯州,法官查尔斯·坎贝尔在1994年因推翻一起死刑谋杀案而被投票赶出德克萨斯刑事上诉法院。坎贝尔法官在任12年,此前曾是一名保守的检察官。他的继任者斯蒂芬·曼斯菲尔德,一个曾因无证执业被罚款、几乎没有刑事法律经验的人,却因承诺支持更多的死刑判决,成为负责审查每一起死刑案件的法官之一,这使得得克萨斯州被处决的人数超过美国其他州。此外,法官诺曼·兰福德在1992年,因建议搁置一宗检察官办案程序违法的死刑案件被投票赶出州法院,而击败他的死刑检察官卡普里斯·科斯珀,曾在担任检察官期间,在办公室门上悬挂绞刑绳。

    在华盛顿州,最高法院的一名高级大法官于1995年选择辞职,因为他“不愿再参与到一个在死刑案件中故意剥夺生命权的司法系统”。在辞职时,大法官罗伯特·乌特警告了法官选举过程中的政治化,由此,华盛顿州失去了一位在最高法院任职23年的受人尊敬的大法官。

    在北卡罗来纳州,最高法院前首席大法官詹姆斯·埃克萨姆不得不参与一场竞选活动,以反击针对他死刑观点的抹黑行为。首席大法官埃克萨姆在确认死刑判决的裁决中明确表示,他不会让个人对死刑的看法干扰他维护宪法的义务。埃克萨姆首席大法官在选举中幸存下来,但表示“公众对死刑的呼声变得越来越尖锐”,即使偶尔推翻死刑判决,也会越来越难以生存。他宣布他不会在1998年连任,并表示他“很高兴不必再竞选”。他最终在任期结束前辞职。

    (二)任命法官也面临压力

    即使在法官任命不受选举约束的地方,政治也会将那些没有盲目效忠死刑的司法候选人排除在外。在克林顿总统任期初期,参议院共和党人就发出通知,他们将挑战其提名的司法候选人,因为这些人对死刑的投入不足。例如,佛罗里达州最高法院首席大法官罗斯玛丽·巴克特被提名为美国上诉法院法官时,她遭到了相当大的反对。尽管其在200多起案件中维持了死刑判决,但参议员奥林·哈奇仍想看看她“对死刑是否足够认真”。

    巴克特虽然成功被任命为首席大法官,但死刑政治化仍在继续,那些曾经投票给她的人被指责对犯罪软弱。公开支持死刑的参议员黛安·范斯坦、爱德华·肯尼迪、吉姆·萨瑟和查尔斯·罗伯在竞选连任时因投票支持巴克特法官而受到攻击。萨塞尔失去了比尔·弗里斯特的席位,后者后来在将大法官彭妮·怀特从田纳西州最高法院赶下台的竞选活动中发挥了重要作用。

    候选人迈克尔·赫芬顿用一则误导性广告抨击参议员范斯坦,广告上写着“范斯坦在受害者死后让杀手活着”。他的整版广告描述了三起案件中谋杀案的可怕细节,在这些案件中,巴克特大法官投票推翻了死刑,但没有给出需要推翻死刑的法律依据。

    一个不祥的迹象表明,这种针对法官的政治攻击可能会在这个选举年尤甚,这发生在曼哈顿联邦地区法院法官小哈罗德·贝尔的裁决争议中。虽然不是死刑案件,但贝尔法官决定排除一些针对毒品被告的证据引发的愤怒引起了参议员罗伯特·多尔的弹劾呼声和克林顿总统要求辞职的暗示。贝尔法官最终改变了他对证据的裁决,然后完全退出了此案。

    多尔参议员还抨击克林顿总统任命的两位最高法院大法官露丝·巴德·金斯伯格和斯蒂芬·布雷耶,称他们愿意利用“技术性问题”来推翻死刑判决。尽管多尔努力将这些大法官定位为处于死刑判例的极端,但法院大多数死刑案件都是一致裁决的。在刑事案件中,金斯伯格大法官在80%的时间里站在伦奎斯特、托马斯和苏特大法官一边。

    (三)法官对自己地位的保护

    由于法官受到政治攻击,一些法官不遗余力地表明他们“对犯罪并不软弱”。阿拉巴马州最高法院的民选法官不敢让公众误解他们的观点。他们最近自行实施了措施,以加快对死刑犯的处决。他们说,即使是那些还没有完成上诉的人,也会设定处决日期。然而,根据阿拉巴马州资源中心前主任布莱恩·史蒂文森的说法,这些囚犯中没有提出进一步上诉的原因是他们没有律师。

    在处决人数第三多的弗吉尼亚州,法院还在迅速确定处决日期。定罪后的申请现在必须直接提交给弗吉尼亚州最高法院,自恢复死刑以来,该法院已经100%驳回了它在死刑案件中收到的人身保护令申请。以前,申请是向初审法院提交的,在那里可以举行证据听证会。现在,驳回上诉的决定是在提交申请书几周后发布的,没有听证会,没有口头辩论,也没有专家意见。提交的大量内容涉及复杂的法律问题,但法院在每个案件中都会迅速发出相同的驳回。弗吉尼亚资源中心现在没有联邦资金,但面临着大量等待执行的案件。由于工作人员耗尽,它必须赶紧向联邦法院提交申请,否则处决将在短时间内进行。

    州法院不对死刑案件进行复核是不负责任的。过去20年中发现的大量无辜死刑犯和联邦法院发现的错误案件比例很高,这有力地表明死刑审判中正在犯下严重错误。今年通过的立法削弱了联邦法院在审查死刑案件中的作用。这使得州法院变得更加重要。否则,这些错误将永远不会得到纠正。

    在加利福尼亚州,州最高法院颁布的死刑决定发生了彻底的转变,而死刑有关的法律没有发生任何变化。他们没有修改法律,而是发起了一场政治运动,以罢免最高法院首席大法官罗斯·伯德和两名助理大法官。在法院因法律缺陷推翻了一系列死刑判决后,他们被投票罢免。随着新任首席大法官的上任,加州迅速实现了全国最高的死刑案件确认率,在上诉法院审理的死刑案件中,维持了惊人的97%。相比之下,全国约35%的死亡案件在上诉中被推翻,是加州最高法院的10倍。

    然而,在为加州的死刑犯寻找律师时,这个法院的记录是全国最糟糕的法院之一。该州超过四分之一的死刑犯,即128名囚犯,甚至在他们的第一次上诉中都没有得到律师辩护。

    在北卡罗来纳州,由一位新任首席大法官和两名新任共和党大法官领导的该州最高法院对死刑案件置若罔闻。在1995年审查的24起死亡案件中,法院维持了所有定罪,只将一起案件发回重判。相比之下,在1993-1994年,同一法院下令对大约10%的死刑案件进行新的审判,对四分之一的案件下令重新判刑。

    美国上诉法院第五巡回法院审理德克萨斯州、路易斯安那州和密西西比州这三个主要死刑州的案件。近年来,该法院一直由极端保守的任命人员主导。因此,虽然全国授予联邦人身保护令的比率约为40%,但第五巡回法院在其死刑案件中授予救济的比例不到5%。

    二、民选法官通过推翻陪审团建议死刑判决

    毫不奇怪,考虑到法官面临的政治压力,其判处死刑的概率远高于陪审团。这一现象长期存在,近期司法否决权的实践亦证实此点。

    ——约翰·保罗·史蒂文斯大法官

    在九个保留死刑的司法辖区中,当死刑判决的终局裁量权由法官而非陪审团行使时,法官所承受的死刑决策政治压力尤为显著。其中八个辖区的法官须通过选举程序保持职位。

    在四个由法官行使量刑权的司法辖区中,陪审团虽可先行提出量刑建议(体现最接近证据的公民群体对刑罚的判断),但该建议可被法官否决。具有显著政治倾向的民选法官往往更倾向于推翻陪审团的终身监禁建议而改判死刑,鲜有推翻陪审团死刑建议改判终身监禁之例。在佛罗里达、阿拉巴马和印第安纳三个实行法官再选制度的州,法官已在189起陪审团建议终身监禁的案件中改判死刑,而推翻陪审团死刑建议仅60例。阿拉巴马州尤为突出,民选法官推翻终身监禁建议改判死刑的比率是推翻死刑建议的十倍。唯一例外是特拉华州(该州法官不实行选举制),其七次陪审团建议否决均维持终身监禁。

    在哈里斯诉阿拉巴马州案中,联邦最高法院以多数意见维持了法官无说明义务即可否决陪审团建议的司法实践。持异议意见的史蒂文斯大法官指出,民选法官易受公众复仇情绪影响,警告司法官员为谋求连任可能屈从于要求“严厉打击犯罪”的政治压力。当法官被赋予推翻陪审团终身监禁建议改判死刑的权力时,实质上破坏了美国宪政体制中精心设计的司法权制衡机制。史蒂文斯撰文强调:“考虑到法官面临的政治压力,其判处死刑的概率远高于陪审团。这一现象长期存在,近期司法否决权的实践亦证实此点。”

    在分析民选法官偏好死刑的动因时,史蒂文斯提出“高阶权威”理论:“当代死刑案件法官可能过度响应的‘高阶权威’,实为一种迫使觊觎更高职位或仅求留任的法官不断宣誓效忠死刑制度的政治气候……在备受瞩目的死刑案件中屈从政治压力的危险,与效忠乔治三世的法官面临的危险如出一辙。”

    (一)法院树上的绞刑索

    佛罗里达州法官威廉·拉马尔·罗斯以死刑议题政治化著称。1972年联邦最高法院暂缓死刑执行期间,其通过在法院草坪树木悬挂绞刑索的具象化方式公开抗议该司法决定。当佛罗里达州恢复死刑制度后,罗斯法官迅速行使裁量权推翻陪审团一致建议——对存在饮酒后失忆症状的优等生兼运动员道格·麦凯改判死刑,该判决后被州最高法院撤销。

    另一佛州法官理查德·斯坦利在雷利·波特死刑案审理期间,当庭展示指虎与枪械等暴力象征物。被问及能否亲自执行电刑时,其宣称:“只要获准在宣判后立即拔枪射其眉心,本人完全赞同该程序。”尽管陪审团基于被告人年轻且无重大犯罪记录全票建议终身监禁,斯坦利法官径行改判死刑。需特别指出,该法官在宪法要求的量刑听证程序(即展示量刑相关证据的法定环节)前已形成预判:“当陪审团作出有罪裁决时,本人已形成内心确信并据此量刑。”斯坦利近期更直言不讳:“坦率而言,本人对此毫不在意。”

    法庭书记官杰里贝克近期作证称,斯坦利法官在波特被定罪前即预谋变更管辖至格莱兹县,其理由为:“此地民众公正开明……将依据证据定罪那个混蛋”,继而“送他上电椅”。该证据披露导致波特死刑执行令暂缓,再审程序现处待决状态。

    阿拉巴马州法官罗伯特·李·凯在沃尔特·麦克米兰案中,推翻陪审团基于证据薄弱提出的终身监禁建议,直接判处死刑。该案关键证人六年后承认伪证,若非阿拉巴马资源中心介入证明其无罪,麦克米兰恐怕已遭误杀。此案凸显了当黑人男性在南方小镇被控谋杀白人女性时,民选法官通过死刑判决进行政治表态蕴含的重大程序风险。

    阿拉巴马州法官布拉克斯顿·基特里尔近期推翻陪审团对17岁边缘智力障碍者迈克尔·肖恩·巴恩斯的不得假释终身监禁建议,斥责其“凶残冷血,应与被害人遭受同等对待”。佛州塞米诺尔县法官罗伯特·麦格雷戈则通过“爆炸性指示”(即强制要求达成裁决的补充性陪审团指令)迫使陷入僵局的陪审团作出有罪裁决,继而推翻其终身监禁建议改判死刑。该案关键证人为求减刑的吸毒青少年,其证言经催眠引导“恢复”与被告关于抛尸地点的对话记忆。1996年因证人翻供导致定罪被撤销后,已退休的麦格雷戈竟欲重掌该案审理权。

    (二)死刑导向的司法生态

    即便在陪审团制框架下,主审法官仍可通过多重程序机制实质影响死刑结果:包括指定贫困被告辩护律师(如德克萨斯州频繁任命15案中12遭死刑判决的罗恩·莫克,以及当庭瞌睡的“闪电式”律师乔·弗兰克·坎农)、控制专家证人预算、限制审前动议等。休斯顿法官威廉·哈蒙更当庭宣称“处决被告系践行神旨”,并在法庭悬挂绞刑场照片,公然贬斥刑事上诉法院为“自由派混蛋”。部分法官以签署“笑脸”死刑令、将执行日设为书记员生日“礼物”等方式展现司法恣意。

    加州“犯罪受害者联合体”等组织通过政治行动委员会资金推动罢免“低效”法官,其领导人哈丽雅特索拉诺强调:“法官应强制推进死刑审判进程”。司法选举中,候选人公然以“送多少杀手进死囚牢”(加州候选人约翰·奎特曼)、“刑事司法零容忍”(阿拉巴马法官鲍勃·奥斯汀)作为竞选纲领,路易斯安那最高法院大法官杰克·沃森更将死刑立场写入竞选文宣。

    里州法官厄尔·布莱克威尔在审理非裔失业被告死刑案期间,通过签署新闻稿宣布转投共和党,称“民主党过度代表少数族裔、懒汉与非白人群体”,拒绝回避申请后判处被告死刑。阿拉巴马州法官迈克·麦考密克在司法选举前两周受理死刑案件,拒绝延期审理、回避申请及变更管辖请求,利用庭审曝光度赢得选举后立即作出死刑判决。

    《1996反恐与有效死刑法案》通过强化州法官裁量权、限制联邦法院合宪性审查,使民选法官更易受政治周期影响。死囚申请人现难以获得非选举制联邦法官的独立司法审查,州司法程序的宪法性保障机制遭到系统性削弱。

    三、以生命为筹码的政治博弈

    我屈服于职位带来的威望与权力。我深知州长诉求:任何死刑案件不得提出宽宥建议。

    ——路易斯安那州赦免委员会前主席霍华德·马塞勒斯

    死刑制度提供的政治机遇不仅作用于民选法官群体,更延伸至司法部长、检察官及州赦免委员会成员。此类政治竞争常导致灾难性后果。

    (一)死刑执行的政治边缘策略

    1996年1月,死囚罗伯特·比尔尚未提交联邦人身保护令申请。俄亥俄州政府却径行设定执行日期,联邦法官以”申请未正式递交”为由拒绝签发暂缓执行令。联邦第六巡回上诉法院于预定执行日前两日签发紧急暂缓令以保障程序权利。司法部长贝蒂·蒙哥马利却在执行前数小时向联邦最高法院提起紧急动议,蓄意制造“最后时刻危机”,动员全州筹备30年来首次死刑执行。

    尽管本案尚未完成常规司法审查,蒙哥马利仍召开新闻发布会谴责“死刑执行迟滞”,坦承辩护律师极可能成功获得暂缓令。其甘愿以被告生命为赌注推行制度极限测试,实质系政治姿态展演。

    司法部发言人马克·韦伯展现对正当程序与心理创伤的漠视:“该犯理当伏法,故执行判决无需顾虑。”其承认推动执行实属政治表演:“我们清楚制度现实——比尔极可能通过联邦上诉。”(前司法部长李·费舍尔1994年对约翰·伯德案采用相同“懦夫博弈”策略,两案被告至今仍在死囚监区)蒙哥马利在最高法院驳回其虚构上诉后坦言:”我们实现了传递政治信号的目标——将制度博弈空间压缩至极限。”

    蒙哥马利进一步介入死刑上诉程序(打破俄亥俄恢复死刑16年来司法部长不干预传统),直接致电检察官称可代行死刑案件答辩职责。其推动的立法草案规定:凡被法院认定存在辩护失职的律师,终身不得承接死刑案件。俄亥俄最高法院首席大法官托马斯·莫耶斥之为“无的放矢的解决方案”——该州自恢复死刑以来尚无因辩护失职推翻定罪的先例。

    前精神病患者莱昂·莫瑟虽表达伏法意愿,但其司法行为能力存疑。联邦法官签发行为能力听证暂缓令后,州司法部长成功上诉撤销暂缓令并抢在听证前执行。当法官试图通过监狱电话评估莫瑟精神状态时,州政府隐瞒死刑室座机存在,致电接通时致命药物已注入其体内。

    蒂莫西·鲍德温案凸显赦免程序的系统性失灵。赦免委员会主席马塞勒斯在闭门审议中向州长法律顾问比尔·罗伯茨痛陈:“若赦免旨在施行仁慈,此案堪称最佳范例。”却被告知“州长不愿直面此类案件”。委员会最终全票维持死刑判决。马塞勒斯事后忏悔:“我缺乏依循良知的勇气,向职位附带的权力光环屈服。我深谙任命者的政治需求:所有死刑案件必须拒绝宽宥。”

    (二)不惜一切代价执行死刑

    尽管财政紧缩常为有效竞选策略,各州在死刑议题上却挥霍无度:

     ●  得克萨斯州单案成本超200万美元,休斯顿地区检察官约翰尼·霍姆斯公开宣称“成本与时间非追诉考量”;

     ●  加州拟追加年度预算2300万美元加速处决,叠加现有每年9000万美元死刑系统维护费;

     ●  佐治亚州科布县对已获四项终身监禁的弗雷德·托克斯再启死刑程序,预估耗资百万美元。

    德克萨斯州2016年通过缩短上诉期限立法却拒付律师费,致全年处决数从19例骤降至3例(含2例放弃上诉者)。贝克萨尔县助理检察官埃德·肖内西批判:“立法者企图构建不支付对价的死刑制度。”

    四、民选检察官在死刑案件中的关键作用

    本判决认定:警方与检察官的行为系故意为之,具有恶意且令人发指。

    ——联邦法官肯尼思·霍伊特

    检察官在死刑案件中享有广泛的自由裁量权:可决定是否寻求死刑或终身监禁、是否接受辩诉交易、是否动用全部政府资源支持特定起诉。民选检察官深知死刑审判将获得媒体高度关注,在选举临近时,死刑案件更成为获取免费宣传、塑造强硬形象的绝佳政治资本。

    此类裁量权通常不受司法审查。只要案件符合最低标准,法院不得质疑检察官将特定谋杀案定性为死刑案件的决定。且一旦公开宣布寻求死刑,即便出现强有力的无罪证据,亦难以逆转程序。

    (一)“枪上的权力标记”

    部分检察官将死刑定罪数量作为权力象征进行标榜,其参与选举时深谙“打击犯罪过度严苛”的公众认知几乎无法形成。例如俄克拉荷马城地区检察官鲍勃·梅西在其竞选文宣中将”成功将44名谋杀犯送入死刑待决区列为首要政绩。

    得克萨斯州哈里斯县地区检察官约翰尼·霍姆斯以死刑适用构建职业声誉。自1976年以来,其主导的死刑执行数量超过除得州外全美各州总和。该检察官办公室设有命名为“银针协会”的公示栏,详尽记录哈里斯县通过注射死刑处决的个案。

    然而联邦法院近期一项死刑判决对霍姆斯办公室检察官的恣意裁量提出严厉司法批评。在撤销休斯顿市里卡多·格拉死刑定罪的裁决中,肯尼斯·霍伊特法官指斥执法机关与公诉部门:“调查所揭示的警察与检察官行为具有主观故意,存在恶意渎职,其性质构成严重司法失范。”其特别强调该公诉滥权行为的政治工具性,称其“系为实现定罪率提升与权力符号积累而进行的制度性设计”。

    基于前乔治亚州地区检察官(现任法官)道格拉斯·普伦主导的死刑公诉策略,查塔胡奇司法区死刑待决人数居全州之首。但普伦通过程序异化实现定罪目标。其办公室近期被揭露不当干预乔治亚州哥伦布市刑事案件的法官分配机制。死刑案件被系统分配至普伦前任检察官出身的法官审理。此外,普伦任哥伦布市检察官期间,该办公室在死刑案件中83%的任意回避权针对非裔陪审员行使。当乔治亚州最高法院首席大法官提出强化死刑案件贫困被告人法律援助计划时,普伦将其斥为“对死刑制度的系统性破坏”。

    普伦就任法官后持续推行死刑扩张政策。亚特兰大奥运会前接受采访时,普伦法官宣称:“伤害我治下民众者必遭严惩。不得假释的终身监禁是司法软弱的象征,是制度性缺陷的体现。”

    担任检察官期间,普伦成功对智障非裔被告人杰罗姆·鲍登求处死刑。智力障碍者在自我辩护中常常表现出能力缺失、庭审中情绪表达失当,且对公诉机关表现出非常规配合,此类因素系统性的提升了死刑定罪概率。IQ值59的鲍登被处决引发乔治亚州司法声誉危机,促使该州通过立法禁止对智障者适用死刑。但普伦明确表示若再遇同类案件仍将坚持死刑诉求。

    普伦近期获任乔治亚州高等法院新设法官职位。查塔胡奇司法区四位高等法院法官中,穆林斯·惠森特与威廉·史密斯均通过办理重大死刑案件获得司法任命。史密斯竞选法官期间,其最大单笔政治献金(5000美元)来自其经办死刑案件中被害者家属。

    肯塔基州联邦检察官欧内斯特·贾斯敏因对三一高中双尸案凶手成功求处死刑确立职业声望。其以“三一检察官”名义开展竞选活动,在中学报刊投放广告并频繁携被害者家属参与造势。

    内布拉斯加州总检察长唐·斯坦伯格采取非惯例操作,在最高法院案情摘要中附加个人信函,要求处决其称为“持续对受害人家属显露蔑笑的残暴凶手”哈罗德·奥特伊。在公开推动奥特伊死刑执行的同时,斯坦伯格以决策者身份参与赦免听证会,其幕僚向听证会陈述官方版犯罪事实。

    (二)不受制约的裁量权

    系统性推进死刑公诉的检察官鲜少遭遇制度性制约。马里兰州巴尔的摩县州检察官桑德拉·奥康纳与费城地区检察官林恩·亚伯拉罕均声明对符合形式要件的案件一律适用死刑。但相应州长均未对其法律滥用或死刑激进主义采取问责机制。而当纽约布朗克斯地区检察官罗伯特·约翰逊对重大袭警案死刑适用持审慎态度时,州长以其“违反死刑法强制性规定”为由启动公诉权紧急接管程序。

    约翰逊虽未明示绝对死刑废止立场,但认为此类公诉蕴含不可接受的误判风险。纽约州立法未设定死刑强制适用条款,裁量权完全赋予检察官。约翰逊在选区民众充分知晓其死刑立场情况下高票连任。州长指定狂热支持死刑的总检察长丹尼斯·瓦科接管案件并决定是否对凯文·吉莱斯皮警官遇害案适用死刑。瓦科选择死刑求刑的决定符合预期。

    该案以悲剧告终:被告安赫尔·迪亚兹在赖克斯岛拘留所疑似自杀身亡,未及进入审判程序。州长帕塔基在获悉被推定无罪且处于国家监护的个体死亡后,作出冷酷表态:“安赫尔·迪亚兹系暴力罪犯,其死亡方式与犯罪本质相符。我为凯文·吉莱斯皮之死致哀。”

    伊利诺伊州助理总检察长玛丽·肯尼因拒绝推动对无辜者执行死刑而辞职。总检察长要求其继续抗辩罗兰多·克鲁兹的上诉,尽管存在他者认罪及大量无罪证据。肯尼选择离职,而两次在压倒性无罪证据前坚持起诉克鲁兹的詹姆斯·瑞安晋升州总检察长。伊利诺伊州最高法院最终撤销克鲁兹定罪,重审宣告无罪。(其同案被告亚历杭德罗·埃尔南德斯亦经死刑判决后被撤销定罪释放)。

    五、围绕犯罪议题的政治煽动加剧死刑滥用

    犯罪议题的政治修辞遮蔽了理性辩论空间,围绕死刑的夸张表述突破了基本限度。作为《美利坚契约》中“有效死刑法案”的原始提案者,纽特·金里奇近期重返佐治亚州推动毒品走私者强制死刑立法。任何走私商业数量级毒品入境者将面临死刑。金里奇设想象征性处决——单次集体处刑35人,以形成威慑效应。其在雅典市筹款晚宴宣称:“出于对儿童的充分保护,我作出决策:实施此类犯罪者必处极刑。”为实现程序简化,其同时主张废除此类案件多数上诉权。

    新墨西哥州长加里·约翰逊近期提出将死刑适用年龄降至13岁。其同时向慎用死刑的法官发出隐性警示,称死刑裁量权虽属司法范畴,“但需由选民对法官履职表现进行政治评估”。

    部分政客将死刑作为攻击政治对手的工具,即便对方坚持强硬死刑立场。阿拉巴马州总检察长杰夫·塞申斯以支持死刑著称,但共和党参议员候选人西德·麦克唐纳仍借塞申斯认同州刑事上诉法院正确裁决之机发动攻势。该院认定初审法院适用死刑的标准超出本州死刑法定要件,一致裁决要求撤销死刑判决。麦克唐纳在竞选广告中无视法律逻辑:“谋杀即谋杀,任何法律技术细节无法改变本质。作为参议员我将捍卫被害人权利而非罪犯权利。”

    内华达州总检察长弗兰基·休·德尔帕帕指责联邦上诉法院“对死刑存有制度性偏见”,理由是本州案件审查耗时过长。但其刻意回避司法责任——总检察长办公室因未及时回应诉状导致程序迟延。迈克尔·格里芬法官指出托马斯·内维厄斯案中,“1989至1994年间总检察署完全未履行职责”。

    加州总检察长丹·伦格伦将死刑作为政治募资工具,其赴华盛顿推动立法压缩死刑案件上诉程序。使用官方信笺的募款函将上诉制度称为“刑事司法体系漏洞”。伦格伦为展示死刑立场不择手段,近期发布失实新闻稿谴责公设辩护人向死囚赠送饼干与运动鞋,实则担忧本已放弃上诉权的被告可能启动联邦司法审查程序脱离州司法控制。

    南卡罗来纳州总检察长查尔斯·康登通过死刑议题介入国家政治。其主导国会撤销死刑案件资源中心全部拨款,既剥夺法庭对手的辩护资源,又塑造反对死刑上诉的斗士形象。尽管资源锐减可能导致司法系统迟滞与政府成本激增,康登等人仍以牺牲司法秩序为代价攫取政治资本。

    政客深谙利用公众恐慌巩固支持之道。亚利桑那州众议员莱斯利·约翰逊(梅萨市共和党籍)在尤马市恶性犯罪后立即提议对儿童性侵者适用死刑。其在议会宣称速效方案:“通过死刑彻底清除性犯罪者。即使存在误判,我愿承受比例代价——毕竟儿童安全高于一切。”

    犯罪议题政治化导致政府各层级(尤其司法系统)系统性排斥死刑反对者。单一立场即可引发政治放逐,即便候选人资质卓越。当今若小马丁·路德·金、大法官布伦南、马歇尔与布莱克门在世,恐难获联邦司法任命。司法机构与民选官员体系几乎彻底清除少数派观点持有者。全国公共广播电台甚至因参议员罗伯特·多尔与执法团体施压,撤销死刑犯视角的系列节目。

    1988年威利·霍顿事件引发社会恐慌后,比尔·克林顿1992年暂停竞选活动,亲自主持阿肯色州脑损伤囚犯注射死刑,清晰表达了死刑立场。入主白宫后推动联邦死刑适用范围扩至六十项罪名(含非致死性犯罪),签署预算案撤销死刑资源中心资助,支持可能阻断死囚联邦司法救济的”反恐法案”。克林顿总统确立亲死刑政策框架后,鲍勃·多尔通过加州圣昆廷监狱(全美最大死囚区)摆拍造势,呼吁弹劾联邦法官哈罗德·贝尔并加速死刑执行。克林顿发言人即刻回应称总统同样支持大幅削减死囚联邦上诉权,避免在犯罪议题上示弱。

    死刑上诉制度整体沦为政治表演舞台。最新操作模式表现为:借反恐之名组织俄克拉荷马城爆炸案幸存者及家属,为限制死刑案件联邦审查造势。当法案中更具争议性的反恐条款遭弃后,连支持者都承认“死刑制度改革才是法案核心”。

    未明言的是:人身保护令制度修改与反恐毫无关联。且俄城爆炸案属联邦管辖,本不涉及联邦法院审查州法院裁决。经精心策划的立法推动运动,媒体选择性忽略部分爆炸案家属反对仓促行刑的立场。

    六、宽恕:州长在罪犯垂死前的姿态

    死刑程序的最后一步是由州长考虑是否给予宽恕。然而,由于近年来这一程序变得高度政治化,宽恕的授予变得极为罕见。在过去四年中,全国范围内每年仅有一次减刑。在本世纪早期,大约有20%的死刑案件会获得宽恕。但近年来,很少有州长在任期内有勇气批准哪怕一次宽恕。

    近年来,支持死刑的州长们没有选择宽恕,而是采用了一种最受欢迎的技术——人为加速签署死刑执行令。对于一个渴望更快执行死刑的选民群体来说,签署执行令的方案有几个好处。首先,它给人一种死刑程序正在加速的印象。其次,它使州长能够在“强硬程度”上与前任进行数字上的比较。第三,当死刑执行令不可避免地无法以签署的速度执行时,州长可以将责任归咎于法院或辩护律师,称其为“真正的问题”。

    这种对刑事司法系统的操纵不仅仅具有政治影响。死刑执行令会使法律系统陷入混乱。即使在提出上诉之前,也必须争取暂缓执行。这在一个已经复杂的过程中增加了更多的层次。如果执行令的数量过多,可能没有足够的律师来处理突然激增的诉讼。没有律师代表的被告很容易被忽视,并在没有法律代表的情况下被执行死刑。1989年,由首席大法官伦奎斯特任命的委员会在研究死刑上诉时强烈建议,此类审查“应不受即将执行的时间压力影响,并应在有能力的律师协助下进行……”

    佛罗里达州的鲍勃·马丁内斯擅长以执行死刑相威胁,他在四年内签署了139份死刑执行令,是其前任鲍勃·格雷厄姆的两倍,也是下一任州长劳顿·奇利斯的许多倍。马丁内斯经常一次签署五份执行令,且常常不按时间顺序。他在自己送上电椅的死囚形象前进行竞选活动。然而,在这三位州长的任期内,实际执行死刑的速度大致相同——只有签署死刑执行令的速度加快了。尽管如此,这一过程给法院和那些被指派为死囚辩护的人带来了巨大的负担。

    宾夕法尼亚州州长汤姆·里奇上任时也承诺加快死刑的执行。自1995年担任州长以来,他至少签署了41份死刑执行令。宾夕法尼亚州有两起死刑执行,但这两起案件的囚犯都放弃了上诉。同样,这些执行令给人留下了里奇强硬无情的印象,并成功地压垮了已经严重资源不足的贫困辩护系统,该系统不得不应对每一次死刑执行的威胁。

    结论

    尽管犯罪问题常常是政治演讲的主要内容,但最近对死刑的强调干扰了司法系统的基本公正性。当那些将决定被告生死的法官——他们甚至有权否决陪审团的一致裁决——通过宣称自己对罪犯的强硬态度来竞选公职时,公正性就受到了威胁。当那些将决定是否以及针对谁寻求死刑的检察官凭借他们的死刑记录竞选公职时,这便助长了滥用权力的可能性。

    政客们通过将对死刑的忠诚作为担任公职的试金石,煽动了这种寻求更多死刑判决和更快执行的螺旋式努力。这一问题正在将高度合格的候选人排除在竞选或获得公职之外。当对死刑的丝毫犹豫都会让人被贴上“对犯罪软弱”的标签时,关于死刑价值的理性辩论变得越来越困难。最终,死刑损害了司法系统本身的完整性,因为个人权利被牺牲以换取政治利益。

    翻译:汪秉均,中央民族大学法学院2022级本科生。

  • 戴鑫:纸草档案与托勒密埃及的社会经济史研究

    纸草学诞生于19世纪末20世纪初的欧洲,是一门主要研究希腊罗马时代埃及(约公元前4世纪至公元7世纪中叶)纸草及纸草文本的学科。19世纪80年代,埃及考古学之父弗林德斯·皮特里在埃及法雍地区的考古活动,以及英国牛津大学古典系伯纳德·格伦菲尔和阿瑟·洪特在奥克西林库斯的发掘,使得包含希腊罗马时代埃及行政文本在内的大量纸草在千年之后重见天日。埃及源源不断的出土文献令欧洲学术界大为震撼,德国的罗马史家特奥多尔·蒙森曾预言“20世纪将是纸草学的世纪”。他的学生乌尔里希·威尔肯投身纸草研究,于1900年创建第一个纸草学期刊《纸草档案研究》,成为纸草学诞生的标志之一。

    纸草学家将同一个人、家庭、社区保存的一系列纸草整理为档案,便于开展史学研究。纸草档案兼指官方文书和私人书信,因保存者来自社会各阶层,可能将官方通信、行政文书和私人家书混杂存放。芝诺档案是现存托勒密埃及时期(公元前305至公元前30年)数量最大的档案(总计2063份纸草文本,其中1800余件为希腊语文本),所有者为考诺斯的芝诺,他曾担任托勒密二世时期财政大臣阿波罗尼奥斯的秘书,还受命为后者管理地产,负责组织灌溉近2750公顷土地。芝诺收藏了工作与生活相关的各种官方文件和私人通信,时间跨度为公元前263年至公元前229年,在近代学者重现和探讨托勒密埃及经济制度中扮演至关重要的角色。1911年,芝诺档案中的部分文本首次出现在伦敦和斯特拉斯堡。1914年冬,大量芝诺档案文本分批次流入古物市场,先后为埃及开罗博物馆、大英博物馆以及其他欧美博物馆、科研机构或私人等收藏。

    20世纪20年代,俄裔美国纸草学家迈克尔·罗斯托夫采夫利用新整理的部分芝诺档案撰写《公元前3世纪的埃及大地产》,是为托勒密埃及社会经济史研究的发端。早期研究即以希腊语纸草档案为核心史料,重点关注托勒密家族的王室经济。1939年,比利时纸草学家克莱尔·普雷欧出版专著《拉吉德王室经济》,详尽而细致地描述了托勒密王室对经济的高度控制或“垄断”。不久,罗斯托夫采夫在《希腊化世界社会经济史》(1941年)中,进一步强调埃及的“国家垄断”和“计划经济”色彩。他指出托勒密二世实施经济和社会改革,从而在埃及确立了希腊化经济体系。某种意义上来说,罗斯托夫采夫和普雷欧依靠纸草档案,共同奠定了欧美学术界关于托勒密埃及社会经济史研究的基础。

    20世纪70年代以来,纸草学家们将工作重心转向早期研究中忽视的地方经济。除了希腊语纸草档案之外,他们还着手收集、整理不同时期个别村庄或地产的相关埃及语纸草文本,按专题重新分类汇编。门西斯档案在格伦菲尔和洪特发掘的纸草文献中最为著名,保存者为公元前2世纪法雍地区科尔克奥西里斯的书吏门西斯,他详细记录了当地农业经济和行政管理情况。由于这些纸草文书出土时混杂在鳄鱼木乃伊中,门西斯档案的重建工作颇为不易。1971年,英国剑桥大学的多萝茜·克劳福德利用该档案重点分析科尔克奥西里斯的行政、土地、人口和农业状况,揭示古埃及政府试图对该地区的人口、税收、土地和农业生产进行精准测算、记录和管控,建立了一套复杂而严密的土地登记系统。

    随着纸草档案编辑重心的偏移与研究视野的拓宽,埃及社会经济史的研究被赋予新的生命力。学者们的关注点不再仅限于地方行政与经济层面,还广泛涉及个体社会生活的方方面面。比利时纸草学家威利·克拉瑞斯搜索和整理一些分散于世界各地的皮特里纸草(由英国考古学家皮特里发掘于法雍附近的古罗布,也称为古罗布纸草),出版了其中53份遗嘱类文本,展现了托勒密埃及法雍地区封地军人的家庭关系、身份、财产以及当地的农业和地产信息。美国纸草学家纳夫塔利·路易斯以特定群体为研究对象,整理汇编了兼具官方与私人性质的个人文本,以个案研究的形式再现不同职业和社会身份的希腊移民在埃及的社会生活。1998年,荷兰纸草学家阿瑟·维胡格特也以门西斯档案为研究对象,描述了更为微观的社会生活场景,关注门西斯本人的社会身份认同、工作、生活,通过信件的格式规范推测门西斯和通信人的等级关系。

    纸草学数据库建设则引领了计量分析和利用计算机进行研究的潮流,也加快了跨学科进行社会经济史专题研究的进程,增加了对人口统计、社会结构、城市化、社会关系等领域的关注。耶鲁大学和密歇根大学最先开始对收藏的纸草进行电子编目。杜克大学于1982年开始建立杜克纸草文本数据库,收录已经出版的纸草文本。20世纪90年代中期开始,欧美一些高校和科研机构开始大规模扫描纸草文本,通过互联网建立起世界范围的纸草档案库,纸草文献得以通过数字化信息的形式在网络上被查阅和检索。比利时鲁汶大学的纸草学家重视结合数据分析,探究希腊罗马时代埃及个人与社会的关系。截至2019年,鲁汶的特里斯迈吉斯托斯数据库(Trismegistos,缩写为TM)已收录680123份纸草数据,其中370086份文本记录了496702个人的信息。

    进入新世纪,托勒密埃及的社会经济史研究迎来一个新的高峰。2006年,克拉瑞斯和剑桥大学纸草学家多萝茜·汤普森历时十五年的合作,共同出版了《计算希腊化埃及的人口》。该书重点收集、编辑了从法雍至中部埃及吕克波利斯州(诺姆)一百年间(公元前250年至公元前150年)的税收类纸草文本,按照区域和税收类型划分为54组,对上述地区的人口、家庭、婚姻、职业、族群以及财产等情况进行量化分析,展现出王朝中期社会经济发展的动态图景。基于这一研究成果,鲁汶大学研究员卡嘉·穆勒尝试用社会网络分析以及地理学理论研究托勒密埃及国内外新定居点的分布情况,认为它们构成了支撑国家权力的网络,对托勒密埃及国家经济稳定发挥了重要作用。鲁汶数据库TM也于2012年建立网络分析系统,极大地推动了托勒密埃及人物志研究,可以用于分析个人、家庭、地点、人名甚至埃及语书信的关联。目前,帕许里斯档案、法雍档案、上埃及档案等多个项目仍持续进行。

    欧美学术界关于托勒密埃及的社会经济史研究在很大程度上依托于纸草文本的收集整理、档案编辑方式、技术方法的革新等,伴随着纸草学研究的开拓而延展。数量庞大且不断新增的纸草文书使这一研究领域具有独特的优势。鲁汶大学TM收录公元前6世纪至公元7世纪的档案超过500份,共计近2万件文本。据美国纸草学家范·明宁估算,到2030年时,出版的纸草文本将不少于10万件。尽管纸草文献的编辑、整理工作漫长而艰苦,但纸草学经过一百余年的沉淀、累积和更新,逐渐克服技术困难(如文献残损、勘误、确定年代和地点等)和文献内容庞杂且零碎等缺陷,档案的整理也已取得长足进展。在跨学科合作和计算机网络技术的助推下,无论是综合研究还是微观考察,都将进一步完善研究者们对托勒密埃及的社会经济和文化图景的绘制。

    本文转自《光明日报》( 2025年02月24日 14版)

  • 陈志武,林展,彭凯翔:海洋贸易与中国南方的兴起(671-1371年)

    今天中国的经济重心显然在南方,特别是在包括广州、深圳、杭州、上海等特大城市的南方沿海省份。然而,在唐代(618-907)之前的数千年里,中国的社会、经济和政治中心一直位于北方。正是在唐、宋(960-1279)和元(1279-1368)三个朝代,南方才崭露头角。那么,是什么促成了这一转型?是谁推动了这一根本改变中国经济社会地理的转型?陈志武、林展和彭凯翔在Asia-Pacific Economic History Review 2025年最新一期的论文发现,阿拉伯-波斯商人触发并主导的海上贸易,特别是瓷器贸易,是唐宋元时期南方崛起的重要原因。

    中国大概在9千至1万年前进入定居农耕。在接下来的数千年里,许多地区出现了人类定居点,包括南方沿海地区,但代表当时先进发展水平的防御性城邑(由城墙或壕沟所包围),仅出现在长江流域沿线及其以北的地区。目前,已发掘的城邑遗址,新石器早期(公元前8000年-前5000年)有13个,新石器中期(公元前5000年-前3000年)有56个,新石器晚期(公元前3000年-前1700年)有128个。这些小型城邑,虽然按现代标准是面积较小(通常小于一平方公里),但它们是中华文明的早期摇篮。

    在解释为什么中国的史前发展及早期发展未在南方发生时,陈志武、Peter Turchin和王万达(2023)指出,北方地势较为平坦,缺乏自然屏障,使得当地居民更容易受到武力攻击,因此不得不推出人工防御措施,尤其是建设防御性城墙,并引发早期城邑的诞生;因为这种较高的战争威胁迫使北方建立城墙城邑,让北方起先建立并治理高人口密度的复杂社会,开启文明化发展进程,而南方因山区多、易守难攻,故战争威胁少,就无必要建立城墙城邑,错失发展早期复杂社会的机会。因此,战争驱动型增长是北方史前和早期历史时期的特点,这可以称之为北方发展模式。(这里的北方与南方大致以长江为分界线,文章还考察了南方沿海的府,见图1)。

    图1 1820年清朝的南方沿海、南方和北方

    然而,从公元8世纪南方开始崛起,表现为其人口占比从742年时的24.6%增长到1393年时的58.3%(南方沿海府的比例在同一时期从5%增至18.5%),这彻底改变了中国的社会经济格局,使经济与社会重心转移至南方(见图2)。

    图2 中国本土(相当于清代的内地十八省)南方(红色)及南方沿海(蓝色)的人口占比

    注:对于公元前5000年至公元2年之间的时间点,使用每个地区的考古遗址数量作为人口的代理度量,数据来源于香港大学量化历史中心的中国考古数据库(CADB)。对于公元2年后的时间点,每个地区的人口估算来自《中国人口史》和国家统计局。

    从唐代初期(618年-907年)到明代初期(1368年-1644年)的七个世纪,历史学家称之为唐宋元(或简称唐宋)转型,因为它涵盖了唐、宋(960年-1279年)和元(1279年-1368年)三个朝代。自新石器时代早期以来,南方一直是中国的边陲地区,因此南方常被称为“南蛮”,但在转型的高潮时期,南方出现了许多繁荣城市。当马可·波罗在13世纪后期访问中国时,他对泉州——一个在唐代之前并无太多人烟的港口城市——印象深刻,称其为“世界上两个最大商贸港口之一”,并称之为“东方的亚历山大”。另一个例子是,广州的户数从713年-741年时的64,250户增加到1174年-1189年时的195,713户。

    从14世纪末开始,尽管南方在明清时期经历了绝对人口的增长(除了太平天国时期 ),但相对占比逐渐衰退,其人口比例从1393年时的58.3%下降到1953年时的38.1%(见图2)。在1393年至1953年间的每个分时期,北方的人口增长率始终超过南方和南方沿海地区(见图3)。因此,到明代晚期,社会经济重心再次回归北方(至少在中国本土范围内)。

    图3  中国本土(相当于清代的内地十八省)南方、南方沿海和北方的年人口增长率

    注:蓝色、红色和黄色线条分别表示南方沿海、南方和北方各时期的年人口增长率。数据来源与图2相同。

    唐宋元时期发生了什么?

    那么,到底是谁、什么事促成了唐宋元大转型?现有文献强调了(1)农业技术进步的作用,如占城稻的引进,(2)水道和河流网络的改善,(3)战争引发的从北向南大移民,以及(4)国内商业的增长。关于海上贸易是否在唐宋转型中起到了关键作用,学术界也存在争议。有学者认为,海上贸易推动了宋代南方的商业革命,也有学者认为海上贸易的影响不应被过分高估,因为关税收入仅占财政收入的一小部分。而关于阿拉伯波斯穆斯林商人发挥的作用,就更是研究甚少。

    本文指出,推动南方崛起的正是从7世纪末开始逐步进入广东及其他沿海地区的阿拉伯波斯穆斯林商人(以下统称阿拉伯商人),他们引发了海上贸易的繁荣,并使海洋贸易在南宋至元代期间达到顶峰,推动了南方中国的崛起。从这个意义上,这种由市场驱动的斯密式增长,通过对外开放和远程国际贸易,创造了南方。

    正因为南方的崛起是由贸易和商业推动的,而北方早期的社会经济发展是由战备驱动的,所以,南方文化明显偏向商业和市场(南方重商、志在经商),而北方文化则倾向于政治权力和等级制度(北方重权、志在做官);南方和北方发展的驱动力差异带来了新的城市类型——北方发展出了“城”(突出防御功能的城墙),而南方发展出来的是贸易“市”镇(market towns),虽然现代中文里把两类高人口密度的聚集地合在一起叫“城市”。南方兴起的市镇与北方的防御性城邑之鲜明对比,也在于南方市镇更加开放,专注于商业和民生,而非防御性战备的军力建设。

    证据何在?

    为了验证以上假说,文章聚焦到海上瓷器贸易。在明代之前,瓷器是中国最主要的出口商品之一。尽管并非唯一的出口商品,但却是海上贸易的代表性商品。

    本文展示了海上瓷器贸易的三组宏观数据。首先,从生产方面看,从7世纪末到14世纪末,南方生产瓷器的主要窑址数量增加许多,尤其是在宋元时期达到顶峰;在唐宋时期,主要陶瓷窑址多半在北方,而到宋元时期,61.9%的主要窑址在南方,特别是沿海。但,从15世纪开始,南方的陶瓷窑址数量和占比都急剧下降(因为朱元璋开始实施的海禁)(见图4)。

    图4  隋唐、宋元和明清时期主要窑址及位于南方的百分比

    注:左侧和右侧纵轴分别表示每个时期(隋唐、宋元、明清时期)主要窑址的总数和位于南方的主要窑址百分比。

    其次,从陶瓷消费端——出口目的地:中东、西亚、东非、北非——来看,在那里已经考古发掘出的中国陶瓷碎片总数,自公元9世纪以来逐年上升,也在14世纪达到顶峰,随后显著下降(见图5)。如果将这些当年陶瓷出口目的地挖出的中国陶瓷碎片总数视为衡量中国每个世纪瓷器出口量的代理指标,那么,瓷器出口也应在14世纪达到巅峰,这跟图4反映的瓷器生产端的起伏情况高度吻合,也跟图2反映的南方人口占比的起伏高度一致:从唐朝中期开始上升,到元末、明初达到峰值,然后因朱元璋海禁而逐步下滑。这表明,海上贸易,尤其是瓷器贸易,在推动南方崛起的过程中发挥了关键作用。

    图5  西亚和非洲出土的中国瓷器碎片与南中国出口瓷器的百分比,按世纪划分

    注:左侧和右侧纵轴分别表示“在西亚及东非和南非出土的中国瓷器碎片总数”和“从南中国窑址出口到西亚和非洲的瓷器百分比”。本图使用的数据来自张(2024),涵盖了来自170个遗址的27,729件瓷器碎片。

    为了正式检验上述假说,文章以覆盖清代中国的269个府为基本分析单位,基于三个不同时期的面板数据集做具体量化验证:742年–976年(唐代)、976年–1393年(宋元)和1393年–1851年(明清),每一分期为面板分析的基本时间单位。由于数据的限制,尤其是古代窑址和府级人口的数据,这些时间段并不完全对应各个朝代的始末年份。由于历史上府的边界变化频繁,他们采用已有文献中一贯做法,将各时期的数据调整到以清代1820年的府为基准。

    (A)隋唐时期
    (B) 宋元时期
    (C) 明清时期

    图6  各时期的海关位置(星)、主要窑址(蓝点)和年人口增长率(橘红深浅)的分布

    分析中,被解释的结果变量为每个府在一个时期的年人口增长率。在文献中,人口密度常常被视为度量经济发展水平的代理变量,因为在工业化之前的马尔萨斯经济中,繁荣的地区能够支持较高的人口密度。然而,由于文章中的面板数据涉及不同年数的分时期,作者采用各期的年化人口增长率,以确保结果变量的跨期可比性。

    核心解释变量(原因变量)是每个府到最近海关的距离。这一指标的设计是基于这样一个理论:离海关越近的地区应当具备更低的海贸成本、更强的市场信息优势和更便利参与海上贸易的条件,因此,他们更会参与海上贸易,其地方经济和人口增长应从海上贸易受益更多。唐代之前并未设立正式的海关。大致自公元713年起,唐代在广州设立了专门管理海上贸易的官员——“市舶使”。宋代至明代的海上贸易管理机构为“市舶司”,清代改名为“海关”(其各时期的分布,见图6)。为行文方便,这里统一称为“海关”。

    实证分析表明,在研究期内,越是靠近海关的地区,其人口增长率就显著高于远离海关的地区。具体来说,如果一个府在各方面与一般地区相同,但其到海关的最短距离是后者的两倍,那么,该府的年化人口增长率会低于一般府,仅为后者增长速度的一半。因此,海上贸易对742年至1851年间地方社会经济发展的影响是显著的;尤其在唐宋元时期,海上贸易带来的经济与人口增长最为凸显,成为南方崛起的主要推动力,但从明初开始,这一效果就逐渐式微。

    为了将海上贸易的影响与其他混杂因素区分开来,文章在稳健性检验中加入了若干控制变量,如地区的地形崎岖度和水稻、小麦宜种指数(以排除农业生产条件的影响)。作者还控制了每个时期各府经历的战争数以及战争移民的影响,以排除战争和大规模迁徙的影响。此外,还控制了河流网络密度,以消除国内商业活动对检验结果的影响。在考虑了这些其它效应之后,基本结果仍然成立:海洋贸易参与度越高的地区,在唐宋元期间的人口增长速度显著越快。

    是陶瓷贸易吗?

    以上分析表明,接近海关代表着较高的海上贸易参与潜力。但潜力不等于现实。为了深入挖掘一个地区的实际出口贸易参与水平,文章使用各府主要瓷器窑址的数量作为代理变量,以衡量该地区在一时期内的出口贸易参与程度。因为在十五世纪之前,瓷器和丝绸是中国主要的出口商品,茶叶还没唱主角。根据考古发掘,南方沿海地区自唐代起就有窑址,但起初,大多数窑址位于北方,远离海岸(见图6A);但在宋元时期,由于海上贸易的蓬勃发展,沿海地区兴建瓷窑,许多主要窑址就位于更接近海岸的地区(见图6B);1371年海禁政策出台后,许多沿海窑址在明清时期(1644–1911年)被废弃(见图6C)。文章的分析显示,在742年–1393年(唐宋元时期),靠近海关的地区拥有显著更多的窑址,这些地区的人口增长率也显著更高,而这一效应在明清时期则明显减弱。这些结果在控制了多个协变量的影响后仍然成立。

    与实证结果密切相关的一个问题是:为什么阿拉伯商人能够长期主导远洋贸易?值得注意的是,南方沿海的海上贸易至少可以追溯到战国时期(公元前475年–前221年),而陶瓷窑址的历史则更早。但在公元7世纪之前,海上贸易的规模和范围是有限的。转折发生在7世纪,伊斯兰教在中东兴起,并很快在西亚、东非和北非扩散传播,带来了以下变化。

    首先,伊斯兰教的圣训禁止使用黄金和白银作为饮食器皿,因此瓷器成为穆斯林精英以及后来的中产阶级的良好替代品。早期的瓷器如陶碗、杯子、罐子和瓷壶占据了商船的大部分空间。其次,伊斯兰艺术和建筑必须避免描绘人类和动物形象,因为教义禁止偶像崇拜,这促使采用抽象的几何和植物图案以及书法来进行装饰、绘画,如清真寺、建筑、家居、装饰、坟墓等场景所见。瓷器的非具象特性使其成为同时满足宗教和装饰功能的理想媒介。正因为如此,瓷器成为了伊斯兰艺术的标志性元素,结合了功能性、宗教规则和美学。尤其是,伊斯兰艺术和装饰偏好重复使用同样的形状图案,需要重复使用大量一模一样的瓷片,这就要有大量劳动力,而唐宋元时期的中国是同时期人口最多的国家,加上中国有悠久的精良陶瓷工艺传统;于是,随着伊斯兰教的成功传播,也由于伊斯兰艺术的特殊要求,通过阿拉伯商人作为贸易中介,创造了对中国瓷器的巨大需求,让中国在宋元时期就成为“世界工厂”(当然,制造的是陶瓷),促成了本文所研究的海上贸易繁荣并造就中国的南方。

    总结来说,从唐代到元代,中国南方的崛起是由于阿拉伯和波斯商人唐初来到中国,带来对瓷器和其它商品的巨大需求,并在宋代中期以前主导了远程海洋贸易;为了配合长程贸易的陶瓷等商品需求,不仅沿海地区的经济和社会得以发展,而且也带动了离海岸线较近的南方各地的商业、手工业和农业,这种外溢辐射效应就跟1980年代的对外开放贸易不仅带动沿海,也带动了南方其它地区的发展一样。阿拉伯波斯商人催生的这一变化促使南方成为中国经济和社会的中心,并将中国的市场经济进一步融入到全球网络中。与此同时,北方的传统发展模式依旧与战争和防御相联系,形成了南北两种截然不同的经济文化模式,分别代表了北方与南方的历史发展轨迹。

    总 结

    本文的贡献涉及三方面。首先,它加深了我们对海上贸易和斯密增长影响的理解。工业革命之前,斯密增长主要是由市场扩展带来的专业化增加推动的,是社会经济发展的主要原因。本研究强调了海上贸易在唐宋元时期推动斯密增长的关键作用。此前,斯密增长被认为是宋代收入增长的重要因素,但阿拉伯商人主导的海上贸易并未被视为其主要驱动力。此外,根据本文的发现,斯密增长在中国的开始时间应追溯到唐代,而非宋代,因此比现有文献中所述的时间要早得多。这一研究也补充了Acemoglu等(2005)的工作,后者将大西洋贸易确立为现代欧洲沿海国家社会经济发展的主要驱动力,然而,本文的研究重点是中世纪时期印度洋和西太平洋的海上贸易,比西方大航海时代要早九个世纪。

    其次,本文的研究为唐宋变革的讨论做出了贡献。特别是,文章不仅描述了中国社会经济中心从北方向南方的转移,而且实证性地展示了是谁(阿拉伯-波斯商人)和是什么(海上贸易,尤其是瓷器贸易)触发了这一转变,丰富了我们对中国历史从7世纪末到14世纪末的理解,阐明了南方崛起的原因。尽管以往的研究侧重于农业技术进步、水路改善、战争引起的大迁徙以及国内商业发展在解释这一转型中的作用,但它们大多忽视了阿拉伯-波斯商人对海上贸易的影响以及海上贸易对唐宋变革的催化作用,尤其是由阿拉伯-波斯商人建立的远距离跨国信任网络,这些网络根植于他们共同的伊斯兰信仰。

    第三,本研究为应用考古学和历史数据研究中国历史做出了贡献。正如王庚武(2003)所强调的,历史记录的编纂者大多数来自北方,尤其是在唐代之前,他们并不了解南方,特别是不知悉沿海地区的情况,因此无法将许多关于南方发展的事件和进展纳入早期的历史档案。由于传统的历史学家在研究这些早期王朝时主要依赖历史档案,他们的研究深度和广度因此受到了限制。然而,近年来,中国、西亚、西南亚、东南亚、东亚和非洲的陆地及沉船遗址的考古发掘出版了大量文献,为研究海上贸易史的全貌及其社会经济影响提供了丰富的数据集。通过将考古数据与历史数据相结合,本文为揭示唐宋元变革的触发因素提供了新的视角,进一步阐明了南方崛起的过程。

    Chen, Zhiwu, Zhan Lin, and Kaixiang Peng. “Rise of the south: How Arab‐led maritime trade transformed China, 671–1371 CE.” Asia‐Pacific Economic History Review (2025).

  • 杜润生:对深化改革的一点看法

    关于农村经济政策问题的一些意见

    今年(1981年)元月一日至八日,我随紫阳同志到鄂豫鲁三省的宜吕、荆州(重灾区)、南阳、开封和菏泽(困难地区)五个专区,对农村情况进行了考察,听取了地方干部的汇报,访问了一些农户。据一路所见所闻,深感农村形势比我们所想象的还要更好一些。在生产方面、党群关系方面、干部工作作风方面,都出现了好的势头。这就进一步证明了党的三中全会以来,中央关于农村的重要决策都是完全正确的。坚持下去,必然会推动农村事业更加蓬勃地向前发展。

    一、困难地区实行包产到户稳定几年,大有好处。

    河南省的兰考县和山东省的东明县,属于长期落后、贫困的地区,是生产靠贷款、吃粮靠返销、生活靠教济的“三靠”穷县。这两个县都是实行了包产到户和大包干到户。从一九七八年开始试行至今,兰考县已占生产队数的百分之八十,东明县占百分之九十以上,经济效果显著。兰考县粮食总产量,近十几年在二亿斤上下徘徊,一九八〇年达到三亿一千万斤,全县一九七八年还净吃返销粮八百万斤,一九七九年转缺为余,一九八〇年净交售三千二百万斤。棉花、花生也大幅度增长。社员人均集体分配收入,由一九七九年的四十九元七角,增至八十元,如将超产部分的个人收入计算在内,可达一百几十元。有个最穷的生产队,社员常年在外要饭棚口,包产到户后,一年人均口粮即达五百八十六斤,最困难户收入亦达三、四百元,还出现不少千元以上的“富裕户”。一九八〇年全县累计社队陈欠国家贷款一千五百万元,当年增产增收后,农民立即偿还陈欠贷款一百八十万元。东明县一九五八至一九七八年二十年间,净吃国家返销粮四亿五千万斤,花国家救济款和累欠国家贷款达七千八百万元。现在也由缺粮县变为余粮县。到目前为止,国家已收购粮食六千万斤,棉花三百万斤,花生七百四十万斤,芝麻四百七十万斤。社员人均集体分配收入一九七九年为三十一元,一九八〇年连超产部分的收入计算在内,超过百元。全县农村的人均储蓄存款,一九七九年为三元,一九八〇年达十七元。

    开封地区的登封县和菏泽地区所属各县均实行了包产到户,与兰考、东明的变化情况大体相同。

    目前,这些地区社员的温饱问题已大体解决。农民喜气洋洋说:“过去愁着没饭吃,现在愁着粮食没处放,再不用出门要饭了。”“联产联住心,一年大翻身。红薯换蒸馍,光棍娶老婆。”农村市场上,手表、自行车、缝纫机、收音机,的确良等消费品供不应求。有百分之十的农户盖起了新砖瓦房。同时,对生产资料的需求量也大大增长,大牲畜、架子车、双犁、轧花机、小型脱粒机、高质量的手扶拖拉机等添置不少。他们说:“二十多年了,可熬到自己能当家了”。现在是“既有自由,又能使上劲。”“戏没少看,集没少赶,亲戚没少串,活没少干,粮没少收”。到处听到同样的呼声,希望能三几年不变,“一年不变有饭吃,二年不变有钱花,三年不变小康家,国家赶快盖粮仓。”

    这些长期落后,贫困的地区,在短短一两年内发生了如此显著的变化,原因是多方面的。气候好,“天帮忙”固然是一个重要因素,但是在极左路线下也有天时好的时候,并未见引来象去年的这种变化。看来起主导作用的,还是党的政策。据菏泽地委谈,三中全会以来,他们根据中央文件精神落实了十一项政策,其中主要的有三条:
    (一)尊重社队自主权,因地种植(过去沙壤地不准种花生,盐碱地不准种棉花,淤地不准种大豆)。
    (二)收购价格优惠(这些穷困地区没有征购任务,或基数很低。现在交售的粮、棉、油多按超、议购价格收进)。
    (三)生产队建立了各种生产责任制,并允许包产到户。

    包产到户激发了农民的生产积极性,这是一个不容置疑的事实。过去一个相当长的时期内,把集中劳动和平均分配当作集体经济的优越性来提倡,大呼隆加上吃大锅饭,把农民的主动性和积极性都搞掉了。社员在干部的监督下进行“集体劳动”,干多干少、干好干坏一个样,一年干到头,分到的东西还不足棚口。农民穷得活不下去,想自己谋点生路,又被当作资本主义行为来批判、斗争、限制,一点自由都不给。社员出工不出力,搞低效劳动或无效劳动。干部管得越紧,群众应付办法越多:“队长在,我就磨,队长走,我就站。”人们把这种情形概括为三个字:“摽、穷、靠”。摽在一起受穷,穷得没饭吃,就靠国家救济。干群关系越来越坏。一个支书说:“一年之内,春、夏、秋拿龙提虎,冬天当狗熊。”意思是平时想法儿整治社员,得罪了人,一到冬天搞运动时,就成了斗争对象。上级领导看到集体办不好,总认为是“资本主义作怪”,连年整顿,越整越“左”,离群众也就越远。集体经济本来是为了解放生产力,可是由于采取了上述过左做法,压抑了社员积极性,就走向反面,变成了生产力发展的桎梏。了解了这些情况,就不难理解包产到户为什么在贫困、落后地区有那么大的吸引力。对于包产到户,群众热烈欢迎,干部冒险倡导,这正表明,生产关系一定要适合生产力性质这个法则,在背后起着不可抗拒的作用。在与干部谈话中,紫阳同志说:“包产到户,堵是堵不住的,只能导,不能堵。群众要求政策三年不变,我们就按群众意愿办。在这些地方,包产到户的办法要稳定一个时期。”只有这样,符合当地实际,有利于大局。

    类似兰考、东明这样的穷困地区,全国大约有一亿五千万人口。退到包产到户,搞它三、五年,使这里的社队转变穷困面貌,使每个农民平均收入达到一百元上下(集体收入和家庭收入),并减轻国家每年返销儿十亿斤粮食的负担,是完全有可能的。包产到户,特别是包干到户这种形式,虽然带有个体经营性质,但由于它是处在社会主义经济条件下,不同于历史上封建社会时期的小农经济,今后一个时期还会有相当大的生产潜力可以发挥,这是可以肯定的。以两千年搞小农经济受穷为理由,来否定包产到户有增产可能性,是缺乏根据的。当然,包产到户也有它不容否认的局限性和消极因素。在这些地方,包产到户和大包干到户带来的各种矛盾和问题,如计划种植、农机利用、水利设施的维护和使用、地块零散、军属和五保户的优抚、民办教师赤脚医生的待遇等等问题,已经遇到了,也提出来了。但据已有经验,凡是生产队组织和领导能继续下去(这点至为重要)的地方,都能找到某种解决办法。如:农机具可以包给机耕承包组或户,实行计费代耕:民办教师包了一份田,又补口粮几百斤,加上每年公助费一百八十元,收入不算太低,军烈属也有照顾办法。而且,对于包产到户,应当作为一种过渡形式来评价其作用。随着生产的发展,农民对扩大再生产的要求必然会提出来,那时就会重新走向新的联合。一些农民也很清楚:“包产到户是个穷法儿。三几年后,叫俺咋办就咋办,俺还要集体的。”听说实行包产到户较早的社队,社员之间由于各种条件不同,已出现了收入差距;一部分农民为了克服生产上的困难,又开始了小规模的合作,如简单的牲口插犋、换工、调整地块等等。有些资金较充裕的人,三、五联合起来,自负盈亏,搞打井、机耕、育种、粮米加工等专业性的技术服务业务。预计今后承包土地会逐渐向务农能手集中,副业向另一些能工巧匠集中,逐步形成专业化分工。然后在这个基础上扩大联合范围。可以看出,包产到户走向联合是必然的,但不一定再走过去的路子-一声令下,全面组织起来,而将根据经济上的需要,通过各种自愿的小型合作,走上逐步扩大的道路。这是后话。现在应当先稳定下来。在稳中求变,不要急忙图进。

    本文来源:农业集体化重要文件汇编,中共中央党校出版社1981年10月第一版

    对深化政治体制改革的几点看法

    一、当前中国要过好“市场关”与“民主关”

    在加入 WTO 以后,中国承诺了,而且国际认同了中国将按WTO的规则,即全球化贸易规则,重新修订中国的有关法律、规章。包括总结历史经验,需要在《宪法》中规定市场经济和私有经济的合法性,并接受工商联的建议,进一步确认在现阶段,和公有财产一样,应“保护私有财产不受侵犯”。

    过“市场关”,必须同时过好“民主关”,两者密不可分,不能只接受市场,不接受民主。经济上所有制多元化,反映到政治上必然出现多种经济主体参与的新格局,他们分别代表不同所有制与不同阶层的经济利益,提出不同的要求。为使这些不同声音、不同要求得以充分表达,作为执政党,必须发扬民主,尽可能地从多方面集中群众意见,避免决策的失误。这就是在过“市场关”的同时,还要过“民主关”的经济动因。那些不利于经济发展的体制性障碍,实质上是当前深化改革、稳定社会的主要桎梏,也是对于执政党,地位的一种潜在的威胁。江泽民同志提出加强民主法制,进行具有中国特色的,而不是形式上照搬西方的深入的政治体制改革,是一项正确决策。

    二、过好“民主关”,必须确立相应的制度框架

    (一)政府主要官员经民主选举,候选人实行差额选举法,行政司法立法,相互分工,相互制衡,防止政府过度集权。

    (二)给农民以国民待遇。从制度上、体制上、法律上废除歧视农民的分割城乡的户籍制,让农民享有自由迁徙权和《宪法》给予的其他公民权利。除土地税外,免除其他附加税,经营服务业按城市居民一样收取所得税。

    (三)根据江泽民同志“七一讲话”精神,加强执政党的建设。建设有中国特色的社会主义,必须坚持“四项基本原则”不动摇。鉴于市场经济包含多元化的经济成分,极为分散的独立的企业,复杂的对内对外的经济联系,以及频繁的社会交往,党的一元化领导应主要依靠制定方针政策和党员模范作用来实现。不可以党代政,干涉政府、社团、企业、事业单位的具体业务。党要管党,特别是管好在不同岗位上担负领导工作的干部,要求他们以身作则,凭本人道德品质和优良业务水平,以及贯彻执行党的方针政策的坚定性,密切联系群众,从整体上推动社会进步。要发动群众实行民主监督,防止公务人员违法乱纪,贪污腐败,蜕化变质。

    (四)加强全国人大、政协的民主功能。建国前后,毛泽东、周恩来极其重视政治协商会议,拟订政协《共同纲领》,实行共产党领导下的多党合作制,通过民主讨论,集思广益,共商国是,提倡从团结的愿望出发,经过批评自我批评达到新的团结,以利于发挥各阶层、各界人民的建设积极性。关于政府组成,早在抗日战争时期,毛泽东就规定了“三三制”的权力结构。放手使用、信任非党民主人士参加政府工作。今天,党具有崇高威望和掌握政治上、军事上及组织上不可替代的实力,应当更充分地发挥人大、政协的作用。党不宜既当“运动员”,又当“裁判员”,要从直接干预经济事务中退出,以便发挥好领导作用。

    民主的实质,首先是一种办事秩序,重大的问题要经过当事人、有关者,特别是法定协商机构,表达意见,体现决策民主化与科学化。人大是最高权力机构,应充分发挥《宪法》赋予人大代表的神圣的民主权利。对于人大代表提出的问题以及批评、建议,党组织应采取热情支持、鼓励的态度。由人大、政协承担部分民意的反馈作用,对全局和长远的稳定是极为必要的,不可或缺的。

    (五)要消除民主“恐惧症”。一个民主国家发生一点小乱子不可避免,不必害怕。中国不会由于民主而出现大规模的动乱,只会由于不民主而出现暴力闹事局面。

    有13亿人口,占地960万平方公里的大国,出点小乱子有利于暴露出隐患和潜在矛盾,及时研究对策,改正错误,有利于防止小病酿成大病。因此,对个别地方群众集体反映意见,无需惊慌失措,但要有充分的思想准备和预警方案、对策。在和平建设时期,人民内部矛盾是客观存在,甚至会突出起来,解决矛盾的惟一办法是根据毛泽东同志倡导的正确处理人民内部矛盾的指导方针,发扬民主,建立民主制度。全球化,不只是经济全球化,也伴随民主政体全球化。“民主关”必须过,中国一定会在这一进程中走在前列。

    本文为2002年6月11日,杜润生谈话记录整理稿,选自《 杜润生文集》下册,山西出版集团2008年7月第1版第1283—1286页

  • 韩建业:论五帝时代

    “五帝时代”指古史传说中夏代以前的中国上古时代,其历史真实性在古代原不成问题。但自晚清民国以来,中西文化激烈碰撞下疑古之风盛行,五帝时代因之基本被否定,极端者甚至有“东周以上无史说”。虽然因晚商都邑殷墟、早商都邑郑州商城等考古学发现,此说宣告破产,但对商代以前的夏代乃至五帝时代,学术界的质疑声至今仍未断绝。五帝时代的真实情况究竟如何?只有紧密结合文献史学和现代考古学,并以适当的方法展开研究,才有希望逼近答案。

    一、文献记载中的五帝时代

    《周礼·春官·宗伯》:“外史掌书外令,掌四方之志,掌三皇五帝之书。”其中“三皇五帝”显然指人而非神,且“五帝”晚于“三皇”。《周礼》所载官制等基本符合西周或者春秋时期的实际情况,可知“三皇五帝”的提法也当出自西周或春秋,而非战国以后的发明。战国时期出现“五帝”的情况增多,《荀子》《战国策》中各3处,且多与三王、五伯并举,《吕氏春秋》中有14处之多,一般连称“三皇五帝”或“五帝三王”。和“三皇”有多种组合的情况不同,严格来说“五帝”说其实只有一种,就是出自《大戴礼记·五帝德》《帝系》当中的黄帝、颛顼、帝喾、尧、舜,在《国语》中也有同样的排列顺序,很可能是至迟在春秋时期已有的说法,后被《史记·五帝本纪》采用。其他一些曾被称为“五帝”者其实并非确指,或者属于神圣而非人王。即便真正的“五帝”就一种说法,那也应该是从众多古人中挑选的结果,同时期还存在很多其他杰出人物。在这个意义上,我们就可以使用“五帝时代”这个概念,指称以“五帝”为代表的那个时代。有关五帝时代的记述,目前只能在商周及以后的文献中见到,被认为部分可能是“口耳相传”的结果,五帝时代一般也就被划到“传说时代”的范畴,相当于西方学术界所谓“原史”时期。

    疑古学者多视“五帝”为神话人物,基本否定五帝时代的历史真实性。顾颉刚在1926年出版的《古史辨》第一册中明确提出“层累地造成古史说”,认为东周初年《诗经》里有天神禹,东周末年《论语》里出现尧、舜,战国至西汉伪造了许多尧、舜之前的古“皇帝”,结论是“东周以上只好说无史”,“自三皇以至夏商……都是伪书的结晶”。更早的时候,胡适也主张“中国东周以前的历史,是没有一个字可以信的”。但1928年开始的对殷墟的发掘,发现甲骨文、宫殿、王陵等大量证据,确凿无误地证实晚商属于信史。这不但推翻了“东周以上无史说”,而且证明“层累地造成古史说”逻辑难以自洽。又因晚商史业已被证为信史,早商、夏代甚至五帝时代的历史真实性也理应重新加以考虑。

    其实早在1917年王国维就发表《殷卜辞中所见先公先王考》,论定《史记·殷本纪》所记载的商殷世系几乎完全合于甲骨卜辞所见商人世系。王氏明确认为尧、舜、禹属于历史人物,不应疑古太过。之后蒙文通于1927年出版《古史甄微》,提出中国上古民族可以分为江汉、海岱、河洛三系。徐旭生在1943年出版的《中国古史的传说时代》一书中提出中国古代部族可以分为华夏、东夷、苗蛮三大集团。1935年傅斯年则提出“夷夏东西说”。这些研究虽与传统的中华一脉古史观有别,但却都是在承认五帝时代真实历史背景的基础上做出的综合研究。

    五帝时代的诸多人物并非出于战国西汉以后的杜撰,这在晚商、西周和春秋时期的出土文献中也有所证明。殷墟甲骨文中的“四方”“四方风”,见于《山海经》和《尚书·尧典》。殷墟甲骨文中商人将帝喾(高辛氏)作为高祖,这也和传世文献吻合。刻有“天鼋”或“天”族徽的先周和周代青铜器主要分布在陕西,或与轩辕黄帝的名号有关。西周图片公图片记载禹敷土浚川,春秋秦公簋记载“鼏宅禹迹”,春秋晚期的秦公一号大墓石磬上秦人将高阳(颛顼)作为高祖。战国时期金文简牍上关于五帝时代的记载就更多了。比如齐侯因图片敦铭文记载田齐的高祖为“黄帝”,长沙子弹库楚帛书关于炎帝、祝融、帝俊、共工等的记载,清华简《五纪》关于黄帝、蚩尤等的记载,以及其他简牍上有关于尧、舜的记载。

    但需要承认的是,不管传世还是出土,目前尚不见晚商以前的相关文献。换句话说,所有关于五帝时代的记载都见于至少七八百年之后的文献中,它们的说服力因此大打折扣。但学人很早就提出新的解决途径:“要想解决古史,唯一的方法就是考古学。”即便顾颉刚也认为,地下出土的古物既可以用来破坏旧古史,也可以用来建设新古史。李学勤则从文献和考古结合的角度,提出要“走出疑古时代”。显而易见,探索古史真相不能仅依靠文献记载,还得和考古学结合。

    二、五帝时代考古学探索的方法

    利用考古学探索并一定程度上实证古史,最重要的是达成传说和考古资料这两个古史系统之间的互证互释。考古资料是传说史料最可靠的参照系,经过百余年的工作,这个参照系已经以中国史前(原史)考古学文化谱系为主要内容基本建立起来。假设五帝时代为真,那么当时不同族群集团的遗存及其时空框架也应包含在其中,只待与传说史料相印证。

    早在20世纪30年代,徐中舒就提出虞夏对应彩陶文化(仰韶文化),太昊少昊对应黑陶文化(龙山文化)。到了50年代,范文澜又推测仰韶文化可能为黄帝时代文化。七八十年代以来,关于五帝时代的考古学探索更多。既有对炎黄、三苗、东夷、有虞氏、陶唐氏、共工氏 等族群所对应的考古学文化的探索,有对“大禹治水”等个案的研究,也有从宏观上对五帝时代的把握,并主要形成两类意见。第一类意见认为,五帝时代大体可以与仰韶文化和龙山文化时期对应。如严文明、苏秉琦等认为仰韶文化后期(铜石并用时代前期)对应炎黄时期,龙山时代(铜石并用时代后期)对应尧舜禹时期,笔者等进一步提出仰韶文化前期已进入炎黄时期;许顺湛认为仰韶文化对应炎黄文化,仰韶文化末期到龙山时代早期为颛顼时代,中原龙山文化早期对应帝喾时代,中原龙山文化晚期对应尧舜时代。第二类意见认为,五帝时代和龙山时代大体对应。如童恩正认为中原龙山文化和“五帝”符合,沈长云、江林昌认为五帝时代大致对应龙山文化时期,李先登等具体提出五帝时代早期的黄帝、颛顼、帝喾时期相当于龙山时代早期,五帝时代晚期的尧舜禹时期相当于龙山时代晚期,徐义华认为龙山时代城址的大量出现可能与黄帝时代的战争背景相关。

    总体来看,上述关于五帝时代的宏观认识,时间上不出仰韶文化时期和龙山时代,空间上集中在黄河中下游,涉及长江中下游和西辽河流域。空间范围的框定基本就是根据文献传说,时间范围则是从夏商所对应的考古学文化前溯,大致符合“从已知推未知”的逻辑思路。殷墟和郑州商城遗址的发掘,确证殷墟文化和二里岗文化分别为晚商文化和早商文化,二里头遗址的发掘基本确定二里头文化为夏文化或晚期夏文化,则五帝时代只能在之前的龙山时代甚至更前,但到底“前”到何时则不好确定。有些学者在基本信任文献传说的前提下,以神农氏“教民稼穑”为依据,设想当时应为农业社会,认为应该从仰韶文化开始,但实际上中国农业在距今8000多年的前仰韶时期已有初步发展。不少学者以《史记·五帝本纪》所记轩辕黄帝征战四方、统一天下、置官监国为根据,设想其社会应该比较复杂高级,但到底高级到何种程度,是初步开始社会复杂化,还是即将进入或已经进入国家社会?这些其实都难以遽断。考古学上对农业起源发展和社会复杂化进程的认识本身就存在不同意见。还有就是这种“比附”式宏观观察方式,很依赖于文献记载细节的真实性——而这本身是需要验证的。也有不少人想当然地以为,既然关于五帝时代的记载比较模糊,那么与考古学的对应也自当比较宏观笼统才对,但问题是如果每一个细节和局部都得不到证实,又如何能保证整体和宏观的真实性?因此,对五帝时代的考古学探索,最终还需从细节和局部入手,而且必须遵循严格的论证逻辑,找到有效的研究方法。

    “由已知推未知”的思路建立在考古学文化一定程度上可以对应于族群、国族的前提之上。我们可以将族群分成三种情况:一是具有相同文化传统、文化习俗和语言的事实上的族群,一般和考古学文化有较好的对应关系;二是当时人所认同甚至包含一定程度建构成分在内的族群,最容易在民族志中找到案例;三是文献记载中的族群。这三种族群多数情况下其主体部分应该是重合的,是以第一种情况作为基础的。国族指国家层面的族群共同体,由一个族群扩展或多个族群融合而成,因国家力量整合形成血缘、文化、语言、历史等方面的共性。因为文化等共性的存在,国族也会和考古学文化有一定程度的对应关系,但情况更为复杂。族群和国族的复杂性,提醒我们考古学文化和族群不宜做简单对应,已进入早期国家阶段的五帝时代尤其如此。但从商周二代国家范围和考古学文化圈存在一定程度的对应关系来看, 考古学文化和国族的对证研究并非不可行,与一般族群的对证研究理应更有可能。

    尽管如此,古史传说中关于特定族群的记载往往存在模糊或歧异之处,加之很难对族群和国族进行区分,而考古学文化本身通常也并非毫无异议,这就使得考古学和古史的对证很容易导向诸多难以验证的推论,对五帝时代的考古学对证尤其如此。这也是很多人质疑古史和考古学能否对证研究的主要原因。但如果我们遵照严谨的逻辑,找到若干比较确定的关键点,再将这些关键点串联成面,而且和古基因、古语言谱系研究结合起来,就有可能增强古史对证的准确性和有效性。为此,笔者有针对性地提出两种研究方法,即变迁法和谱系法。

    “变迁法”就是以考古学上观察到的巨大变迁来一定程度上证实文献传说中的重要战争或迁徙事件的方法。考古学上的巨大变迁,包括考古学文化巨变和中心聚落巨变两个方面,前者指考古学文化面貌格局发生大范围的剧烈变化,后者指中心聚落、古城等突然毁弃或者出现破坏、暴力现象,两者通常互有关联。而这些在考古学上都是相对容易识别到的。巨变往往是大规模战争和迁徙事件的产物,推测也应当是古人最倾向于记载、传承下来的内容。因此,用考古学上的巨大变迁对古史加以验证,相对容易且确定性也较高。而用这种方法所获得的关键认识,又可以进一步作为其他相关研究的基点。

    “谱系法”则是将文化谱系、基因谱系、语言谱系和族属谱系相互结合的方法。族群既然和血缘、语言、文化都密切相关,那么如将它们都结合起来进行研究,推论的确定性一定会增加。如果再将四个谱系结合起来,就会形成更加确定的推论。目前中国新石器时代考古学文化谱系的基本框架和基本内容已经确立,只是需要不断完善。对古代人群基因和语言谱系的建立方兴未艾,目前已经在揭示东亚现代人基因组、中国南北方史前人群迁徙与融合过程,以及汉藏、南岛和阿尔泰语系等人群的基因和语言谱系等方面取得了初步成果。族属谱系则需要对涉及五帝时代的传世文献和出土文献进行整理分析,最终构建出上古时期族群谱系的基本框架,允许有几套可能性框架,最终以文化、基因和语言谱系来验证。当然,这里的关键是对“四谱”的互释,最佳的办法依然是结合重大历史变迁,由点及面逐渐展开。

    三、考古学视野下的五帝时代

    五帝时代有文献记载的重要战争事件,首先要数五帝时代之末的“禹征三苗”;与其大略同时的“稷放丹朱”事件,可能也有军事暴力发生;还有一个就是五帝时代之初轩辕黄帝和蚩尤之间爆发的“涿鹿之战”。考古资料显示,这些战争事件可能都真实发生过。

    (一)禹征三苗与黄河流域文化的南下

    “禹征三苗”事件在《墨子·非攻下》有详细记载:“昔者三苗大乱,天命殛之。日妖宵出,雨血三朝……五谷变化,民乃大振……禹亲把天之瑞令,以征有苗……禹既已克有三苗,焉磨为山川,别物上下,卿制大极,而神民不违,天下乃静。”古本《竹书纪年》对三苗灭亡前夕的天灾有类似记载:“三苗将亡,天雨血,夏有冰,地坼及泉,青龙生于庙,日夜出,昼日不出。”可见,“禹征三苗”应是趁后者发生天灾内乱之际发动的一场有计划的征服战争。

    从文献记载来看,禹或夏禹主要活动在黄河流域,但具体地点不好遽定。史载“禹兴于西羌”、“禹会诸侯于涂山”、“禹都阳城”或“平阳”。禹的兴起或诞生地被认为在中国西部,禹会诸侯的“涂山”有被认为是在江淮地区,禹所都的阳城或平阳有晋南、豫西、豫东等不同说法。“大禹治水”“禹画九州”传说中禹的活动范围更广。禹是夏人首领,夏人主要的活动区域多被认为在晋南和豫中西地区,但也有其他观点。比较而言,三苗的居地更好确定。三苗属于徐旭生所说苗蛮集团,其活动地区虽然涉及黄河下游、长江中下游广大地区,但到和尧舜禹发生冲突的时候,基本就是在江汉两湖地区。《战国策·魏策》:“昔者三苗之居,左彭蠡之波,右洞庭之水,文山在其南,而衡山在其北。恃此险也,为政不善,而禹放逐之。”据考证,这个范围大抵东至鄱阳湖、西以洞庭湖为界、向北及于桐柏山。

    夏禹作为夏王朝的创建者,其主要活动年代当在距今4000年左右。距今约4100年之前,在豫西南、豫东南和江汉两湖地区分布着范围广大的石家河文化,但之后发生文化巨变:石家河文化特色鲜明的陶器群大范围快速消失,新出矮领瓮、细高柄豆、侧装足鼎等与王湾三期文化煤山类型接近的陶器,出现鬶、盉等龙山文化或造律台文化因素,致使豫东南、豫西南、鄂西、鄂北等地都突变为王湾三期文化,江汉平原及附近地区突变为和王湾三期文化接近的肖家屋脊文化;聚落遗址急剧减少,如大洪山南麓由石家河文化时期的63处遗址锐减到14处;从屈家岭文化延续至石家河文化的大约20个古城,此时基本都遭到毁弃,包括石家河文化的中心天门石家河古城;最保守的祭祀方式也发生突变,石家河文化大量用首尾相套的陶缸祭祀的现象消失,数以十万计的红陶小动物、小人、红陶杯等祭品祭器也基本消失或者数量剧减;在肖家屋脊文化当中出现前所未见的浅浮雕、透雕的小件玉器,此类玉器在更早的龙山前期晚段就出现在山东临朐西朱封、山西襄汾陶寺、河南禹州瓦店等遗址。如此大规模的黄河流域文化南下引起的文化和聚落巨变,只能是大规模战争的结果,和“禹征三苗”事件吻合。此前曾有人将“禹征三苗”解释为二里头文化向江汉地区的渗透,但此说在年代上似有抵牾之处,因为二里头文化已经是晚期夏文化了,和夏禹不能对应。

    (二)稷放丹朱与北方文化的南下

    古本《竹书纪年》:“后稷放帝朱于丹水。”后稷指周人的始祖弃,《诗经·大雅·生民之什》:“厥初生民,时维姜嫄,生民如何,克禋克祀,以弗无子,履帝武敏歆,攸介攸止,载震载夙,载生载育,时维后稷……即有邰家室。”《国语·鲁语上》:“周人禘喾而郊稷。”记载中他是帝喾的嫡长子,理应最有资格成为帝喾的继承人,但他勤于农事而被封为后稷,就是当时的农官,实际继承人是和他同代的尧,这或许为后来的矛盾埋下了伏笔。关于后稷的诞生地“有邰”,汉代以来流行泾渭说,近世有晋南说。尧子丹朱的居地被认为是在豫西南丹水,其实当为被流放后的结果,之前应与尧居于一地。尧的居地又有山东、河北、山西诸说,山西说本身又有“平阳”说和“晋阳”说的分歧,还有晋阳徙平阳说。虽然后稷和丹朱—尧的居地有多种说法,但他们发生交集的地方却只有晋南。文献记载尧时已在丹水流域征服苗蛮, 《吕氏春秋·召类》:“尧战于丹水之浦,以服南蛮。” 丹水附近的陶斝极似晋南者,晋南的丹砂也可能来自丹水地区,后稷逐放丹朱于丹水比较符合情理。

    按《尚书·尧典》所载,稷和禹所处时代大致相同,则“稷放丹朱”发生时间应也与“禹征三苗”接近,在距今4100年前后。从考古学上来看,当时晋南地区确实发生了一次文化和聚落巨变:大量双鋬陶鬲出现在原本有斝无鬲的临汾盆地,致使本地陶寺文化剧变为陶寺晚期文化;陶寺遗址甚至附近的临汾下靳、芮城清凉寺等地大中型墓葬,几乎都被挖毁;陶寺遗址还有宫殿废弃、暴力屠杀、摧残女性等现象。双鋬鬲是老虎山文化的典型陶器,其分布范围主要在今内蒙古中南部、陕北、晋中北和冀西北一带。在陕西神木石峁、内蒙古清水河后城嘴、山西兴县碧村遗址都发现了距今4000年多年前的充满军事气氛的大型石城聚落,尤以400万平方米的石峁石城最为瞩目,显示其具有强大实力。考古学上的晋南巨变应当同老虎山文化南下密切相关,和“稷放丹朱”事件能够吻合。

    “稷放丹朱”的考古学实证,证明陶寺古城在该事件发生前至少有一段时间应当是陶唐氏尧的都邑,而老虎山文化人群中至少有一支参与了后稷对丹朱的战争放逐事件。据记载,后稷是轩辕黄帝的直系姬姓后裔,北狄也是,而石峁古城很可能为北狄故城,则以后稷名义发起的这起事变,有石峁人群参与也是有可能的。至于《竹书纪年》等有关舜囚尧和阻丹朱的记载,似乎和儒家历来所称道的尧舜禅让之说相去甚远,其实有相通之处,即尧、舜更迭必然是因某一重大变故而发生,这一变故很可能就是“稷放丹朱”事件,“稷放丹朱”或许还有舜的参与。

    (三)涿鹿之战与黄土高原文化的东进

    《逸周书·尝麦》记载:“蚩尤乃逐帝,争于涿鹿之河(或作阿),九隅无遗。赤帝大慑,乃说于黄帝,执蚩尤,杀之于中冀,以甲兵释怒。”似乎蚩尤和炎帝(此记载中误作赤帝)、蚩尤和黄帝之间的战争都发生在涿鹿,蚩尤曾一度侵凌炎帝,黄帝应炎帝所请而击杀蚩尤。但在《史记·五帝本纪》中,黄帝和蚩尤之间的才是涿鹿之战,另有炎黄之间的阪泉之战,没有提到蚩尤和炎帝之间战争的具体情况:“炎帝欲侵陵诸侯,诸侯咸归轩辕。轩辕乃修德振兵……以与炎帝战于阪泉之野。三战,然后得其志。蚩尤作乱,不用帝命。于是黄帝乃征师诸侯,与蚩尤战于涿鹿之野,遂禽杀蚩尤。而诸侯咸尊轩辕为天子,代神农氏,是为黄帝。” 《战国策》《庄子》等都有黄帝、蚩尤战于涿鹿的记载。至于炎黄间的“阪泉之战”,在《大戴礼记·五帝德》《左传》《列子》等中也都有记载。但先秦汉晋以来文献记载中两场战争就已有混淆,除上述《逸周书·尝麦》记载蚩尤逐炎帝也在涿鹿,《逸周书·史记解》、《水经注》也有类似记载,近世学者也多将二者混同,不过尚不足以否定《史记》的说法。

    上述文献所记涿鹿之战中的轩辕黄帝、炎帝和蚩尤,显然都是具体的个人,也有不少记载中的黄帝、炎帝和蚩尤只是部族首领的统称。当然无论是个人还是部族,都应有个大致的活动范围,只是炎、黄等的传说遍及大江南北,自汉代以来就众说纷纭。《国语·晋语》:“昔少典娶于有蟜氏,生黄帝、炎帝。黄帝以姬水成,炎帝以姜水成。成而异德,故黄帝为姬,炎帝为姜。”徐旭生据此并结合其他材料考证认为,黄帝部族发祥于偏北的陇东陕北地区,炎帝部族则发祥于偏南的渭河上游地区,二者都属于华夏集团。此后他们向东迁徙,在路线上同样是前者偏北而后者偏南。徐旭生还认为蚩尤属于东夷集团,是九黎的首领,九黎的活动范围从晋东南一直延伸到河北、河南、山东三省交界之处。但从《尚书》《国语》等相关记载看,蚩尤还是苗蛮集团的先祖,将之归入苗蛮集团也未尝不可,可见蚩尤部族活动范围很大。关于黄帝和蚩尤发生交集的“涿鹿”虽也有不同说法,但大致都在华北一带,尤其今冀西北涿鹿一带为涿鹿古战场的观点被更多人认可。黄帝部族从陕北东向经内蒙古中南部到达冀西北也是顺理成章的事。至于炎帝部族,按照徐旭生的说法,是偏南沿着渭河流域东向发展,应该是抵达晋、陕、豫交界地带才更合情理,与冀西北相距较远,炎黄之间的阪泉之战也就更有可能发生在晋南附近。

    轩辕黄帝早于后稷、夏禹的时代。从大约距今4100年往前追溯,直到距今4700多年,就能看到在陇东陕北至华北这一大片地方,曾经发生过一次考古学文化格局的巨变。黄土高原大部分地区在仰韶晚期向庙底沟二期转变的过程中,文化仍连续发展,而内蒙古中南部、河北大部和豫中地区则不然:内蒙古中南部老虎山文化代替仰韶文化海生不浪类型,冀西北地区老虎山文化替代雪山一期文化,冀南豫北和郑洛等地的仰韶文化大司空类型、秦王寨类型衰亡,西辽河流域的红山文化消亡,海岱地区的大汶口文化当中新增不少横篮纹。这种突变当和黄土高原文化的东进有关。与此同时,在陕北、内蒙古中南部地区突然涌现出许多军事性质突出的石城。这些变化可能是由黄土高原人群在大规模战争事件中的胜利而导致,很可能对应文献记载中的涿鹿之战。尤其是在冀西北张家口贾家营遗址明确存在老虎山文化前期遗存,文化面貌和陕北、内蒙古中南部同期遗存近似,上限有可能早到庙底沟二期。崇礼邓槽沟梁甚至还发现老虎山文化的城址。冀西北被认为有可能是古涿鹿之地,张家口的这些发现为涿鹿之战的实证增加了新的线索。

    特别值得一提的是,冀西北等地在庙底沟二期之前是雪山一期文化,其与海岱地区的大汶口文化有着密切关系。海岱地区是蚩尤或东夷部族的大本营,大汶口文化很可能是以蚩尤等为首的东夷部族的文化。大汶口文化和江汉两湖地区的屈家岭文化的形成有很多共性,屈家岭文化被认为是三苗或苗蛮的文化,而记载中蚩尤又是苗民的领袖,可见东夷和苗蛮关系非常密切。距今5000年左右的仰韶文化晚期,中期大汶口文化和早期屈家岭文化分别强烈向西向北影响,很多文化因素渗透到郑洛、晋南、关中东部各地,这或可视为蚩尤所代表的东夷和苗蛮集团大力扩张并侵凌黄河中游各部族的考古学证据。这种情况从庙底沟二期开始发生重要转变。距今4700多年恰好是中国考古学上一个重要时代——庙底沟二期的开启年代,不少人认为庙底沟二期已属于广义龙山时代的早期;传承下来的黄帝纪元元年为公元前2698年,也正在这个年代范围之内。

    (四)五帝时代的基本时空格局

    从考古学上大致实证禹征三苗、稷放丹朱、涿鹿之战事件,建立了进一步探索五帝时代的三个基点,其基本时空格局也可由此初步推定。

    禹征三苗事件的实证,进一步确定了夏禹的历史真实性和夏代的上限,证明以王湾三期文化后期为代表的中原龙山文化后期属于早期夏文化,石家河文化及其前身屈家岭文化等属于三苗文化。禹征三苗之后,黄河、长江流域文化融为一体,奠定了夏王朝版图的基础,因此,《尚书·禹贡》的“九州”很可能记载的是距今4000年左右的真实状况,基本等同于夏初疆域,而非出于战国时人的想象。

    稷放丹朱事件的考古学探索,说明尧、丹朱、后稷可能确为真实历史人物,由此可推知《尚书·尧典》等文献记载的舜等其他人物也应当基本属实,证明晋南的陶寺文化至少有一段时间和陶唐氏尧有关。

    涿鹿之战事件的考古学探索,说明轩辕黄帝、蚩尤、末代炎帝,以及文献所载同时期人物,都可能有一定的历史真实性,推测黄土高原的仰韶文化后期至龙山文化早期可能属于黄帝部族文化,以东华北平原直至黄河下游地区的仰韶文化后期、雪山一期文化、大汶口文化等,可能与蚩尤部族有关。这两大区域之间的晋南、豫西和关中东部等地区,可能就是炎帝部族的核心分布区。

    由此可见五帝时代人物的活动范围主要是黄河和长江流域,尤以黄河流域为主,时间上则从4700多年前延续至约4100年前。又可归纳为早、中、晚三期,其中轩辕黄帝、蚩尤和末代炎帝等最早,距今4700多年;帝喾、尧、舜、稷、丹朱、禹等属于晚期,距今4100年左右;颛顼在中期,年代介于二者之间。《大戴礼记·五帝德》《史记·五帝本纪》记载颛顼、帝喾分别为黄帝的孙和曾孙,之后紧接着就是尧、舜,似乎五帝时代不过五六代人,充其量也就100多年,现在看来应当存疑。如果承认颛顼为黄帝之孙,帝喾为后稷之父,则颛顼和帝喾之间就可能间隔了20多代、500多年。

    早于距今4700多年的前五帝时代的文化,在考古学上也是有线索可循的。既然距今4700多年的黄土高原地区的仰韶文化晚期有可能为黄帝部族文化,那么黄土高原或者渭河流域更早的仰韶文化理应与更早的黄帝部族有关。仰韶文化初期开始于距今7000年左右,当时分布在关中和汉中地区的零口类型诞生不久,即东向扩展至晋南豫西地区,形成与零口类型大同小异的仰韶文化枣园类型。联系《国语·晋语》黄炎同源而分道的记载,零口类型有可能是最早的黄炎共同的文化,此后的零口类型中晚期和半坡类型则可能是黄帝部族文化;而晋南豫西的枣园类型,以及后续的东庄类型、庙底沟类型,则主要为东迁后的炎帝部族文化。黄炎之外其他部族的文化也可以循此逻辑向前追溯。

    以上对五帝时代时空框架的建构主要是根据几个关键点做出的,如果能在此基础上将文化、基因、语言和族属谱系结合起来进行全面深入的研究,相信会得到更加令人信服的结论。

    四、五帝时代与中华文明的初步发展

    从现在的考古学研究来看,中华文明起源于距今8000多年,形成于距今5100年左右。因此五帝时代并非中华文明的起源和形成时期,而是已经进入初步发展时期。

    距今5100年左右中华文明形成的最重要的标志,就是良渚和南佐两个超大型聚落遗址的发现。浙江余杭良渚遗址内城面积近300万平方米,计入外城则达630万平方米,内城中部有30万平方米的人工堆筑的“台城”和宫殿建筑,有随葬600多件玉器的豪华大墓,出土了大量玉器、水稻等,外围更有高低坝、沟壕等构成的大规模水利系统。甘肃庆阳南佐遗址面积600万平方米左右,遗址核心区由两重环壕和九座大型夯土台围成,面积达30多万平方米;其中央偏北处围出数千平方米的“宫城”,主殿夯筑而成,占地700多平方米,出土了大量精美白陶、黑陶和水稻。这两个规模超大的中心聚落,宫殿建筑、壕沟水利等工程浩大,玉器、白陶、黑陶等的制作都有很高的专业化水准,说明已出现强大的公共权力或王权。两个聚落都在继承原有聚落(社会)的基础上实现了跃进式发展,超常的规模依赖于对较大范围内人力物力的统一调配,这无疑指向地缘关系对早先区域性氏族社会格局的重塑。笔者认为,王权和地缘关系的同时出现,显示两地业已迈入早期国家行列,中华文明正式形成。但两处早期国家的统治范围基本不出太湖周边或黄土高原地区,称之为“古国”或“邦国”比较合适,属于“古国文明”阶段。

    距今4700多年是中华文明初步发展的关键节点。黄土高原文化的东向强烈拓展,很可能已将内蒙古中南部、河北大部和河南中部等地区纳入一个更大的国家组织之内,甚至黄河下游的大汶口文化区可能也属于这个早期国家的统治范围。而按照《史记·五帝本纪》的记载,通过涿鹿之战和阪泉之战,轩辕黄帝已经统一天下,置官设监,监于万国。不但统治黄河流域,还“南至于江”。考古发现和文献记载大致可以吻合。距今约4500年以后,面积达三四百万平方米的襄汾陶寺都邑和神木石峁石城先后在晋南和陕北地区出现,黄土高原的文化中心地位得以延续。

    距今约4100年是中华文明早期发展的关键节点。此时至少长江中游地区已经通过“禹伐三苗”事件被纳入华夏集团版图。《尚书·禹贡》等记载的夏禹划分“九州”,很可能即真实发生在这一背景之下。据此可以说,至迟在夏朝初年夏王已经初步建立起“大一统”的天下王权。其统治特色是由夏后氏及许多其他族氏共同构成统治集团,从而建立起“血缘组织基础之上的政治组织”,而所谓“九州”即统治天下“万国”的结果。这些标志着“王国文明”阶段的到来。

    结语

    通过对文献传说和考古学的对证研究,我们现在可以说,文献传说中的五帝时代应该是真实存在过的,其年代大抵从约4700年前延续至约4100年前。前后可划分为三个时期,大体自轩辕黄帝、蚩尤和末代炎帝等起,继以颛顼和其后诸帝,最后为帝喾、尧、舜、稷、丹朱、禹等。五帝时代,中华文明已经过起源和形成的时期,进入初步发展阶段。经过长期兼并融合,跨区域的王权国家在此时萌芽,早期时已至少形成对黄河流域大部的统治,晚期时更以“禹征三苗”为契机,将长江流域也纳入国家版图,夏王朝初步“一统”的格局正是在此基础上建立的。

    五帝时代是古代中国人心目中信史的头一篇章。以五帝为代表的上古祖宗先圣,其后更成为历代敬仰效法的对象,奠定了中华民族数千年来追求文化“一体”、政治“一统”的基础,也成为延续中华文明的重要原因之一。可以说,百年来对五帝时代的质疑和否定,一定程度上就是对中华历史根脉的质疑和否定。虽然考古学为复原、重建中华上古史带来了新的途径和方法,但考古学的局限性又决定了它并不能独立解决上古时代的精神创造、制度创造、族群认同、历史记忆等重大问题,而精神创造和制度创造才是中华文明之所以区别于其他文明、之所以伟大长存的核心所在,族群认同和历史记忆更是中华民族凝聚发展的关键。因此在缺乏深入论证的情况下,不应轻易否定五帝时代,更不该轻率地把结合古史传说的研究看作考古学发展的障碍和误区。

    当然,从考古学出发探索五帝时代古史并不容易,它要求研究者必须熟谙相关文献记载和考古学知识系统,必须掌握严谨可靠的研究方法,而不是盲目比附。它更要求研究者必须认真辨析后世文献对五帝时代真假杂糅的记载;根据新的发现不断完善仍比较粗糙的考古学文化谱系;大力加强基因和语言谱系的建设工作;以及完善创新进行古史和考古学对证的理论方法。唯有如此,我们才有机会逐渐接近五帝时代的真相。

    本文转自《中国社会科学》2024年第12期

  • 周天勇:1978年中国为什么选择改革开放?

    一个社会的变革,总是来自于生存面临的危机,需要通过改革和开放,走出发展的困境。我们应当实事求是地重新回顾1978年文化大革命结束时,我们在经济、技术、建设等方面的发展水平和境地,评价建国后三十年经济建设方面的功与过,才有可能在30年后的今天理解当时必须改革开放的真正原因。

    1949年建国以后,从经济体制上看,对资源、产品和劳动力,甚至许多消费资料,我们采取了计划分配的方式,生产资料所有制方面实行了国有和集体所有制;农村,在公社、生产大队、生产小队之间,调动资源和分配利益的层次多次上下调整,留去自留地也多次变动。从对外经济关系、科学技术等方面看,我们采取了关门发展的方式。从经济学的角度看,财产,甚至消费资料的制度上,我们实行,或者力图实行高度公有的体制;资源配置方式上,我们试图国家大一统来分配生产资料和消费资料;对外经济战略上,我们走了一条进口替代和自我封闭循环的道路。这样的体制和道路使我们建国后到改革开放初的经济社会发展成功了吗?回答是否定的。

    评价一国经济社会发展如何,应当以一些国际上已经研究成熟,并且为统计和经济学界通用的一系列指标,综合地进行衡量。首先,建国后到改革开放初,由于左的思潮干扰经济建设,使我们的经济总量和人均水平在世界各国的位次上不断后移,而且与许多国家发展的差距也越来越大。不论现在学术界怎样批判发展的唯GDP论,但是,GDP总量和人均GDP水平是衡量一个国家发展的最核心的指标,它代表着一国发展的生产力水平,而且是一个国家一切社会、政治、文化、国防等等事业的物质和财富基础,没有GDP持续和有效的增长,其他方面的发展便无从谈起。从经济总量和人均GDP水平看,1952年,中国GDP总量占世界GDP的比例为5.2%,1978年下降为5.0%。人均GDP水平按当时官方高估的汇率计算,也只有224.9美元。1948年,中国人均GDP排世界各国第40位,到了1978年中国人均GDP排倒数第2位,仅是印度人均GDP的2/3。从人民生活水平看,1976年全国农村每个社员从集体分得的收入只有63.3元,农村人均口粮比1957年减少4斤;1977年全国有1.4亿人平均口粮在300斤以下,处于半饥饿状态;1978年全国居民的粮食和食油消费量比1949年分别低18斤和0.2斤;当年全国有139万个生产队(占总数的29.5%),人均收入在50元以下。

    1978年全国有2.5亿绝对贫困人口。当年,失业的城镇青年2000万人,实际城镇失业率高达19%左右,居民食品消费占总其支出的比重,即恩格尔系数,城乡分别高达56.66%和67.71%。1980年时,城乡居民家庭的耐用消费品,主要是缝纫机、自行车、手表、收音机,每百户的拥有率也只有5.5%、11.2%、15.7%、14.9%;黑白电视机的每百户拥有率也仅为1.6%;家庭电话非常少,即使按当时的公用电话计算,每百户普及率只有0.64部;而洗衣机还很少有,家庭轿车普及率几乎为零。居住方面,1978年时,城镇居民人均居住面积仅为3.6平方米,农村居民每户平均居住面积仅为8.1平方米。据世界权威的经济增长学家麦迪森研究计算,1952年到1978年中国GDP的实际平均增长率只有4.7%。整个国家和人民的发展和生活水平,大多数发展和生活指标排在世界国家和地区170位以外,处于联合国有关部门和世界银行等组织划定的贫困线之下。

    其次,发展经济学的理论认为,一个国家的发展,其现代化,核心是从农业社会到城市社会的结构转型。解放以后到改革开放初,中国人口城乡结构转型先是大起大落,后是几乎停滞。中国城乡人口的比例:1949年为10.6﹕89.4;1958—1960年大跃进,人口向城市转移过多过快,1960年时城乡人口比例为19.7﹕80.3;三年经济困难,1962年时,人口又从城市向农村逆转移,比例大幅度下降到了17.3:82.7,到文化大革命结束时的1978年,城乡人口比例为17.9﹕82.1。1952-1978年,中国工业生产增长了16.5倍,城镇人口比重仅上升了5.5个百分点,产业结构与城乡结构之间严重扭曲。1980年时,世界城市化水平为42.2%,发达国家为70.2%,发展中国家为29.2%,而中国城市化水平仅为19.4%,比发展中国家平均水平还要低近10个百分点。1950年时,韩国城市化水平为27%,1980年时,上升到48%,中国在城市化方面比韩国的差距拉大了20个百分点。从全国的人口城乡结构看,改革开放初时,82%的人口为农民,发展水平基本上还处于传统农业社会的状态。

    GDP和劳动力就业的产业结构,也是一国现代化进程的重要标志。从产业结构看,建国三十年中,农业生产总值下降缓慢,农业剩余劳动力的产业转移更加缓慢。1950年中国GDP的三次产业结构为29﹕29﹕42,1980年时为21.6﹕57.8﹕20.6。纵向相比,农业份额下降速度较慢,第三产业比例大幅度萎缩。横向相比,1980年时,发展中国家的GDP结构平均为24﹕34﹕42,中国的工业化超前,第三产业的发展严重滞后。而从劳动力三次产业就业结构看,1950年为86﹕6﹕8,1962年为82﹕8﹕10,1980年为68﹕19﹕12;同期,韩国的劳动力就业结构从1960年的66﹕9﹕25,转型到1980年的34﹕29﹕37;发展中国家的劳动力就业结构从1960年的71﹕11﹕18转型到1980年的56:16:28。从GDP和劳动力在农业和服务业上的分布看,我国除了工业化超前外,1980年的水平低于世界发展中国家平均水平,仍然是一个落后和传统的农业国家。

    再次,建国后的30年,除了军事工业技术某些方面有一些进展外,其他各方面的自主的科学技术进步步伐缓慢,与世界发达国家,包括一些新兴的发展中国家科学技术水平的差距越来越大,落后于发达国家40年左右,落后于韩国、巴西等发展中国家20年左右。

    导致我国建国以来科学技术进步缓慢的主要原因是:1、正规的知识教育受到冲击。特别是文化大革命十年中,中等高等教育搞革命,中高等教育的考试被废除,一般的知识课程设置被打乱,中高等基础和专业知识被大量删减和简单化,耽误了一代人知识的教育的培养,科学技术人才匮乏。2、科技人员没有应有的社会地位,并受到歧视。知识分子排为臭老九,有专业知识的人往往被指责走白专道路;许多留洋回国的知识分子,在50年代被打成右派,在文化大革命中受到压制;特别是1966年后大规模动员城镇知识青年上山下乡,城市中的知识分子走五七道路,接受贫下中农再教育,荒芜了一代人的学业,耽误了一代人的事业。3、当时的环境中很难学习国外较为先进的科学技术知识。学习国外前沿的科学知识,包括学习国外先进的科学技术,很容易被认为是搞资本主义和修正主义;因为要通过外语才能看到国外科学技术方面的文献,当时的环境中会当成里通外国,被认为是敌特分子。实事求是地讲,建国后的30年,特别是文化大革命十年,科学技术进步的政治和社会环境是不堪回首的。

    因此,建国后三十年的科学技术进步,有这样一些特点:1、国防先行,民用落后。上世纪60年代以来,我国在原子弹、氢弹和发射卫星等方面取得了进展,这对于奠定我们当时的国际地位,起了重要的作用。但是,在民用制造业、农业等领域,新技术新工艺的进展很慢,特别是东北一些老工业基地,有些工厂使用的还是日伪时留下的技术十分落后的机器设备。2、研究立项可能不少,能产业化应用的不多。在计划经济体制下,由于对科技人员发明创造没有激励政策,院所和大学的科学研究与生产实际相脱节,一些科学技术发明创造不能应用于实际,不能大规模产业化,不能变成现实的生产力。3、虽然对外交流方面比较封闭,但还是进行了三次技术设备的引进,对我国工业体系的技术进步起了重要的作用。第一次技术设备引进是1952-1959年。我们从愿意为新中国提供帮助的原苏联和其他社会主义国家引进技术设备,集中在冶金、动力、石油化工、矿山、机械、电子、汽车、拖拉机、飞机和军工等重工业部门。

    第二次技术引进是1963—1966年。这次引进是在我国与原苏联关系非常紧张,国家经济还很困难的情况下进行的,我国开始从资本主义国家引进,主要引进补缺门的关键性生产技术,引进规模小,但影响大,引进重点开始由重工业转向解决“吃、穿、用”的工业项目上,而且引进了一些中小型项目用于企业的技术改造。第三次技术设备引进是1973—1977年,这次引进发生在文化大革命的后期,其背景是建国二十多年来,国民经济中的许多问题暴露出来,有从国外引进有关先进技术设备的必要性和迫切性,引进国仍然是资本主义国家。第三次技术设备引进的特点是:解决人民吃饭穿衣问题的项目占首位;引进规模是前几次中最大的;所引进的技术装置,具有大机组、大系统、高速、高效、自动控制、热能综合利用程度高等特点。在20世纪国外新一轮的电子信息、航空航天、化学合成、核能利用、激光、新材料、生物工程等科学技术步中,1978年时,除了较少的项目,中国在各个方面都处于空白。虽然建国后,我们也有一些重大的科学技术进步成果,但是与世界科学技术在战后的突飞猛进相比,我国科学技术水平仍然处于非常落后的状态。

    20世纪50年代到70年代,各发达国家科学技术进步对经济增长的贡献率,分别从20世纪初的10%提升到了50—70%。而根据专家们的计算,我国科学技术进步对经济增长的贡献率,1952—1957年为27.78,1957—1965年只为8.24%,1965—1976年间更是仅为4.12%。因此,与世界科学技术进展相比,建国后到文化大革命结束,我国科学技术进步非常缓慢,对国民经济增长和社会发展的推动作用十分有限。

    第四,交通和工业体系的建设和规模,反映一国的综合实力。20世纪70年代末,虽然我国工业体系中的重工业有一定的发展,但是,轻工业、交通、城市等等的建设与世界上发展较快的发展中国家相比,还十分落后;即使重工业,在技术工艺方面,差距依然较大。交通通信体系落后于印度。1980年时,建成通车铁路里程55321公里,平均时速只有40公里左右;公路通车里程88.8万公里,其中硬化路面公路里程为66.1万公里,没有一条高速公路;人均铁路和公路里程为0.5公尺和8公尺,铁路、公路、水运和管道等运输线路密度为1229公里/万平方公里。1980年印度铁路里程为6.13万公里,公路163万公里,人均铁路和人均公路里程0.9公尺和23公尺,分别是中国的近1倍和4倍,铁路、公路、水运和管道等运输线路密度为5715公里/万平方公里,是中国的4.65倍。

    通讯方面,1980年中国每百人拥有的固定电话只有0.19部,印度则为0.43部,是中国的1倍多。

    工业体系方面,建国后纵向比较,有长足的发展。整体上看,到1980年,全国工业总产值4703亿元,比1949年增长46.3倍,工业收入在国民收入的比重由1949年的12.6上升到1980年的45.8%;从1949年到1980年,主要工业品产量在世界的排位,钢由第26位上升到第5位,煤炭从第9位上升到第3位,发电量由第25位上升到第6位;化纤和电视机,1949年我国根本没有产量,1980年这两项在世界上的位次是第5位。但是,由于人口众多,人均工业品产量在世界各国比较看水平还是很低。如1980年时,与世界一些发展中国家相比,巴西人均钢铁产量121公斤,人均发电量1880度,印度人均煤炭产量为168公斤,墨西哥人均原油产量1369公斤;而中国人均钢铁产量为36.7公斤,发电量297度,煤炭66公斤,原油105公斤,仍然低于这些发展中国家的发展水平。

    20世纪50年代,通过第一次技术设备引进,我国的机械工业在短期内,就建设起了一批重型机械、矿山机械、发电设备、化工机械、炼油、采油设备,机床、汽车、拖拉机、飞机、坦克、船舶以及轴承、风动工具、电器、电缆、绝缘材料等制造工厂;60年代,在第一次引进的基础上,填平补齐,引进了一批新的技术设备,使我国的制造水平进一步提高,制造出发展原子弹、导弹和新型飞机所需要的新材料、新仪器和新设备,经过70年代的引进建设,我国基本上建立了一个比较独立、完整的工业体系和国民经济体系。如经过几次引进,我国建立起了石油化工、无线电、汽车、拖拉机、飞机、军工、化纤、电子计算机和彩色电视机等新兴工业部门。但是,从技术层次、装备状况、产业结构、生产规模,以及所处时段看,当时我国工业发展的整体水平,与世界各发达和新兴工业化国家的进程比较,实事求是地讲,总体上也只是在处在工业化的初级阶段。

    建国后,如果党的中心工作集中在经济建设上,如果没有频繁的政治运动对科学技术的冲击,如果体制适应生产力的发展,如果国民经济象东亚一些新兴发展中国家和地区,象改革开放后每年以9.5%的速度增长,到1978年时,按1950年不变价格,我国经济总量将会达到7367亿元人民币,比当年实际的3645亿要多出3722亿元,人民币人均GDP将达到450美元左右,在世界各国中中国的发展程度就会排在下中等收入国家的行列中。如果在1978年7367亿人民币的规模上,即使改革开放以来每年以7.5%的速度再增长29年,2007年我国GDP总量,就会为401267亿元,人均GDP为30369元人民币,高于实际的人均18845元人民币。东亚发展中国家的货币币值,在战后高速增长的几十年中,由于经济对外依存度上升、商品价格差别缩小,以及生产力水平提高,即使扣除亚洲金融风暴时各国的货币贬值因素,相对美元也普遍升值了100%到200%不等。我们取中值按照150%的升值率衡量,如果没有建国后左的思潮对经济发展的干扰,2007年我们的人均GDP将达到11000美元,在2000年时,已经完成第一次现代化进程,现在已经进入了世界新兴工业化国家的行列。计算到这里,我们不能不为建国后三十年中,工作中心选择方面的重大失误,感到深深的痛心和惋惜。

    总之,建国后到1978年的30年中,中国共产党人有着将中国建设成为世界现代化强国的强烈愿望,并为此进行了艰苦的努力和探索。但是,由于革命胜利后,党没有从一个工作中心为阶级斗争的革命党转变为一个工作中心为搞经济建设的执政党,对怎样搞社会主义经济建设并不熟悉,榜样上学习了苏联模式,而且在资源配置方式上实行了计划经济,生产资料所有上采取了一大二公的国有制、城镇集体所有制和农村人民公社社队体制,对外关系上走了自我封闭的道路,发展上倾斜于国防工业和重工业。其结果是:劳动生产效率较低,科技人员和企业没有创新和技术进步的动力来源,技术进步缓慢,投资建设浪费较大,三次产业结构和二次产业内部结构失调,二元结构转型进程停滞,与整个世界各国经济社会发展的差距越来越大。可以这样评价:建国后的三十年里,在全球经济社会发展的竞争中,我们走了弯路,延误了时机,可以说,成绩为三,问题为七。

    回首当年,如果没有三十年以来的发展道路的调整,没有对三十年来对一大二公和计划经济的低效率体制的改革,如果不对外开放学习国外先进的技术和管理知识及经验,我们今天的经济和社会发展水平,毫无疑问,仍然会处在世界最贫穷国家的行列。1978年时,要不要改革开放,关系到占世界1/5人口中华民族走向繁荣富强,还是贫困没落之大事。这就是中国共产党人和中国人民,为什么在三十年前依然决然地选择改革开放这一决定中国命运的伟大事业,将其坚持了三十年之久,并且还要继续坚持下去的主要原因。

    本文原发《学习时报》2008 年 09 月 01 日

  • 何芊:游戏还是工具——生成式人工智能与历史模拟

    “历史模拟”并不是一个新奇的概念。在教学中鼓励学生依照历史记录,重演历史角色或主要行为体的决策与行动,培养共情与同理心,体会历史中的能动性与复杂性,已是较为常见的模拟设计。不少以历史为素材的游戏同样作为历史模拟被引入课堂。历史游戏学者亚当·查普曼区分了两类历史游戏的模拟方式。其一是以《刺客信条》和《荒野大镖客》为代表的现实主义模拟。它们以精良的视觉效果还原了历史事件的节选片段与历史场景的局部空间,通过细节的仿真与过往的重现为玩家营造身临其境的参与式体验。其二是以《文明》系列为代表的概念化模拟。这种策略类游戏通过将历史对象、概念、进程以及历史观念写入游戏规则来模拟历史,比如《文明》系列的设计逻辑就出自保罗·肯尼迪的《大国的兴衰》。这种模拟允许玩家在规则之内自由发挥,组合出架空的历史,演绎开放式的走向。

    无论是让学生扮演历史中的行为体,还是在游戏中“亲历”虚拟的历史场景,抑或是通过玩法与规则理解历史阐释的逻辑,教学中的模拟设计都无可避免地存在着简化和泛化历史的倾向。虽然游戏化的历史与历史本身之间的关系存有较大争议,但这并未妨碍游戏化的历史模拟进入到课程教学之中。游戏与模拟的边界模糊,或者说是历史模拟的游戏化,默认了事实与假设、历史与仿历史之间不可逾越的鸿沟,这恰恰是历史课堂中接纳模拟的前提。

    将模拟视为研究工具的历史学家更多集中在计量史学及其他交叉领域,这些研究方向往往拥有丰厚的理论与数据资源。20世纪60年代,伴随着计量史学的诞生,模拟方法进入到史学研究当中。第一代计量史家罗伯特·福格尔和约翰·迈耶等人奠定了反事实推演的基础方法。这一时期模拟与历史的结合还有两种形式:一是利用文献记录为模型设计变量、提供参数设定的佐证。二是通过模拟结果与真实历史的比对来验证模型。从20世纪90年代开始,新一代计量史家进一步将反事实推演与蒙特卡罗模拟相结合,通过模拟实验,发现关键的因果关系,检验既有研究结论。历史模拟在计量史学中自证了其工具价值。历史事件没有简单重复,史学研究只能从已知过去的观察中抽丝剥茧、考镜源流,研究成果往往自成一说,高下难辨。如果真能对历史学的研究对象,比如经济发展的变化趋势、重大事件的爆发过程以及复杂系统的演化发展进行多次模拟观测,应当能帮助我们更客观地理解前人结论,更精准地揭示人类历史中复杂交错的因果关系。

    即便集成了大量历史信息,结合了既有理论与统计学方法,传统模拟依然只能构造对现实世界的简化近似。传统模拟依赖于计算机随机过程的重复实现,以此生成特定条件下针对同一对象的多种可能结果。传统模拟的特点表现为系统内的信息交互以抽象数字为表征,模型的诸多参数由研究者结合前人成果自行决定。简言之,以数理逻辑为运行基础的模拟系统仍比较简单。而牵引历史变化发展的,不仅有数据指标所揭示的机械规律,还有弥散分布的大量非理性因素。历史情境内人的情感、好恶、偏见、道德、迷信,以及这些因素以语言为载体在群体与个体之间反复的交糅共振,都在左右着人的行动与选择。非理性因素错综晦暗,难以融入相对简化的数学模型。

    生成式人工智能为传统模拟的不足带来了新的改进工具。首先,大模型具有繁复的计算结构,庞大的参数规模与海量的训练语料,足以支撑更复杂的仿真模拟设计。其次,大模型的行为选择由预训练和微调所决定,相较于原本由研究人员对参数赋值并结合随机过程而产生的模拟结果,更贴合现实。再次,大模型的模拟系统内部,信息交流可以用自然语言代替数字表征,与人类社会的语言交互模式更为接近。此外,大模型还通过对齐技术进一步向人类价值取向靠近。大模型在完成预训练之后,通过基于人类反馈的强化学习,实现与人类偏好、道德准则和价值观念的对齐。如果说传统模拟尚且是简化后的仿真,那么当下大模型对人类的模仿已几近“乱真”。比如由大模型合成的模拟受访者复现了人类被试在行为经济学和社会心理学等领域的部分经典实验结果。大模型的类人化智能在交互环境中也得到了印证。以外交谈判为核心的策略类语言桌游《外交》,讲求多人博弈之中的意图识别、谎言洞察、信任获取以及协商合作等综合能力,经过特别训练的大模型已能在网络对战中达到优秀的人类玩家水平。

    不仅如此,大模型还可以驱动多智能体的仿真模拟系统(Multi-Agent Modeling, MAS),这也是近来历史模拟所采用的方法。智能体仿真模拟原本是社会学家用来探索个体与系统、微观与宏观之间互动关联的路径:通过创建多个自主智能代理,在计算机的模拟环境中观察智能体之间、智能体与环境之间基于给定规则的相互作用,从而解释微观个体行动如何导致复杂系统演变的“涌现”现象。大模型的能力跃升,对人类智能的趋近,同人类价值观念的对齐,都进一步提升了智能体模拟对人类社会的仿真度。在此基础上,原本因化约而备受批评的历史模拟也展现出新的可能性。新一代的历史模拟将重大事件的主要参与方构建为多个智能体,利用真实的历史情境设定智能体的参数,制定智能体之间的行动规则,并通过大模型的运行环境来模拟多智能体之间的交互过程,从而分析历史事件爆发的因果机制。

    新的历史模拟在外交史和战争史领域已有初步展现。罗格斯大学与密歇根大学的联合团队以一战前夕的英、法、德、奥匈、塞、俄、美、奥斯曼等国为原型创建了多智能体系统,其中,代表各国的多智能体在结盟、备战与宣战的行为中较为准确地复现了历史中的国际关系。类似的方法还被用来模拟第一次英法百年战争期间的重要战役,以证明由智能体所演绎的将军与军士可还原战役的主要结果。从这些尝试看,历史模拟与侧重理论探索的试验性模拟不同:其一,模拟系统的有效性需比对真实历史来验证;其二,模拟对象应当采取匿名化处理,以避免大模型调用历史知识,干扰模拟系统。不过,所谓复现历史,标准尚无定论,仍由研究者自行设定。比如在战役模拟中,研究人员利用英法最终伤亡率的高低比值,与史载对照,以此判断仿真是否成功。史实与模拟之间的拟合误差,也缺乏公认的基准。在一战模拟中,国家间结盟、宣战与备战的复现,最高准确度分别为77.78%、54.6%以及92.09%。这些数值能否证明模拟成功,可能还需更多讨论。

    当然,依托于大模型的历史模拟仍然存在不少局限。首先,模拟依旧是对历史情境的抽象和简化。智能体的行动范围局限于研究者指定的有限选项,而选项设计往往紧扣论题,容易出现简化后的偏移。比如围绕战争爆发设计模拟,国家智能体的行动选项中,导向冲突的选项更多,而和平类行为不足,若是设计逻辑缺乏其他依据,那么由模拟结果得出战争不可避免的推论难以令人信服。其次,语言对模拟结果的诱导作用无法被排除。模拟的主要环节,包括智能体的参数设定,智能体之间的互动方式,以及触发行动的事件本身,都要通过自然语言的描述来实现。模拟中的智能体行为究竟是复现了决策,还是停留在语言关系推断,实难分辨。再次,通用大模型的预训练语料主要来自移动互联网时代,本就存在“近因偏见”,如果不在微调环节令模型接受历史语义训练,模拟可能难向近代以前延展。除此之外,大模型的幻觉文本、价值偏见,以及模型不定期更新导致的实验结果无法重复,这些固有疑难同样也在挑战着历史模拟作为研究方法的可靠性。

    尽管有种种不足,但新一代的历史模拟依然具有不容忽视的发展潜力。作为一种研究工具,大模型驱动的历史模拟需要更多的检视与讨论。有一部分问题可以改进:比如通过消融实验,或结合史学研究成果,能衡量或优化模拟系统中的组件设计;采用开源模型,进行本地部署,并介入微调环节,能提升大模型生成内容的稳定性,也能令模拟更贴合历史语境。即便新的模拟方法仍远不足以还原复杂历史情境,但简化的历史模拟设计已足够在教学场景中迭代传统的课堂模拟。大模型不仅可以实现原本由学生扮演的模拟,还能翻转学生的参与方式,让他们从角色扮演者变成模拟设计者。学生利用提示词,描述具体场景,拟定大模型的“人设”,并同其他同学驱动的大模型角色展开对话,完成一场基于历史的语言游戏,这无疑能激发学生主动求知的热情。总之,无论作为游戏还是工具,生成式人工智能都带来了全新的增量。

    本文转自《光明日报》( 2025年02月10日)

  • 李金操:“里斯本丸”沉船事件的本事、记忆与纪念[节]

    1942年10月1日,日本政府运送盟军战俘的船只“里斯本丸”沉没【1】。二战期间,随着侵略范围的不断扩大,日本国内众多劳动力被征召入军。为解决本土劳力资源短缺问题,日本政府派遣船只将大批盟军战俘运往日本充当苦力。因运输环境极其恶劣,不少战俘在运输过程中死亡,美国学者米切诺将此类船只表述为“地狱航船”【2】。“里斯本丸”是众多“地狱航船”中既普通又特殊的一艘:说其普通,是因为该船仅是日本陆军省征召之众多民用商船中的一艘【3】,在型制功能和任务执行方面并无特别之处;说其特殊,主要在于船只沉没之际,日方负责人曾欲屠杀全部战俘,此举可谓相当匪夷所思。沉船事件不仅将英、日、中等国卷入其间,更是引发了一场持续数十年的史实论争与记忆重塑。

    ……

    一、虚假记忆的建构

    “里斯本丸”原是日本邮船株式会社名下的民用货船,该船总长445英尺(135.6米),实测排水量7053吨,净排水量4308吨,运输规模尚称可观。抗战前后,该船主要在东亚、东南亚、南亚海域执行运输任务【8】。1942年9月,伪“港督”矶谷廉介在日本政府要求下,开始着手向日本本土运送羁押在香港战俘营内的白人战俘。“里斯本丸”号承载人员于当月26日集结完毕,共计有日本军人、乘客778名,英俘1816人,此外还有1676吨战略物资【9】。

    9月27日,“里斯本丸”正式起航,是当月离港的第二艘战俘运输船【10】。未料几日后的10月1日,船只在航行至舟山列岛附近时,因遭美国潜艇攻击而沉没。消息一出,引发舆论关注。最先报道“里斯本丸”沉船事件的是日本媒体,10月8日,日本官方喉舌——《朝日新闻》刊登两则相关报道。第一则报道中,日方强调该船是“载有1800名英俘及少量日军官兵的陆军运输船”,凸显船只“战俘运输船”身份的同时,隐瞒该船运送大量日军,即具备军用船功能的事实。日方首先披露,该船遇难并非因自身或环境原因,而是“遭美潜艇袭击而沉没”。事故发生后,日军“立刻派船前往现场救援,救起了数百名英军”。在此基础上,第二则报道意在论证“英美敌军”的不人道。论者旁征博引,结合“里斯本丸”“哈尔滨丸”“朝日丸”以及停靠在马来海岸哥打巴鲁海边的医疗船等一系列所谓非军事船只被英美军队袭击的事件,在充分“证实”英美军把“国际法如同草鞋一样丢弃”之观点的基础上,深入“印证”英美军“非法不人道”的结论【11】。

    与此同时,为日方掌控之中国沦陷区媒体也在借“里斯本丸”沉船事件大作文章。奉天(今沈阳)的《盛京时报》声称“英美现在已漏出了图穷匕现的情况,唯以其穷途末路,所以竟而不择手段,不辨识清楚,莽撞地把搭载自己方面俘虏的日本船只给击沉”的行为实在“滑天下大稽”,同时不忘强调“日船载送英俘虏兵,原为使之居于安全所在”,嘲讽盟军的同时彰显日方的“光辉”形象【12】。北平的《晨报》在讥讽美方潜艇“盲目妄为之行动,终于引起将自己联合国之俘虏葬入海底之可讥事态”之余,结合当时美国出动军队帮助英国守戍英属殖民地的背景,将装载英俘之“里斯本丸”被美潜艇袭击一事视为美军对英属殖民地“暴行”的延伸,并意味深长地表示“此美潜水艇击沉英兵俘虏事件,所予英国民之影响,极堪注目”【13】;张家口的《蒙疆新报》也有“其(美潜艇)盲目行动,遂惹起使自己联合国之俘虏葬于海底之事态”,以及“将英兵俘虏收容船击沉,因此与英国民之影响,殊惹注目云”等语【14】,离间英美同盟的图谋跃然纸上。

    显然,在沉船事件发生后,日本政府很快主导其所掌控下的舆论,刻意塑造出一段有利于日本国家形象和国际地位的历史记忆。纵观日方关于“里斯本丸”沉没事件的报道,主观性宣泄较多而对事件本身的客观性记述乏善可陈,尤其是隐瞒了该船运送战俘的主要目的,并在关键节点上语焉不详。可以想见,其宣传并不能令反法西斯阵营,特别是英国满意。侵华期间日本对占领区舆论的管控十分严格,周边报纸均无刊登对日方不利言论之条件,故日方想当然地认为可以独享该事件的叙述权和解释权。由于反法西斯阵营各国仅能基于日方提供的只言片语了解和跟进“里斯本丸”事件,故而很难明晰事件全貌。

    鉴于所知讯息有限,英国在最初围绕沉船事件与日方展开交涉时,一直秉持措辞谨慎的态度。得知“里斯本丸”沉没导致千名英俘溺亡后,英国政府迫切想了解幸存英俘的消息,于是委托中立国瑞士代为咨询。10月13日,瑞士驻东京公使致电日本外务省,代英方表达了“希望能够尽快向英国政府报告相关信息”的意愿【15】,但日方却置之不理。10月19日,英方又通过国际红十字委员会向日本政府发送电文,希望相关机构“将船上所有俘虏的姓名写封电报发还”【16】,日本仍不予理会。英国政府见状,于10月下旬再次通过瑞士向日本政府传递消息,希望瑞士驻日本公使代替英国“访问拘留在收容所中的俘虏”【17】。但日本似乎心中有鬼,不仅不敢让战俘与外界接触,甚至拖延一个多月才勉强答复,且给出了完全否定的答案——“根据情况,此次许可是难以实现的”【18】。显然,日方不愿给他国了解真相的机会。日本欲盖弥彰的行为引起当事国英国的警觉,但英国政府苦于所掌握信息有限,难以采取进一步措施。

    事情很快发生转机。三位被中国渔民营救的英俘被成功护送至安全区域后,日方极力隐藏的真相被初步揭露。12月5日,《中央日报扫荡报联合版》提到有船只运送大量英俘,在北上途中因潜艇袭击而沉没,英俘伊文斯等三人艰难“脱险”,正在中国游击队帮助下“赴渝”【19】。该报道暗示,日方相关宣传是否属实,可得以验证。12月19日,《中央日报扫荡报联合版》再次刊登一则相关通讯,指出“里斯本丸”英俘“华莱士、尹士等数人”在中国民众帮助下逃出日本封锁,正在向安全地带转移。该通讯还提到驻港日军在香港集中营“对待英俘,极为残酷”,强令“不得一饱”之英俘“均次服苦役”,并时常实施“侮辱”或“枪杀”。此外,该通讯还首次提到日方此次运送英俘是为将他们“送入工厂,罚充苦役”【20】。虽然该通讯未对日方前期报道进行针对性批驳,且未描述沉船经过,但它首次揭露了日方对“里斯本丸”所载英俘政策宣传的虚伪,不失为一有分量的质疑。

    伊文斯等战俘抵达重庆后,英方大使馆相关人员通过三人口述了解到“里斯本丸”沉船事件的经过,并通过重庆军事参战处,于12月22日将有关信息传递至英国【21】。依据三位英俘传递回来的讯息,《泰晤士报》于次日刊登一则通讯,重点强调以下信息:其一,“里斯本丸”受袭当天傍晚,日方下令封闭战俘所在船舱,导致若干战俘在船只沉没前非正常死亡;其二,日本弃船离开后并未打开封闭的船舱,战俘们自行撕开密封帆布,才为获救争取到一线生机;其三,包括伊文斯等三人在内,不少战俘在游至日方救助船只时,日方并未理会;其四,一些本可以获救的战俘在落水后被日本无故射杀;其五,有不少英俘在中国渔民的帮助下获救;其六、日方救助船虽然也救起一些人,但并未全力施救。此外,英方通讯还首次公布了获救英俘的姓名和被俘前的职务,以便向国际舆论明确通讯确实来源于当事人,且内容真实可靠【22】。自此,英国政府终于通过英方当事人,掌握到关于沉船事件的可靠讯息。

    既已明了事情经过,英国政府一改往日交涉时的谨慎态度,开始借沉船事件抨击日方的卑劣行径。1943年3月26日,英国政府再次通过瑞士向日本政府传达外交文件,强烈谴责日方在船体受损后“不顾战俘,任其自生自灭”的行为,以及封闭船舱等促使战俘处境急剧恶化的行径,要求日本政府对沉船事件展开调查,将有关结果尽快通报,对相关负责人进行处罚,并承诺此后不再发生类似事件【23】。收到抗议后,包括日本驻瑞士公使铃木、日本陆军省次官富永恭次和外务省次官松本俊一殿在内的一批高层官员着手研究解决方案,在此期间,他们都极尽可能为日本相关人员进行开脱。铃木声称日方已在救助问题上“尽了最大努力”,因而“不应对参加行动者有任何批评”,同时强调其本人“很难认同英国政府所提抗议理由”【24】。富永恭次、松本俊一殿认为英方抗议“完全就是捏造的”,其目的便是“意图诽谤我们帝国的正义之姿”;他们还强调“遇难时,护送人员,船长及下属船员都跟着俘虏行动到了最后一刻,其中还有一部分人员壮烈牺牲”,并附言“遇难时的具体细节只有当时担负任务的人知道”,英国无权质疑【25】。

    日本外务省于5月20日通过瑞士驻东京公使馆,对英国政府的抗议文书进行正式外交答复,声称“英国政府以毫无事实依据的情报为基础,对帝国当事人采取的妥当措施进行毁谤”,并强调日方全体人员已为英军战俘的人身安全“战斗到了最后一刻,甚至牺牲”,被救助的900多名俘虏就是“对英国政府抗议中捏造事实的最好回击”【26】。在外务省给予正式书面答复的同时,陆军省俘虏情报局也出台文件,对英方抗议声明所提内容逐条批驳,诡称日方是为避免战俘骚乱才不得已将英俘关押于船舱内(实际是封闭船舱)【27】。为应对英方驳斥,俘虏情报局还向外务省提出三点“建议”:其一,英方抗议“完全是捏造事实”,是为“毁谤我们帝国的正义之姿”,外务省需“以强硬的态度对此予以反驳”;其二,推断英国是通过俘虏患者与外国代表之间的邮件往来而取得相关“歪解”,建议“相关管理者有必要注意”;其三,虽然此次事件经过已在适当时间正确处理,但今后类似事件很可能成为“敌方外交战略宣传上的手段”,日方当事人“需要将足以粉碎反击的资料尽早送至相关当局”【28】。陆军省俘虏情报局等机构似已沉浸在“日方在拯救‘里斯本丸’运载英俘上展示出了正义且光辉的帝国形象,英方对日方的诋毁讯息纯属捏造”的认知中难以自拔。

    王明珂指出,对于已发生的事情,人们的记忆“常常是选择性的、扭曲的或是错误的”【29】,其主要原因是一个族群往往通过塑造或强化集体记忆的方式来与“其他群体的社会记忆相抗衡”【30】。“里斯本丸”沉船事件发生后,日本政府凭借信息垄断的优势,通过舆论媒体,对外传递“美国潜艇不顾国际公约,无故攻击日本战俘运输船”,以及事故发生后“日本方面竭尽全力拯救数百名英俘”等讯息,迅速构建起对日本形象绝对有利的社会记忆,借此对抗美英等国的反法西斯同盟。当英方通过伊文斯等三位英俘了解到事情经过后,立即着手批驳日方虚假宣传,希望通过澄清事实等方式打破日方宣传的影响,但从以驻瑞士公使铃木、陆军省次官富永恭次和外务省次官松本俊一殿为代表的日本政府高层官员对英方外交诉求的驳斥可窥知,日本精心构建之扭曲记忆对日本社会的影响业已根深蒂固。

    在日本有意建构扭曲记忆以对抗反法西斯同盟的情况下,单凭外交手段,很难达成重塑相关记忆的目的。暴行暴露后,日本非但没有迷途知返,反而竭力扭转事件走向。1943年,日本大阪出版社出版发行的《大东亚战争记录画报》后篇收录了三篇与“里斯本丸”事件有关的文章。第一篇题为《美国潜艇在东支那海的暴举》,内容与《朝日新闻》第一则报道类似,唯在用词、叙事上更为考究。文章将“里斯本丸”运送的陆军官兵美化为英方战俘的护送者,为该船搭载军队谋求一合理解释。此外,该文在描述日方“立刻派出救助船”的同时,强调他们“经过努力”才“成功救助数百名英国俘虏”,进一步凸显日方“英俘拯救者”的形象【31】。而题名为《揭露美军之凶恶,连友军也屠杀的背信弃义行为》的文章则与《朝日新闻》第二则报道有一定区别,主要表现为日方已不再用“英美敌军”等将英美视为牢固同盟的表述,转而单独攻讦美国。日方称“美国终于露出了其凶恶的獠牙”,贬斥“将道义挂在嘴边,时常自我宣扬为正义的拥护者”的美国是一个“连至今一起扛枪的战友英国也无情打击”的“背信弃义”国家,同时将美国保护英国海外殖民地——爱尔兰岛、格陵兰岛的举措描述成“强行派兵入侵英国领地”。该文预言美国还会“不断采取手段谋取英国的澳大利亚、印度等殖民地”,进而“夺取过去几百年来英国的世界霸主地位”。与此同时,日本将所谓日方不念旧恶、救敌性命的行为与美方之“背信弃义行径”进行对比,借此彰显所谓的“大日本帝国的正义身姿”【32】。显然,此时日本政府的主要宣传目的已由最初的凸显“英美敌军的不仁道”转变为“尽可能孤立、打击美国”。最后一篇题名为《天罚》的英日文对照文章是上篇内容的延续,有“此次事件在敌方阵营中会掀起怎样的风浪,就让他们自己解决吧”之表述。英日对照的形式显示出日方希望该文内容能影响到西方世界【33】。

    至此,日本社会对“里斯本丸”沉船事件的认知并没有因英方交涉而有丝毫改变,沉船事件依旧是不同国家各执一词的罗生门。

    二、沉船事件本事

    在日本传统文化中,存在一种根深蒂固的“对名誉的义理”理念:“即使做错了,只要别人不知道,名誉就不算受到损害。”【34】日本政府在“里斯本丸”沉船事件发生后所做的一系列虚假宣传,都可视为此理念的具体实践。不仅顽固坚称英方关于“里斯本丸”沉船事件的宣传属恶意捏造,此后每当反法西斯阵营抨击日方战俘政策时,日方都不忘以其在“里斯本丸”遭遇潜艇袭击后的“卓越表现”予以驳斥【35】,似乎只有日方构建的历史记忆才契合事情本源。

    直到日本无条件投降,英国政府主导的香港军事法庭对相关战犯的审判工作宣告完成以后,“里斯本丸”沉船事件的全貌才第一次较为完整地呈现在公众面前【36】。在这次审判中,多位英方当事人出庭或提供宣誓证书,船长经田茂、翻译新森源一郎等战犯为了脱罪,也向法庭提供不少书面文件或口头陈述,这些材料包含诸多被掩盖的信息。实际上,“里斯本丸”共有7个货舱。在包括负责押运之日方成员在内的绝大多数当事人看来,战俘被集中安置在第1至3号三个船舱【37】,但根据经田茂出庭时对法官疑问的回答可知,英军战俘被集中安置在前4个货舱,之所以造成这种误解,主要是由于“2号舱和3号舱之间没有隔断”,导致它们被误认为是一个船舱【38】。

    1942年10月1日凌晨2时45分,在“里斯本丸”航行至距离中国舟山列岛之东汀岛8海里海域时,天气骤变似欲降雨,海面能见度极低。为防止触礁,船长经田茂向东偏北60度方向调整航向,船只驶向离海岸线较远的深水区。5时42分,在航行27海里后,航向又向东偏北调整10度,稍稍向海岸线靠拢,以防敌袭【39】。早7时10分,身为船长的经田茂“稍稍打了个盹”,恰在此时,早已埋伏在附近海域的美军潜艇“鲈鱼号”向“里斯本丸”发射鱼雷,经田茂“错过了命令大副进行曲折航行的机会”【40】。船身被数枚鱼雷击中,其中有两枚发生爆炸,使船只失去继续航行的能力【41】。值得一提的是,日方不仅未在船身添加任何战俘运输标志,还在船首、船尾甲板上分别加装一门本不该出现在非军事船只上的火炮,加上日军频繁在甲板上活动,极易让观察者误认为该船是在执行军事命令,为“里斯本丸”被美军潜艇击沉预设了伏笔。

    “里斯本丸”遇袭后,相关人员很快向外传递求援信号,并以船首火炮还击【42】。收到求援信号后,负责警戒舟山附近海域的上海方面根据地队(下简称上根队)第6警戒队(原第13炮舰队,下简称6警队)迅速组织救援。紧接着,第1、7、8警戒队也在上根队司令部的要求下加入救援行动【43】。最先抵达出事海域的救援军机迅速拱卫“里斯本丸”,向“鲈鱼”号可能出没海域投放深水炸弹,“鲈鱼”号难以发动进一步袭击,如此,“里斯本丸”避免被即刻击沉,得以又在海上漂浮约一天。由于受损严重,即便关闭船尾舱门,依旧不能阻止船身进水、下沉。15时20分,经田茂向最先抵达的救援船只——“栗”驱逐舰【44】发出“船尾正以每小时10英寸速度进水,6小时后水就会到达甲板”的信号;17时10分,该信号又被修正为“船尾正以每小时8英寸速度下沉,7小时后水会到达甲板”【45】。得知情况后,“栗”舰长于17时30分致电上根队6警队最高指挥官,指出紧急情况下应考虑先行转移全部日军;对于战俘,或是由于当时在场船只运载能力不足之故,“栗”舰长仅建议救助“半数”【46】。

    据经田茂事后回忆,“17点左右”他通过旗语接到一个“用‘里斯本丸’上救生筏将船上所有日军转移至‘栗’驱逐舰上”的命令【47】,这当是“栗”舰长在传达上根队指挥官对其前所提营救建议的答复。该命令未涉及战俘群体,表明上根队最高指挥官在一开始就未有救助战俘之打算。在用救生艇运送三次日军后,6警队最高指挥官矢野美年大佐乘该舰队旗舰“丰国丸”抵达,从“栗”舰长手中接过现场最高指挥权。矢野美年并未改变此前所接命令,8时左右将剩余日本部队和乘客转移到附近舰只后,日方仍未将救护战俘纳入考虑范围,而是着手用牵引绳连接“丰国丸”和“里斯本丸”,意欲将“里斯本丸”拖拽至岸边浅水区【48】。以上信息表明,无论是统筹救援行动的最高指挥官——上根队司令,还是在场最高指挥官——6警队司令官矢野美年,对战俘生命均持漠视态度。这无形中助长了留在船上的两位日本军官——杉山中尉和和田少尉的虐俘气焰【49】。

    当晚19时多,在“里斯本丸”上与经田茂、杉山、和田商议解决方案的矢野美年刚一离开,和田秀男便在大副陪同下找到经田茂,要求封闭战俘所在船舱,被经田茂劝阻【50】。第一次封舱要求被拒后,和田仍不死心,于21时纠合船上最高指挥官杉山,再次找到经田茂,以指挥警卫看管战俘是其职务,船长无权干涉为由命令封舱。因有杉山支持,经田茂命令大副执行了封舱命令——将木板在舱口铺齐,盖上防水油布,钉上楔子,并捆上绳索【51】。封舱之举可谓是丧心病狂【52】,战俘们本已超过24小时未补充食物、水及正常如厕,一旦封舱,缺少正常空气流通,战俘生命将危在旦夕【53】。当时即便是人数较少、关押战俘条件最好的1号舱,也有至少两位战俘因身体虚弱、缺少新鲜空气等原因死亡【54】,更遑论其他几个船舱【55】。是夜,战俘最高指挥官斯图尔特上校命令稍懂些日语的波特中尉不断向日本警卫和船员哀告,但日方人员毫不理会【56】。

    如果说封舱之举是泯灭人性,残忍杀害努力自救以求一线生机之战俘的行径则称得上是丧尽天良。10月2日8时10分,“里斯本丸”船体向左倾斜7度即将下沉,经田茂向“丰国丸”打出“‘里斯本丸’即将沉没,我建议船上所有人员弃船”的旗语。8时20分时,“丰国丸”回复的指令还是“船上所有人员准备弃船”。但到了8时45分,该指令被修改为“把警卫和船员转移到即将开过去的一艘船上”【57】。显然,日方负责人刻意规避了战俘群体。在船只即将沉没的危急关头,2号舱几个战俘拼尽全力破开封舱口逃到甲板【58】,准备进一步打开其他3个船舱的封舱口,解救同伴逃生,但在船桥的和田秀男向卫兵“下达了开枪的命令”【59】,绝大部分战俘被压制回船舱【60】。万幸的是,冲上甲板的几个战俘中,有“一两个人躲在了甲板上的绞车后面”【61】,他们趁机打开几个船舱的舱口(第4号舱仅打开了舱门,舱口未及打开)【62】,为战俘们逃出舱体创造了条件。随着船只倾斜愈发严重,再不作为等同送死,1号舱战俘法瑞斯怒吼道:“我们必须死的像人一样,而不是像老鼠一样!”在其精神感染下,几位勇敢的战俘冲上甲板。此时船身即将沉没,2、3号舱的许多人“正在逃生”【63】。战俘们在帮助舱内战友逃出船舱后迅速跳海,如此才使得千名左右的英俘免于随船淹没。

    “里斯本丸”上的英俘是运往日本的重要劳力资源,即便是出于该方面考虑,日方也应积极施救,更遑论国际人道主义理念的约束。但让人遗憾的是,10月1日17时左右通过“栗”舰长传达的将“所有日本部队转移至‘栗’驱逐舰”上的指令表明,上根队司令官并无救助战俘之意【64】。18时左右,矢野美年率领“丰国丸”“百福丸”等日舰抵达出事海域后,“里斯本丸”附近的船只运载量已足以救护全部英俘,但矢野美年仍未更改前令,亦证明救援行动指挥高层内部已达成“不用顾及战俘安危”的共识。正因如此,负责船上警卫工作的和田秀男才胆敢纠集杉山中尉向经田茂施压封仓,亦敢悍然在2号舱战俘第一次冲上甲板时命令警卫开枪。如果说封舱、射杀第一次出舱英俘的举措只代表和田秀男等少数在场底层日本军官的意见,那后来战俘跳海后的遭遇足以证明,将所有英俘葬身大海本就是日方最高指挥官的本意。

    “里斯本丸”即将沉没之际,预感危机将至的战俘们协力逃出船舱,跳海求生。当时有至少20艘救援船只围绕“里斯本丸”【65】,且从战俘跳海至船完全沉没期间至少有一个半小时可以“在没有危险的情况下进行救援工作”【66】,泅水战俘本应轻松获救。但据战俘回忆,日方人员不仅不主动施救,当他们好不容易靠着救援绳索爬上日舰时,日本士兵迅速将他们“踢进水里”【67】。事实上,日方采取的最普遍做法并不是“踢”而是开枪射杀,这在幸存战俘的回忆中有充分表述:伊文斯在描述沉船细节时指出,有部分战俘跳海后“被日本人射杀”【68】;迈尔斯在回忆落水后的经历时指出,日军曾用步枪对在水中挣扎的英俘实施“持续射击”【69】;豪威尔在落水后曾听到附近有持续数分钟的枪击声,并亲眼目睹一位离他约2码距离的同伴被日军射中【70】;查利斯等人在跳海后的第一反应是“向船只游去”,但他们很快便遭到日方射击【71】;希尔在落水后发现在其游往岛屿的路线上“有一些日本巡逻艇”,艇上的日军在“用机枪和步枪向水中的人射击”【72】;克拉克森跳海后周边有数艘日本船只,但上面的士兵“丝毫没有要救我们的意思”,且只要战俘们靠近日船,便会“被射杀在水中”【73】。战俘们的回忆印证了射杀举动不是个人行为,而是集体行径,日本士兵显然是在执行上级命令。

    “里斯本丸”沉没的地点位于浙江省舟山市定海县东极乡,这里的渔民有救助落水者的传统。船只沉没的动静很大,惊动了岛上居民。当渔民发现落水英俘后,果断实施救助,与正在实施射杀的日本官兵形成鲜明对比。在2021年12月17日中国中央电视台“国防军事”频道播出的纪录片《亚太战争审判》第3集《活着回家(上)》中,幸存英俘丹尼斯·莫利回忆到:“是中国渔民的出现改变了一切,当他们出现,日本人看到他们,就停止了射击”;“如果不是看见中国帆船里的中国人救了很多战俘,日本是不会改变主意来接走战俘的”,汉密尔顿在香港军事法庭上提供的证词中也有类似表述【74】。中国渔民的加入超出日方预期,使现场局面愈发复杂。

    抗战期间日本极力宣扬由其为主体的“大东亚共荣圈”,诡称其发动的是一场肩负“东亚全体民族兴废”“为要确立大东亚永远的和平”“决然而起对于中日共同敌人英美”的必胜战争。在其宣传口径中,“日本就是因为要救东亚而与敌人交战”,所以“友邦日本的敌人”就是“中国的敌人也就是全东亚民族的敌人”【75】。日本军人政客对于英国俘虏的极端仇视心理,与上述军国主义宣传不无关系。而中国渔民救助英俘的行为,不仅与日本所谓黄种民族共同抗击白种民族的宣传相悖,更无形中映射出日方的卑劣。由于中国渔民的干预,日方负责人出于控制局面等因素考虑,下令停止射击。在离“里斯本丸”约一英里远的一艘日舰发出“停止射杀英俘的”信号后,射击行为很快停止【76】。日本士兵听从官长指令停止射击一事同样从侧面证实,之前的射击是在执行上级命令。此后,日本方面停止了对英俘攀靠日本舰只的阻拦,并逐渐开始主动解救泅水战俘【77】。根据当日13时51分矢野美年发送给上根队指挥官的电报,日方最终救起644名英俘【78】。

    当时东极乡渔民没有现代化船只,只能依靠平时打渔的小木船,运载能力有限。为最大限度实施拯救,不少船只往返多次,救助行动一直持续到深夜。由于地理位置荒僻、物资匮乏,加上战争影响,渔民生活相当拮据,但他们尽最大努力照顾获救战俘,无偿为他们提供衣物、饭食、沸水和住处【79】。根据《亚太战争审判》第3集《活着回家(上)》播出的幸存英俘查尔斯·佐敦口述资料(藏于伦敦英国战争博物馆),佐敦与十几位同伴被中国渔民救起后,渔民们对他们“非常非常好”,还给了他们米饭和红薯。中国渔民的勇敢无畏和真诚无私给幸存英俘留下了极为深刻的印象,以致70余年后,对过往很多事情都已遗忘的幸存英俘贝宁菲尔德在面对纪录电影《里斯本丸沉没》制作团队采访时,还清晰记得他“一生中吃到的最美味的食物”,是被救起后中国渔民给他的“半块萝卜”。贝宁菲尔德还感叹:“他们冒着生命危险救了我们,日本人有可能因此摧毁他们的整个村庄。他们是真正的英雄!”

    根据10月3日晚21时45分矢野美年发给上根队指挥部的电报,沉船次日,日方在青浜、庙子湖等岛屿上共搜捕英俘414人【80】,连同被中国渔民隐藏且最终被成功送至大后方重庆的伊文斯等3人,以及日方在中国渔民影响下救起的644人,共有1061名英俘因中国渔民的出现而免于随船湮没。这便是“里斯本丸”沉船事件的真相。

    三、中英对沉船事件的纪念

    反法西斯战争胜利后,各国人民沉浸在劫后余生的喜悦中,暂时忘却战争带来的苦痛。受大环境影响,“里斯本丸”事件幸存者最先想到的并不是开展对逝者的缅怀,而是践行对恩人的答谢。自1946年9月王继能赴港后,伊文斯等人又先后多次邀请唐如良、翁阿川等人到上海或香港会晤,不仅设宴款待,赠送钱财、衣物,还设法帮助恩人寻找合适的工作【81】。

    香港军事法庭审判结束后,英国政府也很快将如何答谢中国渔民提上日程。1948年4月12日,英国驻华大使特意致函中国外交部次长叶公超,商议答谢事宜。英国政府感谢了中国渔民的营救及其“以最大爱心给幸存者们食物、衣物和照看”的善行,并特地为渔民筹备专款。赠款形式颇为隆重,“国王陛下的‘康姆斯’号将于5月7日带着这笔款项前往东渔父岛访问,正式授予此项赠款”,为防止国民政府多心,英国政府特意强调“康姆斯”号驱逐舰“不带任何飞机”【82】。赠款仪式的落实有利于提升中国的国际形象,对坚固中英当事群体间的友谊亦大有裨益,这一建议本应得到鼓励,但相关文件转送至国民政府国防部审核时却遭否决。

    国防部认为,国民政府正在舟山群岛筹建海军基地,英方的访问虽然名义上是为赠谢中国渔民,但暗地里很可能是为窥探海防虚实【83】。1948年国民政府深陷国共内战的泥淖,英国访问东渔父岛的行为难免会触动当局者敏感的神经,故其并不愿意节外生枝。稳妥起见,国民政府提议委派浙江省政府委员周向贤代表渔民赴上海英国舰队司令部领取赠款。对英国而言,抗战胜利后国民政府在接收沦陷区时掀起的劫收风潮“闻名当世”,贪腐形象早已深入人心,英国政府不放心将此款项交给其官员。加之如果不能当面向渔民致谢,赠款仪式的纪念意义便会大打折扣,故英国政府未再回复国民政府,英舰造访一事不了了之【84】。

    但在英国政府影响下,国民政府也于1948年10月25日下发对东极渔民的褒奖令。其实早在1946年12月,当年参与组织救助行动的本地乡民沈品生当选为东极乡长后,便曾提议将营救英俘一事“呈报政府备案”,但由于多数当事人以营救“为吾人应有之天职,罔求邀功”为由推辞,报备方案未得落实【85】。直至英国驻华大使致函叶公超,南京国民政府才开始重视此事,并立即着令浙江省政府查验事情真伪【86】。经层层落实,东极乡乡公所如数告知上级营救经过,并对当年参与救助的渔民登记造册【87】。下令调查时,国民政府已顺带告知沈品生英舰拟答谢渔民并赠款一事,故在沈品生上呈县政府的文件中列有赠款分配方案:“拟分别以两山(岛)发起救护赵筱如、吴其生等10人,及参与动员各船户暨冒险护送3英人至内地之唐品根等6人列为甲等,凡献衣供饭者列为乙等,其各帮同送衣服送饭者列为丙等,用示大公,以励将及义务来兹。”【88】

    后来,英舰造访一事不了了之,为避免尴尬,国民政府要求“希酌定政府褒奖办法”【89】。10月11日,国民政府行政院内政部根据浙江省政府所呈当受褒奖人名册,发布褒奖令198件【90】。25日,定海县政府正式拿到由行政院内政部下发、浙江省政府转领的有关褒奖本县东极乡渔民的褒奖令,并将其发放给渔民。次日,《定海民报》对此事予以报道:“英人追怀旧德,尝有派舰至东极慰问及赉致谢金之说,嗣又有改由中央转发奖金之说,且一度层饬县府查复,案悬经年。今始奉到荣誉奖令,亦可谓久矣。”【91】虽然南京国民政府腐败无能,昏招频出,将英舰“至东极慰问及赉致谢金”这一简单事情神秘化、复杂化,导致本该大力宣传的善举“悬案经年”,但最终救助者也算“奉到荣誉奖令”,扩大了东极渔民营救英俘一事在地方上的影响。

    派使者至东极乡当面赠予渔民专款的方案既不能实施,英国政府只能另想他法。1949年2月17日,英国政府在香港举行悼念“里斯本丸”英俘官兵遇难仪式,英港当局决定借机在香港皇后码头举行答谢舟山渔民典礼。答谢仪式由港督葛洪亮亲自主持,英国政府的重视程度可见一斑。典礼开始后,先由港督葛洪亮代表英国政府致答谢辞,简要陈述中国渔民营救英俘之经过,继而举行颁发答谢奖品仪式。奖品主要包括“海安”号机动渔轮一艘,以及为在营救过程中做出突出贡献者准备的奖金、奖状。在仪式最后,葛洪亮亲自为“海安”号剪彩,并示意该渔轮搭载来宾解揽出海,绕海面环驶一周后才返回码头【92】。客观来看,这次酬谢仪式存在很多不足:未邀请渔民代表参加;向渔民转赠奖金、证书的中间人胡栋林与舟山渔民并无太多交集;所赠“海安”号是汽油船,以当时东极乡的条件,根本无力维持其正常运转【93】。即便存在诸多不足,港督葛洪亮在现场千余人面前亲自宣扬中国渔民的正义形象,并通过隆重仪式表达英港当局感戴渔民救护英俘情谊的举措,依旧能在寄托幸存战俘情感、巩固幸存战俘与渔民间的情谊上发挥积极作用。

    抗战胜利后,幸存英俘及英国政府主导下的答谢中国渔民行动很快成为这一时期中英两国纪念“里斯本丸”沉船事件的主流。英国政府为此特意策划一场造访赠款仪式,只是由于国民政府处理不当不了了之。为缓解“英舰恐不来”的尴尬,在“一度层饬县府查复,案悬经年”后,南京国民政府最终也下达了对渔民的褒奖令,从国家层面对营救义举给予了肯定。因不能成行东极乡,英国政府最终选择在香港举行答谢典礼,此举扩大了对中国渔民营救英俘义举的宣传,但也因缺乏渔民代表在场而留有历史遗憾。让人颇感无奈的是,南京国民政府未认识到“里斯本丸”沉船事件在宣传中国国家形象和巩固中英友好关系层面上的积极意义因而始终未有主动挖掘该事件纪念价值的举措。

    1949年10月新中国成立,以美国为首的西方国家奉行孤立、封锁新生人民政权的政策,新生政权不得已采取“一边倒”的外交方针,加入以苏联为首的社会主义阵营。在此后相当长一段时间内,两大阵营意识形态的对立极其尖锐,中英关系难以融洽,这也影响了两国官方、民间交流活动的开展,进而影响到“里斯本丸”沉船纪念活动的深入推进。故新中国成立后,中英两国有关“里斯本丸”沉船事件的记忆长期处于尘封状态,未被全面唤醒【94】。

    东欧剧变和苏联解体宣告世界两极格局的结束,开展“里斯本丸”纪念活动的外部条件初步具备。1991年12月,港英政府举办抵抗日本侵占香港50周年纪念活动,邀请参加过香港保卫战的250名老兵出席,成功出逃大后方的三名英俘之一的法勒斯也在受邀之列。法勒斯到达现场后“多次谈及他在浙江省定海县东霍洋遇救的经历,亟盼与舟山群岛昔日救命恩人重聚”,并在报纸上刊登“阔别香港四十载,亟寻救命恩人”的启事【95】。与此同时,浙江省舟山市部分政府工作人员也逐渐重视并开始着手挖掘“里斯本丸”事件背后蕴含的深层价值【96】。

    2004年中英建立战略伙伴关系……为“里斯本丸”沉船纪念活动的逐步开展奠定良好基调。2005年是世界反法西斯战争胜利60周年,8月15日至9月5日,中共浙江省委宣传部、省政府新闻办公室等部门通过联合举办纪念反法西斯胜利60周年大型图片展,扩大对“里斯本丸”沉船事件的宣传,确保不少舟山以外的民众了解到东极渔民的英勇事迹【98】。除此之外,在浙江舟山和中国香港等地还举行多次有当年在场人士参加的“里斯本丸”沉船纪念活动。无论是当年参与营救的东极渔民代表应香港“二战退役军人会”邀请访问香港【99】,还是幸存英俘携家人来到浙江舟山东极海岛感谢恩人【100】,均使久被尘封的“里斯本丸”沉船记忆愈加清晰。

    此后十年间,“里斯本丸”沉船事件受到越来越多的关注。在学术领域,以中国学者唐洪森、田庆华和英国学者托尼·班纳姆为代表的文史工作者开展了一系列卓有成效的研究,为学界了解沉船事件作出卓越贡献。在艺术领域,以“里斯本丸”沉船事件为主题的歌曲、影视作品和戏剧被创作出来并呈现给中英两国民众,客观上扩大了该事件在两国民间的影响力【101】。社会各界人士对沉船事件的关注推动了“里斯本丸”纪念活动的深入推进。2015年10月2日,浙江海洋学院隆重举行“里斯本丸”英军士兵遇难73周年暨中国人民抗日战争胜利70周年纪念活动,不仅中方相关人士积极参与,英国驻香港领事馆和退伍军人协会等组织机构也给予大力支持【102】,足见该事件的纪念意义及其背后蕴含的精神价值,已为两国人民高度重视。……

    如今中英两国人民围绕“里斯本丸”沉船事件开展的纪念活动仍在不断推进,舟山本地热心人士与幸存英俘及其后人间的书信往来不断,并相约让双方下一代延续这份宝贵情感,使新生一代成长为“情感维系的传承者”,以确保“这份跨越中英两国的友谊长存”【104】。……

    四、结语

    “里斯本丸”事件是日本所制造的战时悲剧,若非中国渔民及时出现,船上1800多名英俘很可能会全部葬身大海。事件发生后,日本政府曾主导构建出关于“里斯本丸”沉船事件的虚假记忆,在掩盖运送大量英俘赴日做苦力及枪杀泅水英俘等真相的同时,借虚构日军是“英俘拯救者”来鼓吹所谓的“大日本帝国的正义身姿”。后来,英国政府通过中方护送至大后方的幸存英俘了解到事情经过,开始要求日本政府调查并公布事情真相。但由于此时日方建构的记忆已成功主导日本政府各机关工作人员的思维和意识,英国政府并未达成对事件正本清源的目的。直到世界反法西斯战争取得胜利,对相关战犯的审判结果公之于众后,笼罩在日方谎言迷雾中的真相才为世人所知,英方重塑相关记忆的工作才宣告完成。稍显遗憾的是,不仅重要历史事件的发生会受政治影响改变走向,记忆的修正亦会因政治力量的介入而有所迟滞。

    中国渔民在营救英俘过程中表现出英勇无畏、无私奉献且不图回报的品质,受到获救战俘的高度肯定和赞扬。抗战结束后,幸存英俘和英国政府迅速着手对中国渔民实施答谢,两国围绕“里斯本丸”事件开展的纪念活动发轫颇早,但国民政府并未对中国渔民救助英俘的国际人道主义行为大力宣扬,错失了在国际舞台上展示中国国家形象的宝贵机会。新中国成立后,受冷战格局下东西方两大意识形态对峙的影响,相关纪念活动并未持续开展,与该事件有关的历史记忆也长期封存在当事者的脑海中,未被全面唤起。直至2004年中英全面战略伙伴关系确立后,借着两国关系步入“黄金时代”的春风,该事件背后蕴含的深刻价值才逐渐被两国政府和人民挖掘而重视,“里斯本丸”事件纪念活动才再次活跃起来。相比于官方路径,“里斯本丸”沉船事件的民间路径,即被救英俘与中国渔民之间的情谊自始及今,它在修正被政治力量遮蔽的历史真相之余,揭示出人性的温度和善意,这或许也是沉船事件至今仍为两国人民纪念的原因所在。

    本文转载自《史学月刊》2025年第2期

  • 侯卫东:中国古代理想城市规划理念探源

    城市的出现是人类文明史上一座关键里程碑,古人通过营造城市而构建了全新的社会秩序、塑造了城市生活方式。城市自诞生以来就成为人群聚居之地、资源汇集之处,在古人身份构建中发挥了关键作用。以城墙为界限的地缘关系与以血缘为纽带的宗族关系深度融合,居住形态和社会组织之间高度耦合,共同形成了中国古代社会治理和宗族生活交织在一起的人文景观。

    中国古代理想城市规划理念

    学界一般认为战国时期成文的《周礼·考工记》,是现存中国古代最早对以王城为代表的城市规划进行理想化描述的文献,其核心文本为:“匠人营国,方九里,旁三门,国中九经九纬,经涂九轨。左祖右社,面朝后市,市朝一夫。”这种理想的王城由两重城垣相套构成,大城为边长九里的城墙围合的方形城池,每面设三座城门,四面环抱位居中央的宫城。这样理想的城市规划以大一统王朝的王城为基准,诸侯国都城、卿大夫采邑的规格则按照等差进行削减。在周王朝的天下秩序中,古人是否践行过这些理想城市规划理念,是追溯其历史渊源的关键环节。

    (一)理想城市规划理念与鲁国营造实践

    根据浙江大学陈筱博士的研究,可将《周礼·考工记》理想城市规划的核心内容提炼为:①王城由内外两重城垣相套构成,外城四面环抱着中央的宫城。②外城为边长9里(约合3750米)的正方形,每面设三座城门,城门内通城市干道而构成井字形路网,城内可能还有若干次干道。③城内的功能区有王宫、祖庙、社稷、朝堂和市场等,不论它们位于宫城之内还是散布在外城中,其相对空间关系不变。④王城有明确的南北中轴线,形成了显著的几何中心点,不同功能区的规模存在整数倍的比例关系,很可能采用了模数制进行设计。

    陈筱博士认为《周礼·考工记》不是对既有城市模式的记录,而是在成书阶段并未完全实现的理想城市规划,描述的是周王朝理想王城的边界与规模,城门、干道、城市主要功能构成及布置,应视作中国古代理想城市的文本渊源。宋代以来的学者根据自己对《周礼·考工记》文本的理解,绘制有多种王城布局推测图,图中都有贯通全城的南北向中轴线,轴线南部通过穿越城门的主干道、北部指向宫城。中轴线控制着城市功能单元和道路的空间布局,轴线东西两侧的城区结构对称、功能元素彼此呼应。这种中轴线控制全城布局的推断,在周代都邑考古资料中也有与之相应的案例,比如曲阜鲁城的布局就有此类现象。

    田野考古和研究工作确认的曲阜鲁城城墙,始建于西周晚期,鲁国是西周初年周公的封地,鲁城应当有更早的城市建置基础。陈筱博士通过对鲁城路网结构和地貌的勘探复原,将南北向纵贯全城、大致居中的8号道路指认为控制全城布局的中轴线,这条道路通过城内自然高地中部,其延长线连接城南礼制建筑舞雩台。曲阜师范大学徐团辉博士认为鲁城中部偏东的南北向9号道路连接周公庙宫殿区和都城正门南墙东门,共同构成一条南北向的宫城中轴线,这条中轴线很可能在鲁城最初营建之时就已设计;春秋晚期在周公庙宫殿区增筑了一座横长方形小城,南门设于南墙正中并与9号道路相连,更加凸显了9号道路的宫城中轴线地位。

    曲阜鲁城8号道路及其延长线贯通的全城南北中轴线,控制着宫城及各类功能区划的方位、道路网络的布局、礼仪性建筑的选址、冶铸工业区的分布,将城市内外空间紧密连接起来,使整座城市秩序井然。9号道路贯通的是以鲁城宫城为核心的南北中轴线,控制着宫殿、宗庙、衙署等高规格建筑的布局,使鲁城的核心日常运转整肃有序。

    可见,《周礼·考工记》理想城市规划理念在曲阜鲁城的营造实践中有很多体现,因为鲁国的始封君周公是周王朝制礼作乐的主要负责人,曲阜鲁城应当是按照周王朝诸侯国都城规制营造的典范,其布局应是《周礼·考工记》理想城市规划文本的重要依据之一。

    (二)周王朝都邑制度的郑国营造实践

    在周王朝及诸侯国的城市营造实践中,《周礼·考工记》理想城市规划理念是否按照等差体现在不同规格的都邑建置上?我们可以通过考察来判断这种理念践行的历史纵深。

    文献上最早关于周王朝都邑营造制度的描述是《左传·隐公元年》记载的祭仲规劝郑庄公的话:“都,城过百雉,国之害也。先王之制:大都,不过叁国之一;中,五之一;小,九之一。今京不度,非制也,君将不堪。”这里的“先王之制”指周王朝早期就厘定的都邑营建制度,郑国这样的诸侯国从国都到最基层的城邑分为四个层级,可根据周代尺度转换成通行的表述方式:1.国都的城垣规制是三百雉,相当于“方五里”即每边城墙长约2079米的方城,面积约432万平方米;2.大都的城垣规制是百雉,相当于“方三分之五里”即每边城墙长约693米的方城,面积约48万平方米;3.中都的城垣规制是六十雉,相当于“方一里”即每边城墙长约415.8米的方城,面积约17.2万平方米;4.小都的城垣规制约三十三雉,相当于“方九分之五里”即每边城墙长约231米的方城,面积约5.3万平方米。

    郑国是否实施过祭仲所说的都邑营造制度,是检验这种理想的都邑制度是否为历史事实的关键。

    荥阳京襄城村一带的春秋时期古城就是祭仲所说的郑国京城遗址,该城平面呈纵长方形,南北长约1820米、东西宽约1460米,面积约266万平方米。京城平均边长约1640米,约合3.94里、237雉,其规模远超“大都不过百雉”的标准。《左传·庄公二十八年》说:“凡邑,有宗庙先君之主曰都,无曰邑,邑曰筑,都曰城。”公子段被称为“京城大叔”,可知京城最初营建时是一座“有宗庙先君之主”的郑国“大都”,其宗法和政治地位都很高,作为国都新郑西北方向国君直辖的“大都”应当符合制度,并不存在“不度”和“非制”的问题,只是后来作为公子段的“都城”才“不度”并“非制”。

    荥阳南城村南的春秋时期古城是郑国境内的古城遗址,该城平面呈横长方形,东西长约770米、南北宽约675米,面积约52万平方米,是一座约合边长为721米的方城,城垣规格约合1.73里、104雉,相当于“大都”的规制。

    新密古城寨古城内有丰富的龙山时期至汉代遗存,城墙至今在地面仍可见,春秋时期郑国境内显然也能看到这座古城的城垣。该城平面近横长方形,南城墙和北城墙均长约460米、东城墙长约345米、西城墙复原长度约370米,面积约16.5万平方米,是一座约合边长为407米的方城,城垣规格约合1里、60雉,相当于“中都”的规制。

    荥阳娘娘寨内城营建于两周之际,后来又营建了外城。内城平面近方形,边长约210米,面积约 4.41万平方米。外城南墙长约1200米且西接索河,东墙长约800米且北接索河,南城墙和东城墙呈直角曲尺形连接在一起,与索河共同形成相对封闭的围合空间,北墙和西墙未找到。娘娘寨内城城垣规格约合0.5里、30雉,接近“小都”的规制,说明春秋早期郑国境内应当存在祭仲所说的“小都”。

    上述案例表明,田野考古发现的春秋时期郑国城邑与《左传》里祭仲所讲的都邑制度有高度的对应关系,郑国境内符合“先王之制”的“大都”“中都”“小都”是存在的,周王朝这种理想的都邑制度至少在一定范围内实施过,是一定时空范围内的历史事实,并非没有实践的理想制度设计。

    中国古代理想城市规划的渊源

    《汉书·礼乐志》载:“故象天地而制礼乐,所以通神明、立人伦、正情性、节万事者也。”“王者必因前王之礼,顺时施宜,有所损益,即民之心,稍稍制作,至太平而大备。”在古人的认知中,礼制的核心是维护社会秩序,既强调对前代礼制的继承,又注重顺时施宜、因地制宜。以曲阜鲁城及郑国城邑为代表的周王朝诸侯国城市规划与营造实践,是《周礼·考工记》理想城市规划的直接实践渊源,也是对此前夏商王朝政治文化遗产的继承和发展,可以鲁城为基点向前追溯理想城市规划理念更早的历史渊源。

    周王朝在武王、周公带领下追寻“地中”“土中”“天下之中”的过程中,舍弃了前朝故都“大邑商”,选择了更早的夏都故地二里头一带。西周初年青铜器何尊的铭文记载了成王追述武王的话:“余其宅兹中或(国),自之乂民”,把营建于夏都故地二里头附近的东都成周称为“中国”,即周公“乃作大邑于土中”的中央之城,体现了“择中立都”“建中立极”的政治观念。以二里头夏都为中心的中原腹地,在西周初年已经明确成为观念上的“地中”“土中”“中国”等四方仰慕的中央神圣空间,中国古代逐渐形成“居中而治”的传统政治观。

    二里头夏都的营造以及中原腹地作为中央神圣空间的形成,有着深厚的历史文化积淀。4000多年前的龙山文化时代,出现了一次广泛筑城的浪潮。在中原腹地临近水源的高阜平坦之地,用黄土夯筑城墙;在城内居高居中之地营造贵族宫室和公共活动空间,民居、作坊和墓地有序安排;干道连接城门,地下陶水管、暗渠或明渠构成完善的给排水系统。龙山时代的筑城和宫室营造技术,为夏商王朝城市规划和营造实践提供了技术积累,成为中国古代城市营造技术的主流,也是重要的中华文明基因。

    公元前1800年前后,中原腹地形成以二里头夏都为代表的二里头文化,是对此前中华文明肇始阶段文化的凝聚和升华。纵横多条十字正交的路网结构,将二里头夏都区划成网格状多宫格“里坊式”布局,宫城位居中部偏东南。每个网格单元都是面积10万平方米左右的纵长方形,并且长宽比例接近,路网形成之后不久又分别在多个网格单元的道路内侧营造夯土围墙。二里头夏都宫城内10余座大型宫殿宗庙建筑排列有序,采用回廊庭院式布局,即是其后数千年官式建筑四合院式布局的渊源。二里头夏都的规划理念和营造实践已有建筑模数的意识,体现了王都规划“模写天下”的宇宙观。

    新郑望京楼二里头文化城邑(夏城)及二里岗文化城邑(商城)的选址和规划理念与二里头夏都最为接近,其城垣围合的面积约37万平方米、平面近菱形,商城内由道路及其延长线界隔成九宫格式布局,每个单元格的面积约4万平方米,相当于二里头夏都的缩略版。

    二里头夏都以宫城为中心的“多宫格”布局、中轴线理念、四合院式宫室制度,以青铜礼器为核心的多材质组合的器用制度,以宴享、祭祀、丧葬为代表的礼仪制度等,创造了新的空间秩序和价值秩序,体现了更加成熟的王朝礼制。

    商王朝早期以郑州商城和偃师商城两座王都的营造为引领,也出现了一次广泛筑城的浪潮。郑州商城营建在丘陵与平原过渡地带的高阜平坦之地,临近河湖等充足的水源,300万平方米左右的大城(内城)平面为纵长方形,东北角受紫荆山自然土岗的影响形成一个折角。郑州商城大城东北部发现的垣墙及其延长线,可将宫殿宗庙建筑界隔成多个“宫城单元”。也有学者结合夏商王朝都城布局特征和规划理念,提出“宫城”应在大城中部一带。根据目前的考古发现情况,郑州商城大城中南部很可能存在多个重要的功能单元。

    郑州商城与历代郑州城重叠,很难对“宫城单元”或“网格单元”进行清晰识别,也无法确认其是否存在如二里头夏都一样的网格状“里坊式”布局。但这种将都城按功能区划分成若干单元的方式,无疑继承了二里头夏都的规划理念。在郑州商城大城之外,又结合周围岗地及河湖水系,因地制宜营建了防护范围达到1000万平方米以上的外城,实现了中原王朝都城的第一次超大型建设。郑州商城作为商王朝取代夏王朝前后营建的都城,继承了二里头夏都的营造技术和规划理念,又有很多创新和突破,比如上文提及的因地制宜营建了面积达到1000万平方米以上的外城,给排水设施更加复杂完善等,其都城规划和营造实践体现了商王朝建立者们的理想和追求。

    偃师商城是在二里头夏都附近选择理想之地平地起建的,既有近在咫尺的二里头夏都作为模本,又有营造郑州商城最早一批宫殿宗庙建筑和宫城的实践经验,因而其营造可以更好地体现商王朝初年的建城理想。偃师商城首先营造了面积4万平方米左右的近方形宫城,宫城居于其中部偏南的位置,之后向外营造面积约81万平方米的纵长方形大城(早期大城,即考古报告中所说的“小城”),从而形成重城相套的结构。早期大城与宫城大体是同一条南北中轴线,且以此对称有序、布局严整地营造了多个近方形功能单元,每个功能单元约4万平方米。偃师商城西南角有一个约3.5万平方米的府库类封闭单元,西北角有一个约4万平方米的仓储类功能单元,而东南角、东北角的城墙都有与西南角、西北角相似的拐折,由此推测这两个位置也应有相对独立的功能单元。虽然目前还无法确认81万平方米的大城是否都用垣墙和道路界隔成“里坊式”的功能单元,但可以明确的是,其继承了二里头夏都网格状“里坊式”布局的规划理念,并且进一步发展了建筑模数意识。偃师商城功能单元的建筑模数明显小于二里头夏都的宫城,表明其规格低于真正的王都。偃师商城宫城内营建的东西两组建筑,每组建筑也依然遵循始于二里头夏都的南北向中轴对称原则。

    郑州商城和偃师商城的城市规划,体现了商王朝“模写天下”的宇宙观和对都城秩序的追求,影响后世数千年的都城规划和城市营造。商王朝中期营建的安阳洹北商城,总体布局更加追求方正规矩、重城相套、中轴对称、四合院式建筑等,把“模写天下”的都城规划理念推向新高度。

    这些自二里头夏都以来的营造实践积累的城市规划理念,包括城圈方正规矩、重城相套、中轴线控制全城、网格化分区规划、四合院式宫室建筑等,与后来曲阜鲁城代表的理想城市的早期实践有明显的渊源关系,当为《周礼·考工记》所载理想城市规划理念的历史渊源。

    中国古代理想城市规划理念的赓续

    中国古代理想城市规划文本形成之后不久,秦统一六国,建立了大一统王朝。《史记·秦始皇本纪》《三辅黄图》等文献记载表明,秦始皇重新营造都城咸阳的原则是“法象上天”,与理想城市规划“模写天下”的理念不是一个传统。西汉帝都长安城按功能区划营造多个宫城的方式,虽与理想城市规划理念有接近之处,但比起二里头夏都、偃师商城的布局,仍然因地制宜有余、规划严整不足。

    后世都城营造实践中,在东汉魏晋帝都洛阳城基础上重新营造的北魏帝都洛阳城,是遵循中国古代理想城市规划理念的一个关键节点。洛阳城内,铜驼街北端连接宫城、南端延伸至礼制建筑圜丘,这条线就是控制全城布局的南北向中轴线,与曲阜鲁城的南北向中轴线非常相似;宫城和衙署之外,北魏洛阳城还有纵横交错的道路网络界隔的大量里坊空间,与二里头夏都、偃师商城的规划理念遥相接应。这些应当反映了从北方迁入中原的魏孝文帝竭力追求正统王朝理念的迫切心情。

    北魏王朝的后继者东魏北齐在北魏洛阳城规划理念和营造实践的基础上,重新规划并营造了东魏北齐帝都邺城,新邺城与北魏洛阳城的形制相仿,其布局更加方正规矩、中轴线更加突出。整座城市以宫城为中心,围绕全城中轴线对称布局,城内分布着数量众多的里坊。东魏北齐邺城的布局,体现了对中国古代理想城市规划理念的继承与创新,对后世的隋唐长安城、洛阳城的“棋盘式”里坊布局产生了直接影响。

    北宋东京城从州桥经天街到宣德门一直纵贯至大内,有一条明确的南北向城市中轴线;东西向穿城而过的汴河以象天汉,州桥也称为天汉州桥。因此,北宋东京城的布局理念“象天法地”,对中国古代理想城市规划理念既有继承、又有创新,也形成了新的开放式城市空间。起家于北方草原地带的元世祖忽必烈营造帝都元大都时,在金中都的基础上采用《周礼·考工记》的理想王城规划理念设计并营造,与魏孝文帝营造北魏洛阳城时竭力追求正统的心情非常接近。元大都受原有建筑和地形地势的影响,营造时并不能很好地实现理想城市规划理念,此后平地起建的元中都、明中都是更贴近《周礼·考工记》规划理念的城市。明清北京城继承了元大都的城市规划理念并拓展创新,中国古代理想城市规划理念融入明清北京城的营造实践中,也成为赓续至今的中华文明基因。

    本文转自《光明日报》( 2025年02月08日 10版)

  • 俞可平: “奴婢贱人,律比畜产” —— 中国古代贱民的政治学分析

    对贱民阶层的专门研究源自民国时期。瞿同祖根据历代的法律制度对中国历史上的良贱阶层做了明确的分类,陈序经、王书奴等则对疍户和娼妓等贱民群体进行了比较系统的考察。但总体而言,民国时期对贱民群体的研究非常稀少。对贱民阶层真正系统而专业的研究,是1978年改革开放以后开始的。一批历史学者,特别是经济史学者从不同的层面对贱民群体进行了分门别类的专门研究,如对奴婢、娼妓、乐户、堕民、疍户、官户、杂户、田仆的专门研究。不少学者对贱民的来龙去脉、生活方式、人际关系、社会地位和法律规定等各个方面都做了非常出色的探究,如对徽州田仆的研究。不过,迄今学界对贱民的关注,多偏于具体的专门论述,而缺少综合性的宏观分析。此外,已有的贱民研究,几乎没有政治学者的参与。而从根本上说,贱民首先是一个政治等级或政治阶层,只有深刻揭示贱民的政治意义,才能真正认识贱民的本质及其在中国传统社会中的实质性功能。本文将首先对贱民的定义、性质、特征、类别和历史演变做一简要的宏观考察,在此基础上着重从政治学的角度分析贱籍制度与中国传统专制政治的内在联系及其本质功能。

    一、“四民”之外的贱民

    “明贵贱,辨等列”(《左传·隐公五年》)是中国传统等级秩序的根本法则,“编户齐民”是贯彻这一根本法则的社会管理制度。“编户齐民”即是通过户籍制度将普遍平民进行分类管理,它把广大民众分为士、农、工、商四类。春秋时期的管仲说:“士农工商四民者,国之石民也”(《管子·小匡》)。战国时期的谷梁赤也说:“古者有四民:有士民,有商民,有农民,有工民”(《春秋谷梁传·成公元年》)。《汉书·食货志》曰:“士农工商,四民有业。学以居位曰士,辟土殖谷曰农,作巧成器曰工,通财鬻货曰商”(《汉书·食货志》)。后晋刘昫等撰的《旧唐书》进一步延续了古代的“士农工商”四民说:“凡习学文武者为士,肆力耕桑者为农,巧作器用者为工,屠沽兴贩者为商”(《旧唐书·职官志》)。直至明清,“士农工商”四民依然是对国民的基本分类,但明清两代的户籍制度则分别将居民的户籍进一步细分为“军民匠灶”和“军民商灶”四类,将从军的“军户”、从事手工业的“匠户”和从事盐业的“灶户”单列,并明文规定上述“四民为良”(《大清会典》卷十七)。

    然而,自正式确立“四民”体系以来的漫长历史进程中,无论在哪个朝代,在上述“士农工商”或“军民商灶”法定的“良籍”之外,还有一个被列入“贱籍”的特殊群体,他们的社会政治地位比普通“四民”更低,不能享受普通平民的法定权利,甚至不属于普通的“庶民”“百姓”范畴。这个被排斥于“士农工商”四民之外而处于社会最底层的特殊社会群体,就是本文所说的“贱民”,亦称“贱人”“贱口”或“贱色”。之所以称这一特殊群体为贱民,一方面,是因为无论就其从事的职业还是就其所处的社会地位而言,这一群体都处于最低劣和卑微的社会末端;另一方面,无论从国家的法律规定还是从社会的伦理评价来看,这一被打入“贱籍”的特殊群体,都与属于“良籍”的平民有着本质的区别。贱民在不同的历史时期和不同的地区,各有不同的称呼,如奴婢、部曲、客女、佃客、番户、杂户、乐户、堕民、娼优、丐户、疍户、世仆、伴当、九姓渔户等等,这些不同的称谓大体上反映了贱民群体的构成。

    在传统中国政治语境中,“贱”实质上是一个等级关系概念,即所谓的“贵贱有等”(《荀子·王制》)。一是从官民关系上说,官贵民贱;二是在平民之间,还有良贱之分。普通的黎民百姓是“良民”,可以享受基本的法定权利,而“良民”之外还有“贱民”,他们连最基本的平民权利也被无情剥夺。“贱”的第一种含义是以官为贵,以民为贱,贵贱有别,以强调名器之尊。这里的“贱”,是指普通平民,是相对意义上的“贱”。另一种平民关系上的“贱”,则是绝对意义上的“贱”,“是指在社会上处于特别低下的法律地位和社会地位、没有独立人格的个人,以及由这些人构成的等级。这个意义上的‘贱’或‘贱民’,就不仅相对贵族、缙绅,即使相对一般百姓而言,他们的地位也是卑下的”。进而言之,这个处于社会等级最末端的贱民群体,鉴于其连最普通的平民身份也被法律所剥夺,他们实质上已经不是正常意义上的人,而被贬低到其他动物和财产的地步。正如《唐律》所毫不隐晦地宣示的,“奴婢贱人,律比畜产”(《唐律疏议·名例六》)。贱民之“贱”体现在其政治地位、生产劳动、社会交往、教育科举、日常生活、荣誉奖励等各个方面,并且以国家的法律制度和社会的礼仪习俗加以规约和维系。

    贱民不得拥有正常的户籍,没有独立的身份,更无独立的人格,从而也不享有普遍平民的基本法定权利。将每一户人家以及家庭的每一成员编籍入册,是中国历代王朝的强制性要求,违犯者会受到法律的惩罚。唐律规定:家长若不如实登记户籍信息,将受到刑事处罚,面临牢狱之灾:“诸脱户者,家长徒三年……脱口及增减年状以免课役者,一口徒一年”(《唐律疏议·户婚一》)。《清会典》也规定,凡民必须入籍:“凡民之著于籍,其别有四:一曰民籍,二曰军籍,三曰商籍,四曰灶籍,察其祖籍,辩其宗系,区其良贱。”“凡民”之中的“民”不包括贱民,列入贱籍的贱民根本就没有独立的户籍权,他们必须寄身或依附于主人的户籍。上引唐律同时规定,“奴婢、部曲亦同不课之口”,必须登记在户主名下,不许自主为户。不仅私奴不得拥有正常的户籍,即便官奴也同样如此。官奴必须隶属于所服役的衙门,不得在地方自立户籍。唐律对此有诸多详细的规定:“官户隶属司农,州、县元无户贯”(《唐律疏议·名例六》),“杂户者,前代犯罪没官,散配诸司驱使,亦附州县户贯……官户亦是配隶没官,唯属诸司,州县无贯”(《唐律疏议·户婚上》),“工、乐及官户、奴,并谓不属县贯。其杂户、太常音声人有县贯,仍各于本司上下”(《唐律疏议·贼盗二》)。《大明律》也以“军、民、匠、灶”四民分籍,严格限制贱民进入正常的民籍,并将所有贱民列入“丐籍”。但此“丐”并非通常意义上的“乞丐”,列入“丐籍”的贱民其地位连乞丐也不如:贱民的“丐籍表示身份,同没有职业的乞丐相比,在户籍分类上截然不同:一属贱民,一属良民,不可混淆”。

    贱民的生命安全没有基本的法律保障,其生存权和人身自由随时可能被主人或其他“良民”所剥夺。“杀人偿命”这一古典法律通则,并不适用于贱民。主人可以对奴婢施加各种人身伤害而不受惩罚,对男女奴仆的体罚、残害以及对女仆的奸污,只要不出人命,几乎都不会受到法律制裁。有学者指出,在唐律中没有发现任何条文用以约束主人对奴婢的虐待和残害行为。“除了擅杀一事,主人控制下私奴婢生命、身体的安全无法受到保障,主人对奴婢的权力几近绝对。”即使是故意虐杀奴婢,主人也不用偿命,而只需受到轻微的处罚。唐律规定:“诸奴有罪,其主不请官司而杀者,杖一百。无罪而杀者,徒一年”(《唐律疏议·斗讼二》);“诸主殴部曲至死者,徒一年。故杀者,加一等。其有愆犯决罚致死,及过失杀者,各勿论”(《唐律疏议·斗讼二》)。清律也有类似的规定:“若奴婢有罪,其家长及家长之期亲,若外祖父母,不告官司而殴杀者,杖一百;无罪而杀者,杖六十,徒一年……若违犯教令而依法决罚邂逅致死及过失杀者,各勿论。凡官员将奴婢责打身死者,罚俸二年;故杀者,降二级调用;刃杀者,革职……”(《大清会典事例》卷八一《刑部》)。对贱民生命安全的保障,有时甚至还不如对动物生命的保障。例如,清律规定,“凡私宰自己马牛者,杖一百”(《大清律例》卷二十一);而官员残杀奴婢只需“罚俸二年”或“降二级调用”。由于历朝对贱民的生命安全几无法律保障,发生在贱民身上的种种惨绝人寰的虐害行径,可谓罄竹难书。

    贱民的自由权、平等权和人格权被剥夺,不享有基本的人权。贱民虽是人类,但他们仅是生物学意义上的人,而非社会学和政治学意义上的人,在本质上,他们并不被当作正常的人类,而是当作主人的工具和财产。虽然贱民群体内部还有不同的差别,奴婢是最低下的贱民,是贱民中的贱民,但是所有贱民,无论是奴婢还是部曲、堕民、乐户、佃客,都没有独立的人格,而是附属于主人的工具,从而没有起码的人身自由权和人格平等权。贱民必须绝对听从主人的使唤和遣差,不得有违主人的意愿,否则主人可对其进行任意处罚。贱民也没有职业、迁徙、婚姻和交往的自由,没有任何隐私权和人格尊严。例如,贱民不仅自己须由主人决定其婚配,甚至其子女的婚配权也得由主人决定,否则,也将受到法律的惩罚。清律规定:“凡家仆将女子私嫁与人,不问本主者,鞭一百。无论年份远近,生子与未生子,俱离异,给予本主。”与剥夺贱民基本自由相伴随的,是历代法律明文规定贱民与主人、良民的极度不平等。以斗殴、杀人及强奸为例,主人殴伤、奸淫,甚至杀死贱民可以不承担任何法律责任,普通平民(良民)殴伤、奸淫和杀死贱民也只需承担轻微的刑事惩罚;反之,若贱民殴伤、奸淫和杀死主人或良民,则要受到法律的最严厉惩罚。唐律规定:主人杀死奴婢部曲,只要杖一百,至多徒一年;良民殴伤贱民者,其罪“减凡人一等,奴婢又减一等”(《唐律疏议·斗讼二》)。然而,若贱民殴打主人,则“伤者绞,杀者皆斩”;若贱民殴打良民,则罪“加凡人一等,奴婢又加一等”。主人强奸女性贱民,则不受惩罚;良民强奸女性贱民,也只需受到轻微惩罚:“奸他人部曲妻、杂户、官户妇女者,杖一百;强者加一等……明奸己家部曲妻及客女,各不坐”(《唐律疏议·杂律上》)。反之,若贱民奸淫主人或良民,则面临极刑的处罚:“其部曲及奴奸主及主之期亲,若期亲之妻者绞,妇女减一等,强者斩”;“诸奴奸良人者,徒二年半,强者流,折伤者绞”(《唐律疏议·杂律上》)。明清两代几乎完全继承了历朝对贱民在法律上的非人性歧视,在某些方面甚至比前朝更严厉。例如,洪武《大明律》规定:“凡奴婢骂家长者,绞。骂家长之期亲及外祖父母者,杖八十,徒三年。大功,杖八十;小功,杖七十;缌麻,杖六十。”“凡奴婢殴家长者,皆斩;杀者,皆凌迟处死;过失杀者,绞;伤者,杖一百,流三千里。若殴家长之期亲及外祖父母者,绞;伤者,皆斩;过失杀者,减殴罪二等;伤者,又减一等;故杀者,皆凌迟处死”(《大明律》)。清律规定:奴婢对主人的辱骂和殴打,均要受到极刑的处罚:“凡奴婢殴家长者(有伤;无伤。予殴之奴婢不分首从),皆凌迟处死”;“凡奴婢骂家长者,绞”(《大清律例》卷二十八、二十九)。

    人以役贱,也是历代贱民的基本特征。贱民从事的职业都是社会中最低劣的行业,欲称“贱业”;反过来说,最低贱的工作非贱民莫属。除了侍候主人或官员的各类仆役,以及各种最辛苦的劳役外,凡是被当时的社会舆论视为最下贱的各种职业,均由贱民群体承担,例如唱戏、卖淫、行刑、埋尸、抬轿、剃头、阉割、丧葬等等。以宋以后浙东的“堕民”为例,男女贱民从事的各类“贱业”竟多达数十种。清律明文规定“奴仆及倡优隶卒为贱”:“凡衙门应役之人……其皂隶、马快、步快、小马、禁卒、门子、弓兵、仵作、粮差及巡捕营番役,皆为贱役。长随亦与奴仆同”[《大清会典》(光绪)卷十七]。因此,“清代的贱民首先是指奴婢和娼优。长随跟奴仆同等;开豁以前的乐户隶属‘乐籍’,与娼优是一样的。为官府服役的皂等所干的各种差事,被认为是侍候官老爷的‘贱役’;人以役贱,所以凡应承这种差役的人都被划进贱民的圈子里”。明清时期徽州的佃仆,是等级高于奴婢的贱民群体,其服役的范围,“主要是冠婚祭喜庆,以及属于地主生活方面的一些劳役。但也有一些是属于生产性的劳动,如看守树木、除草、修路、建筑仓库、搭桥、春渡等。还应指出,如抬轿、奏乐、丧葬杂役,等等所谓‘贱役’,也是由佃仆承担的,而且成为佃仆的一种标志”。

    作为“四民”之外的一个特殊群体,贱民被强制要求赋有某种侮辱性的身体标识和社会符号。历朝对贱民的服饰、出行等均有明确规制。违犯贵贱的规制,即要受到法律的惩罚。首先是服饰的穿戴必须有别于良民而凸显其贱民身份。如《大明会典》载明:“正德元年,禁商贩吏典、仆役、倡优、下贱,皆不许服用貂裘。僧道隶卒下贱之人,俱不许服用纻丝纱罗绵”(《大明会典》卷六十一)。清律也规定:“只许奴仆穿茧绸、毛褐、葛布、梭布、貂皮、羊皮;不准穿纺丝、绸绢、缎纱、绫罗、各种细毛、狼皮以及石青色衣。只许戴狐皮、沙狐皮、貂子皮帽;不许戴貂帽。乐户只准穿戴本色黄骚鼠皮帽。凉帽用绿绢裹,绿绢沿边。不许穿各项绫缎及狼皮衣。”据明代徐渭记载:浙江的堕民,“四民中即所常服,彼亦不得服”。其服饰的典型特征是:“帽以狗头状,裙布以横,不长衫”(《徐文长集》卷十八《风俗论》)。其次,在出行、就餐、称谓等社会生活的许多方面,历代都有关于贱民的特殊定制。贱民不能走道路的中间,不能与主人同桌共餐,与良民相逢必须主动避让。如浙东堕民,其出行“不得乘坐车马,只能步行。路遇平民,堕民必须让路。绍兴乃是水乡,出行的主要工具是船。然而,如果有堕民同行,即便是冰天雪地,北风呼啸,平民也不允许堕民入舱……堕民外出时总是低着头,迈着碎步,靠右急速而行。如果双方相向而行,堕民得给平民让路”。

    对贱民最为残酷的制度,就是贱籍的世袭性。在中国传统社会,历代的规制是,除了极其特殊的例外,贱民自己及子孙后代均不能脱贱为良。换言之,一日为贱,不仅终身为贱,而且世代为贱。尤其是贱民及其子孙永世不得参加科举考试,不能进入朝廷官僚体系,成为朝廷官员。在中国传统社会,由贱入贵的主要制度性途径,便是通过科举考试进入官僚体系。这一选拔精英的道路对于普通平民而言,是转变其身份的主要通道,而这条通道对于贱民而言则是完全关闭的。唐律对科举取士的资格要求很高,普通的工商阶层都被排除在外,更何况贱民阶层。到了明清时期,法律已明确规定贱民不得参与科举考试,不得进入仕籍。如清律明文规定:“凡出身不正,如门子、长随、番役、小马、皂隶、马快、步快、禁卒、仵作、弓兵之子孙、倡优、奴隶、乐户、丐户、胥户、吹手,凡不应应试者混入,从重治罪。认保、派保互结之五童互相觉察,容隐五人连坐,禀报黜革治罪”[《大清会典》(光绪)卷十二]。“其八旗户下人及汉人家奴、长随、倡优、隶卒子孙,概不准冒入仕籍。步军统领衙门番役缉捕勤奋者,止准该衙门酌加奖赏,毋许奏给顶戴,其子孙概不准应试出仕”[《大清会典》(光绪)卷十]。在贱民群体中地位稍高一些的佃仆子弟,即使因为特殊的机遇,其经济地位足以供养子弟上学读书,也同样因贱民身份的限制而“不准应试出仕”。

    婚姻是传统社会中人们改变身份的重要途径之一,为了阻止贱民通过婚姻变更贱籍,历代均对贱民的婚姻做了严厉的限制,禁止贱民与良民之间的通婚。唐律认为,各色人等各有自己匹配的婚姻,良贱之间尤其不能婚配。违犯良贱之间的婚配关系,就打乱了既定的等级秩序,必须受到法律的严惩。“人各有耦,色类须同。良贱既殊,何宜配合。”故此,“诸与奴娶良人女为妻者,徒一年半,女家减一等。其奴自娶者亦如之。主知情者,杖一百;因而上籍为婢者,流三千里”(《唐律疏议·户婚律》)。“工、乐、杂、官户及部曲、客女、公私奴脾,皆当色为婚。若异色相娶,律无罪名,并当违令,各改正”(《唐律疏议·诸杂户不得与良人为婚》)。明清两代不仅沿袭了唐律关于良贱禁止通婚的规定。明律专门辟有“良贱为婚姻”的条文,良贱通婚不仅贱民本人要受罚,主人若有责同样要治罪。“凡家长与奴娶良人者,杖八十。女家减一等。不知者不坐。其奴自娶者,罪亦如之。家长知情者,减二等。因而入籍为婢者,杖一百。妄以奴婢为良人,而以良人为夫妻者,杖九十。各离异改正”(《大明律·婚姻》)。《娶乐人为妻妾》条规定:“凡官吏娶乐人为妻、妾者,杖六十,并离异;若官员子孙娶者,罪亦如之”(《大明律·婚姻》)。清律也认为,良贱通婚有辱良民,“婚姻配偶义取敌体,以贱娶良,则良者辱也”。因此,“凡家长与奴娶良人为妻者,杖八十;女家减一等。不知者不坐。其奴自娶者,罪亦如之。家长知情者,减二等。因而入籍为婢者,杖一百。若妄以奴婢为良人而与良人为夫妻者,杖九十(妄冒,由家长,坐家长;由奴婢,坐奴婢)。各离异改正”(《大清律例》卷十《户律·婚姻》)。

    历代统治者之所以对贱民有如此苛刻、侮辱和非人的法律规定,归根到底是因为不把贱民当作人看待,而视其为工具、物产和资财。唐律明言的“奴婢贱人,律比畜产”,道出了中国历史上贱民群体的共同本质。因为本质上没有把贱民当作人,而是把他们视作“会说话的工具”,因而贱民的人身自由、生命安全和人格尊严等基本人权便被残酷地剥夺。正因为实质上被当作是所有者的工具、物产和资财,所以贱民便可以被主人合法地买卖、转让、没收:“奴婢皆同资财,即合由主处分”(《唐律疏议·户婚三》)。一旦主人犯罪,其奴仆因视为财物反而不用受到连坐,可以像其他财物一样被籍没分配。“诸谋反及大逆者,皆斩……若部曲、资财、田宅并没官”(《唐律疏议·贼盗一》)。

    二、历史上的各类贱民

    贱民的历史在中国源远流长,从文字记载和考古发现来看,贱籍制度几乎与早期国家同步。这一点符合马克思主义史学的主流理论,即人类在进入文明社会之前,经历了原始社会和奴隶社会。最早的贱民脱胎于奴隶,贱民制本质上是奴隶制的残余。

    夏商周三代是中国历史上文字记载的最早王朝,也是中国的早期国家形态。分别记载夏朝和商朝政治军事制度的《甘誓》和《汤誓》中均出现了“孥戮”的概念,据训诂学家考证,这里的“孥”同“奴”,说明夏商时期已存在“奴婢”。清代学者江声注释《甘誓》曰:“‘孥’或为‘奴’,当从‘奴’,谓有罪而没为奴也。或奴,或戮,视其所犯”(《尚书集注音疏》卷三《夏书》)。另一位清代学者段玉裁也认为“孥”与“奴”在上古时代是通假的:古“奴婢”“妻孥”字,皆作“奴”。“孥”字是俗称,《尚书》原文只作“奴”。“其实‘孥子之孥’两‘孥’字,亦当正为‘奴’,古子女奴婢统称奴,其既也假‘帑’为‘奴’字,其后又制‘孥’为之”(段玉裁:《古文尚书撰异》)。孔子在论及商代的三位杰出“仁”者时,提到了其中的箕子曾经为“奴”,这也间接证明商代奴婢的存在:“微子去之,箕子为之奴,比干谏而死,殷有三仁焉”(《论语·微子》)。《周礼》关于奴婢的记载相当多:《秋官》曰,“其奴,男子入于罪隶,女子入于舂槀。凡有爵者,与七十者,与未乱者,皆不为奴”(《秋官司寇·司民/掌戮》)。《大宰》曰,“八曰臣妾,聚敛疏材”,东汉经学家郑玄说,“臣妾,男女贫贱之称者,或奴戮之余允,或背德之质子,晋惠之男女皆是”(《周礼注疏·正义序》)。《周礼》在详细分述“治官”“宫正”“宫伯”“膳夫”“庖人”等50余种职业时,包含了大量的“胥”“徒”等奴仆群体,甚至其中提及的“女酒”“女浆”“女幂”“女祝”“女工”等,据专家考证也均为女奴。

    春秋战国时代,中国政治逐渐进入绝对的君主专制时期;到了秦汉时期,这种绝对的君主专制政治得到逐渐稳固。与此相一致,中国的贱民制度大约在春秋战国至秦汉时期正式形成,并且成为国家法定的重要政治制度。《左传》论及春秋时期鲁国的社会等级时,就出现了“隶”“僚”“仆”“台”等贱民群体:“天有十日,人有十等,下所以事上,上所以共神也。故王臣公,公臣大夫,大夫臣士,士臣皂,皂臣舆,舆臣隶,隶臣僚,僚臣仆,仆臣台,马有圉,牛有牧,以待百事”(《左传·昭公七年》)。西汉王莽说,“秦为无道,置婢奴之市,与牛马同栏”(《汉书·王莽传》)。这说明,在秦王朝时,已经把奴婢视作牛马般的贱民,这一点已为后世出土的秦律等文献所证明。抄录于秦王政时期的《睡虎地秦墓竹简》中的《秦律十八种》就有关于“隶臣妾”和“人奴妾”的专门条款;而形成于秦统一后的《岳麓书院藏秦简》中所载的秦律,则不仅有“隶臣妾”的条款,而且首次在法律条文中出现了“人奴婢”的用语。到了汉代以后,奴婢作为主要的贱民群体已经大量存在,并且以法律制度的形式加以明确规定。例如张家山汉简《二年律令·告律》就明文规定,奴婢不是正常的人而属于财物的范畴:“民欲先令分田宅、奴婢、财物,乡部啬夫身听其令,皆参办券书,辄上如户籍。”奴婢向官方诉讼主人不仅不得受理,而且还要受到“弃市”的极刑:“子告父母,妇告威公,奴婢告主、主父母妻子,勿听而弃告者市。”汉以后的唐宋明清历代大体沿用了秦汉的良贱律,以国家法律的形式将贱民群体打入另类,被剥夺基本的人权。从此以后,贱民群体一直伴随着中国传统的专制政治而长期存在,但其表现形式及构成在历史上却有所不同。中国历史上出现过的贱民群体主要有奴婢、部曲、娼优、佃仆、乐户、丐户、疍户、皂隶、堕民等。

    1.奴婢。在中国的贱民演化史上,奴婢是典型的贱民,也是出现最早、数量最庞大、存续时间最长、分布范围最广的贱民群体。奴婢是“男奴女婢”的通称,又常常被称为“奴仆”“家仆”“家奴”“人臣”“人妾”“家僮”“丫鬟”“丫头”“使女”“苍头”“驱口”“驱奴”等。根据其隶属或所有关系,奴婢又可分为官私两类,为朝廷官衙所拥有的为官奴,为家庭私人所有的则是私奴,官奴和私奴在一定条件可相互转换。“如官奴婢往往被皇家或官府当作赏赐品赐予下属官吏,从而变成了私奴婢;原是私奴婢者,也有因主人犯罪,其家属和奴婢没官,而转弯成为官奴婢者。”一般认为,奴婢是奴隶制度的残余,因而在战国后期和秦汉早期奴婢就作为一个特殊群体而大量存在了。史载,战国末期秦国大臣吕不韦和嫪毐的私奴婢就数以万千计:“不韦家僮万人,嫪毐家僮数千人”(《史记·吕不韦传》)。秦汉之后,官私奴婢的数量不断增加。汉代的“官奴婢十万余人”(《汉书·贡禹传》),唐代仅宫廷的官奴婢就有10万多人,私奴婢的数量则更为庞大。唐太宗的儿子越王李贞,“家僮千人”(《旧唐书·越王贞传》),大臣冯盎更甚,拥有“奴婢万余人”(《旧唐书·冯盎传》)。地方官僚和豪富巨贾蓄奴成风。如,广州刺史胡证,“善蓄积,务华侈,厚自奉养,童奴数百”(《旧唐书·胡证传》),京师巨富王宗,“侯服玉食,僮奴万指”(《旧唐书·王处存传》)。历史上不少朝代对奴婢的数量曾经做出过各种限定,因为奴婢规模过大,在一定程度上会削弱社会生产力并减少政府的税收。例如,汉时曾规定:“诸侯王奴婢二百人,列侯公主百人,关内侯吏民三十人”(《汉书·哀帝记》)。唐代规定得更为详细:“王公之家不得过二十人;其职事官,一品不得过十二人,二品不得过十人,三品不得过八人,四品不得过六人,五品不得过四人,京文武清官,六品不得过二人,八品九品不得过一人”(《唐会要》卷八六《奴婢》)。清朝亦有蓄奴的定制:“旗下督抚家口,不得过五百名,其司、道以下等官视汉官所带家口,准加一倍”(《清圣祖实录》卷二〇八)。然而,是否拥有奴婢,以及拥有多少奴婢,是专制政治下等级特权的体现,一般的制度规定难以有效约束权贵家庭的蓄奴之风,历代关于蓄奴的限定很大程度上形同虚设。例如,直至中国历史上最后一个存在合法贱民的清朝,权贵家庭成百上千地蓄奴仍是十分普遍的现象。有清一代,“仕宦之家,僮仆成林”。乾隆宠臣和珅,“供厮役者,竟有千余名之多”(《清仁宗实录》卷三七)。不仅督抚大员奴婢成群,甚至七品州县之官也“多置僮仆以逞豪华,广引交游以通声气,亲戚往来,仆从杂沓,一署之内几至百人”。

    2.部曲。作为贱民群体的部曲,源于南北朝,主要盛行于唐代。部曲原泛指军队士兵,后来则专指私家军队。“部曲”一词在东汉末、三国、西晋时代的历史文献中已经常出现,泛指部队、军队、队伍和士兵。但在当时,“无论是官方部队还是私家士兵,都可以用部曲一词表示”。然而,随着历史的演进,部曲一词逐渐更多指私家军队,再从私兵进而蜕变成为私家仆人,成为有别于“良人”的“贱人”。到了唐代,部曲已成正式制度规定的贱民群体。清末民初的沈家本和何士骥等曾对部曲做过专门的考证。沈家本认为,从三国至周、隋三百多年间,兵祸战乱不绝,地方将吏纷纷拥私兵以自重。“第其初,部曲虽供役私家而尚未沦于卑贱,故别于奴婢,而不混为一等。洎乎朝移代易,荣悴不齐,此等人不供役公家,不系户籍,其妻儿衣食仍仰给私门,而部曲之称犹袭畴昔,于是杂户、官户之外遂有一项名目矣。”何士骥也认为,部曲源自东汉三国时期的私兵,并逐渐从私兵蜕变成为供主人役使的贱人。但何士骥和浜口重国都认为,在南北朝时部曲已经完成了从私兵向贱人的转变。部曲的女性眷属则称为“客女”,“客女,谓部曲之女”(《唐律疏议》卷二),从事“典型的奴隶劳动”,在《唐律》中亦被列入“贱人”。宋代关于部曲的文献记载已经不多,因而也有专家断定,“部曲作为一个贱民阶层,在宋代已不存在”。虽然部曲在宋代最后逐渐消亡,但至少从法律制度来看,宋初仍然存在作为贱民群体的部曲。《宋刑统》沿袭《唐律疏议》仍有不少关于部曲的条款,例如,宋初的《户婚律》也如唐律一样规定:“诸奴婢诈称良人,而与良人及部曲、客女为夫妻者所生男女并从良,及部曲、客女知情者,从贱。即部曲、客女诈称良人,而与良人为夫妻者,所生男女亦从良;知情者从部曲、客女。皆离之。其良人及部曲、客女被诈为夫妻,所生男女经一载以上不理者,后虽称不知情,各同知情法”(《宋刑统》卷十四《户婚律》)。

    3.杂户。杂户是四民之外从事“百工伎巧”等各类社会贱业的贱民群体之一,通常认为源自北朝,而特别盛行于唐代,是唐代贱民阶层的重要组成部分。虽然学界对作为贱民阶层的“杂户”何时形成尚有争议,但通常认为,“北魏时期存在一种专门服务于官府不同部门的杂户,它主要由隶户、屯户、兵户、营户、牧户、乐户及佛图户诸户构成。北魏杂户不是某一特定人口,而是一种社会群体或社会阶层的专称,且相对于当时的编户齐民,他们处于社会的底层,身份和地位近似于奴隶”。据一些专家考证,杂户之名北魏之前就出现于典籍律令之中,但通常是指“杂役之户”,从事官府的各项劳役;也指“异族”“部族”等繁多的含义,其地位低于一般庶民,但仍属于良民群体。但在北魏分裂后的西魏和北周年间,“杂户”一词的含义发生了重大变化,从良民阶层变为贱民阶层了。北魏以后,“杂户”作为贱民群体正式形成,恰如其称谓所示那样,其含义确实十分庞杂。有些专家将魏晋南北朝时期的杂户、营户、盐户、金户、乐户、僧祗户、屯户、牧户、新民、府户、城民、驿户、伎作户、百工技巧、绫罗户、丝绸户、匠户等通称为“杂户”。按魏律和唐律的规定,杂户属于官贱民的一类,非为私属,不得列为普通民籍,而由州县单列贱籍。“杂户者,前代犯罪没官,散配诸司驱使,亦附州县户贯”(《唐律疏议·户婚上》)。“杂户者,谓前代以来,配隶诸司,职掌课役,不同百姓。依令老免、进丁、受田,依百姓例,各于本司上下”(《唐律疏议·名例三》)。

    4.官户。官户是籍没的官奴婢,是官贱人的一类。与杂户不同的是,官户仅限于朝廷衙司,不属地方州县。唐律载:“官户者,亦谓前代以来,配隶相生,或有今朝配没,州县无贯,唯属本司”(《唐律疏议》)。官户主要从事各种苦力型的劳作,因其“分番输作,又称番户”。“诸律令格式有言官户者,是番户之总号,非谓别有一色”(《唐六典·刑部尚书》)。据考证,作为贱民群体的官户,最早出现于隋朝。在隋朝,官贱人中已正式确立了“官户”的类别,并在某种程度上承担了杂户的义务,而隋朝的“官户”之名又沿袭自陈朝。到了唐朝开元年间,法律已将官户与奴婢、工户、乐户、杂户和太常声人等六类人一同列为“官贱人”。作为唐代重要的贱民群体,官户归属刑部都官曹管辖,但其劳作则主要分配到司农寺。“凡诸行宫与监、牧及诸王、公主应给者,则割司农之户以配”(《唐六典·刑部尚书》)。官户女奴主要给达官贵人家庭提供侍役,“官户奴婢有技能者配诸司,妇人入掖庭,以类相偶,行宫、监牧及赐王公、公主皆取之。凡孳生鸡彘,以户奴婢课养”(《新唐书·百官志三》)。而官户男奴则主要从事农业生产和放牧业,并配给一定数量的农田和牲口,“诸官户受田,随乡宽狭,各减百姓口分之半。其在牧官户、奴,并于牧所各给田十亩。即配戍镇者,亦于配所准在牧官户、奴例”(《天圣令·田令》)。上述律令提到的“官户、官奴都是唐代的贱民”,两者的区别在于“丁、官户是分番的,而官奴则无番”。作为重要贱民群体的官户,唐代之后基本上不复存在。到了宋代,“官户”之名仍在,但其意义却发生了颠覆性的变化,从原先的下层贱民变成了上层权贵。北宋中期的“官户”指的是“品官之家,谓品官父祖子孙及同居者”,且唯有以军功入仕或“至士大夫以上方有资格作官户”。

    5.乐户。顾名思义,乐户就是从事音乐舞蹈职业的群体,故又称“乐工”“乐人”“乐籍”。音乐舞蹈是人类生活不可或缺的内容,伴随着有文字记载的整个人类历史。商周、春秋、战国和秦汉时期,已有大量关于礼乐舞蹈的文献,但尚无将乐舞当作贱业的记载。法律条文明确将“乐户”列入贱籍始于北魏,魏律载:“有司奏立严制:诸强盗杀人者,首从皆斩,妻子同籍,配为乐户;其不杀人,及赃不满五匹,魁首斩,从者死,妻子亦为乐户”(《魏书·刑法志》)。北魏后,中国历史上的绝大多数时间中乐户便作为贱民阶层而存在,成为存续时间最长的贱民群体之一。乐户以“贱民”身份活跃在宫廷、军旅、地方官府、寺庙和民间,“从北魏时期发端,到清代雍正年间被禁除,前后经历了一千四百余载”。唐代作为贱民的乐舞职业者分为两个群体,即“乐户”和“太常音声人”,前者籍在朝廷的太常寺,后者籍属州县。“工乐及官户奴,并谓不属县贯,其杂户太常音声人有县贯”(《唐律疏议·贼盗一》)。但“乐户”和“太常音声人”两者本质相同,均属贱民:“工、乐者,工属少府,乐属太常……‘太常音声人’,谓在太常作乐者,元与工、乐不殊”(《唐律疏议·名例三》)。总之,音声人作为单独的一类,与官户、杂户是有区别的,但“其地位绝对低于良人”。有些研究者认为,乐户的地位在宋元时有明显提升,甚至在宋代已不属于贱民阶层。而在元代,出现了一个不属于贱民阶层的“庶民乐户”,即“礼乐户”。“他们不仅享受着正常人的权利,可以应试、做官,甚至还有免除赋役的特权。”不过,更多的研究表明,“乐户”在北魏以后的中国传统社会中长期属于“四民”之外的贱民阶层,特别是在明代,“乐户”的数量剧增,而其社会地位则极其低下,“没有哪个时代的乐户比明代更为低贱”。

    6.倡优。中国古代作为贱民阶层的乐户,在相当程度上与娼妓是重合的。在中国最早的古代典籍没有“娼”只有“倡”,而“倡”与“乐”相通。如“《说文》没有‘娼’字,梁顾野王《玉篇》上始有‘娼’字,并说:‘娼,也’。字作何解?《说文》说:‘,放也,一曰淫戏’。宋丁度《集韵》说:‘倡,乐也,或从女’。明人《正字通》说:‘倡,倡优女乐,别作娼’”。由此可见,“古代娼女起源于音乐。所以后世娼女虽以卖淫为生,而音乐歌舞,仍为她的主要技术”。从语源学上看,娼妓与乐舞这两种职业有着内在的联系,林语堂甚至认为,中国的娼妓继承着音乐的传统,没有娼妓就没有音乐。娼妓以出卖自己的肉体为职业,无疑属于中国传统社会最低贱者的行列,毫无例外地被历朝的法律制度打入贱籍。然而,中国历代的法律条文中,很少明确将娼妓单独列为贱籍。之所以这样,主要原因应该就是如上述所言,中国古代法律语境中的“乐户”很大程度上包含了“娼妓”。王书奴说,“‘女乐’这种人物,一方面牺牲色相,他方面也可谓出卖肉体,实为‘巫娼’演进之产物”。《魏书》所谓“‘乐户’,即‘女乐’的化名”,“女乐”与“娼妓”实为“一途”。另据一些专家考证,古代娼妓与专业歌舞女艺人名称上通用。“如对‘妓籍’‘伎籍’‘娼籍’‘倡籍’‘花籍’检索,发现其与‘乐籍’相通,吴梅说‘伎女’从良,则脱‘乐籍’;从四库全书检索‘妓乐’一词的数量结果占‘妓’字检索结果的22%,说明中国古代传统社会的娼妓是专业歌舞女艺人。”根据经君健的研究,在明清两代,“乐户”与“娼妓”同类。例如,明景泰八年有议:“凡良家妇女不许教坊司买作倡优,民户为乐户者皆令改正。”而在清代,朝廷废除教坊司的乐籍后,山西等地仍保留不少“乐户”户籍,这些“乐户”仍是“娼妓”,被当地视为“贱之甚者”,“不齿于齐民”。

    7.胥吏。作为贱民群体的胥吏,是官贱人的一种,主要在衙门和高官家庭从事低贱的劳役,其主体是各类衙役、差役、隶卒、皂隶、长随和家人。胥吏、隶卒是国家政权不可缺少的组成部分,因此这一阶层随国家政权而产生,具有悠久的历史。《左传》所描绘的鲁国昭公时期的胥吏阶层就已经十分复杂:“士臣皂,皂臣舆,舆臣隶,隶臣僚,僚臣仆,仆臣台”(《左传·昭公七年》)。沈家本在总结历代刑法时,对属于胥吏阶层的隶卒做过详尽的分类,从先秦的司隶、罪隶、蛮隶、奚隶、臣隶、臣妾等,到汉魏至唐宋明清的皂隶、民隶、徒隶、胥隶等,虽名称各异,但内容大体相同:“隶,贱官”也;“隶,贱臣”也;“隶,奴也,贱也,役也”。作为在中央与地方政府机关中从事衙役的这个胥吏阶层,在中国历史上的各个朝代中都处于非常低贱的地位,大体上均属于“四民”之外的贱民阶层。有学者指出,虽然这个阶层在今天看来属于“公务员”的范畴,但在历史上实际履行着“官奴婢”的职能。“官署中的低级公务员由官奴婢担任,其工作受到歧视,列为贱业,变成中国历史上的特殊传统,残留了几千年之久。这些工作统称为‘吏’的工作。吏又称‘皂吏’‘隶吏’‘青吏’,都表示其职业之卑贱及其从业者身份之低下。皂、隶直接点明其奴隶身份。”衙门中的胥吏、役差虽然地位类同贱民,但不少研究者认为在明清之前的历朝法律制度中,很少有明确的条款将其列入贱籍的。但明清之后,胥吏衙役群体被列入制度性的贱民阶层则是明确无误的。例如《清会典》明确规定,衙门中的“隶卒为贱”。“衙门应役之人,除库丁、斗级、民壮仍列为齐民外,皂隶、马快、步快、小马、禁卒、门子、弓兵、仵作、粮差,及巡捕番役,皆为贱役。”

    8.佃仆。佃仆是一种区域性的贱民,分布于明清时期的安徽、江苏、浙江、江西、湖南、湖北、福建、广东、河南等地。佃仆制源于何时,历史学家并无明确答案,但多数研究者认为,佃仆制至少在明代以前就存在了,明清时期已在许多地方流行。有些认为源自东晋南朝,有些认为源于唐宋时期。有人考证,“佃仆”的称呼在北宋时就出现了,盛行于南宋并且一直延续到元明清以后,“累世相承,遂不得自齿于齐民”。佃仆在不同区域和不同时期,有各种不同的称呼,如佃民、地仆、庄仆、庄人、住佃、火佃、庄佃、细民、伴当、世仆等。一般认为,安徽的徽州是佃仆制流行的典型地区,以致对徽州佃仆的研究成为中国历史学界,特别是中国经济史研究界的一个引人注目的领域。但也有人认为,作为明代独具特色的土地占有关系,佃仆制虽盛行于南方各省,“而江西尤为突出和盛行”。作为贱民群体的佃仆,其本质特征即是其奴仆身份,不得与四民相齐,从而不享有普通民众的基本权利。佃仆首先是主人的奴仆,同时也是主人的佃农。如清律明确规定,佃仆是“奴而兼佃户者,即退佃而名分永存”。“佃仆和地主具有主仆名分,是人身依附强固的标志,也是佃仆区别于一般佃户的重大特征。主仆名分是终身的关系,而且延及子孙,世代相承,经‘数十世不改’。”这种双重的人身依附关系常常以佃仆与主人之间的契约形式得以确立,并且由国家的法律条文加以保障,永世不得改变。作为奴仆,为主人服役是佃仆分内的工作,从服侍主人的衣食住行,到服务主人家的婚丧嫁娶;作为佃农,佃仆还要为主人家从事生产劳动,从耕种田地到经商买卖等。鉴于佃仆身份和劳役的这种双重性,有的专家认为这是由于将大量奴仆用于农业生产,从而使“佃农奴仆化”的结果。因此,佃仆是一个不同于奴婢而接近奴婢,不同于佃户和雇工人,但又不属于良人的特殊贱民阶层。

    区域性的贱民除了佃仆外还有很多,比较有代表性有江浙的“堕民”或“丐户”、浙江的“九姓渔户”和广东沿海一带的“疍户”。堕民又称堕贫、惰民、惰贫、大贫、小姓、轿夫、丐头、丐户等,最早出现于南宋,盛于元明清的浙江和江苏部分地区。堕民的服侍对象称“主顾”或“脚埭”,两者之间形成人身依附性的主仆关系。“九姓渔户”或“九姓渔民”亦称“江山船”,自称“船浪人”,主要存在于浙江和江西的水乡,尤其是聚居于浙江的衢江、东阳江、桐江以及富春江流域,这些船户因陈、钱、林、李、袁、孙、叶、许、何九姓得名。九姓渔户以捕鱼为业,女子也常兼以卖淫为生。疍户或疍民,亦作蜑户、蛋户。“疍”,古时又作为“蜑”“蛋”“蜒”,因而疍户又有别称蜑族、蛋民、蜒户等。主要分布在广东、福建、广西沿海地区,台湾和浙江也有分布。与江浙的九姓渔户非常类似,疍户也主要从事水上的捕捞业和采珠业等,不少疍户女子亦被迫卖淫为生。一方面,堕民、疍户和九姓渔户被社会排斥于“四民”之外,他们与其他贱民一样被粗暴剥夺作为普通民众的基本权利;另一方面,从制度层面上说,他们又不像其他贱民群体那样有明确的法律条文规定,因而,有些专家亦称这类区域性贱民为“习惯型贱民”。

    三、贱民制度与中国专制政治

    贱籍制度与中国传统专制政治有着内在的联系,对巩固绝对君主专制发挥着特殊的功能。作为一种特殊政治存在的贱民等级,不仅是中国君主专制不可缺少的政治基础,而且是中国专制政治体系中超稳定的结构性要素。

    贱籍制度是专制社会等级秩序的产物,是专制政治结构不可缺少的组成部分。专制政治的结构基础就是等级秩序,专制政治越发达,等级结构就越复杂。中国传统政治的本质,是绝对的君主专制,或称王权政治。王权政治也是一个社会结构体系,君主处于整个社会结构的顶端;王权是至高无上的权力,王权体系在社会结构体系中占据主导地位。“臣民在社会与历史上只能为子民、为辅、为奴、为犬马、为爪牙、为工具。”相对于皇帝而言,其他所有子民都是“臣仆”或“奴才”。中国传统社会中作为皇帝“子民”的主体,即是所谓的“士农工商”四民,这些“子民”自身也构成一个庞大复杂的等级结构体系,其中“士”居于“子民”结构体系的顶端。作为中国士大夫阶层主体的各级官僚,自身也是一个复杂的等级体系,即所谓“九品中正”制,拥有朝廷品秩的官员就多达十八个层级。士尚且如此,其他子民自无可逃遁于等级秩序体系之外。政治等级在传统社会意味着政治秩序,在子民中间划分等级,根本目的就是为了便于统治。对此,西周和先秦的文献就已有明确表述。例如,《逸周书》就认为,如果没有必要的等级秩序,不仅社会的正常生活无法维持,人们之间也必然会发生各种利益冲突,最终导致相互残杀。如果人群之间为了争夺利益而发生战乱,那么,人们就不可能安居乐业,统治者也无法驾驭民众。“凡民不忍好恶,不能分次。不次则夺,夺则战;战则何以养老幼,何以救痛疾死丧,何以胥役也”(《度训解第一》)。荀子也说得很明白,先王之所以区分贵贱富贵,就是为了防止混乱失控:“先王恶其乱也,故制礼义以分之,使有富贵贫贱之等”(《荀子·王制篇》)。《左传》所描述的王权体系,实际上就是一个复杂而完备的等级秩序体系,它建立在君王为顶端、贱民为低端的结构体系之上:“封略之内,何非君土。食土之毛,谁非君臣?故《诗》曰:‘普天之下,莫非王土。率土之滨,莫非王臣。’天有十日,人有十等,下所以事上,上所以共神也。故王臣公,公臣大夫,大夫臣士,士臣皂,皂臣舆,舆臣隶,隶臣僚,僚臣仆,仆臣台,马有圉,牛有牧,以待百事”(《左传·昭公七年》)。

    贱籍制度的存在,是中国传统特权政治的社会等级结构基础。从上面《左传》的这段引文和其他记载中可以清楚地看到,不仅普通民众之间须“明贵贱,辨等列,顺少长”(《左传·隐公五年》),而且贱民之间也还有不同的等级之分。为便于政治统治,在贱民这个最低端的社会阶层中再划分出不同的等级,贱人中间还有“高级贱人”与“低级贱人”之分,这正是从先秦至明清的贱籍制度的共同特征。如果“皂”以下为奴仆的话,那么《左传》所列的先秦奴仆便有五个等级。唐律的相关规定同样清楚地表明,不同的贱民群体之间存在着严格的等级差别:“诸部曲殴伤良人者(官户与部曲同),加凡人一等。奴婢,又加一等。若奴婢殴良人折跌支体及瞎其一目者,绞;死者,各斩”(《唐律疏议·斗讼二》);又规定官贱人升为良人须经过几个等级:“一免为番户,再免为杂户,三免为良人”(《唐六典·刑部尚书》)。直到清王朝,贱民阶层内部的等级差别依然十分明显。据经君健的研究,从法律地位、政治地位、社会地位和经济地位的综合考察来看,清代的贱民可分为四个等级:奴婢、娼优和乐户是最低级的贱民群体,是“贱民中的贱民”;堕民、丐户、疍户和九姓渔户是比奴婢地位稍高的倒次第二个贱民等级;佃仆虽没有独立的人格,却因从事生产劳动而接近佃户,因而地位比前两个贱民群体更高些;隶卒和衙役、家人、长随直接服侍官府,是官僚的爪牙,其地位在贱民中最高,属于贱民中的“统治阶级”。从政治文明的角度看,社会的进步程度直接体现为政治上的平等程度。政治上的等级差别越大,表明社会的专制程度越高,而政治文明的程度则越低。在中国传统的专制政治条件下,处于等级秩序顶端的君主不仅拥有至高无上的王权,而且以皇帝为代表的统治阶级还拥有超常的政治经济特权。从某种意义上说,皇帝为代表的统治阶级的超常特权,正是建立在剥夺大量贱民群体的基本权利这一基础之上的。换言之,统治阶级的超级特权体制,是以贱民阶层完全丧失其基本人权为代价的。

    贱民群体的产生是政治镇压的结果,贱籍制度本身就是赤裸裸的国家暴力制度。按照马克思主义的国家理论,国家本质上是一种暴力机器,是一个阶级统治另一个阶级的暴力工具。“到目前为止,一切社会形式为了保存自己都需要暴力,甚至有一部分是通过暴力建立的。这种具有组织形式的暴力叫做国家。”从国家的历史发展进程来看,这一判断无疑是极为深刻的。为了夺取和巩固国家政权,历史上的各种政治势力集团最终都会毫无例外地使用军队等暴力工具,对敌对势力进行残酷的镇压和杀戮,并运用暴力手段将被统治阶级牢牢控制在既定的政治秩序之下。中国历史上贱民群体的形成,有力地证明了马克思主义的上述论断。大量可靠的历史文献记录表明,贱民群体的来源虽然多种多样,但贱民阶层的主体来源就是国内外战争中被战败的俘虏、国内政治斗争中被镇压的敌对集团成员,以及受到统治阶级法律惩罚的形形色色罪犯。

    历代的文献记载表明,将大量的俘虏分赏给将帅大臣为奴,是王朝征服敌人的常用手段。恩格斯说:“战争提供了新的劳动力,俘虏变成了奴隶。”把战争中的俘虏当作法定的奴仆,既可以增加战胜方的初级劳动力,又可有效防止这些昔日敌对力量的反抗。因此,将战争中的俘虏当作奴仆,是世界历史上早期国家的通例,中国当然也不在例外。现代汉字中的“虏”源自甲骨文,本意即是战争中的俘虏:“虏,获也”(《说文》),后引申为“奴隶”和“奴仆”。俘虏是奴婢等贱民群体的最早来源,这一点在先秦时代是十分清楚的。睡虎地秦简的法律就有明确的条文:“寇降,以为隶臣”(《睡虎地秦墓竹简》,第89页)。从甲骨文、金文和竹简关于降寇的大量记载表明,战争中的俘虏是奴婢隶臣等贱民群体的主要来源。汉唐以后国家政权日益稳定,战争俘虏不像先秦时代那样众多,但仍是贱民的重要来源。班固在《汉书》中还把“奴”与“虏”并连在一起:“齐俗贱奴虏,而刁间独爱贵之。桀黠奴,人之所患,唯刁间收取,使之逐鱼盐商贾之利”(《汉书·货殖传》)。别人都怕凶狠狡黠的“奴虏”,但齐地的刁间却善于使用“奴虏”来发财致富。有的专家认为,在唐朝的对外战争中,“有关俘虏对方人口的记录虽然很多,但除了少数是用以‘献俘’,一部分予以释放外,只有在某些战役中的俘虏才被成奴隶,而其中的绝大多数俘虏,究竟如何处理,往往并无明确交待。这说明唐代的对外战争,已经不以掠夺奴隶为其主要目的。因此说,俘虏只是唐代官属奴婢的来源之一,而不是其主要来源”。尽管如此,还是有不少的文献明确记载,即使在唐代,战争中的大部分俘虏仍是贱民的重要来源。历次对外战争中抓获的众多俘虏,有些转为奴婢成为官贱民,有些分赐给大臣成为私贱民。唐律规定“凡俘馘,酬以绢,入钞之俘,归于司农”(《新唐书·兵志》)。俘虏成为农奴,是王朝的常态;而战俘赦为良民,恰恰是少数的例外。《旧唐书》的一则记载即是明证:“初,攻陷辽东城,其中抗拒王师,应没为奴婢者一万四千人,并遣先集幽州,将分赏将士。太宗愍其父母妻子一朝分散,令有司准其直,以布帛赎之,赦为百姓。其众欢呼之声,三日不息”(《旧唐书·高丽传》)。明清两代在这一点上更是有过之而无不及。例如,明灭元,凡蒙古部落子孙流寓中国者,另所在编入户籍。其在京省,谓之乐户,在州邑,谓之丐户。又如,顺治帝将满清入关时俘获的近百万青壮年称为“血战所得人口”,作为犒赏将其中部分俘虏分赐给将帅为奴:“或有因父战殁而以所俘赏其子者;或有因兄战殁而以所俘赏其弟者”(《清实录》第3册)。

    将敌对政治集团成员贬为贱民,剥夺其基本的尊严和权利,防止敌对力量的复辟和反抗,是传统社会中政治镇压最常用的残忍手段。从传说中的“三代”原始国家政权到宋元明清的中国历代王朝,都毫无例外地将直接针对君主政权的反抗行为称为“谋反”“大逆”,列为“十恶不赦”的重罪之首。除了主犯处斩处绞之外,其余家属则籍没为奴,成为历代贱民群体的主要来源之一。《隋书》载:“其谋反、降叛、大逆以上皆斩。父子同产男,无少长,皆弃市。母妻姊妹及应从坐弃市者,妻子女妾同补奚官为奴婢”(《隋书·刑法志》)。《魏书》载:“大逆不道腰斩,女子没县官”(《魏书·刑法志》)。唐律载:“诸谋反及大逆者,皆斩;父子年十六以上皆绞,十五以下及母女、妻妾(子妻妾亦同)、祖孙、兄弟、姊妹若部曲、资财、田宅并没官,男夫年八十及笃疾、妇人年六十及废疾者并免”(《唐律疏议》卷十七)。后来的宋元明清历朝法典,基本都沿袭了上述规定,将被镇压的敌对政治集团成员或直接处死,或籍没为贱民。即使被誉为“盛世”的唐朝,也同样需要运用残酷的贱民政治来巩固和维护政权。滨口重国在详细梳理唐武德至开元年间包括“玄武门之变”“房遗爱事件”“长孙无忌事件”“越王贞事件”和“太平公主事件”等上百起“谋反”与“大逆”事件后指出,这些事件中被籍没为“官贱人”等奴仆的被镇压政治集团成员,数量最多估计有20万人左右,中位数也在10万人左右。浙江堕民的来源相传有五种不同说法,即“宋焦光赞部曲说”“蒙古后裔说”“赵宋皇室后裔和忠臣说”“反抗洪武的忠臣义士说”以及“项羽余部说”。明朝的徐渭说,“丐以户称,不知其所始,相传为宋罪俘之遗,故摈之,为堕民。丐自言则曰,宋将焦光赞部落,以叛宋投金故被斥”。鲁迅也说,小时候听说堕民是宋朝降将后代,但后来他怀疑了:“他们的祖先,倒是明初的反抗洪武和永乐皇帝的忠臣义士也说不定。”不难发现,上述五种观点中无论哪一种,都与政治斗争和政治镇压相关。

    在利用贱民政治来无情摧毁敌对政治力量方面,明朝堪称典范。大明律规定:“凡谋反及大逆,但共谋者,不分首从,皆凌迟处死。祖父、父、子、孙、兄弟及同居之人,不分异姓,及伯叔父、兄弟之子,不限籍之异同,年十六以上,不论笃疾、废疾皆斩;其十五岁之下,及母女、妻妾、姊妹,若子之妻妾,给付功臣之家奴”(《大明律·刑律》)。不仅如此,为了防止可能出现的政治反抗,《大明律》还专门增设奸党条,运用连坐与贱民制度严厉禁止臣下结党和内外官员交结。吏律规定,“若在朝官员,交结盟党紊乱朝政者,皆斩,妻女为奴,时产入官”,“内外官员相互勾结者,皆斩,妻子流二千里安置”(《大明律·吏律》)。为了削弱相权,消除可能出现的政治威胁,朱元璋制造了一系列令人发指的政治迫害事件,其中尤以“胡惟庸案”和“李善长案”为甚,创造了中国历史上连坐之最。胡惟庸案连坐人数高达3万余人,除了丞相胡惟庸本人及其成年亲属被处死外,其余均被籍没为奴。民间相传,江浙贱民“九姓渔户”最初也是朱元璋对敌对势力政治镇压的产物,“九姓渔户为明初与朱元璋争天下的陈友谅的部属,明朝建立之后,其子孙九族贬入舟居,以渔为生,改而业船”。明成祖朱棣全面继承了其父的血腥传统,在发动靖难之役夺得皇位后,对建文帝旧部进行无比残酷的政治清算。《明史》有载:“成祖起靖难之师,悉指忠臣为奸党,甚者加族诛、掘冢,妻女发浣衣局、教坊司,亲党谪戍者至隆、万年间犹勾伍不绝也。”朱棣不仅处死建文帝的所有干将,将建文帝其余旧部贬为贱民,而且对其极尽羞辱,将其妻女统统贬为倡优,或被送入教坊司、浣衣局,或被充宫廷乐户成为官贱人。

    将罪犯及其连坐的家属籍没为奴婢贱民,是中国最早的政治法律制度之一,并贯穿于整个中国传统社会。《周礼》就有罪犯为奴的条款:“其奴,男子入于罪隶,女子入于舂槀。凡有爵者,与七十者,与未乱者,皆不为奴”(《周礼·司寇》)。汉郑玄对此的注释则更加清楚:“谓坐为盗贼而为奴者,输于罪隶、舂人、槁人之官也。由是观之,今之奴婢,古之罪人也”(郑玄:《周礼注疏》卷三十六)。汉律也规定:“罪人妻子没为奴婢,黥面”(《三国志·魏志·毛玠传》)。从历代法律的成文规定来看,贱民的主要来源是朝廷的罪犯,许多专家也据此认定贱民群体主要源于各类罪犯。从表面上看,这样的判断无疑是对的。一是因为国家的法律本质上体现了统治阶级的意志,掌握政权的统治者总会尽量运用法律的手段,首先将其镇压对象的行为列为“谋反”“谋叛”“大逆”等罪行,再判以重罪,从而使其政治镇压行为具有“合法”的外衣;进而将失败的政治对手打入贱籍,使其永世不得翻身。二是因为国家的统治者要有效维护政权,除了维护政治秩序外还必须维护基本的社会公共秩序,这就需要严厉打击杀人盗窃等普遍的犯罪行为,将罪犯打入贱籍便是一种十分有效的手段。由此之故,一方面,所有被镇压的政治集团成员除被处死者外都会被作为罪犯而籍没为奴婢倡优等贱民,历代官修的史书对此都有相当详细的记录;另一方面,除了政治罪犯外,也确实有大量普通的刑事罪犯及其缘坐亲人被籍没为贱民。例如,籍没罪犯为奴贯穿于整个唐代,但由于政治斗争的原因,在初唐和后唐时有大量达官贵人的“家口”以谋反或叛逆罪而被籍没为奴婢。此外,“也有的本无‘反逆’之实,只以酷吏所陷,或因事触犯刑律,或因坐赃、逃亡等等原因,而家口被籍没为奴婢的,在唐代也大有人在”。又如,罪犯及其家口入奴的数量在清朝极大地增加,清朝在继承历代“罪奴”的基础上,又增加了“发奴”这一新贱民群体。清初,入“发遣为奴”的罪行约30多条,到了同治年间增多至103条,诸如“给付功臣之家为奴”“发黑龙江给披甲人为奴”“发新疆给官兵为奴”“发各省驻防官兵为奴”等等。与历代王朝的贱民制度一样,这些罚为奴仆的罪犯分为两类,一类是政治犯,另一类则是普通刑事犯。“给付功臣之家”之奴,多为政治犯:犯谋反、大逆、谋叛、“谋危社稷”和“不利于君”等死罪的连坐家口,包括母女、妻妾、姊妹、儿媳及15岁以下的男性家人。其他“发遣之奴”则为普通刑事罪犯及其连坐的家人。

    作为中华民族政治解放过程的重要内容,废贱为良经历了极其漫长而艰难的历程。从历史文献的记载来看,从贱民群体形成之日起,就产生了反对贱民政治的努力。早在西周,就出现了反对将罪犯家属籍没为奴的呼声。《康诰》曰:“父不慈,子不祗,兄不友,弟不共,不相及也”(《左传·僖公三十三年》),周文王则被认为是“罪人不孥”的代表性人物。孟子说:“昔者文王之治岐也,耕者九一,仕者世禄,关市讥而不征,泽梁无禁,罪人不孥”(《孟子·梁惠王下》)。东汉的毛玠甚至当着皇帝的面说:“将妻子没为官奴婢”是“使天不雨者”的行径,他为此触犯龙颜而遭受了牢狱之灾(《三国志·魏志·毛玠传》)。历史上不仅时有反对贱民制度的呼声,更有一些统治者将废贱为良付之行动。沈家本详细列举了历代废奴为良的各种尝试,比较重要的有:汉代高祖、文帝、光武、建武均有过免贱为良的举措,如高祖五年诏曰“民以饥饿自卖为人奴婢者,皆免为庶人”,文帝四年“免官奴婢为庶人”;晋、魏、唐、宋、辽、金、元、明亦偶见免贱为良的实例,如唐显庆二年“敕放诸奴婢为良及部曲客女者听之”,宋开宝四年“诏广南有买人男女为奴婢转佣利者,并放免”,金天辅六年“诏奴婢先其主降,并释为良”,辽世宗大定二十九年“诏诸饥民卖身已赎放为良,复与奴生男女,并听为良”,明洪武五年诏“诸遭乱为人奴隶者复为民”,明英宗时“谕吏部曰:教坊乐工数多,其择堪用者量留,余悉发为民。凡释教坊乐工三千八百余人”。然而,所有上述这些免贱为良的事例,均是零星而偶发的皇帝“善举”。有些是出于饥荒的原因,有些是为了收买人心,还有一些是为了增加朝廷的税收,而都不是制度性的废贱为良。

    在中华民族废贱为良的政治解放历史进程中,有过三次里程碑式的改革与突破,第一次是清朝雍正年间首次从正式制度层面推行“豁贱为良”;第二次是民国时期,从国家法律上全面废除贱民制度;第三次就是中华人民共和国的成立,不仅从法律上而且从社会经济的现实基础上彻底铲除贱民制度,终结了盛行中国数千年的贱民政治。

    清廷统治中国后,一方面沿袭了中国传统的贱民制度,将大量的战俘和罪犯变为朝廷和贵族的奴仆,另一方面也对贱民制度实行了不少重大改革。例如允许奴婢独立开户,逐步解除开户奴婢出旗为民的禁令,顺治八年废除了教坊司乐户,康熙十二年又下诏裁撤地方乐户,等等。清朝关于贱民制度的突破性改革,则是雍正年间一系列的“豁贱为良”政策。这一重大政治改革,首先从废除山西和陕西的乐户开始。雍正元年(1723)三月,监察御史年熙上奏曰:“山、陕两省乐户另编籍贯,世代子孙勒令为娼。绅衿地棍呼召即来侑酒。间又一二知耻者,必不相容。查其祖先,原是清白之臣。因明永乐起兵不从,遂将子女编入教坊,乞赐削除。”雍正十分赞同此奏,立即批转交由部议,部议结果认为:“压良为贱”,乃“前朝弊端”,“亟易革除”。雍正随即同意部议结果,下旨在全国范围内废除所有乐户的贱籍:“各省乐户皆令确查削籍,改业为良。若土豪地棍仍前逼勒凌辱及自甘污贱者,依律治罪。”同年七月,两浙巡盐御史噶尔泰上奏请豁除丐户贱籍,在部议不同意的情况下,雍正仍下旨废除丐户的贱籍。雍正五年(1727)四月,又主动下诏豁除“佃仆”“伴当”和“世仆”的贱籍。雍正皇帝说:“近闻江南徽州府则有伴当,宁国府则有世仆,本地呼为细民。其籍业下贱,几与乐户、惰民相同。又其甚者,如二姓丁户村庄相等,而此姓乃系彼姓伴当、世仆……若果有之,应予开豁为良。俾得奋兴向上,免至污贱终身,累及后裔。”雍正七年(1729)后,又相继发布上谕豁除疍户和九姓渔户等的贱籍。对雍正帝的豁贱为良政策,清史官方文献有如下记载:“雍正元年,直隶巡抚李维钧言,请将直隶丁银摊入地粮内征收,嗣是各省计人派丁者,以次照例更改,不独无业之民无累,即有业民户亦甚便之。二年,天下人丁共二千四百八十五万四千八百一十八口。时山西省有曰乐籍,浙江绍兴府有曰惰民,江南徽州府有曰伴儅,宁国府有曰世仆,苏州之常熟、昭文二县有曰丐户,广东省有曰蜑户者,该地方视为卑贱之流,不得与齐民同列甲户。上甚悯之,俱令削除其籍,与编氓同列。而江西、浙江、福建又有所谓棚民,广东有所谓寮民者,亦令照保甲之法案户编查。”

    虽然雍正的“免贱为良”也有扩大户籍人数从而增加税收的经济目的,但却是对传统贱民制度的一次全面改革从而伴有某些政治因素,因而遭到保守势力的竭力反对。最初对废除丐户贱籍的“部议”就没有通过,但拥有绝对权力的皇帝仍可排除阻力强制推行。然而,即使皇帝运用其至高无上的君权推出新政,若执行过程中遇到大批官僚的抵制,新政实际上仍然无法有效运行。雍正帝“豁贱为良”的新政也遭遇了中国历代政治改革同样的困境,在其强行推出一系列废贱为良的政策后,同时在中央与地方两个层面均遭到了强烈的抵制,以致在他去世后这一新政很大程度上被实质性地否定了,其标志性事件便是乾隆三十六年(1771年)重新限定贱民群体“报官改业”的资格。在官本主义的传统中国,对于普通民众来说,科举入仕是其人生价值的最高体现。同样,对于贱民群体而言,还其良民身份最实质性的体现,就是允许其与良民一样参加科举考试,进而入仕为官。然而正是在“豁贱为良”这一关键环节,雍正帝的政策遭遇了保守势力的顽固抵制。乾隆三十六年,陕西学政刘墫上奏曰:已经豁贱为良的乐户丐户,“应请以报官改业之人为始,下逮四世本族亲支皆系清白自守,方准报捐应试”。换言之,贱民正式豁免贱籍后,再要经过子孙四代及直系亲属被证明“清白自守”,不再从事“贱业”,方能应试捐官。这其实就是在最关键点上剥夺了从良贱民的权利,实质上也就是否定了雍正帝的豁贱为良新政。然而,刘墫的这一上奏不仅获得“部议”同意,而且为乾隆钦准,成为清朝的律令:“凡开豁为良之乐籍、堕民、丐户及已经改业之疍户、九姓渔户人等,耕读工商听其便。仍以报官改业之人为始,下逮四世,必其本族亲支系清白自守者,方准应试报捐。若豪棍借端攻讦,欺压讹诈,依律治罪”(《大清律例汇辑便览》卷八《户部则例》)。显而易见,乾隆三十六年条例,是一次严重的政治倒退:“如果说雍正时期贱民已因豁贱为良获得凡人等级地位,到将近半个世纪之后的乾隆中叶却又对这部分凡人的部分政治权利中以剥夺,给以新的侮辱。堕民、疍户等过去为贱民,法无所据;开豁以后不同于良民却定例在案了。”因而可以说,“乾隆三十六年条例”是中国贱民解放史上的最后一次反动,也标志着雍正“废贱”改革的最终失败。

    四、结论

    贱民是中国传统社会中一个数量庞大的特殊群体,是士农工商“四民”之外的一个特殊阶级,处于中国社会等级结构体系的最底层。以往的研究者通常把贱民视为传统社会中的一个低贱等级,严格地说,这是不确切的。按照“地主阶级”和“农民阶级”这样的类型学标准,无论是从经济地位,还是从社会地位和政治地位的标准看,贱民不是一般意义的等级或阶层,而是一个相对独立而且极其特殊的阶级,是中国传统社会阶级结构中一个不可缺少的组成部分。中国历代究竟有多少贱民人口?至今没有,实际上也不可能有答案,但从历代典籍档案的相关记载中,大体可以推算出这是一个数量不小的群体。从贱民的来源看,由于贱民的世袭性,一日为贱不仅终身为贱,而且子子孙孙永世为贱,除了极个别的特赦、军功和赎身外,即使改朝换代也无法改变贱民的身份。在世传的贱民群体之外,历代都会有罪犯、俘虏等大批新的贱民产生。因此,无论中国社会发生什么样的变化,总有一个庞大的贱民群体始终存在着。

    据《隋书》载,隋炀帝时“异技淫声咸萃乐府,皆置博士弟子,递相教传,增益乐人至三万余”(《隋书·裴蕴传》)。唐时有所收敛,但宫廷乐户贱人也少则“音声人一万二七人”(《新唐书·百官志三》),多则“总号音声人,至数万人”(《新唐书·礼乐志十三》)。皇帝和朝廷拥有的奴婢乐户等官贱民数量众多自不待言,达官贵人家庭拥有的私贱民数量则更多,传统中国从中央的政要到地方的土豪,几乎每家都会使用各色贱民。汉武帝时,“治郡国缗钱,得民财物以亿计,奴婢以千万数”(《汉书·食货志》);三国时糜竺“祖世货殖,僮客万人”(《三国志·蜀志·糜竺传》);东晋的陶侃,拥有“家僮千余”(《晋书·陶侃传》);唐代一个都督,可以“家僮数千”(《新唐书·李谨行传》);北宋时有些地方豪富,“家饶于财,僮奴数千指”(《宋史·吴延祚传》);明代仕宦之家的奴仆,“有至一二千人者”(《日知录·奴仆》);清朝乾隆年间徽州六邑总人口20多万,仅一次性开豁的佃仆就达“数万丁”(《大清会典事例》卷七五二)。即使在法律正式废除贱民制度的民国初年,仅绍兴一县的堕民竟还有“三万余人”之多。与全国的总人口相比,贱民群体当然只占一个较小的比例,但从历代的各种记录可以窥见,中国历代贱民群体的数量总规模却超乎想象地庞大。唐长孺曾整理过贞观盛世的一份详细户籍资料,该材料记载:唐西州某乡总人口为2064人,其中奴婢为116人,占总人口比例的5.6%。王天石也整理过另外两份唐贞观和永徽年间的户籍档案,贱口的比例则更高。一份材料记载,全乡总人口为1200人,奴婢人口140人左右,贱口比例为12%;另一份材料记载,全乡人口2300人,奴婢337人,贱民比例为14%。可见,唐贞观永徽年间平均贱民比例高达10%以上。唐代的这个户籍数字,也许接近于中国传统社会贱民阶级在全国总人口中的比例。

    贱籍制度将非人性和反人道的行为合法化,它本质上是一种政治奴役和社会奴役。作为处于社会等级结构最底层的特殊阶级,中国的贱民实质上是一个被全社会奴役的群体。在生物学和人类学意义上,贱民毫无疑问是人类的一部分,是中华民族的同胞,但在社会学意义上,贱民并不被视为正常的人类和同胞,而被视作动物与财产,即所谓“律比畜产”。他们同时被国家的法律和社会的礼仪剥夺了作为平民的基本人权,不仅受到享有权力与财富的统治阶级的奴役,而且也被普通的平民百姓所歧视,不仅没有独立的经济地位,而且也毫无社会政治地位。在国家制度的层面,历代王朝均将贱民群体打入“士农工商”四民之外的贱籍,被无情剥夺基本的人身自由和人格尊严,他们不能像普通平民那样开户立籍和成家立业,不能自由迁徙,不能应试入学和入仕为官,不能与其他阶层子女通婚,一旦触犯法律,他们就要受到比普通民众严厉得多的惩罚。在法律的层面,贱民群体因为被当作“畜产”和“资财”,因而可以被主人买卖,其市场价格有时甚至不如牛马;他们是主人的奴仆,不仅人身依附于主人,而且可以被主人随意处置,包括任意的人格侮辱、人身虐待、性侵害,直至被主人虐杀。在社会的层面,贱民没有正常的社会生活,他们不能从事一般的职业,而被严格限定于各类最低劣的“贱业”;奴婢、佃仆、乐户、部曲等官私贱民不仅要受到历代官僚阶级和地主阶级的奴役,而且也要受到普通民众阶层的严重歧视和欺压。他们不能与普通平民居住在一起,而常常被限定在特定的贱民居住区域;他们的穿着打扮和出行交往,都不能同于常人,而有特定的贱口标识;即使他们的祖先也曾跻身名门豪族,一旦沦为贱口便要被家族除籍。总之,贱民的“一切权利被剥夺,使之处于最卑下最受奴役的地位。倘若奴婢设法去奴籍为良,或以逃亡等方式试图摆脱所受的各种压迫和虐待时,则又要受到严酷的刑律处罚”。因此,贱民受到的不是一般贫民阶级的经济剥削与政治压迫,而是被残酷地剥夺人之所以为人的基本权利,是被中国传统的礼法体系彻底非人化和奴化的特殊群体。

    贱民制度是中国专制政治条件下政治奴役与政治压迫的集中体现,贱民的解放程度是中国政治解放的重要尺度。历代贱民的种类、称号和来源多种多样,然而,无论哪个朝代,贱民最重要的来源都与政治压迫和政治镇压直接或间接相关,各种不同种类和称呼的贱民本质上都被剥夺基本人权,并受到非人道的对待。贱民作为中国传统社会最低贱的阶级,不仅仅是由于其经济地位,更是由于其社会和政治地位。在主人眼中,贱民与可供自己随意使唤的牲口并无实质差别,为了使贱口更好地服侍自己,主人反而必须像饲养牲口那样维系贱民的生命和体力。因而,纯粹从物质生活方面看,在经济极度困难以至威胁到生死存亡的某些特殊情况下,贱民的生存条件甚至可能比普通贫民要更好。这也是为什么在一些饥荒和灾难时期,平民会自甘出卖为奴的主要原因。然而,统治者和主人之所以要为贱民提供必需的物质生活条件,仅仅是为了使其维系生命以更好地被主人役使。

    在中国传统专制政治的条件下,贱民阶级存在的真实意义,就在于供统治集团奴役;贱民以牺牲基本的人权,来满足统治阶级的特权需要。在漫长的中国专制政治历史上,在所有的社会阶级群体中,贱民是受奴役和压迫最深重的群体。他们不仅受到以君主为代表的统治阶级的奴役和欺压,而且还要受到被统治阶级中其他平民阶层的歧视和侮辱,贱民阶级的政治解放超乎想象的艰难。即使国家的政治法律制度正式废除了贱民的卑贱身份,即使经济收入和物质生活条件已经不再处于社会的最底层,社会对贱民群体根深蒂固的歧视以及贱民群体的自我鄙视也难以在短时期内消除。一位研究浙江堕民历史的学者回顾了从明初设立“禁止再呼堕民碑”开始的极其漫长的堕民解放历程,最后不无感慨地说,直到改革开放后,堕民的政治、经济和文化障碍才完全消除,而成为国家的正常公民:“中华人民共和国成立后,堕民被列入劳动人民的行列,特别是改革开放以后,堕民发家致富,平民消除了歧视堕民心理,堕民也不再有自卑心理,平民与堕民的界线得于泯灭,堕民作为一个贱民群体被彻底消融。”鉴于妇女在历史上被更多地剥夺作为人的基本权利,比起男性来受压迫更加深重,马克思和恩格斯曾引述傅立叶的话说,“妇女解放的程度是衡量普遍解放的天然标准”。据此我们可以说,在中华民族的政治进步史上,贱民解放的程度是衡量中国政治进步的重要尺度。

    贱民制度在中国持续存在数千年,是中国专制政治的结构性要素之一,给中华民族留下了沉重的政治和社会遗产。中国历史上贱民群体的形成,并非“物竞天择,优胜劣汰”的自然竞争结果,而更多的是内外战争和政治斗争的产物。贱民虽然从事社会最低贱的职业,处于社会的最底层,受到最残酷的奴役,但这并不等于贱民群体是中华民族的“糟粕”。恰恰相反,大量的贱民源于残酷的政治镇压,昔日万人之上的皇亲国戚和达官贵人,完全可能一夜之间变成众人唾弃的奴仆罪隶。因此,数千年的贱民制度和数量庞大的贱民阶级的长期存在,深刻地影响了中华民族的国民性,依附性、不平等、对权力的崇拜和对人格尊严的忽视成为国民性中严重的负面遗产。

    贱民政治即是奴性政治,奴性的形成与专制政治和贱籍制度有着内在的联系。鲁迅对中华民族的国民性有过极其深刻的分析和批判,他认为中国的国民性中有着浓厚的“奴性”。他说:中国人在历史上虽然经历过许多朝代,但实质上就是两个时代,即“想做奴隶而不得的时代”和“暂时做稳了奴隶的时代”。因此,“中国人向来就没有争到过‘人’的价格,至多不过是奴隶”。中国传统的专制政治环境,导致了严重的人身依附关系,使得许多人身上带有深深的奴性:“专制者的反面就是奴才,有权时无所不为,失势时即奴性十足……做主子时以一切别人为奴才,则有了主子,一定以奴才自命。”

    等级特权本来就是专制政治的内在属性,而贱民制则将等级特权从官僚阶级的价值转变成全民的价值,对等级特权的追求成为一般民众的内在精神。特级特权是官僚政治的产物,官员的权利与其官爵紧密相连。然而在中国,由于士农工商这些普通民众之下还存在着一个更低下的贱民阶级,在贱民群体面前庶民百姓也有强烈的优越感。不仅如此,贱民阶级内部还有三六九等,从而使得贱民群体自己也拥有等级意识。因而,在中国的传统国民精神中,存在着一种强烈的等级意识,使自己或自己的子孙成为高于别人的等级,成为传统中国人的普遍追求和内在激励。“吃得苦中苦,方为人上人”,成了许多人的励志语和座右铭。

    中国的传统社会是一个典型的官本主义国家。“官本主义就是以权力为本位的政治文化和社会政治形态,在这种政治文化和社会政治形态中,权力关系是最重要的社会关系。在各种类型的社会权力中,政治权力处于支配地位,是官本主义的核心要素。因此,权力本位通常也表现为官本位。在官本主义条件下,权力成为衡量人的社会价值的基本标准,也是影响人的社会地位和社会属性的决定性因素。权力支配着包括物质资源和文化资源在内的所有社会资源的配置,拥有权力意味着拥有社会资源。”传统中国的官本主义与贱民制度是一种互为增益的关系,正是政治权力催生了大量的贱民群体,贱民群体的存在本身就是政治特权的宣示。剥夺贱民的基本权利,最实质性的就是剥夺其通过科举考试或捐官的途径成为朝廷官员的权利。官本主义与贱民制度的相互增益,导致了传统中国人对政治权力无以复加的崇拜。在相当程度可以说,在权力面前不仅贱民是奴婢,其他普通民众也同样是奴婢。

    贱民制度彻底剥夺了人的尊严,极大地遏制了中国人对尊严的追求。在现代社会,人的最高价值就是人的尊严,“人人生而自由,在尊严和权利上一律平等”成为全人类的共识。然而,在中国的传统政治文化中,尊严与权力相辅相成,权力而非德性和理性成为尊严的基础。谁拥有权力,请就拥有尊严;谁拥有多大的权力,谁就拥有多大的尊严。皇帝拥有最高的政治权力,他也因此而成为最有尊严的人。反之,没有权力就没有尊严,处于最底层的贱民没有任何权力可言,也就没有任何尊严可言。贱民制度的长期存在,不仅彻底泯灭了贱民群体的尊严意识,也在很大程度上泯灭了普通中国人的尊严意识。即使强调德行的儒家本身,其主流观点也把最高的尊严给予了皇帝,例如朱熹就说“人主极尊严”。

    总之,数量庞大的贱民群体是中国历史上一个重要的政治存在,是士农工商四民之外一个特殊的阶级,处于中国传统社会最低贱的地位。贱籍制度是中国历史最悠久的政治制度之一,是中国绝对君主专制主义的重要制度基础。从根本上说,贱民阶级的产生,是专制政治统治的需要。贱民具有世袭性,最早的贱民群体源自俘虏和罪犯,是战争和政治镇压的产物。贱民被当作是牲口和财物,完全剥夺了基本的人权,没有起码的人身自由、人格尊严和生命保障。贱民制度是一种极端非人道的政治奴役,与人类的政治文明完全背道而驰,贱民解放的程度是中华民族政治文明进步和政治解放的重要尺度。

    本文载于《学术月刊》2025年第1期。

  • 余少祥:论社会法的本质属性[节]

    一、体现社会法本质的基本范畴

    范畴及其体系是衡量人类在一定历史时期理论发展水平的指标,也是一门学科成熟的重要标志。社会法的基本范畴是社会法的概念、性质及结构体系等内容的本质体现,这是当前学术界研究相对薄弱的环节。社会法的基本范畴经历了从社会保护、社会保障到社会促进,从生存性公平到体面性公平的演变,体现了社会法不同于其他部门法的本质特征。

    (一)国内立法史视角

    一直以来,我国社会法的基本范畴都是社会保护,主要体现为对特定弱势群体的生活救济和救助。到了近代,开始探索社会保障制度。新中国成立尤其是新时代以来,社会促进逐渐成为社会法的新追求。

    在我国古代,虽然没有系统的社会法制度体系,但很早就有关于社会救济的思想和行为记载,如《礼记·礼运》提出“使老有所终,壮有所用,幼有所长,鳏寡孤独废疾者,皆有所养”;《墨子》主张“饥者得食,寒者得衣,劳者得息”。在制度方面,《礼记·王制》言及夏、商、周各代对聋、哑等残障人士“各以其器食之”。在西周,六官中地官之下设大司徒,专门负责灾害救济。春秋战国时期,增加了“平籴、通籴”等措施。两宋之后,居养机构发展较为完善,有福田院、居养院等多种形式。此外,还有用于赈灾的名目众多的仓储体系,如汉有常平仓,唐有义仓,两宋有惠民仓、社仓,元有在京诸仓、御河诸仓,明有预备仓等。但总体上看,这些救助措施均非法定义务。统治者赈灾济困乃是一种怀柔之术,是为巩固皇权的收买人心之举,与现代意义的社会法相距甚远。

    我国真正开启社会立法的是北洋政府。清末搞得沸沸扬扬的修宪和制订法律的活动,催生了民法、刑法等一批法律法规,却没有一部关于社会救济和保障民众生活的法律。1923年,北洋政府颁布《矿工待遇规定》,首次引入“劳动保险”概念,可谓我国社会法的破壳之作。可惜,这些法令因战乱和时局动荡刚实施便很快夭折。南京国民政府建立后,先后颁布《慈善团体监督法》《救灾准备金法》《最低工资法》等。从抗日战争起,以国民政府社会部成立为标志,社会立法渐趋完备。1943年《社会救济法》颁布,奠定了民国社会法的基石。这一时期,《社会保险法原则》《职工福利社设立办法》等先后公布,为探索社会保障进行了有益尝试,社会法发展开始迈入现代化门槛。但由于内战不断、政局不稳、政令不畅,加上官僚买办资本的抵制,这些法令并没有得到有效实施。

    新中国成立后,我国实行的是计划经济体制和单位对职工生老病死全包的政策。直到20世纪80年代,民众的基本生活保障仍是由国家和集体组织承担。90年代起,随着向市场经济转型,一部分群体开始从单位人向“社会人”转变。为确保这部分民众的基本生活来源,我国开始建立社会保障制度,先后颁布《残疾人保障法》(1990)、《劳动法》(1994)、《城市居民最低生活保障条例》(1999)等社会法规。进入21世纪后,相继出台了《劳动合同法》(2007)、《社会保险法》(2010)等社会立法。新时代以来,又陆续推出《慈善法》(2016)、《法律援助法》(2021)等,加上之前的《红十字会法》(1993)、《就业促进法》(2007),社会促进逐渐成为立法的关键词。从总体上看,我国当代社会立法是制度变迁的产物,而非在市场发展中形成的,因此与西方国家有所不同。

    (二)国外立法史视角

    社会法是舶来品,深受欧美日等工业国家影响,因此探求社会法的概念、范畴与体系等,离不开对外国法制的比较观察。从总体上看,国外社会法范畴也经历了社会保护、社会保障和社会促进的演进。

    英国是世界上最早实行社会立法的国家,其目的是为脆弱群体提供社会保护。1388 年,金雀花王朝制定了一部《济贫法案》。1531年,亨利八世又颁布了一部《名副其实救济法》,规定老人和缺乏能力者可以乞讨,地方当局将根据良心从事济贫活动。这两个法案与1601年伊丽莎白《济贫法》相比,影响较小。后者诞生于“羊吃人”的圈地运动时期,旨在“将不附任何歧视性的工作给有工作能力的人”,后为很多国家效仿。1563年,英国颁布了历史上第一部《劳工法》,1802—1833年又颁布5个劳动法案,覆盖了几乎所有工业部门,确立了现代劳动保护体系及基本原则。1834年,英国政府出台《济贫法修正案》,史称“新济贫法”。这些立法孕育着社会法的丰富遗产,具有鲜明的时代性、体系性和结构性特征。此后欧洲其他工业化国家纷纷仿效英国,建立起自己的社会保护制度。

    世界上最早实行社会保险立法的是德国。19世纪中后期,俾斯麦政府采取“胡萝卜加大棒”政策,一面对工人阶级反抗实施残酷镇压,一面通过社会保险对其安抚,相继出台了《疾病保险法》(1883)、《工伤保险法》(1884)等法规。由于社会保险法适应了工业化对劳动力自由流动的需求,解决了劳动者生活的后顾之忧,在社会法体系中占有重要地位。但西方社会法真正完成的标志是1935年美国《社会保障法》施行,这是社会保障概念在世界上首次出现。之后,社会法的发展开始进入一个新的历史阶段——为社会成员提供普遍福利,其典型标志是英国“贝弗里奇计划”实施。由于该计划被逐步纳入立法,标志着英国社会法走向完备和成熟。第二次世界大战后西方各国在推行社会立法时,不同程度借鉴了《贝弗里奇报告》模式,使得西方社会法的福利化转型最终完成。

    20世纪60年代,西方国家普遍解决了生存权问题,社会促进开始成为立法的重要权衡。除了传统的慈善法大量兴起外,扶贫法和反歧视法逐渐形成新的热潮。以美国为例,1964年约翰逊政府通过《经济机会法》,宣布“向贫困宣战”,此外还实施了社区行动计划、学前儿童启蒙教育计划等。其他国家如英国的《儿童扶贫法案》、法国的“扶贫计划”和德国的《联邦改善区域结构共同任务法》等在促进落后地区经济社会发展方面也起到了重要作用。在反歧视方面,美国、英国、欧盟和日本都有完备的立法。尤其是美国,仅反就业歧视法就多达十余部,且有大量判例具有重要立法价值。这一时期,日本的《反对性别歧视法》(1975)、瑞典的《男女机会均等法》(1980)等纷纷出台。根据反歧视法的差别待遇原则,都是为了促进国民获得实际平等地位,实现社会实质公平。

    (三)学术研究史视角

    我国社会法研究肇始于民国初期。1949年以后,又分为“大陆”和“台湾地区”两个支系,前者的探索早于后者,而且在一定程度上沿袭了民国的传统。从学术史上看,学术界在某些观点上取得了较大共识,但核心范畴略有差异。

    民国的社会保护和社会幸福说。多数民国学者认为,社会法是救济和保护社会弱者之法。如李景禧提出,社会法是“为防止经济弱者地位的日下,调整了暂时的矛盾”。陆季藩指出,社会法是“以保护劳动阶级或社会弱者为目标”的法。林东海认为,凡是“解决社会上之经济的不平等问题”的立法,都是社会法。杨智提出,社会法是“以增进及保护社会弱者之利益为目的”的法。也有学者主张,社会法包含一般社会福利。如张蔚然提出,社会法是“关于国民经济生活之法”。卢峻认为,社会法的目标是“使社会互动关系或社会连立关系”达到最高目标。黄公觉则明确提出,广义社会法“指一切关于促进社会幸福的立法”,狭义社会法仅指“为促进社会里的弱者或比较不幸者的利益或幸福之立法”。

    大陆的劳动保护与社会保障说。1993年,中国社会科学院法学研究所在一份报告中将社会法解释为“调整因维护劳动权利、救助待业者而产生的各种社会关系的法律规范的总称”。这是新中国学术界首次系统阐述这一概念。最高人民法院2002年编纂的《社会法卷》认为,“坚持社会公平、维护社会公共利益、保护弱势群体的合法权益”是“社会法的主要特点”。在学术界,多数学者将社会法定义为调整劳动与社会保障关系的法律。如张守文认为,社会法“具有突出的保障性”,主要是“防范和化解社会风险和社会危机,保障社会安全和社会秩序”;赵震江等认为,社会法是“从整个社会利益出发,保护劳动者,维护社会稳定”,包括“社会救济法、社会保障法和劳动法等”。从中国社会法学研究会历次年会讨论的情况来看,劳动法、社会保障法、慈善法属于社会法的观点已被普遍接受。

    台湾地区的社会安全和生活安全说。很多台湾学者从社会保护出发,将社会法称为社会安全法。如王泽鉴认为,社会法“系以社会安全立法为主轴所展开的”。钟秉正认为,社会法是“以社会公平与社会安全为目的之法律”,“以消除现代工业社会所产生的各种不公平现象”。也有学者明确提出社会法是生活安全法。如郝凤鸣认为,社会法是“以解决与经济生活相关之社会问题为主要目的”,“藉以安定社会并修正经济发展所造成的负面影响”;陈国钧认为,社会法旨在保护某些特殊人群的“经济生活安全”,或用以促进“社会普遍福利”,这些法规的集合被称为社会法或社会立法。总之,在台湾学术界,社会法集中指向与社会保护、社会保障和社会福利等相关的社会安全或生活安全法。

    二、决定社会法本质的要素分析

    事物的本质和发展方向是由核心要素决定的,在讨论社会法的本质之前,我们先分析决定其本质的核心要素。如前所述,社会法产生的根源是社会的结构性矛盾,尤其是市场化带来诸多社会问题,使得国家不得不运用公权力干预私人经济,达到保障民众生存权、化解社会矛盾的目的。在一定意义上,政治国家、经济社会和历史文化等要素在社会法本质形成过程中起到了决定性作用。

    (一)政治国家要素

    作为国家在干预私人领域过程中形成的全新法律门类,社会法与传统的自由权、自由市场经济体制以及民主法治国家理念存在一定冲突。正是国家职能的转变决定了社会法的内在精神和本质,使人民受益于国家的关照。

    1.从消极国家到积极国家

    在古典自由主义时期,政府主要承担“守夜人”角色。资本主义发展到垄断阶段以后,不但造成市场机制失灵,而且难以维持社会稳定。于是,社会上层开始形成一种共识,即通过国家干预,改良资本主义制度,以消除暴力革命的隐患。正如马克思和恩格斯指出,“资产阶级中的一部分人想要消除社会弊病”,“但是不要由这些条件必然产生的斗争和危险”。按照黑格尔的阐述,国家的目的在于“谋公民的幸福”,否则它“就会站不住脚的”。在这种情形下,国家这只“看得见的手”开始不断发挥作用,以平衡不同社会群体的需求,积极国家随之诞生。因此,国家干预并非理论家的发明,而是在历史进程中实际发生的,即对抗已重新采取直接的国家干涉主义形式,国家进一步成为社会秩序的干预者。

    国家干预社会生活是通过社会立法实现的,直接决定了社会法的性质和宗旨。由于国家不得不采取干涉主义的社会立法来做社会救济的工具,于是在法律上体现为,国家对于任何人都有保障其基本生活的义务。从立法宗旨来看,旨在打破弱肉强食的丛林法则,将社会贫富分化控制在一个可以承受的动态合理范围之内。比如,通过劳资立法,克服自由资本主义无节制地追求高额利润造成的社会分裂等严重后果。事实上,国家实行经济社会干预,不是否认私人利益和个人需求,而是将其重整到更高的全社会层面,即运用国家的力量实现个人的特殊利益与社会整体利益的统一。因此,社会法表面上是社会性的,实质上是政治性的,是一种典型的政治法学,它发轫于人对国家的依附性,发生于国家对共同体内每个人的幸福所负有的法律责任,使国民的生活安全得到有效保障。

    2.从社会国到福利国家

    积极国家进一步引发从消极自由到积极自由的发展。也就是说,国家不仅有保障公民基本自由不受侵犯的消极义务,更有保障公民基本生存与安全的积极义务,这也是社会发展进步的重要标志。在这一背景下,政府不再像以前一样仅仅囿于维护社会秩序,或对出现的问题进行决策干预,而是更进一步转换为保障人民具有人格尊严和最低生存条件的给付行政。通过给付行政,政府承担了涵盖广泛的计划性的行为、社会救济与社会保障等任务。尤其是在工业社会条件下,国民享有基本权利和事实自由的物质基础并不在于他们为社会作过什么贡献,而根本上依赖于政府的社会给付。正是给付行政成就了今天的社会国,即一个关照社会安全与民生福祉的国家。社会法便是为实现社会国的目标任务形成的法律体系,而社会国原则又为立法者干预私人领域提供了合法性依据。

    19世纪末20世纪初,随着垄断资本主义发展,社会本位的法理念开始取代个人本位的法思想并居于支配地位。这一时期,政治国家与市民社会的矛盾在法律上体现的结构也发生了新变化,使得国家在向国民承诺下不断增加福利范围。1942年,英国“贝弗里奇计划”首次采用福利国家称谓,通过财产重新配置,为公民提供基本生活保障。二战之后,这一思想主宰了西方的正统观念,很多国家确认促进民生幸福是公民的重要社会权利,对广泛和普遍的社会福利而言同样如此,国家承担了民众直接或间接的生活责任。可见,政治国家不但有力地推动了社会法的发展,而且决定了其福利化方向,最大限度地消除了各阶级之间的对抗冲突以及社会革命的危险,促进了社会公正公平,有效维护了社会稳定。

    (二)经济社会要素

    工业革命以后,资本主义的新信念是唯物质主义的,即只要物质财富足够多,一切社会问题都会自动消失。事实上,纯粹的市场机制无法解决社会公平、效率以及经济长期稳定等重要问题。由于市场体系造成了巨大的社会混乱,如果不深刻调整,市场机制也将被摧毁。因此,资产阶级国家被迫用法律来防止资本主义剥削过度的现象,通过社会立法去收拾资本和市场留下的烂摊子,出现了以社会法为核心、旨在对冲和矫治市场化不利后果的社会保护运动,结果连最纯正的自由主义者也承认,自由市场的存在并不排斥对政府干预的需要。正如罗斯福在1938年向国会提交的一份“建议”中指出:“我们奉行的生活方式要求政治民主和以营利为目的的私人自由经营应该互相服务、互相保护——以保证全体而不是少数人最大程度的自由。”

    经济民主理论认为,经济问题与伦理问题密切相关,人类经济生活应满足高尚、完善的伦理道德方面的欲望。社会法倡导社会保险、社会救济、劳工保护等社会权利,以解决资本主义发展中日益严峻的社会问题。一方面,要保障每个人拥有获取扩展其能力的物质条件和自我实现的机会;另一方面,要在支持扩大国家给付的理由与加重政府财政负担的结果之间进行权衡。可见,社会法的产生不单纯是对民众生活的保护,也是产业制度有效运行和社会存续的必需。因此,社会法在本质上是由资本主义的结构性矛盾决定的,是这一矛盾在法学层面的反映。因此,社会法与市民法同属资本主义的法,它不否认市场经济。

    与此同时,社会要素也深刻地影响着社会法的本质。随着工业革命深入发展,市场为社会创造了巨额财富,也制造了大量贫困。正如马克思恩格斯所说,“劳动生产了宫殿,但是给工人生产了棚舍”。1848年,《共产党宣言》发表,整个欧洲为之震动。恩格斯明确指出:平等不仅应“在国家的领域中实行”,还应当“在社会的、经济的领域中实行”。这一时期,各种社会主义思潮如德国的社会民主党运动、法国的工团社会主义、巴枯宁与蒲鲁东的无政府主义等纷纷发出社会改革的呼吁。由此看来,近现代社会实际上受到了一种双向运动支配,其一是经济自由主义原则,其二是社会保护原则,二者交互作用。应该说,社会法的产生正是对社会无序发展及其大量不良后果进行矫正的反向运动。

    从本质上看,社会保险、社会救助等均是由社会再分配决定的,其目的是使社会上的富人与穷人达成一种建立稳定秩序的合作。如德国当时的社会保险立法受到普遍赞成,资方认为可以抵消暴力革命,劳方则视其为实现社会主义的第一阶段。这一共识不断巩固和积累,成为重要的社会支持手段。美国学者卡尔多等在社会福利的基础上,还提出一种社会补偿理论,认为从受益者新增收益中拿出一部分补偿受损者,就实现了帕累托改进。总之,社会再分配是以生存权和社会公平为法理基础,这是社会法最重要的价值理念,体现了生产关系变革和社会法的发展进步。而且,社会法的发达程度是由经济社会发展水平决定的。一方面,所有的社会权利实现都依赖于经济发展指数和财政状况;另一方面,它限制资本主义的非人道压榨和剥削,却使资本家在所谓合法范围内得以充分发展。

    (三)历史文化要素

    社会是由历史事实的总和所规定的、经验地形成的人类质料,作为最具解释力的最新法理范式,社会法标志着人类政治文明、法治文明和社会现代化达到了空前高度,历史意义深远。历史法学派明确指出,法是以民族的历史传统为基础生成的事物,是从特殊角度观察的人类生活。萨维尼详细考察了德国法,认为法的素材“发源于国民自身及其历史的最内在本质”,因而受历史决定。马克思认为,历史意味着现实的个人通过生产实践活动进行物质创造,并逐渐认识世界、改造世界;而“表现在某一民族的政治、法律、道德、宗教”等“语言中的精神生产”也是“人们物质行动的直接产物”。因此,法律是历史的产物,是世世代代的人活动的结果。可见,马克思历史观的内核在于,从历史和现实出发考察法律的形成和本质,并将市民社会理解为整个历史和社会立法的基础。

    德国是现代社会法的发源地,其社会立法极大地丰富、发展和完善了现代法律体系。从实践中看,德国社会法受历史因素的影响是广泛而深远的。如1794年《普鲁士普通邦法》规定,国家有义务对那些为了共同利益而被迫牺牲其特殊权利和利益的人进行补偿。以此为源头,德国逐渐孕育出公益牺牲原则,成为社会补偿法的理论渊源。为了应对二战受害人及其遗属的供养问题,德国出台了《联邦供养法》,并逐步演变为对各类暴力行为受害人的补偿。再如,德国法律有一个苛情救济制度,主要是为恐怖和极端犯罪受害人提供人道主义款项,但受害人无法主动主张这一权利。2013年,第十八届议会提出,要制订新的受害人补偿和社会补偿法。不久,柏林恐怖袭击案发生,使得改革进程急剧加速。如今,服民役者、因接种疫苗身体受损者均被纳入社会补偿范围,使其社会法体系日臻完备。

    文化也是社会法本质形成的重要决定因素。马克思指出,“权利决不能超出社会的经济结构以及由经济结构制约的社会的文化发展”,因为文化是现代社会思想的特殊元素,奠定了一整套理解和解释人类行为的规则。社会文化决定论甚至认为,人类及社会制度的形成,由各种文化价值和社会机构决定。尤其是法律文化,决定了一国法律的内在逻辑,以及历史进程中积累下来并不断创新的群体性法律认知、价值体系、心理态势和行为模式。客观地说,很多法律特性只有通过法律文化才能得到解释,如德国、英国、美国和法国法的不同。因此,法律既存在于一个与传统相通的整体之中,又存在于一个与他物相关联而形成的民族精神的整体之中,他们共同构成了法律的文化意义的经纬。

    决定社会法本质的文化要素有法律观念、传统和制度等,如俾斯麦立法是德国留给世界最宝贵的政治遗产,是法律文化的最高层次。此外,法律理论的影响也是不言而喻的。一是社会连带理论。如社会连带主义法学提出,连带关系要求个人对其他人负有义务,每个人都依靠与他人合作才可能过上满意的生活成为社会保险法的理论基础。二是公民权利理论。如马歇尔提出,公民权利“是福利国家核心概念”,成为福利立法的理论基石。三是差别平等理论。这一理论认为,财富和权力的不平等,只有最终能对每个人的利益,尤其是在对地位最不利的社会成员的利益进行补偿的情况下才是正义的。这些文化元素对社会法本质形成起到了重要的决定作用。因此,如果剥夺了文化要素,社会法就不是今天的样子,也不可能实现生活安全的社会化和国家化。

    三、社会法本质的理论证成

    作为独立的学科名称和专门法学术语,社会法有特定的语意内涵、独立的研究对象和独特的法律本质,应立足于中国的历史和现实文化,借鉴国外经验,构建具有中国特色的社会法理论。并非所有与社会或社会问题相关的法律都是社会法,它以为每一个社会成员提供适当的基本生活条件为使命,因此不仅仅是现代社会场域的法,也是应对现代社会的法。

    (一)社会法是弥补私法不足的法律体系

    私法和市场竞争必然孕育着贫富分化与社会危机。为了挽救资产阶级统治秩序,资本主义国家遂通过社会立法来修正某些私法原则,限制完全的自由竞争,矫正私法和自由放任的市场经济带来的负面后果。

    1.私法公法化与公法私法化

    近代私法推定法律关系发生在身份平等且充分自由的人们之间,对市场经济的保障是十分必要的,至少对于市场主体来说形成了私人平等。所谓私人平等,就是人格与资格平等、机会均等。因此,在经济交往中,只要不采取欺诈、强迫等手段,各方都可以自由地追求利益最大化,国家作为中介人和社会契约的执行者只有保护个体权利不受侵害的消极义务,没有促进个体利益的积极义务。但是,这种抽象平等忽略了人们在天赋能力、资源占有、社会地位等方面的实际差异,结果产生了事实上的不自由、不平等,不可避免地出现“贫者愈贫,富者愈富”的马太效应。正是私法调整机制的不足以及所有权绝对和个人本位法思想泛滥,导致社会弱者生存困难、劳动者生存状况不断恶化和劳资对立等严重社会后果,迫切需要对私法意思自治、形式平等、契约自由等原则进行修正。

    由于私法和市场机制不能自动解决社会贫困、失业等问题,在法律发展中出现了私法公法化和公法私法化现象,逐渐形成社会法这一以实现社会实质公平为目的、以公私法融合为特征的新型法律部门。这是因为,单纯的公法容易导致过多限制经济自由的危险,单纯的私法又无法影响经济活动的全部结构。所谓私法公法化,是国家运用公共权力调整一些原本属于私法的社会关系,使私法带有公法的色彩和性质;所谓公法私法化,是国家以私人身份出现在法律关系中,将私法手段引入公法关系,使国家成为私法的主体和当事人。这种公共权力介入私人领域的做法就是公私法融合,并随之产生与公私法并列的第三法域。按照共和主义的观点,在私人对个人基本权利产生实质性支配关系时,国家有义务帮助个人对抗这种支配,此时基本权利经由国家介入得以保全。

    2.社会法对市民法的修正

    如前所述,市民法(即民法)有益于资源有效配置与财富公正分配,但由于各主体掌握的信息、谈判能力和经济力量等不同,交易结果不一定公平。在现实中,很多人认识到法律的基本精神是有利于强者而非弱者,市民法确立的平等协商、契约自由等原则在实践中形同虚设。一方面,它忽视了个体的现实差异;另一方面,市民法上的“人”是一种超越实际存在、拟制化的抽象人,已逐渐丧失伦理性与社会正当性基础。从法史可知,对人的看法在很大程度上决定着法律的发展趋势和方向。20世纪下半叶起,新的利益前所未有地逼迫着法律,要求以社会立法的形式得到承认,法律也越来越多地确认其存在,将空前大量的权利提高到受法律保护的地位。正是源于此种法理论的立法被称为社会法,这一变化也体现了从市民法到社会法、从近代法到现代法原理的重大转换。

    与市民法不同,社会法更关注人的具象性与实力差异,由此很多学者从市民法修正角度来阐释社会法,将社会矫正思想置于自由主义的平等思想之上。如沼田稻次郎提出,社会法是以“对建立在个人法基础上的个人主义法秩序所存在弊端的反省”为特征的法。事实上,社会法对市民法的修订主要体现为生存权保障,具体而言就是对财产权绝对、契约自由、平等协商等原则的限制,一些学者称之为民法社会化或现代化,是不准确的。社会法对民法的修正是系统化的,在法律理念、原则、方法和调整的法律关系上有显著不同。总之,社会法是传统市民法不足的产物,正如马克思所说,立法者“不是在创造法律,不是在发明法律,而仅仅是在表述法律,他用有意识的实在法把精神关系的内在规律表现出来”。

    (二)社会法调整的是实质不平等的社会关系

    由于私法本身无法推动不平等的社会关系向实质平等转变,以公权力矫正不平等就成为必然选择。社会法正是通过对不平等的社会关系实行区别对待和差异化调整,增强弱者与强者抗衡的力量,实现实质意义的平等和公平。

    1.从形式平等到实质不平等

    私法的形式平等旨在确立绝对财产权和缔约自由权,使个人通过市场机制选择追逐利益最大化,并承担由此带来的后果。但是,这种平等作为近代民主政治的理念不是实质性的,而是舍弃了当事人不同经济社会地位的人格平等和机会均等,并非事实上的平等。恩格斯说:“劳动契约仿佛是由双方自愿缔结的”,这种“只是因为法律在纸面上规定双方处于平等地位而已”,“这不是一个普通的个人在对待另一个人的关系上的自由,这是资本压榨劳动者的自由”。拉德布鲁赫在《法学导论》中写道: “这种法律形式上的契约自由,不过是劳动契约中经济较强的一方——雇主的自由”,“对于经济弱者……则毫无自由可言”。因此,所谓契约自由和所有权绝对,事实上已成为压迫和榨取的工具。

    尽管私法形式正义要求按照法律规定分门别类以后的平等对待,但它并未告诉人们,应该怎样或不该怎样分类及对待,如果机械地贯彻形式平等原则,就容易产生许多弊病。一方面,总会有一些人处于强势地位,一些人居于劣势地位;另一方面,强者常常利用优势地位欺压弱者,形成实际上的不平等关系。以劳动关系为例,如果不对契约双方进行一定干预,劳动者通常被迫同意雇主的苛刻条件而建立不平等劳动关系。由于市场本身无法克服这一现象,必然带来一系列社会利益冲突,甚至导致严重的社会危机。正是自由主义无序发展导致19世纪出现垄断与无产、奢侈与赤贫、餍饫与饥馑的严重对立现象,因此必须对形式平等导致的实质不平等进行矫正,通过社会法规制,平衡各种社会矛盾和利益冲突。

    2.从实质不平等到实质平等

    为了达到实质平等,资产阶级国家开始通过社会立法适当保护社会弱者,抑制社会强者。与民法不同,社会法既有私法调整方法,也有公法调整方法,因为单靠私法规范不能达到目的,必须运用公法的强制性规范予以支持才能实现权利的真正保障。作为反思法律形式平等的必然结果,社会法主要是以社会基准法和倾斜保护的方式对平等主体间不平衡的利益关系予以适度调节,设定一些法律禁止或倡导的方面,体现了马克斯·韦伯所称“现代法的反形式主义”趋势,是一种“回应型法”或称“实质理性法”。其法理基础是,为了校正形式平等所造成的实质不平等,对个人生存和生活条件进行实际保障。当然,这种积极义务是辅助性的,只是对形式平等的缺陷和不足进行必要修正和补充,并没有取代和全面否定形式平等,正如社会法没有取代和完全否定民法一样。

    由此可见,社会法调整的乃是实质不平等的社会关系,旨在纠正市场经济所导致的必然倾斜。所谓实质平等,是国家针对不同人群的事实差异,采取适当区别的对待方式,以缩小由于形式平等造成的社会差距。为了实现这一目标,立法者一方面关注平等人格背后人们在能力、条件、资源占有等方面的不平等,并以倾斜保护方式实现人与人之间的和谐;另一方面重视为人们提供必需的基本生活保障,使得立法的目标变成了结果的平等。有鉴于此,社会法上的社会保障并非临时性救济,也不是政府“信意”为之,而是法律赋予的强制性义务。总之,社会法是近现代社会实质不平等的产物和反映,以应对私法产生的“市场失灵”和过度社会分化等问题。马克思说:“人们按照自己的物质生产率建立相应的社会关系,正是这些人又按照自己的社会关系创造了相应的原理、观念和范畴。”

    (三)社会法通过基准法机制发挥作用

    与民法不同,社会法有一个基准法机制即最低权利保障,它提供了一种在社会的基本制度中分配权利和义务的办法,即将弱者的部分权利规定为强者或国家和社会的义务,以矫正实质意义的不平等,缩小社会差距。

    1.以基准法保障底线

    所谓社会基准法,是将弱者的部分利益,抽象提升到社会层面,以法律的普遍意志代替弱者的个别意志,实现对其利益的特殊保护。具体就是,以立法形式规定过去由各方约定的某些内容,使弱者的权利从私有部门转移到公共部门,实现这部分权利法定化和基准化。比如,国家规定最低工资、最低劳动条件、最低生活保障标准等都是基准法,因其具有公法的法定性和强制性,任何团体和个人契约都不能与之相违背或通过协议改变。社会基准法在初次和再次分配中都有体现,如最低工资法属于初次分配,最低生活保障法属于再次分配。在一定程度上,社会基准法是对私法所有权绝对、等价有偿、契约自由等原则的限制和修正,通常被认为是推行某种“家长制”统治的结果,因为要实现从社会的富有阶层向贫困阶层进行资源再分配,将不可避免地侵犯到财产权的绝对性。

    社会基准法克服了弱者交易能力差、其利益常被民法意思自治方式剥夺的局限,在一定程度上改变了强弱主体力量不均衡状态。但是,它没有完全排除私法合意,即在基准法之上仍按契约自由原则,由市场和社会调节,这是社会法与其他部门法的显著不同。也就是说,当事人的约定只要不违反基准法,国家并不干预,个人和团体契约可以继续发挥作用。因此,社会法规范既有公法的强制性,也有私法的任意性,通过基准法限制某种利己主义的表达,通常被视为一种由统治权力强加于个人的必要。社会法与行政法的共同点在于,都实行强制性规范,但社会法是一种底线控制,没有完全排除契约自由。社会法与民法的共同点在于都尊重契约自由,但前者对契约自由作用有所限制,后者是当事人完全意思自治,任何外力干预都被视为违法或侵权。

    2.以义务规范体现权利

    社会基准法的另一种表现形式是,以义务规范体现权利。这也是社会法的显著特征之一,即立足于强弱分化人的真实状况,用具体的不平等的人和团体化的人重塑现代社会的法律人格,用倾斜保护方式明确相对弱势一方主体的权利,严格规定强势一方主体的义务,实现对社会弱者和民生的关怀。因此,社会法重在对私权附以社会义务,授予权利也是使相对人承担义务的手段。以社会保障法为例,社会救助、社会优抚、社会福利等主要由国家提供,社会保险则由雇主、雇员和国家共同负担,并规定为国家和社会义务,以保障民众的基本生活权利。由此,现代国家已成为新的财产来源之一,民众的生存权不再建立在民法传统意义上的私人财产所有权之上,而是立足于国家提供的生存保障与社会救济的基础之上。

    社会法上的权利义务之所以不一致,是因为社会生活中客观存在一种不对等性,法律对当事人的权利义务设定就有所不同。具体就是,通过后天弥补,以法律形式向弱者适当倾斜。因此,社会法不关心穷人对自己的困境负多大责任,赋予其社会保障权也不以承担义务为前提条件。其实质是,将民众和社会弱者的基准权利规定为国家和社会的义务,因此与一些学者所谓义务本位不同。如欧阳谿认为,社会法“在于促进社会生活之共同利益”,“必以社会为本位”。事实上,封建主义和资本主义以义务为本位的法律,只不过是多数人尽忠于少数人的义务而已。不仅如此,社会法对所有权设定义务并不以权利滥用或过错为条件,限制的也不是个体而是类权利,限制方式包括使所有权负有更多义务,向弱者适当倾斜等,与民法的禁止权利滥用原则并不相同。

    (四)社会法的根本目标是生活安全

    不同于民法维护交易安全、刑法维护人身和财产安全、行政法维护国家安全,社会法旨在维护民众的生活安全,保障其社会性生存。它基于保护社会脆弱群体而产生,形成了不同类型、内容丰富、功能互补的制度体系。

    1.社会法:维系民生之法

    社会法的内在精神是保护民生福祉,也就是保障人民的生活、群众的生计和社会安全。马克思指出:“人们为了能够‘创造历史’,必须能够生活,但是为了生活,首先就需要吃喝住穿以及其他一些东西。”从本质来看,社会法的终极目标是,确保每个公民都能过上合乎人的尊严的生活,保障民众免于匮乏的自由。其核心在于,保护某些特别需要扶助人群的经济生活安全,促进社会大众的普遍福利;其实质是,对市场经济中的失败者以及全体国民予以基本的生存权保障,以此促进整个社会的和谐稳定。笔者曾将理解社会法的关键词概括为“弱者的生活安全”“提供社会福利”“国家和社会帮助”,极言之即“生活安全”。由于社会法建立了一种弱者保护机制和利益分配的普遍正义立场,通常称为民生之法。

    社会法保障民众的生活安全有一个从部分社会到全体社会的发展过程。早期社会法仅仅是维护特殊群体的生活安全,认为社会法保护的是经济上处于从属地位的劳动者阶级这一特殊具体的主体。随着社会的发展,社会法的调整范围从弱者的生存救济拓展到普遍社会福利,实现了从部分社会到全体社会的转换。汉斯·F.察哈尔对此有过精辟总结,认为狭义社会法是“以保护处于经济劣势状况下的一群人的生活安全所”;广义社会法是“以改善大众生活状况促进社会一般福利”。从功能学上看,社会法有利于消融社会对抗、冲突,实现国家和社会安全,即通过保障民众的基本生存权利,扩大社会福利范围,增加公共服务数量,使每一个人都能获得某种程度的生活幸福感。

    2.社会法的最高本体和逻辑结构

    社会法主要通过行政给付保障民众的生活安全,这就要求国家直接提供诸如食品、救济金、补贴等基本条件,使人们在任何情况下都能维持起码的生活水准,这是社会法的最高本体。社会法上的给付分为间接给付和直接给付,如政府在工资、工时、工作条件等方面对企业进行规制,是一种间接给付;国家为保障民众生存而进行社会救助、社会保险、社会优抚补偿等,是直接给付。二者均指向国家积极义务所蕴含的实质平等。一方面,社会法上的给付是法定的,其依据必须是国家所颁布的实在法,而不能单纯地依靠宪法,因此无法律则无社会给付;另一方面,在社会给付法律关系中,国家事实上是给付主体和“财产的公众代理人”,这既是一种公共职能,也是一种国家义务。

    通过行政给付,社会法确认和保护民众的生存权、社会保险权与福利权等,最终形成系统化、不同类型的结构体系。一是社会保护法,即保护妇女、未成年人、残疾人、老年人、劳工等脆弱群体的法规概称。目前,国际社会普遍将社会保护的重点确定为在社会保障体系中得不到充分保护的人。二是社会保障法,即国家用来应对全体社会成员因疾病、生育、工伤、失业和年老等引起收入减少或中断后造成经济和社会困境的法规总称,包括社会保险、社会救助、社会优抚与补偿法等。三是社会促进法,即某一类社会立法,能够促进社会实质正义、社会效用和福利等普遍提升,使公民的生活更加富足、便捷、安定,如慈善法、反歧视法、扶贫法等。这是社会法的三个基本类型,都蕴含行政给付,也都以保障民众的生活安全为目标,在本质上是一致的。

    四、围绕社会法本质的体系建构

    自新中国成立尤其是改革开放后,我国社会法建设取得了很大成就,但相比之下仍然是最为落后的法律部门。由于起步较晚,研究还不充分,至今没有形成相对系统的社会法体系。如何从本质上对社会法以概念清晰、理论坚实、结构严整、逻辑缜密的方式进行体系化建构,并外化为全面有序的法规系列,是推动我国社会法实践和经济社会稳定发展必须解决的重要问题。

    (一)加强社会法科学民主立法

    参照发达国家经验,一方面,我国社会法最大的问题是基本法律缺失,本应是“四梁八柱”的社会救助法、医疗保障法、社会福利法、社会补偿法等仍不见踪影。在社会法分支领域,亦存在诸多盲点,如集体协商与集体合同法、反就业歧视法等尚未出台,涉及平台劳动者保护的法规亦鲜有问世。另一方面,一些法规存在矛盾和冲突。

    针对上述问题,宜在现有法规基础上,以保障民生和共同富裕为导向,进一步完善社会法体系。当前,我国民众在就业、养老、医疗、居住等方面仍存在很多困难,亟待通过立法解决。而且,要促进社会法规范和制度衔接。以社会救助和社会保险为例,我国和美国都实行分立模式,但美国没有社会保险的居民可以得到相应社会救助保障。在英国,1909 年的《扶贫法》要求政府在实行社会救助的同时,通过强制性社会保险使失业人员得到生活救济。在解决法规冲突方面,我国《立法法》确立了两项制度:一是直接解决机制,即“新法优于旧法”“上位法优于下位法”“特别法优于一般法”;二是间接解决机制,即将无法适用处理规则的冲突纳入送请裁决范围,区分法定和酌定情形,由有权机关裁决。此外,也可以运用利益衡量方法化解法律规范冲突,填补法律漏洞。

    同时,提高立法质量。由于种种原因,我国社会法普遍存在立法质量不高问题,主要表现为立法层级低、碎片化严重、落后于实践发展等。以社会保障法为例,除了《社会保险法》,其他都是行政法规和部门规章。由于法规权威性不足,我国社会保障发展明显受限。因此,提高立法层级,建立覆盖面广的法规体系非常重要。从《社会保险法》来看,也存在很多问题。一是占全国人口一半的农民、没有就业的城镇居民、公务员和军人等保险都是“由国务院另行规定”,没有体现全民性;二是其内容远远落后于实践,如城居保与新农合、生育保险与医疗保险已合并,机关事业单位已纳入社会保险,社会保险费明确由税务部门征收,但《社会保险法》均没有体现。由于社会法立法质量不高,不仅没有解决好贫富差距问题,而且在某种意义上使贫富差距逐渐扩大。

    要改变这种状况,必须深入推进社会法科学立法、民主立法。科学立法的核心在于根据社会发展需要,制定符合实际情况的社会法制度。事实上,一项法律只有切实可行,才会产生效力。以最低生活保障法为例,对救济款实行“一刀切”是不科学的,一些发达国家通常采用一种负所得税法,即按照被保障人收入实行差额补助,可以借鉴。所谓民主立法,就是在立法决策、活动中,坚持人民主体性地位,“要把体现人民利益、反映人民愿望、维护人民权益、增进人民福祉落实到依法治国全过程”。需要说明的是,我国社会法意在保障民众的基本生存权,将贫富分化控制在一定范围内,并非“福利超赶”或“泛福利化”,否则会“导致社会活力不足”,阻碍人们的积极性和创造性。

    (二)提升社会法行政执法效能

    社会法行政执法分为两项:一是行政给付,二是行政监察。前者为积极执法,由政府主动履行法定义务;后者为消极执法,实行不告不理原则。在行政执法中,如果当事人违法,还会产生相应的行政、民事和刑事责任。

    1.充分发挥行政给付功能

    社会法行政执法的主要内容是行政给付,这是社会法与传统部门法最显著的区别,体现了法律思想从形式正义到实质正义的追求。但从我国行政给付情况看,重视和保障弱势群体利益的特征并不明显。党的二十届三中全会明确提出,要加强普惠性、基础性、兜底性民生建设。近年来,尽管国家采取了大量措施解决民生问题,但相对贫穷问题依然存在,民生保障还存在薄弱环节。一方面,行政给付中社会保护和社会促进支出很少;另一方面,城乡和地区之间差异较大。在经济发达地区和效益好的单位,给付标准高,在落后地区和效益不好的单位,给付标准低,形成一种反向歧视。不仅如此,有的地方仍存在“人情保”“关系保”等现象,使得法定的行政给付和社会保障功能大打折扣。

    社会法上的行政给付有一个重要特点是,社会化程度越高,保障功效越好,体现的管理制度越公平。我国正处于社会转型期,为更好防范和化解新的社会矛盾,亟待建立公平的行政给付制度体系。一是政府积极主动执法。社会法所保障的社会权利与政治权利不同,政府不积极作为就很难实现。以残疾人保障为例,他们有着特殊的生理和社会需求,需要额外帮助和政府主动作为。当然,社会保护给付并不否定NGO和私人机构的作用,因为政府也会失灵。二是建立行政给付统筹与协调制度。以社会救助为例,目前最低生活保障和临时救助由民政部门负责,特定失业群体救助由人社部门负责,教育类救助由教育部门负责,且救助给付审批程序烦琐,耗时过长,有待改进。三是坚决惩治行政给付中的腐败行为,真正建立群众满意的阳光下的给付制度。

    2.减少行政立法,加强监察职能

    我国社会法有一个重要特点是,法律条文多是原则性、指导性规定,软法性质明显,在立法中授权政府部门另行制定法规或规章的情况很常见。由此,行政部门实际上扮演了执法和立法主体的双重角色。以劳动法为例,由于没有处理好原则与规则的关系,很多规范仍以行政法规和部门规章的形式出台。以社会保险法为例,很多现行制度没有在法律中体现,而是由国务院及其部委的“决定”“通知”等规定。例如,有关养老保险费缓缴、基本养老保险待遇、工伤和医疗保险先行支付与追偿等,都是由国务院文件规定,没有法定标准。甚至一些体制性问题如社保转移接续、社保费征缴主体等都是由行政机关协调解决。

    在我国社会法执法中,应“去行政化”,使其回归监察定位。一是建立健全的监察体制。目前,劳动和社会保障监察已进入实操,但仍存在机构名称设置不规范不统一、规格不一致等问题。二是执法必严。社会法执法不严现象也应纠正,如基本养老保险全国统筹是《社会保险法》明文规定的,但至今省级统筹的目标仍未实现。为此,要大力推动执法权限和力量下沉,以适应社会法执法的实际需要。三是改进执法方式,逐步解决执法中的不作为、乱作为问题,将权力关进制度的笼子。

    (三)推进社会法司法化

    我国社会法在司法机制上仍存在很多空白,例如,社会保护和社会促进法体现的主要是宣示性权利,很少在法院适用。事实上,只有在社会权利受到法院或准司法机构保护的时候,社会法才能真正发挥稳定器的作用。

    1.社会法司法化的限度

    社会法上的诉权并非完全的权利,而是受到了一定限制。一方面,有关社会权的诉讼不可能扩展到尚未纳入法律保护的领域;另一方面,即便有些权利已经纳入法律保护,也不是完全可诉的。这也是社会法区别于其他部门法的显著特征。首先,社会权与自由权有很大区别。社会权需要国家采取积极措施才能实现,自由权只要国家不干预即能实现。其次,国家对国民的责任有一定限度。社会法上的国家责任是由法律明确规定的,是一种有限责任。再次,由司法决定行政给付有违权力分立理念。社会法的行政给付传统上都是由立法和行政机关作出裁量,如果司法过度侵入,会被认为危及民主制度和权力分工体系。最后,由立法和行政机关决定公共资源分配有现实合理性。由于社会法上的权利保护与大量资金投入有关,请求权客体(财政资源)的有限性直接决定了其诉讼的限制性。

    但是,这并不意味着社会法上的权利是不可诉的,承认一部分权利的可诉性,可以促进国家履行其承诺的积极义务。以社会保障权为例,对于公民依法享有的社会保险、社会福利等待遇,当事人可以起诉;对于基准法和约定权益受到侵犯,也可以起诉。如1970年的戈德伯格诉凯利案中,美国联邦最高法院明确指出,社会福利可以请求法院救济。在英国和法国,社会法诉讼由社会保障法庭解决,德国则设立了专门的社会法院。但是,对政府确立的给付标准、最低工资标准等不满意,则不能起诉,因其在很大程度上是由政治而非司法决定。这也是社会法与其他部门法最重要的区别之一。如在1956年日本朝日诉讼案中,原告认为每月600日元不符合宪法规定的最低生活条件,但由于被告日本政府的解释理由更充分,导致“原告的诉讼请求无疾而终”。

    2.社会法司法化的实践进路

    确立公益诉讼和诉讼担当人制度。由于社会权益被侵害的后果不限于某个当事人,而是包含不特定多数人甚至公共社会,非利害关系人亦可起诉。比如,印度建立了一种公益诉讼模式,即只要是善意的,任何人都可以为受害人起诉。在社会法诉讼中,还有诉讼担当人和集团诉讼概念,也是对民事诉讼主体资格的突破和超越。如在集体合同争议中,工会是诉讼担当人和唯一主体,其他任何组织和个人都无权起诉。诉讼担当人与民法上的委托代理人不同,当事人不能解除其担当关系。此外,集团诉讼也是社会法的另一种诉讼机制。20世纪90年代,利用集团诉讼处理劳动保护、社会保险等纠纷成为潮流。对于诉讼请求较小的当事人来说,如果起诉标的比诉讼费用少,当事人就倾向于集团诉讼。

    实行举证责任倒置制度。社会法司法机制同样体现了向弱者倾斜的理念。20世纪以来,在大量司法实践中,诞生了社会法另一个独特的司法机制——举证责任倒置。以工伤事故为例,法律明确规定由雇主承担举证责任;在欠薪案中,劳动者对未付工资的事实不负举证责任,都体现了对劳动者的特殊保护。这一点从工作场所中雇员给雇主造成损失和雇主给雇员造成损失承担责任以及举证责任的“非对等性”也可以看出。再如,就业歧视在美国等国家是违法的,当事人只要表明歧视发生时的情况即可,此后举证责任就转移到雇主那里,否则就构成歧视,在行政给付、社会保护等案例中也是如此。举证责任倒置主要是对弱者实行最大限度的司法保护,应确立为我国社会法基本的司法制度。

    设置专门法庭或适用简易程序。在司法程序上,社会法争议亦有别于一般民事诉讼。以劳动司法为例,很多国家设置了行政裁判前置程序,以及两项重要原则:一是缩短劳动争议审限,二是劳资同盟介入。因此,社会法司法一般审限较短,程序也简单。由于当事人的诉讼请求与生存权和健康权等息息相关,如果像债权、物权一样按照民事案件审理,期限都在半年或一年以上,这种马拉松式的诉讼显然与权利人生存的现实需要是不相容的,很可能危及其生存。因此,对于社会法诉讼中一些耗时长、成本高的案件,为了节省社会成本和当事人的开支,应当使争议得到迅速和经济的处理,因此,可以借鉴一些国家的成功经验,设置专业裁判所或专门法庭,适用简易程序审理。

    本文转自《中国社会科学》2024年第11期

  • John D. Kelleher 《Deep Learning》

    1 Introduction to Deep Learning
    2 Conceptual Foundations 
    3 Neural Networks: The Building Blocks of Deep Learning
    4 A Brief History of Deep Learning
    5 Convolutional and Recurrent Neural Networks
    6 Learning Functions
    7 The Future of Deep Learning

    1 Introduction to Deep Learning

    Deep learning is the subfield of artificial intelligence that focuses on creating large neural network models that are capable of making accurate data-driven decisions. Deep learning is particularly suited to contexts where the data is complex and where there are large datasets available. Today most online companies and high-end consumer technologies use deep learning. Among other things, Facebook uses deep learning to analyze text in online conversations. Google, Baidu, and Microsoft all use deep learning for image search, and also for machine translation. All modern smart phones have deep learning systems running on them; for example, deep learning is now the standard technology for speech recognition, and also for face detection on digital cameras. In the healthcare sector, deep learning is used to process medical images (X-rays, CT, and MRI scans) and diagnose health conditions. Deep learning is also at the core of self-driving cars, where it is used for localization and mapping, motion planning and steering, and environment perception, as well as tracking driver state.

    Perhaps the best-known example of deep learning is DeepMind’s AlphaGo.1 Go is a board game similar to Chess. AlphaGo was the first computer program to beat a professional Go player. In March 2016, it beat the top Korean professional, Lee Sedol, in a match watched by more than two hundred million people. The following year, in 2017, AlphaGo beat the world’s No. 1 ranking player, China’s Ke Jie.

    In 2016 AlphaGo’s success was very surprising. At the time, most people expected that it would take many more years of research before a computer would be able to compete with top level human Go players. It had been known for a long time that programming a computer to play Go was much more difficult than programming it to play Chess. There are many more board configurations possible in Go than there are in Chess. This is because Go has a larger board and simpler rules than Chess. There are, in fact, more possible board configurations in Go than there are atoms in the universe. This massive search space and Go’s large branching factor (the number of board configurations that can be reached in one move) makes Go an incredibly challenging game for both humans and computers.

    One way of illustrating the relative difficulty Go and Chess presented to computer programs is through a historical comparison of how Go and Chess programs competed with human players. In 1967, MIT’s MacHack-6 Chess program could successfully compete with humans and had an Elo rating2 well above novice level, and, by May 1997, DeepBlue was capable of beating the Chess world champion Gary Kasparov. In comparison, the first complete Go program wasn’t written until 1968 and strong human players were still able to easily beat the best Go programs in 1997.

    The time lag between the development of Chess and Go computer programs reflects the difference in computational difficulty between these two games. However, a second historic comparison between Chess and Go illustrates the revolutionary impact that deep learning has had on the ability of computer programs to compete with humans at Go. It took thirty years for Chess programs to progress from human level competence in 1967 to world champion level in 1997. However, with the development of deep learning it took only seven years for computer Go programs to progress from advanced amateur to world champion; as recently as 2009 the best Go program in the world was rated at the low-end of advanced amateur. This acceleration in performance through the use of deep learning is nothing short of extraordinary, but it is also indicative of the types of progress that deep learning has enabled in a number of fields.

    AlphaGo uses deep learning to evaluate board configurations and to decide on the next move to make. The fact that AlphaGo used deep learning to decide what move to make next is a clue to understanding why deep learning is useful across so many different domains and applications. Decision-making is a crucial part of life. One way to make decisions is to base them on your “intuition” or your “gut feeling.” However, most people would agree that the best way to make decisions is to base them on the relevant data. Deep learning enables data-driven decisions by identifying and extracting patterns from large datasets that accurately map from sets of complex inputs to good decision outcomes.

    Artificial Intelligence, Machine Learning, and Deep Learning

    Deep learning has emerged from research in artificial intelligence and machine learning. Figure 1.1 illustrates the relationship between artificial intelligence, machine learning, and deep learning.

    Deep learning enables data-driven decisions by identifying and extracting patterns from large datasets that accurately map from sets of complex inputs to good decision outcomes.

    The field of artificial intelligence was born at a workshop at Dartmouth College in the summer of 1956. Research on a number of topics was presented at the workshop including mathematical theorem proving, natural language processing, planning for games, computer programs that could learn from examples, and neural networks. The modern field of machine learning draws on the last two topics: computers that could learn from examples, and neural network research.

    Figure 1.1 The relationship between artificial intelligence, machine learning, and deep learning.

    Machine learning involves the development and evaluation of algorithms that enable a computer to extract (or learn) functions from a dataset (sets of examples). To understand what machine learning means we need to understand three terms: dataset, algorithm, and function.

    In its simplest form, a dataset is a table where each row contains the description of one example from a domain, and each column contains the information for one of the features in a domain. For example, table 1.1 illustrates an example dataset for a loan application domain. This dataset lists the details of four example loan applications. Excluding the ID feature, which is only for ease of reference, each example is described using three features: the applicant’s annual income, their current debt, and their credit solvency.

    Table 1.1. A dataset of loan applicants and their known credit solvency ratings

    IDAnnual IncomeCurrent DebtCredit Solvency
    1$150-$100100
    2$250-$300-50
    3$450-$250400
    4$200-$350-300

    An algorithm is a process (or recipe, or program) that a computer can follow. In the context of machine learning, an algorithm defines a process to analyze a dataset and identify recurring patterns in the data. For example, the algorithm might find a pattern that relates a person’s annual income and current debt to their credit solvency rating. In mathematics, relationships of this type are referred to as functions.

    A function is a deterministic mapping from a set of input values to one or more output values. The fact that the mapping is deterministic means that for any specific set of inputs a function will always return the same outputs. For example, addition is a deterministic mapping, and so 2+2 is always equal to 4. As we will discuss later, we can create functions for domains that are more complex than basic arithmetic, we can for example define a function that takes a person’s income and debt as inputs and returns their credit solvency rating as the output value. The concept of a function is very important to deep learning so it is worth repeating the definition for emphasis: a function is simply a mapping from inputs to outputs. In fact, the goal of machine learning is to learn functions from data. A function can be represented in many different ways: it can be as simple as an arithmetic operation (e.g., addition or subtraction are both functions that take inputs and return a single output), a sequence of if-then-else rules, or it can have a much more complex representation.

    A function is a deterministic mapping from a set of input values to one or more output values.

    One way to represent a function is to use a neural network. Deep learning is the subfield of machine learning that focuses on deep neural network models. In fact, the patterns that deep learning algorithms extract from datasets are functions that are represented as neural networks. Figure 1.2 illustrates the structure of a neural network. The boxes on the left of the figure represent the memory locations where inputs are presented to the network. Each of the circles in this figure is called a neuron and each neuron implements a function: it takes a number of values as input and maps them to an output value. The arrows in the network show how the outputs of each neuron are passed as inputs to other neurons. In this network, information flows from left to right. For example, if this network were trained to predict a person’s credit solvency, based on their income and debt, it would receive the income and debt as inputs on the left of the network and output the credit solvency score through the neuron on the right.

    A neural network uses a divide-and-conquer strategy to learn a function: each neuron in the network learns a simple function, and the overall (more complex) function, defined by the network, is created by combining these simpler functions. Chapter 3 will describe how a neural network processes information.

    Figure 1.2 Schematic illustration of a neural network.

    What Is Machine Learning?

    A machine learning algorithm is a search process designed to choose the best function, from a set of possible functions, to explain the relationships between features in a dataset. To get an intuitive understanding of what is involved in extracting, or learning, a function from data, examine the following set of sample inputs to an unknown function and the outputs it returns. Given these examples, decide which arithmetic operation (addition, subtraction, multiplication, or division) is the best choice to explain the mapping the unknown function defines between its inputs and output:

    Most people would agree that multiplication is the best choice because it provides the best match to the observed relationship, or mapping, from the inputs to the outputs:

    In this particular instance, choosing the best function is relatively straightforward, and a human can do it without the aid of a computer. However, as the number of inputs to the unknown function increases (perhaps to hundreds or thousands of inputs), and the variety of potential functions to be considered gets larger, the task becomes much more difficult. It is in these contexts that harnessing the power of machine learning to search for the best function, to match the patterns in the dataset, becomes necessary.

    Machine learning involves a two-step process: training and inference. During training, a machine learning algorithm processes a dataset and chooses the function that best matches the patterns in the data. The extracted function will be encoded in a computer program in a particular form (such as if-then-else rules or parameters of a specified equation). The encoded function is known as a model, and the analysis of the data in order to extract the function is often referred to as training the model. Essentially, models are functions encoded as computer programs. However, in machine learning the concepts of function and model are so closely related that the distinction is often skipped over and the terms may even be used interchangeably.

    In the context of deep learning, the relationship between functions and models is that the function extracted from a dataset during training is represented as a neural network model, and conversely a neural network model encodes a function as a computer program. The standard process used to train a neural network is to begin training with a neural network where the parameters of the network are randomly initialized (we will explain network parameters later; for now just think of them as values that control how the function the network encodes works). This randomly initialized network will be very inaccurate in terms of its ability to match the relationship between the various input values and target outputs for the examples in the dataset. The training process then proceeds by iterating through the examples in the dataset, and, for each example, presenting the input values to the network and then using the difference between the output returned by the network and the correct output for the example listed in the dataset to update the network’s parameters so that it matches the data more closely. Once the machine learning algorithm has found a function that is sufficiently accurate (in terms of the outputs it generates matching the correct outputs listed in the dataset) for the problem we are trying to solve, the training process is completed, and the final model is returned by the algorithm. This is the point at which the learning in machine learning stops.

    Once training has finished, the model is fixed. The second stage in machine learning is inference. This is when the model is applied to new examples—examples for which we do not know the correct output value, and therefore we want the model to generate estimates of this value for us. Most of the work in machine learning is focused on how to train accurate models (i.e., extracting an accurate function from data). This is because the skills and methods required to deploy a trained machine learning model into production, in order to do inference on new examples at scale, are different from those that a typical data scientist will possess. There is a growing recognition within the industry of the distinctive skills needed to deploy artificial intelligence systems at scale, and this is reflected in a growing interest in the field known as DevOps, a term describing the need for collaboration between development and operations teams (the operations team being the team responsible for deploying a developed system into production and ensuring that these systems are stable and scalable). The terms MLOps, for machine learning operations, and AIOps, for artificial intelligence operations, are also used to describe the challenges of deploying a trained model. The questions around model deployment are beyond the scope of this book, so we will instead focus on describing what deep learning is, what it can be used for, how it has evolved, and how we can train accurate deep learning models.

    One relevant question here is: why is extracting a function from data useful? The reason is that once a function has been extracted from a dataset it can be applied to unseen data, and the values returned by the function in response to these new inputs can provide insight into the correct decisions for these new problems (i.e., it can be used for inference). Recall that a function is simply a deterministic mapping from inputs to outputs. The simplicity of this definition, however, hides the variety that exists within the set of functions. Consider the following examples:

    • • Spam filtering is a function that takes an email as input and returns a value that classifies the email as spam (or not).
    • • Face recognition is a function that takes an image as input and returns a labeling of the pixels in the image that demarcates the face in the image.
    • • Gene prediction is a function that takes a genomic DNA sequence as input and returns the regions of the DNA that encode a gene.
    • • Speech recognition is a function that takes an audio speech signal as input and returns a textual transcription of the speech.
    • • Machine translation is a function that takes a sentence in one language as input and returns the translation of that sentence in another language.

    It is because the solutions to so many problems across so many domains can be framed as functions that machine learning has become so important in recent years.

    Why Is Machine Learning Difficult?

    There are a number of factors that make the machine learning task difficult, even with the help of a computer. First, most datasets will include noise3 in the data, so searching for a function that matches the data exactly is not necessarily the best strategy to follow, as it is equivalent to learning the noise. Second, it is often the case that the set of possible functions is larger than the set of examples in the dataset. This means that machine learning is an ill-posed problem: the information given in the problem is not sufficient to find a single best solution; instead multiple possible solutions will match the data. We can use the problem of selecting the arithmetic operation (addition, subtraction, multiplication, or division) that best matches a set of example input-output mappings for an unknown function to illustrate the concept of an ill-posed problem. Here are the example mappings for this function selection problem:

    Given these examples, multiplication and division are better matches for the unknown function than addition and subtraction. However, it is not possible to decide whether the unknown function is actually multiplication or division using this sample of data, because both operations are consistent with all the examples provided. Consequently, this is an ill-posed problem: it is not possible to select a single best answer given the information provided in the problem.

    One strategy to solve an ill-posed problem is to collect more data (more examples) in the hope that the new examples will help us to discriminate between the correct underlying function and the remaining alternatives. Frequently, however, this strategy is not feasible, either because the extra data is not available or is too expensive to collect. Instead, machine learning algorithms overcome the ill-posed nature of the machine learning task by supplementing the information provided by the data with a set of assumptions about the characteristics of the best function, and use these assumptions to influence the process used by the algorithm that selects the best function (or model). These assumptions are known as the inductive bias of the algorithm because in logic a process that infers a general rule from a set of specific examples is known as inductive reasoning. For example, if all the swans that you have seen in your life are white, you might induce from these examples the general rule that all swans are white. This concept of inductive reasoning relates to machine learning because a machine learning algorithm induces (or extracts) a general rule (a function) from a set of specific examples (the dataset). Consequently, the assumptions that bias a machine learning algorithm are, in effect, biasing an inductive reasoning process, and this is why they are known as the inductive bias of the algorithm.

    So, a machine learning algorithm uses two sources of information to select the best function: one is the dataset, and the other (the inductive bias) is the assumptions that bias the algorithm to prefer some functions over others, irrespective of the patterns in the dataset. The inductive bias of a machine learning algorithm can be understood as providing the algorithm with a perspective on a dataset. However, just as in the real world, where there is no single best perspective that works in all situations, there is no single best inductive bias that works well for all datasets. This is why there are so many different machine learning algorithms: each algorithm encodes a different inductive bias. The assumptions encoded in the design of a machine leanring algorithm can vary in strength. The stronger the assumptions the less freedom the algorithm is given in selecting a function that fits the patterns in the dataset. In a sense, the dataset and inductive bias counterbalance each other: machine learning algorithms that have a strong inductive bias pay less attention to the dataset when selecting a function. For example, if a machine learning algorithm is coded to prefer a very simple function, no matter how complex the patterns in the data, then it has a very strong inductive bias.

    In chapter 2 we will explain how we can use the equation of a line as a template structure to define a function. The equation of the line is a very simple type of mathematical function. Machine learning algorithms that use the equation of a line as the template structure for the functions they fit to a dataset make the assumption that the model they generate should encode a simple linear mapping from inputs to output. This assumption is an example of an inductive bias. It is, in fact, an example of a strong inductive bias, as no matter how complex (or nonlinear) the patterns in the data are the algorithm will be restricted (or biased) to fit a linear model to it.

    One of two things can go wrong if we choose a machine learning algorithm with the wrong bias. First, if the inductive bias of a machine learning algorithm is too strong, then the algorithm will ignore important information in the data and the returned function will not capture the nuances of the true patterns in the data. In other words, the returned function will be too simple for the domain,4 and the outputs it generates will not be accurate. This outcome is known as the function underfitting the data. Alternatively, if the bias is too weak (or permissive), the algorithm is allowed too much freedom to find a function that closely fits the data. In this case, the returned function is likely to be too complex for the domain, and, more problematically, the function is likely to fit to the noise in the sample of the data that was supplied to the algorithm during training. Fitting to the noise in the training data will reduce the function’s ability to generalize to new data (data that is not in the training sample). This outcome is known as overfitting the data. Finding a machine learning algorithm that balances data and inductive bias appropriately for a given domain is the key to learning a function that neither underfits or overfits the data, and that, therefore, generalizes successfully in that domain (i.e., that is accurate at inference, or processing new examples that were not in the training data).

    However, in domains that are complex enough to warrant the use of machine learning, it is not possible in advance to know what are the correct assumptions to use to bias the selection of the correct model from the data. Consequently, data scientists must use their intuition (i.e., make informed guesses) and also use trial-and-error experimentation in order to find the best machine learning algorithm to use in a given domain.

    Neural networks have a relatively weak inductive bias. As a result, generally, the danger with deep learning is that the neural network model will overfit, rather than underfit, the data. It is because neural networks pay so much attention to the data that they are best suited to contexts where there are very large datasets. The larger the dataset, the more information the data provides, and therefore it becomes more sensible to pay more attention to the data. Indeed, one of the most important factors driving the emergence of deep learning over the last decade has been the emergence of Big Data. The massive datasets that have become available through online social platforms and the proliferation of sensors have combined to provide the data necessary to train neural network models to support new applications in a range of domains. To give a sense of the scale of the big data used in deep learning research, Facebook’s face recognition software, DeepFace, was trained on a dataset of four million facial images belonging to more than four thousand identities (Taigman et al. 2014).

    The Key Ingredients of Machine Learning

    The above example of deciding which arithmetic operation best explains the relationship between inputs and outputs in a set of data illustrates the three key ingredients in machine learning:
    1. Data (a set of historical examples).
    2. A set of functions that the algorithm will search through to find the best match with the data.
    3. Some measure of fitness that can be used to evaluate how well each candidate function matches the data.

    All three of these ingredients must be correct if a machine learning project is to succeed; below we describe each of these ingredients in more detail.

    We have already introduced the concept of a dataset as a two-dimensional table (or n × m matrix),5 where each row contains the information for one example, and each column contains the information for one of the features in the domain. For example, table 1.2 illustrates how the sample inputs and outputs of the first unknown arithmetic function problem in the chapter can be represented as a dataset. This dataset contains four examples (also known as instances), and each example is represented using two input features and one output (or target) feature. Designing and selecting the features to represent the examples is a very important step in any machine learning project.

    As is so often the case in computer science, and machine learning, there is a tradeoff in feature selection. If we choose to include only a minimal number of features in the dataset, then it is likely that a very informative feature will be excluded from the data, and the function returned by the machine learning algorithm will not work well. Conversely, if we choose to include as many features as possible in the domain, then it is likely that irrelevant or redundant features will be included, and this will also likely result in the function not working well. One reason for this is that the more redundant or irrelevant features that are included, the greater the probability for the machine learning algorithm to extract patterns that are based on spurious correlations between these features. In these cases, the algorithm gets confused between the real patterns in the data and the spurious patterns that only appear in the data due to the particular sample of examples that have been included in the dataset.

    Finding the correct set of features to include in a dataset involves engaging with experts who understand the domain, using statistical analysis of the distribution of individual features and also the correlations between pairs of features, and a trial-and-error process of building models and checking the performance of the models when particular features are included or excluded. This process of dataset design is a labor-intensive task that often takes up a significant portion of the time and effort expended on a machine learning project. It is, however, a critical task if the project is to succeed. Indeed, identifying which features are informative for a given task is frequently where the real value of machine learning projects emerge.

    The second ingredient in a machine learning project is the set of candidate functions that the algorithm will consider as the potential explanation of the patterns in the data. In the unknown arithmetic function scenario previously given, the set of considered functions was explicitly specified and restricted to four: additionsubtractionmultiplication, or division. More generally, the set of functions is implicitly defined through the inductive bias of the machine learning algorithm and the function representation (or model) that is being used. For example, a neural network model is a very flexible function representation.

    Table 1.2. A simple tabular dataset

    Input 1Input 2Target
    5525
    2612
    4416
    2204

    The third and final ingredient to machine learning is the measure of fitness. The measure of fitness is a function that takes the outputs from a candidate function, generated when the machine learning algorithm applies the candidate function to the data, and compares these outputs with the data, in some way. The result of this comparison is a value that describes the fitness of the candidate function relative to the data. A fitness function that would work for our unknown arithmetic function scenario is to count in how many of the examples a candidate function returns a value that exactly matches the target specified in the data. Multiplication would score four out of four on this fitness measure, addition would score one out of four, and division and subtraction would both score zero out of four. There are a large variety of fitness functions that can be used in machine learning, and the selection of the correct fitness function is crucial to the success of a machine learning project. The design of new fitness functions is a rich area of research in machine learning. Varying how the dataset is represented, and how the candidate functions and the fitness function are defined, results in three different categories of machine learning: supervised, unsupervised, and reinforcement learning.

    Supervised, Unsupervised, and Reinforcement Learning

    Supervised machine learning is the most common type of machine learning. In supervised machine learning, each example in the dataset is labeled with the expected output (or target) value. For example, if we were using the dataset in table 1.1 to learn a function that maps from the inputs of annual income and debt to a credit solvency score, the credit solvency feature in the dataset would be the target feature. In order to use supervised machine learning, our dataset must list the value of the target feature for every example in the dataset. These target feature values can sometimes be very difficult, and expensive, to collect. In some cases, we must pay human experts to label each example in a dataset with the correct target value. However, the benefit of having these target values in the dataset is that the machine learning algorithm can use these values to help the learning process. It does this by comparing the outputs a function produces with the target outputs specified in the dataset, and using the difference (or error) to evaluate the fitness of the candidate function, and use the fitness evaluation to guide the search for the best function. It is because of this feedback from the target labels in the dataset to the algorithm that this type of machine learning is considered supervised. This is the type of machine learning that was demonstrated by the example of choosing between different arithmetic functions to explain the behavior of an unknown function.

    Unsupervised machine learning is generally used for clustering data. For example, this type of data analysis is useful for customer segmentation, where a company wishes to segment its customer base into coherent groups so that it can target marketing campaigns and/or product designs to each group. In unsupervised machine learning, there are no target values in the dataset. Consequently, the algorithm cannot directly evaluate the fitness of a candidate function against the target values in the dataset. Instead, the machine learning algorithm tries to identify functions that map similar examples into clusters, such that the examples in a cluster are more similar to the other examples in the same cluster than they are to examples in other clusters. Note that the clusters are not prespecified, or at most they are initially very underspecified. For example, the data scientist might provide the algorithm with a target number of clusters, based on some intuition about the domain, without providing explicit information on relative sizes of the clusters or regarding the characteristics of examples that belong in each cluster. Unsupervised machine learning algorithms often begin by guessing an initial clustering of the examples and then iteratively adjusting the clusters (by dropping instances from one cluster and adding them to another) so as to improve the fitness of the cluster set. The fitness functions used in unsupervised machine learning generally reward candidate functions that result in higher similarity within individual clusters and, also, high diversity between clusters.

    Reinforcement learning is most relevant for online control tasks, such as robot control and game playing. In these scenarios, an agent needs to learn a policy for how it should act in an environment in order to be rewarded. In reinforcement learning, the goal of the agent is to learn a mapping from its current observation of the environment and its own internal state (its memory) to what action it should take: for instance, should the robot move forward or backward or should the computer program move the pawn or take the queen. The output of this policy (function) is the action that the agent should take next, given the current context. In these types of scenarios, it is difficult to create historic datasets, and so reinforcement learning is often carried out in situ: an agent is released into an environment where it experiments with different policies (starting with a potentially random policy) and over time updates its policy in response to the rewards it receives from the environment. If an action results in a positive reward, the mapping from the relevant observations and state to that action is reinforced in the policy, whereas if an action results in a negative reward, the mapping is weakened. Unlike in supervised and unsupervised machine learning, in reinforcement learning, the fact that learning is done in situ means that the training and inference stages are interleaved and ongoing. The agent infers what action it should do next and uses the feedback from the environment to learn how to update its policy. A distinctive aspect of reinforcement learning is that the target output of the learned function (the agent’s actions) is decoupled from the reward mechanism. The reward may be dependent on multiple actions and there may be no reward feedback, either positive or negative, available directly after an action has been performed. For example, in a chess scenario, the reward may be +1 if the agent wins the game and -1 if the agent loses. However, this reward feedback will not be available until the last move of the game has been completed. So, one of the challenges in reinforcement learning is designing training mechanisms that can distribute the reward appropriately back through a sequence of actions so that the policy can be updated appropriately. Google’s DeepMind Technologies generated a lot of interest by demonstrating how reinforcement learning could be used to train a deep learning model to learn control policies for seven different Atari computer games (Mnih et al. 2013). The input to the system was the raw pixel values from the screen, and the control policies specified what joystick action the agent should take at each point in the game. Computer game environments are particularly suited to reinforcement learning as the agent can be allowed to play many thousands of games against the computer game system in order to learn a successful policy, without incurring the cost of creating and labeling a large dataset of example situations with correct joystick actions. The DeepMind system got so good at the games that it outperformed all previous computer systems on six of the seven games, and outperformed human experts on three of the games.

    Deep learning can be applied to all three machine learning scenarios: supervised, unsupervised, and reinforcement. Supervised machine learning is, however, the most common type of machine learning. Consequently, the majority of this book will focus on deep learning in a supervised learning context. However, most of the deep learning concerns and principles introduced in the supervised learning context also apply to unsupervised and reinforcement learning.

    Why Is Deep Learning So Successful?

    In any data-driven process the primary determinant of success is knowing what to measure and how to measure it. This is why the processes of feature selection and feature design are so important to machine learning. As discussed above, these tasks can require domain expertise, statistical analysis of the data, and iterations of experiments building models with different feature sets. Consequently, dataset design and preparation can consume a significant portion of time and resources expended in the project, in some cases approaching up to 80% of the total budget of a project (Kelleher and Tierney 2018). Feature design is one task in which deep learning can have a significant advantage over traditional machine learning. In traditional machine learning, the design of features often requires a large amount of human effort. Deep learning takes a different approach to feature design, by attempting to automatically learn the features that are most useful for the task from the raw data.

    In any data-driven process the primary determinant of success is knowing what to measure and how to measure it.

    To give an example of feature design, a person’s body mass index (BMI) is the ratio of a person’s weight (in kilograms) divided by their height (in meters squared). In a medical setting, BMI is used to categorize people as underweight, normal, overweight, or obese. Categorizing people in this way can be useful in predicting the likelihood of a person developing a weight-related medical condition, such as diabetes. BMI is used for this categorization because it enables doctors to categorize people in a manner that is relevant to these weight-related medical conditions. Generally, as people get taller they also get heavier. However, most weight-related medical conditions (such as diabetes) are not affected by a person’s height but rather the amount they are overweight compared to other people of a similar stature. BMI is a useful feature to use for the medical categorization of a person’s weight because it takes the effect of height on weight into account. BMI is an example of a feature that is derived (or calculated) from raw features; in this case the raw features are weight and height. BMI is also an example of how a derived feature can be more useful in making a decision than the raw features that it is derived from. BMI is a hand-designed feature: Adolphe Quetelet designed it in the eighteenth century.

    As mentioned above, during a machine learning project a lot of time and effort is spent on identifying, or designing, (derived) features that are useful for the task the project is trying to solve. The advantage of deep learning is that it can learn useful derived features from data automatically (we will discuss how it does this in later chapters). Indeed, given large enough datasets, deep learning has proven to be so effective in learning features that deep learning models are now more accurate than many of the other machine learning models that use hand-engineered features. This is also why deep learning is so effective in domains where examples are described with very large numbers of features. Technically datasets that contain large numbers of features are called high-dimensional. For example, a dataset of photos with a feature for each pixel in a photo would be high-dimensional. In complex high-dimensional domains, it is extremely difficult to hand-engineer features: consider the challenges of hand-engineering features for face recognition or machine translation. So, in these complex domains, adopting a strategy whereby the features are automatically learned from a large dataset makes sense. Related to this ability to automatically learn useful features, deep learning also has the ability to learn complex nonlinear mappings between inputs and outputs; we will explain the concept of a nonlinear mapping in chapter 3, and in chapter 6 we will explain how these mappings are learned from data.

    Summary and the Road Ahead

    This chapter has focused on positioning deep learning within the broader field of machine learning. Consequently, much of this chapter has been devoted to introducing machine learning. In particular, the concept of a function as a deterministic mapping from inputs to outputs was introduced, and the goal of machine learning was explained as finding a function that matches the mappings from input features to the output features that are observed in the examples in the dataset.

    Within this machine learning context, deep learning was introduced as the subfield of machine learning that focuses on the design and evaluation of training algorithms and model architectures for modern neural networks. One of the distinctive aspects of deep learning within machine learning is the approach it takes to feature design. In most machine learning projects, feature design is a human-intensive task that can require deep domain expertise and consume a lot of time and project budget. Deep learning models, on the other hand, have the ability to learn useful features from low-level raw data, and complex nonlinear mappings from inputs to outputs. This ability is dependent on the availability of large datasets; however, when such datasets are available, deep learning can frequently outperform other machine learning approaches. Furthermore, this ability to learn useful features from large datasets is why deep learning can often generate highly accurate models for complex domains, be it in machine translation, speech processing, or image or video processing. In a sense, deep learning has unlocked the potential of big data. The most noticeable impact of this development has been the integration of deep learning models into consumer devices. However, the fact that deep learning can be used to analyze massive datasets also has implications for our individual privacy and civil liberty (Kelleher and Tierney 2018). This is why understanding what deep learning is, how it works, and what it can and can’t be used for, is so important. The road ahead is as follows:
    • Chapter 2 introduces some of the foundational concepts of deep learning, including what a model is, how the parameters of a model can be set using data, and how we can create complex models by combining simple models.
    • Chapter 3 explains what neural networks are, how they work, and what we mean by a deep neural network.
    • Chapter 4 presents a history of deep learning. This history focuses on the major conceptual and technical breakthroughs that have contributed to the development of the field of machine learning. In particular, it provides a context and explanation for why deep learning has seen such rapid development in recent years.
    • Chapter 5 describes the current state of the field, by introducing the two deep neural architectures that are the most popular today: convolutional neural networks and recurrent neural networks. Convolutional neural networks are ideally suited to processing image and video data. Recurrent neural networks are ideally suited to processing sequential data such as speech, text, or time-series data. Understanding the differences and commonalities across these two architectures will give you an awareness of how a deep neural network can be tailored to the characteristics of a specific type of data, and also an appreciation of the breadth of the design space of possible network architectures.
    • Chapter 6 explains how deep neural networks models are trained, using the gradient descent and backpropagation algorithms. Understanding these two algorithms will give you a real insight into the state of artificial intelligence. For example, it will help you to understand why, given enough data, it is currently possible to train a computer to do a specific task within a well-defined domain at a level beyond human capabilities, but also why a more general form of intelligence is still an open research challenge for artificial intelligence.
    • Chapter 7 looks to the future in the field of deep learning. It reviews the major trends driving the development of deep learning at present, and how they are likely to contribute to the development of the field in the coming years. The chapter also discusses some of the challenges the field faces, in particular the challenge of understanding and interpreting how a deep neural network works.

    2 Conceptual Foundations

    This chapter introduces some of the foundational concepts that underpin deep learning. The basis of this chapter is to decouple the initial presentation of these concepts from the technical terminology used in deep learning, which is introduced in subsequent chapters.

    A deep learning network is a mathematical model that is (loosely) inspired by the structure of the brain. Consequently, in order to understand deep learning it is helpful to have an intuitive understanding of what a mathematical model is, how the parameters of a model can be set, how we can combine (or compose) models, and how we can use geometry to understand how a model processes information.

    What Is a Mathematical Model?

    In its simplest form, a mathematical model is an equation that describes how one or more input variables are related to an output variable. In this form a mathematical model is the same as a function: a mapping from inputs to outputs.

    In any discussion relating to models, it is important to remember the statement by George Box that all models are wrong but some are useful! For a model to be useful it must have a correspondence with the real world. This correspondence is most obvious in terms of the meaning that can be associated with a variable. For example, in isolation a value such as 78,000 has no meaning because it has no correspondence with concepts in the real world. But yearly income=$78,000 tells us how the number describes an aspect of the real world. Once the variables in a model have a meaning, we can understand the model as describing a process through which different aspects of the world interact and cause new events. The new events are then described by the outputs of the model.

    A very simple template for a model is the equation of a line:

    In this equationis the output variable,is the input variable, andandare two parameters of the model that we can set to adjust the relationship the model defines between the input and the output.

    Imagine we have a hypothesis that yearly income affects a person’s happiness and we wish to describe the relationship between these two variables.1 Using the equation of a line, we could define a model to describe this relationship as follows:

    This model has a meaning because the variables in the model (as distinct from the parameters of the model) have a correspondence with concepts from the real world. To complete our model, we have to set the values of the model’s parameters:and. Figure 2.1 illustrates how varying the values of each of these parameters changes the relationship defined by the model between income and happiness.

    One important thing to notice in this figure is that no matter what values we set the model parameters to, the relationship defined by the model between the input and the output variable can be plotted as a line. This is not surprising because we used the equation of a line as the template to define our model, and this is why mathematical models that are based on the equation of a line are known as linear models. The other important thing to notice in the figure is how changing the parameters of the model changes the relationship between income and happiness.

    Figure 2.1 Three different linear models of how income affects happiness.

    The solid steep line, with parameters, is a model of the world in which people with zero income have a happiness level of 1, and increases in income have a significant effect on people’s happiness. The dashed line, with parameters, is a model in which people with zero income have a happiness level of 1 and increased income increases happiness, but at the slower rate compared to the world modeled by the solid line. Finally, the dotted line, parameters, is a model of the world where no one is particularly unhappy—even people with zero income have a happiness of 4 out of 10—and although increases in income do affect happiness, the effect is moderate. This third model assumes that income has a relatively weak effect on happiness.

    More generally, the differences between the three models in figure 2.1 show how making changes to the parameters of a linear model changes the model. Changingcauses the line to move up and done. This is most clearly seen if we focus on the y-axis: notice that the line defined by a model always crosses (or intercepts) the y-axis at the value thatis set to. This is why theparameter in a linear model is known as the intercept. The intercept can be understood as specifying the value of the output variable when the input variable is zero. Changing theparameter changes the angle (or slope) of the line. The slope parameter controls how quickly changes in income effect changes in happiness. In a sense, the slope value is a measure of how important income is to happiness. If income is very important (i.e., if small changes in income result in big changes in happiness), then the slope parameter of our model should be set to a large value. Another way of understanding this is to think of a slope parameter of a linear model as describing the importance, or weight, of the input variable in determining the value of the output.

    Linear Models with Multiple Inputs

    The equation of a line can be used as a template for mathematical models that have more than one input variable. For example, imagine yourself in a scenario where you have been hired by a financial institution to act as a loan officer and your job involves deciding whether or not a loan application should be granted. From interviewing domain experts you come up with a hypothesis that a useful way to model a person’s credit solvency is to consider both their yearly income and their current debts. If we assume that there is a linear relationship between these two input variables and a person’s credit solvency, then the appropriate mathematical model, written out in English would be:

    Notice that in this model the
    m
    parameter has been replaced by a separate weight for each input variable, with each weight representing the importance of its associated input in determining the output. In mathematical notation this model would be written as:

    where

    represents the credit solvency output,

    represents the income variable,

    represents the debt variable, and

    represents the intercept. Using the idea of adding a new weight for each new input to the model allows us to scale the equation of a line to as many inputs as we like. All the models defined in this way are still linear within the dimensions defined by the number of inputs and the output. What this means is that a linear model with two inputs and one output defines a flat plane rather than a line because that is what a two-dimensional line that has been extruded to three dimensions looks like.

    It can become tedious to write out a mathematical model that has a lot of inputs, so mathematicians like to write things in as compact a form as possible. With this in mind, the above equation is sometimes written in the short form:

    This notation tells us that to calculate the output variable
    y
    we must first go through all

    inputs and multiple each input by its corresponding weight, then we should sum together the results of these

    multiplications, and finally we add the

    intercept parameter to the result of the summation. The

    symbol tells us that we use addition to combine the results of the multiplications, and the index

    tells us that we multiply each input by the weight with the same index. We can make our notation even more compact by treating the intercept as a weight. One way to do this is to assume an

    that is always equal to 1 and to treat the intercept as the weight on this input, that is,

    . Doing this allows us to write out the model as follows:

    Notice that the index now starts at 0, rather than 1, because we are now assuming an extra input,
    input0=1
    , and we have relabeled the intercept
    weight0.

    Although we can write down a linear model in a number of different ways, the core of a linear model is that the output is calculated as the sum of the n input values multiplied by their corresponding weights. Consequently, this type of model defines a calculation known as a weighted sum, because we weight each input and sum the results. Although a weighted sum is easy to calculate, it turns out to be very useful in many situations, and it is the basic calculation used in every neuron in a neural network.

    Setting the Parameters of a Linear Model

    Let us return to our working scenario where we wish to create a model that enables us to calculate the credit solvency of individuals who have applied for a financial loan. For simplicity in presentation we will ignore the intercept parameter in this discussion as it is treated the same as the other parameters (i.e., the weights on the inputs). So, dropping the intercept parameter, we have the following linear model (or weighted sum) of the relationship between a person’s income and debt to their credit solvency:

    The multiplication of inputs by weights, followed by a summation, is known as a weighted sum.

    In order to complete our model, we need to specify the parameters of the model; that is, we need to specify the value of the weight for each input. One way to do this would be to use our domain expertise to come up with values for each of the parameters.

    For example, if we assume that an increase in a person’s income has a bigger impact on their credit solvency than a similar increase in their debt, we should set the weighting for income to be larger than that of the debt. The following model encodes this assumption; in particular this model specifies that income is three times as important as debt in determining a person’s credit solvency:

    The drawback with using domain knowledge to set the parameters of a model is that experts often disagree. For example, you may think that weighting income as three times as important as debt is not realistic; in that case the model can be adjusted by, for example, setting both income and debt to have an equal weighting, which would be equivalent to assuming that income and debt are equally important in determining credit solvency. One way to avoid arguments between experts is to use data to set the parameters. This is where machine learning helps. The learning done by machine learning is finding the parameters (or weights) of a model using a dataset.

    Learning Model Parameters from Data

    Later in the book we will describe the standard algorithm used to learn the weights for a linear model, known as the gradient descent algorithm. However, we can give a brief preview of the algorithm here. We start with a dataset containing a set of examples for which we have both the input values (income and debt) and the output value (credit solvency). Table 2.1 illustrates such a dataset from our credit solvency scenario.2

    The learning done by machine learning is finding the parameters (or weights) of a model using a dataset.

    We then begin the process of learning the weights by guessing initial values for each weight. It is very likely that this initial, guessed, model will be a very bad model. This is not a problem, however, because we will use the dataset to iteratively update the weights so that the model gets better and better, in terms of how well it matches the data. For the purpose of the example, we will use the model described above as our initial (guessed) model:

    Table 2.1. A dataset of loan applications and known credit solvency rating of the applicant

    IDAnnual incomeCurrent debtCredit solvency
    1$150-$100100
    2$250-$300-50
    3$450-$250400
    4$200-$350-300

    The general process for improving the weights of the model is to select an example from the dataset and feed the input values from the example into the model. This allows us to calculate an estimate of the output value for the example. Once we have this estimated output, we can calculate the error of the model on the example by subtracting the estimated output from the correct output for the example listed in the dataset. Using the error of the model on the example, we can improve how well the model fits the data by updating the weights in the model using the following strategy, or learning rule:
    • If the error is 0, then we should not change the weights of the model.
    • If the error is positive, then the output of the model was too low, so we should increase the output of the model for this example by increasing the weights for all the inputs that had positive values for the example and decreasing the weights for all the inputs that had negative values for the example.
    • If the error is negative, then the output of the model was too high, so we should decrease the output of the model for this example by decreasing the weights for all the inputs that had positive values for the example and increasing the weights for all the inputs that had negative values for the example.

    To illustrate the weight update process we will use example 1 from table 2.1 (income = 150, debt = -100, and solvency = 100) to test the accuracy of our guessed model and update the weights according to the resulting error.

    When the input values for the example are passed into the model, the credit solvency estimate returned by the model is 350. This is larger than the credit solvency listed for this example in the dataset, which is 100. As a result, the error of the model is negative (100 – 350 = –250); therefore, following the learning rule described above, we should decrease the output of the model for this example by decreasing the weights for positive inputs and increasing the weights for negative inputs. For this example, the income input had a positive value and the debt input had a negative value. If we decrease the weight for income by 1 and increase the weight for debt by 1, we end up with the following model:

    We can test if this weight update has improved the model by checking if the new model generates a better estimate for the example than the old model. The following illustrates pushing the same example through the new model:

    This time the credit solvency estimate generated by the model matches the value in the dataset, showing that the updated model fits the data more closely than the original model. In fact, this new model generates the correct output for all the examples in the dataset.

    In this example, we only needed to update the weights once in order to find a set of weights that made the behavior of the model consistent with all the examples in the dataset. Typically, however, it takes many iterations of presenting examples and updating weights to get a good model. Also, in this example, we have, for the sake of simplicity, assumed that the weights are updated by either adding or subtracting 1 from them. Generally, in machine learning, the calculation of how much to update each weight by is more complicated than this. However, these differences aside, the general process outlined here for updating the weights (or parameters) of a model in order to fit the model to a dataset is the learning process at the core of deep learning.

    Combining Models

    We now understand how we can specify a linear model to estimate an applicant’s credit solvency, and how we can modify the parameters of the model in order to fit the model to a dataset. However, as a loan officer our job is not simply to calculate an applicant’s credit solvency; we have to decide whether to grant the loan application or not. In other words, we need a rule that will take a credit solvency score as input and return a decision on the loan application. For example, we might use the decision rule that a person with a credit solvency above 200 will be granted a loan. This decision rule is also a model: it maps an input variable, in this case credit solvency, to an output variable, loan decision.

    Using this decision rule we can adjudicate on a loan application by first using the model of credit solvency to convert a loan applicant’s profile (described in terms of the annual income and debt) into a credit solvency score, and then passing the resulting credit solvency score through our decision rule model to generate the loan decision. We can write this process out in a pseudomathematical shorthand as follows:

    Using this notation, the entire decision process for adjudicating the loan application for example 1 from table 2.1 is:

    We are now in a position where we can use a model (composed of two simpler models, a decision rule and a weighted sum) to describe how a loan decision is made. What is more, if we use data from previous loan applications to set the parameters (i.e., the weights) of the model, our model will correspond to how we have processed previous loan applications. This is useful because we can use this model to process new applications in a way that is consistent with previous decisions. If a new loan application is submitted, we simply use our model to process the application and generate a decision. It is this ability to apply a mathematical model to new examples that makes mathematical modeling so useful.

    When we use the output of one model as the input to another model, we are creating a third model by combining two models. This strategy of building a complex model by combining smaller simpler models is at the core of deep learning networks. As we will see, a neural network is composed of a large number of small units called neurons. Each of these neurons is a simple model in its own right that maps from a set of inputs to an output. The overall model implemented by the network is created by feeding the outputs from one group of neurons as inputs into a second group of neurons and then feeding the outputs of the second group of neurons as inputs to a third group of neurons, as so on, until the final output of the model is generated. The core idea is that feeding the outputs of some neuron as inputs to other neurons enables these subsequent neurons to learn to solve a different part of the overall problem the network is trying to solve by building on the partial solutions implemented by the earlier neurons—in a similar way to the way the decision rule generates the final adjudication for a loan application by building on the calculation of the credit solvency model. We will return to this topic of model composition in subsequent chapters.

    Input Spaces, Weight Spaces, and Activation Spaces

    Although mathematical models can be written out as equations, it is often useful to understand the geometric meaning of a model. For example, the plots in figure 2.1 helped us understand how changes in the parameters of a linear model changed the relationship between the variables that the model defined. There are a number of geometric spaces that it is useful to distinguish between, and understand, when we are discussing neural networks. These are the input space, the weight space, and the activation space of a neuron. We can use the decision model for loan applications that we defined in the previous section to explain these three different types of spaces.

    We will begin by describing the concept of an input space. Our loan decision model took two inputs: the annual income and current debt of the applicant. Table 2.1 listed these input values for four example loan applications. We can plot the input space of this model by treating each of the input variables as the axis of a coordinate system. This coordinate space is referred to as the input space because each point in this space defines a possible combination of input values to the model. For example, the plot at the top-left of figure 2.2 shows the position of each of the four example loan applications within the models input space.

    The weight space for a model describes the universe of possible weight combinations that a model might use. We can plot the weight space for a model by defining a coordinate system with one axis per weight in the model. The loan decision model has only two weights, one weight for the annual income input, and one weight for the current debt input. Consequently, the weight space for this model has two dimensions. The plot at the top-right of figure 2.2 illustrates a portion of the weight space for this model. The location of the weight combination used by the modelis highlighted in this figure. Each point within this coordinate system describes a possible set of weights for the model, and therefore corresponds to a different weighted sum function within the model. Consequently, moving from one location to another within this weight space is equivalent to changing the model because it changes the mapping from inputs to output that the model defines.

    Figure 2.2 There are four different coordinate spaces related to the processing of the loan decision model: top-left plots the input space; top-right plots the weight space; bottom-left plots the activation (or decision) space; and bottom-right plots the input space with the decision boundary plotted.

    A linear model maps a set of input values to a point in a new space by applying a weighted sum calculation to the inputs: multiply each input by a weight, and sum the results of the multiplication. In our loan decision model it is in this space that we apply our decision rule. Thus, we could call this space the decision space, but, for reasons that will become clear when we describe the structure of a neuron in the next chapter, we call this space the activation space. The axes of a model’s activation space correspond to the weighted inputs to the model. Consequently, each point in the activation space defines a set of weighted inputs. Applying a decision rule, such as our rule that a person with a credit solvency above 200 will be granted a loan, to each point in this activation space, and recording the result of the decision for each point, enables us to plot the decision boundary of the model in this space. The decision boundary divides those points in the activation space that exceed the threshold, from those points in the space below the threshold. The plot in the bottom-left of figure 2.2 illustrates the activation space for our loan decision model. The positions of the four example loan applications listed in table 2.1 when they are projected into this activation space are shown. The diagonal black line in this figure shows the decision boundary. Using this threshold, loan application number three is granted and the other loan applications are rejected. We can, if we wish, project the decision boundary back into the original input space by recording for each location in the input space which side of the decision boundary in the activation space it is mapped to by the weighted sum function. The plot at the bottom-right of figure 2.2 shows the decision boundary in the original input space (note the change in the values on the axes) and was generated using this process. We will return to the concepts of weight spaces and decision boundaries in next chapter when we describe how adjusting the parameters of a neuron changes the set of input combinations that cause the neuron to output a high activation.

    Summary

    The main idea presented in this chapter is that a linear mathematical model, be it expressed as an equation or plotted as a line, describes a relationship between a set of inputs and an output. Be aware that not all mathematical models are linear models, and we will come across nonlinear models in this book. However, the fundamental calculation of a weighted sum of inputs does define a linear model. Another big idea introduced in this chapter is that a linear model (a weighted sum) has a set of parameters, that is, the weights used in the weighted sum. By changing these parameters we can change the relationship the model describes between the inputs and the output. If we wish we could set these weights by hand using our domain expertise; however, we can also use machine learning to set the weights of the model so that the behavior of the model fits the patterns found in a dataset. The last big idea introduced in this chapter was that we can build complex models by combining simpler models. This is done by using the output from one (or more) models as input(s) to another model. We used this technique to define our composite model to make loan decisions. As we will see in the next chapter, the structure of a neuron in a neural network is very similar to the structure of this loan decision model. Just like this model, a neuron calculates a weighted sum of its inputs and then feeds the result of this calculation into a second model that decides whether the neuron activates or not.

    The focus of this chapter has been to introduce some foundational concepts before we introduce the terminology of machine learning and deep learning. To give a quick overview of how the concepts introduced in this chapter map over to machine learning terminology, our loan decision model is equivalent to a two-input neuron that uses a threshold activation function. The two financial indicators (annual income and current debt) are analogous to the inputs the neuron receives. The terms input vector or feature vector are sometimes used to refer to the set of indicators describing a single example; in this context an example is a single loan applicant, described in terms of two features: annual income and current debt. Also, just like the loan decision model, a neuron associates a weight with each input. And, again, just like the loan decision model, a neuron multiplies each input by its associated weight and sums the results of these multiplications in order to calculate an overall score for the inputs. Finally, similar to the way we applied a threshold to the credit solvency score to convert it into a decision of whether to grant or reject the loan application, a neuron applies a function (known as an activation function) to convert the overall score of the inputs. In the earliest types of neurons, these activation functions were actually threshold functions that worked in exactly the same way as the score threshold used in this credit scoring example. In more recent neural networks, different types of activation functions (for example, the logistic, tanh, or ReLU functions) are used. We will introduce these activation functions in the next chapter.

    3 Neural Networks: The Building Blocks of Deep Learning

    The term deep learning describes a family of neural network models that have multiple layers of simple information processing programs, known as neurons, in the network. The focus of this chapter is to provide a clear and comprehensive introduction to how these neurons work and are interconnected in artificial neural networks. In later chapters, we will explain how neural networks are trained using data.

    A neural network is a computational model that is inspired by the structure of the human brain. The human brain is composed of a massive number of nerve cells, called neurons. In fact, some estimates put the number of neurons in the human brain at one hundred billion (Herculano-Houzel 2009). Neurons have a simple three-part structure consisting of: a cell body, a set of fibers called dendrites, and a single long fiber called an axon. Figure 3.1 illustrates the structure of a neuron and how it connects to other neurons in the brain. The dendrites and the axon stem from the cell body, and the dendrites of one neuron are connected to the axons of other neurons. The dendrites act as input channels to the neuron and receive signals sent from other neurons along their axons. The axon acts as the output channel of a neuron, and so other neurons, whose dendrites are connected to the axon, receive the signals sent along the axon as inputs.

    Neurons work in a very simple manner. If the incoming stimuli are strong enough, the neuron transmits an electrical pulse, called an action potential, along its axon to the other neurons that are connected to it. So, a neuron acts as an all-or-none switch, that takes in a set of inputs and either outputs an action potential or no output.

    This explanation of the human brain is a significant simplification of the biological reality, but it does capture the main points necessary to understand the analogy between the structure of the human brain and computational models called neural networks. These points of analogy are: (1) the brain is composed of a large number of interconnected and simple units called neurons; (2) the functioning of the brain can be understood as processing information, encoded as high or low electrical signals, or activation potentials, that spread across the network of neurons; and (3) each neuron receives a set of stimuli from its neighbors and maps these inputs to either a high- or low-value output. All computational models of neural networks have these characteristics.

    Figure 3.1 The structure of a neuron in the brain.

    Artificial Neural Networks

    An artificial neural network consists of a network of simple information processing units, called neurons. The power of neural networks to model complex relationships is not the result of complex mathematical models, but rather emerges from the interactions between a large set of simple neurons.

    Figure 3.2 illustrates the structure of a neural network. It is standard to think of the neurons in a neural network as organized into layers. The depicted network has five layers: one input layer, three hidden layers, and one output layer. A hidden layer is just a layer that is neither the input nor the output layer. Deep learning networks are neural networks that have many hidden layers of neurons. The minimum number of hidden layers necessary to be considered deep is two. However, most deep learning networks have many more than two hidden layers. The important point is that the depth of a network is measured in terms of the number of hidden layers, plus the output layer.

    Deep learning networks are neural networks that have many hidden layers of neurons.

    In figure 3.2, the squares in the input layer represent locations in memory that are used to present inputs to the network. These locations can be thought of as sensing neurons. There is no processing of information in these sensing neurons; the output of each of these neurons is simply the value of the data stored at the memory location. The circles in the figure represent the information processing neurons in the network. Each of these neurons takes a set of numeric values as input and maps them to a single output value. Each input to a processing neuron is either the output of a sensing neuron or the output of another processing neuron.

    Figure 3.2 Topological illustration of a simple neural network.

    The arrows in figure 3.2 illustrate how information flows through the network from the output of one neuron to the input of another neuron. Each connection in a network connects two neurons and each connection is directed, which means that information carried along a connection only flows in one direction. Each of the connections in a network has a weight associated with it. A connection weight is simply a number, but these weights are very important. The weight of a connection affects how a neuron processes the information it receives along the connection, and, in fact, training an artificial neural network, essentially, involves searching for the best (or optimal) set of weights.

    How an Artificial Neuron Processes Information

    The processing of information within a neuron, that is, the mapping from inputs to an output, is very similar to the loan decision model that we developed in chapter 2. Recall that the loan decision model first calculated a weighted sum over the input features (income and debt). The weights used in the weighted sum were adjusted using a dataset so that the results of the weighted sum calculation, given an loan applicant’s income and debt as inputs, was an accurate estimate of the applicant’s credit solvency score. The second stage of processing in the loan decision model involved passing the result of the weighted sum calculation (the estimated credit solvency score) through a decision rule. This decision rule was a function that mapped a credit solvency score to a decision on whether a loan application was granted or rejected.

    A neuron also implements a two-stage process to map inputs to an output. The first stage of processing involves the calculation of a weighted sum of the inputs to the neuron. Then the result of the weighted sum calculation is passed through a second function that maps the results of the weighted sum score to the neuron’s final output value. When we are designing a neuron, we can used many different types of functions for this second stage or processing; it may be as simple as the decision rule we used for our loan decision model, or it may be more complex. Typically the output value of a neuron is known as its activation value, so this second function, which maps from the result of the weighted sum to the activation value of the neuron, is known as an activation function.

    Figure 3.3 illustrates how these stages of processing are reflected in the structure of an artificial neuron. In figure 3.3, the Σ symbol represents the calculation of the weighted sum, and the φ symbol represents the activation function processing the weighted sum and generating the output from the neuron.

    Figure 3.3 The structure of an artificial neuron.

    The neuron in figure 3.3 receives n inputson n different input connections, and each connection has an associated weight. The weighted sum calculation involves the multiplication of inputs by weights and the summation of the resulting values. Mathematically this calculation is written as:

    This calculation can also be written in a more compact mathematical form as:

    For example, assuming a neuron received the inputsand had the following weights
    , the weighted sum calculation would be:
    z=(3X-3)+(9×1)
    =0

    The second stage of processing within a neuron is to pass the result of the weighted sum, the  value, through an activation function. Figure 3.4 plots the shape of a number of possible activation functions, as the input to each function,  ranges across an interval, either [-1, …, +1] or [-10, …, +10] depending on which interval best illustrates the shape of the function. Figure 3.4 (top) plots a threshold activation function. The decision rule we used in the loan decision model was an example of a threshold function; the threshold used in that decision rule was whether the credit solvency score was above 200. Threshold activations were common in early neural network research. Figure 3.4 (middle) plots the logistic and tanh activation functions. The units employing these activation functions were popular in multilayer networks until quite recently. Figure 3.4 (bottom) plots the rectifier (or hinge, or positive linear) activation function. This activation function is very popular in modern deep learning networks; in 2011 the rectifier activation function was shown to enable better training in deep networks (Glorot et al. 2011). In fact, as will be discussed in chapter 4, during the review of the history of deep learning, one of the trends in neural network research has been a shift from threshold activation to logistic and tanh activations, and then onto rectifier activation functions.

    Figure 3.4 Top: threshold function; middle: logistic and tanh functions; bottom: rectified linear function.

    Returning to the example, the result of the weighted summation step was . Figure 3.4 (middle plot, solid line) plots the logistic function. Assuming that the neuron is using a logistic activation function, this plot shows how the result of the summation will be mapped to an output activation: . The calculation of the output activation of this neuron can be summarized as:

    Notice that the processing of information in this neuron is nearly identical to the processing of information in the loan decision model we developed in the last chapter. The major difference is that we have replaced the decision threshold rule that mapped the weighted sum score to an accepted or rejected output with a logistic function that maps the weighted sum score to a value between 0 and 1. Depending on the location of this neuron in the network, the output activation of the neuron, in this instance , will either be passed as input to one or more neurons in the next layer in the network, or will be part of the overall output of the network. If a neuron is at the output layer, the interpretation of what its output value means would be dependent on the task that the neuron is designed to model. If a neuron is in one of the hidden layers of the network, then it may not be possible to put a meaningful interpretation on the output of the neuron apart from the general interpretation that it represents some sort of derived feature (similar to the BMI feature we discussed in chapter 1) that the network has found useful in generating its outputs. We will return to the challenge of interpreting the meaning of activations within a neural network in chapter 7.

    The key point to remember from this section is that a neuron, the fundamental building block of neural networks and deep learning, is defined by a simple two-step sequence of operations: calculating a weighted sum and then passing the result through an activation function.

    Figure 3.4 illustrates that neither the tanh nor the logistic function is a linear function. In fact, the plots of both of these functions have a distinctive s-shaped (rather than linear) profile. Not all activation functions have an s-shape (for example, the threshold and rectifier are not s-shaped), but all activation functions do apply a nonlinear mapping to the output of the weighted sum. In fact, it is the introduction of the nonlinear mapping into the processing of a neuron that is the reason why activation functions are used.

    Why Is an Activation Function Necessary?

    To understand why a nonlinear mapping is needed in a neuron, it is first necessary to understand that, essentially, all a neural network does is define a mapping from inputs to outputs, be it from a game position in Go to an evaluation of that position, or from an X-ray to a diagnosis of a patient. Neurons are the basic building blocks of neural networks, and therefore they are the basic building blocks of the mapping a network defines. The overall mapping from inputs to outputs that a network defines is composed of the mappings from inputs to outputs that each of the neurons within the network implement. The implication of this is that if all the neurons within a network were restricted to linear mappings (i.e., weighted sum calculations), the overall network would be restricted to a linear mapping from inputs to outputs. However, many of the relationships in the world that we might want to model are nonlinear, and if we attempt to model these relationships using a linear model, then the model will be very inaccurate. Attempting to model a nonlinear relationship with a linear model would be an example of the underfitting problem we discussed in chapter 1: underfitting occurs when the model used to encode the patterns in a dataset is too simple and as a result it is not accurate.

    A linear relationship exists between two things when an increase in one always results in an increase or decrease in the other at a constant rate. For example, if an employee is on a fixed hourly rate, which does not vary at weekends or if they do overtime, then there is a linear relationship between the number of hours they work and their pay. A plot of their hours worked versus their pay will result in a straight line; the steeper the line the higher their fixed hourly rate of pay. However, if we make the payment system for our hypothetical employee just slightly more complex, by, for example, increasing their hourly rate of pay when they do overtime or work weekends, then the relationship between the number of hours they work and their pay is no longer linear. Neural networks, and in particular deep learning networks, are typically used to model relationships that are much more complex than this employee’s pay. Modeling these relationships accurately requires that a network be able to learn and represent complex nonlinear mappings. So, in order to enable a neural network to implement such nonlinear mappings, a nonlinear step (the activation function) must be included within the processing of the neurons in the network.

    In principle, using any nonlinear function as an activation function enables a neural network to learn a nonlinear mapping from inputs to outputs. However, as we shall see later, most of the activation functions plotted in figure 3.4 have nice mathematical properties that are helpful when training a neural network, and this is why they are so popular in neural network research.

    The fact that the introduction of a nonlinearity into the processing of the neurons enables the network to learn a nonlinear mapping between input(s) and output is another illustration of the fact that the overall behavior of the network emerges from the interactions of the processing carried out by individual neurons within the network. Neural networks solve problems using a divide-and-conquer strategy: each of the neurons in a network solves one component of the larger problem, and the overall problem is solved by combining these component solutions. An important aspect of the power of neural networks is that during training, as the weights on the connections within the network are set, the network is in effect learning a decomposition of the larger problem, and the individual neurons are learning how to solve and combine solutions to the components within this problem decomposition.

    Within a neural network, some neurons may use different activation functions from other neurons in the network. Generally, however, all the neurons within a given layer of a network will be of the same type (i.e., they will all use the same activation function). Also, sometimes neurons are referred to as units, with a distinction made between units based on the activation function the units use: neurons that use a threshold activation function are known as threshold units, units that use a logistic activation function are known as logistic units, and neurons that use the rectifier activation function are known as rectified linear units, or ReLUs. For example, a network may have a layer of ReLUs connected to a layer of logistic units. The decision regarding which activation functions to use in the neurons in a network is made by the data scientist who is designing the network. To make this decision, a data scientist may run a number of experiments to test which activation functions give the best performance on a dataset. However, frequently data scientists default to using whichever activation function is popular at a given point. For example, currently ReLUs are the most popular type of unit in neural networks, but this may change as new activation functions are developed and tested. As we will discuss at the end of this chapter, the elements of a neural network that are set manually by the data scientist prior to the training process are known as hyperparameters.

    Neural networks solve problems using a divide-and-conquer strategy: each of the neurons in a network solves one component of the larger problem, and the overall problem is solved by combining these component solutions.

    The term hyperparameter is used to describe the manually fixed parts of the model in order to distinguish them from the parameters of the model, which are the parts of the model that are set automatically, by the machine learning algorithm, during the training process. The parameters of a neural network are the weights used in the weighted sum calculations of the neurons in the network. As we touched on in chapters 1 and 2, the standard training process for setting the parameters of a neural network is to begin by initializing the parameters (the network’s weights) to random values, and during training to use the performance of the network on the dataset to slowly adjust these weights so as to improve the accuracy of the model on the data. Chapter 6 describes the two algorithms that are most commonly used to train a neural network: the gradient descent algorithm and the backpropagation algorithm. What we will focus on next is understanding how changing the parameters of a neuron affects how the neuron responds to the inputs it receives.

    How Does Changing the Parameters of a Neuron Affect Its Behavior?

    The parameters of a neuron are the weights the neuron uses in the weighted sum calculation. Although the weighted sum calculation in a neuron is the same weighted sum used in a linear model, in a neuron the relationship between the weights and the final output of neuron is more complex because the result of the weighted sum is passed through an activation function in order to generate the final output. To understand how a neuron makes a decision on a given input, we need to understand the relationship between the neuron’s weights, the input it receives, and the output it generates in response.

    The relationship between a neuron’s weights and the output it generates for a given input is most easily understood in neurons that use a threshold activation function. A neuron using this type of activation function is equivalent to our loan decision model that used a decision rule to classify the credit solvency scores, generated by the weighted sum calculation, to reject or grant loan applications. At the end of chapter 2, we introduced the concepts of an input space, a weight space, and an activation space (see figure 2.2). The input space for our two-input loan decision model could be visualized as a two-dimensional space, with one input (annual income) plotted along the x-axis, and the other input (current debt) on the y-axis. Each point in this plot defined a potential combination of inputs to the model, and the set of points in the input space defines the set of possible inputs the model could process. The weights used in the loan decision model can be understood as dividing the input space into two regions: the first region contains all of the inputs that result in the loan application being granted, and the other region contains all the inputs that result in the loan application being rejected. In that scenario, changing the weights used by the decision model would change the set of loan applications that were accepted or rejected. Intuitively, this makes sense because it changes the weighting that we put on an applicant’s income relative to their debt when we are deciding on granting the loan or not.

    We can generalize the above analysis of the loan decision model to a neuron in a neural network. The equivalent neuron structure to the loan decision model is a two-input neuron with a threshold activation function. The input space for such a neuron has a similar structure to the input space for a loan decision model. Figure 3.5 presents three plots of the input space for a two-input neuron using a threshold function that outputs a high activation if the weighted sum result is greater than zero, and a low activation otherwise. The differences between each of the plots in this figure is that the neuron defines a different decision boundary in each case. In each plot, the decision boundary is marked with a black line.

    Each of the plots in figure 3.5 was created by first fixing the weights of the neuron and then for each point in the input space recording whether the neuron returned a high or low activation when the coordinates of the point were used as the inputs to the neuron. The input points for which the neuron returned a high activation are plotted in gray, and the other points are plotted in white. The only difference between the neurons used to create these plots was the weights used in calculating the weighted sum of the inputs. The arrow in each plot illustrates the weight vector used by the neuron to generate the plot. In this context, a vector describes the direction and distance of a point from the origin.1 As we shall see, interpreting the set of weights used by a neuron as defining a vector (an arrow from the origin to the coordinates of the weights) in the neuron’s input space is useful in understanding how changes in the weights change the decision boundary of the neuron.

    Figure 3.5 Decision boundaries for a two-input neuron. Top: weight vector [w1=1, w2=1]; middle: weight vector [w1=-2, w2=1]; bottom: weight vector [w1=1, w2=-2].

    The weights used to create each plot change from one plot to the next. These changes are reflected in the direction of the arrow (the weight vector) in each plot. Specifically, changing the weights rotates the weight vector around the origin. Notice that the decision boundary in each plot is sensitive to the direction of the weight vector: in all the plots, the decision boundary is orthogonal (i.e., at a right, or 90°, angle) to the weight vector. So, changing the weights not only rotates the weight vector, it also rotates the decision boundary of the neuron. This rotation changes the set of inputs that the neuron outputs a high activation in response to (the gray regions).

    To understand why this decision boundary is always orthogonal to the weight vector, we have to shift our perspective, for a moment, to linear algebra. Remember that every point in the input space defines a potential combination of input values to the neuron. Now, imagine each of these sets of input values as defining an arrow from the origin to the coordinates of the point in the input space. There is one arrow for each point in the input space. Each of these arrows is very similar to the weight vector, except that it points to the coordinates of the inputs rather than to the coordinates of the weights. When we treat a set of inputs as a vector, the weighted sum calculation is the same as multiplying two vectors, the input vector by the weight vector. In linear algebra terminology, multiplying two vectors is known as the dot product operation. For the purposes of this discussion, all we need to know about the dot product is that the result of this operation is dependent on the angle between the two vectors that are multiplied. If the angle between the two vectors is less than a right angle, then the result will be positive; otherwise, it will be negative. So, multiplying the weight vector by an input vector will return a positive value for all the input vectors at an angle less than a right angle to the weight vector, and a negative value for all the other vectors. The activation function used by this neuron returns a high activation when positive values are input and a low activation when negative values are input. Consequently, the decision boundary lies at a right angle to the weight vector because all the inputs at an angle less than a right angle to the weight vector will result in a positive input to the activation function and, therefore, trigger a high-output activation from the neuron; conversely, all the other inputs will result in a low-output activation from the neuron.

    Switching back to the plots in figure 3.5, although the decision boundaries in each of the plots are at different angles, all the decision boundaries go through the point in space that the weight vectors originate from (i.e., the origin). This illustrates that changing the weights of a neuron rotates the neuron’s decision boundary but does not translate it. Translating the decision boundary means moving the decision boundary up and down the weight vector, so that the point where it meets the vector is not the origin. The restriction that all decision boundaries must pass through the origin limits the distinctions that a neuron can learn between input patterns. The standard way to overcome this limitation is to extend the weighted sum calculation so that it includes an extra element, known as the bias term. This bias term is not the same as the inductive bias we discussed in chapter 1. It is more analogous to the intercept parameter in the equation of a line, which moves the line up and down the y-axis. The purpose of this bias term is to move (or translate) the decision boundary away from the origin.

    The bias term is simply an extra value that is included in the calculation of the weighted sum. It is introduced into the neuron by adding the bias to the result of the weighted summation prior to passing it through the activation function. Here is the equation describing the processing stages in a neuron with the bias term represented by the term b:

    Figure 3.6 illustrates how the value of the bias term affects the decision boundary of a neuron. When the bias term is negative, the decision boundary is moved away from the origin in the direction that the weight vector points to (as in the top and middle plots in figure 3.6); when the bias term is positive, the decision boundary is translated in the opposite direction (see the bottom plot of figure 3.6). In both cases, the decision boundary remains orthogonal to the weight vector. Also, the size of the bias term affects the amount the decision boundary is moved from the origin; the larger the value of the bias term, the more the decision boundary is moved (compare the top plot of figure 3.6 with the middle and bottom plots).

    Figure 3.6 Decision boundary plots for a two-input neuron that illustrate the effect of the bias term on the decision boundary. Top: weight vector [w1=1, w2=1] and bias equal to -1; middle: weight vector [w1=-2, w2=1] and bias equal to -2; bottom: weight vector [w1=1, w2=-2] and bias equal to 2.

    Instead of manually setting the value of the bias term, it is preferable to allow a neuron to learn the appropriate bias. The simplest way to do this is to treat the bias term as a weight and allow the neuron to learn the bias term at the same time that it is learning the rest of the weights for its inputs. All that is required to achieve this is to augment all the input vectors the neuron receives with an extra input that is always set to 1. By convention, this input is input 0 (), and, consequently, the bias term is specified by weight 0 ().2 Figure 3.7 illustrates the structure of an artificial neuron when the bias term has been integrated as .

    When the bias term has been integrated into the weights of a neuron, the equation specifying the mapping from input(s) to output activation of the neuron can be simplified (at least from a notational perspective) as follows:

    Notice that in this equation the index  goes from  to , so that it now includes the fixed input, , and the bias term, ; in the earlier version of this equation, the index only went from  to . This new format means that the neuron is able to learn the bias term, simply by learning the appropriate weight , using the same process that is used to learn the weights for the other inputs: at the start of training, the bias term for each neuron in the network will be initialized to a random value and then adjusted, along with the weights of the network, in response to the performance of the network on the dataset.

    Figure 3.7 An artificial neuron with a bias term included as w0.

    Accelerating Neural Network Training Using GPUs

    Merging the bias term is more than a notational convenience; it enables us to use specialized hardware to accelerate the training of neural networks. The fact that a bias term can be treated as the same as a weight means that the calculation of the weighted sum of inputs (including the addition of the bias term) can be treated as the multiplication of two vectors. As we discussed earlier, during the explanation of why the decision boundary was orthogonal to the weight vector, we can think of a set of inputs as a vector. Recognizing that much of the processing within a neural network involves vector and matrix multiplications opens up the possibility of using specialized hardware to speed up these calculations. For example, graphics processing units (GPUs) are hardware components that have specifically been designed to do extremely fast matrix multiplications.

    In a standard feedforward network, all the neurons in one layer receive all the outputs (i.e., activations) from all the neurons in the preceding layer. This means that all the neurons in a layer receive the same set of inputs. As a result, we can calculate the weighted sum calculation for all the neurons in a layer using only a single vector by matrix multiplication. Doing this is much faster than calculating a separate weighted sum for each neuron in the layer. To do this calculation of weighted sums for an entire layer of neurons in a single multiplication, we put the outputs from the neurons in the preceding layer into a vector and store all the weights of the connections between the two layers of neurons in a matrix. We then multiply the vector by the matrix, and the resulting vector contains the weighted sums for all the neurons.

    Figure 3.8 illustrates how the weighted summation calculations for all the neurons in a layer in a network can be calculated using a single matrix multiplication operation. This figure is composed of two separate graphics: the graphic on the left illustrates the connections between neurons in two layers of a network, and the graphic on the right illustrates the matrix operation to calculate the weighted sums for the neurons in the second layer of the network. To help maintain a correspondence between the two graphics, the connections into neuron E are highlighted in the graphic on the left, and the calculation of the weighted sum in neuron E is highlighted in the graphic on the right.

    Focusing on the graphic on the right, the  vector (1 row, 3 columns) on the bottom-left of this graphic, stores the activations for the neurons in layer 1 of the network; note that these activations are the outputs from an activation function  (the particular activation function is not specified—it could be a threshold function, a tanh, a logistic function, or a rectified linear unit/ReLU function). The  matrix (three rows and four columns), in the top-right of the graphic, holds the weights for the connections between the two layers of neurons. In this matrix, each column stores the weights for the connections coming into one of the neurons in the second layer of the network. The first column stores the weights for neuron D, the second column for neuron E, etc.3 Multiplying the  vector of activations from layer 1 by the  weight matrix results in a  vector corresponding to the weighted summations for the four neurons in layer 2 of the network:  is the weighted sum of inputs for neuron D,  for neuron E, and so on.

    To generate the  vector containing the weighted summations for the neurons in layer 2, the activation vector is multiplied by each column in the matrix in turn. This is done by multiplying the first (leftmost) element in the vector by the first (topmost) element in the column, then multiplying the second element in the vector by the element in the second row in the column, and so on, until each element in the vector has been multiplied by its corresponding column element. Once all the multiplications between the vector and the column have been completed, the results are summed together and the stored in the output vector. Figure 3.8 illustrates multiplication of the activation vector by the second column in the weight matrix (the column containing the weights for inputs to neuron E) and the storing of the summation of these multiplications in the output vector as the value .

    Figure 3.8 A graphical illustration of the topological connections of a specific neuron E in a network, and the corresponding vector by matrix multiplication that calculates the weighted summation of inputs for the neuron E, and its siblings in the same layer.5

    Indeed, the calculation implemented by an entire neural network can be represented as a chain of matrix multiplications, with an element-wise application of activation functions to the results of each multiplication. Figure 3.9 illustrates how a neural network can be represented in both graph form (on the left) and as a sequence of matrix operations (on the right). In the matrix representation, the  symbol represents standard matrix multiplication (described above) and the  notation represents the application of an activation function to each element in the vector created by the preceding matrix multiplication. The output of this element-wise application of the activation function is a vector containing the activations for the neurons in a layer of the network. To help show the correspondence between the two representations, both figures show the inputs to the network,  and , the activations from the three hidden units, , and , and the overall output of the network, .

    Figure 3.9 A graph representation of a neural network (left), and the same network represented as a sequence of matrix operations (right).6

    As a side note, the matrix representation provides a transparent view of the depth of a network; the network’s depth is counted as the number of layers that have a weight matrix associated with them (or equivalently, the depth of a network is the number of weight matrices required by the network). This is why the input layer is not counted when calculating the depth of a network: it does not have a weight matrix associated with it.

    As mentioned above, the fact that the majority of calculations in a neural network can be represented as a sequence of matrix operations has important computational implications for deep learning. A neural network may contain over a million neurons, and the current trend is for the size of these networks to double every two to three years.4 Furthermore, deep learning networks are trained by iteratively running a network on examples sampled from very large datasets and then updating the network parameters (i.e., the weights) to improve performance. Consequently, training a deep learning network can require very large numbers of network runs, with each network run requiring millions of calculations. This is why computational speedups, such as those that can be achieved by using GPUs to perform matrix multiplications, have been so important for the development of deep learning.

    The relationship between GPUs and deep learning is not one-way. The growth in demand for GPUs generated by deep learning has had a significant impact on GPU manufacturers. Deep learning has resulted in these companies refocusing their business. Traditionally, these companies would have focused on the computer games market, since the original motivation for developing GPU chips was to improve graphics rendering, and this had a natural application to computer games. However, in recent years these companies have focused on positioning GPUs as hardware for deep learning and artificial intelligence applications. Furthermore, GPU companies have also invested to ensure that their products support the top deep learning software frameworks.

    Summary

    The primary theme in this chapter has been that deep learning networks are composed of large numbers of simple processing units that work together to learn and implement complex mappings from large datasets. These simple units, neurons, execute a two-stage process: first, a weighted summation over the inputs to the neuron is calculated, and second, the result of the weighted summation is passed through a nonlinear function, known as an activation function. The fact that a weighted summation function can be efficiently calculated across a layer of neurons using a single matrix multiplication operation is important: it means that neural networks can be understood as a sequence of matrix operations; this has permitted the use of GPUs, hardware optimized to perform fast matrix multiplication, to speed up the training of networks, which in turn has enabled the size of networks to grow.

    The compositional nature of neural networks means that it is possible to understand at a very fundamental level how a neural network operates. Providing a comprehensive description of this level of processing has been the focus of this chapter. However, the compositional nature of neural networks also raises a raft of questions in relation to how a network should be composed to solve a given task, for example:
    • Which activation functions should the neurons in a network use?
    • How many layers should there be in a network?
    • How many neurons should there be in each layer?
    • How should the neurons be connected together?

    Unfortunately, many of these questions cannot be answered at a level of pure principle. In machine learning terminology, the types of concepts these questions are about are known as hyperparameters, as distinct from model parameters. The parameters of a neural network are the weights on the edges, and these are set by training the network using large datasets. By contrast, hyperparameters are the parameters of a model (in these cases, the parameters of a neural network architecture) and/or training algorithm that cannot be directly estimated from the data but instead must be specified by the person creating the model, either through the use of heuristic rules, intuition, or trial and error. Often, much of the effort that goes into the creation of a deep learning network involves experimental work to answer the questions in relation to hyperparameters, and this process is known as hyperparameter tuning. The next chapter will review the history and evolution of deep learning, and the challenges posed by many of these questions are themes running through the review. Subsequent chapters in the book will explore how answering these questions in different ways can create networks with very different characteristics, each suited to different types of tasks. For example, recurrent neural networks are best suited to processing sequential/time-series data, whereas convolutional neural networks were originally developed to process images. Both of these network types are, however, built using the same fundamental processing unit, the artificial neuron; the differences in the behavior and abilities of these networks stems from how these neurons are arranged and composed.

    4 A Brief History of Deep Learning

    The history of deep learning can be described as three major periods of excitement and innovation, interspersed with periods of disillusionment. Figure 4.1 shows a timeline of this history, which highlights these periods of major research: on threshold logic units (early 1940s to the mid 1960s), connectionism (early 1980s to mid-1990s), and deep learning (mid 2000s to the present). Figure 4.1 distinguishes some of the primary characteristics of the networks developed in each of these three periods. The changes in these network characteristics highlight some of the major themes within the evolution of deep learning, including: the shift from binary to continuous values; the move from threshold activation functions, to logistic and tanh activation, and then onto ReLU activation; and the progressive deepening of the networks, from single layer, to multiple layer, and then onto deep networks. Finally, the upper half of figure 4.1 presents some of the important conceptual breakthroughs, training algorithms, and model architectures that have contributed to the evolution of deep learning.

    Figure 4.1 provides a map of the structure of this chapter, with the sequence of concepts introduced in the chapter generally following the chronology of this timeline. The two gray rectangles in figure 4.1 represent the development of two important deep learning network architectures: convolutional neural networks (CNNs), and recurrent neural networks (RNNs). We will describe the evolution of these two network architectures in this chapter, and chapter 5 will give a more detailed explanation of how these networks work.

    Figure 4.1 History of Deep Learning.

    Early Research: Threshold Logic Units

    In some of the literature on deep learning, the early neural network research is categorized as being part of cybernetics, a field of research that is concerned with developing computational models of control and learning in biological units. However, in figure 4.1, following the terminology used in Nilsson (1965), this early work is categorized as research on threshold logic units because this term transparently describes the main characteristics of the systems developed during this period. Most of the models developed in the 1940s, ’50s, and ’60s processed Boolean inputs (true/false represented as +1/-1 or 1/0) and generated Boolean outputs. They also used threshold activation functions (introduced in chapter 3), and were restricted to single-layer networks; in other words, they were restricted to a single matrix of tunable weights. Frequently, the focus of this early research was on understanding whether computational models based on artificial neurons had the capacity to learn logical relations, such as conjunction or disjunction.

    In 1943, Walter McCulloch and Walter Pitts published an influential computational model of biological neurons in a paper entitled: “A Logical Calculus of the Ideas Immanent in Nervous Activity” (McCulloch and Pitts 1943). The paper highlighted the all-or-none characteristic of neural activity in the brain and set out to mathematically describe neural activity in terms of a calculus of propositional logic. In the McCulloch and Pitts model, all the inputs and the output to a neuron were either 0 or 1. Furthermore, each input was either excitatory (having a weight of +1) or inhibitory (having a weight of -1). A key concept introduced in the McCulloch and Pitts model was a summation of inputs followed by a threshold function being applied to the result of the summation. In the summation, if an excitatory input was on, it added 1; if an inhibitory input was on, it subtracted 1. If the result of the summation was above a preset threshold, then the output of the neuron was 1; otherwise, it output a 0. In the paper, McCulloch and Pitts demonstrated how logical operations (such as conjunction, disjunction, and negation) could be represented using this simple model. The McCulloch and Pitts model integrated the majority of the elements that are present in the artificial neurons introduced in chapter 3. In this model, however, the neuron was fixed; in other words the weights and threshold were set by han.

    In 1949, Donald O. Hebb published a book entitled The Organization of Behavior, in which he set out a neuropsychological theory (integrating psychology and the physiology of the brain) to explain general human behavior. The fundamental premise of the theory was that behavior emerged through the actions and interactions of neurons. For neural network research, the most important idea in this book was a postulate, now known as Hebb’s postulate, which explained the creation of lasting memory in animals based on a process of changes to the connections between neurons:
    When an axon of a cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased. (Hebb 1949, p. 62)

    This postulate was important because it asserted that information was stored in the connections between neurons (i.e., in the weights of a network), and furthermore that learning occurred by changing these connections based on repeated patterns of activation (i.e., learning can take place within a network by changing the weights of the network).

    Rosenblatt’s Perceptron Training Rule

    In the years following Hebb’s publication, a number of researchers proposed computational models of neuron activity that integrated the Boolean threshold activation units of McCulloch and Pitts, with a learning mechanism based on adjusting the weights applied to the inputs. The best known of these models was Frank Rosenblatt’s perceptron model (Rosenblatt 1958). Conceptually, the perceptron model can be understood as a neural network consisting of a single artificial neuron that uses a threshold activation unit. Importantly, a perceptron network only has a single layer of weights. The first implementation of a perceptron was a software implementation on an IBM 704 system (and this was probably the first implementation of any neural network). However, Rosenblatt always intended the perceptron to be a physical machine and it was later implemented in custom-built hardware known as the “Mark 1 perceptron.” The Mark 1 perceptron received input from a camera that generated a 400-pixel image that was passed into the machine via an array of 400 photocells that were in turn connected to the neurons. The weights on connections to the neurons were implemented using adjustable electrical resistors known as potentiometers, and weight adjustments were implemented by using electric motors to adjust the potentiometers.

    Rosenblatt proposed an error-correcting training procedure for updating the weights of a perceptron so that it could learn to distinguish between two classes of input: inputs for which the perceptron should produce the output, and inputs for which the perceptron should produce the output(Rosenblatt 1960). The training procedure assumes a set of Boolean encoded input patterns, each with an associated target output. At the start of training, the weights in the perceptron are initialized to random values. Training then proceeds by iterating through the training examples, and after each example has been presented to the network, the weights of the network are updated based on the error between the output generated by the perceptron and the target output specified in the data. The training examples can be presented to the network in any order and examples may be presented multiple times before training is completed. A complete training pass through the set of examples is known as an iteration, and training terminates when the perceptron correctly classifies all the examples in an iteration.

    Rosenblatt defined a learning rule (known as the perceptron training rule) to update each weight in a perceptron after a training example has been processed. The strategy the rule used to update the weights is the same as the three-condition strategy we introduced in chapter 2 to adjust the weights in the loan decision model:
    1. If the output of the model for an example matches the output specified for that example in the dataset, then don’t update the weights.
    2. If the output of the model is too low for the current example, then increase the output of the model by increasing the weights for the inputs that had positive value for the example and decreasing the weights for the inputs that had a negative value for the example.
    3. If the output of the model is too high for the current example, then reduce the output of the model by decreasing the weights for the inputs that had a positive value and increasing the weights for the inputs that had a negative value for the example.

    Written out in an equation, Rosenblatt’s learning rule updates a weight  (
    ) as:

    In this rule,
      is the value of weight i after the network weights have been updated in response to the processing of example t is the value of weight i used during the processing of example t is a preset positive constant (known as the learning rate, discussed below),  is the expected output for example t as specified in the training dataset,  is the output generated by the perceptron for example t, and  is the component of input t that was weighted by  during the processing of the example.

    Although it may look complex, the perceptron training rule is in fact just a mathematical specification of the three-condition weight update strategy described above. The primary part of the equation to understand is the calculation of the difference between the expected output and what the perceptron actually predicted: . The outcome of this subtraction tells us which of the three update conditions we are in. In understanding how this subtraction works, it is important to remember that for a perceptron model the desired output is always either  or . The first condition is when ; then the output of the perceptron is correct and the weights are not changed.

    The second weight update condition is when the output of the perceptron is too large. This condition can only be occur when the correct output for example  is  and so this condition is triggered when . In this case, if the perceptron output for the example  is , then the error term is negative () and the weight  is updated by . Assuming, for the purpose of this explanation, that  is set to 0.5, then this weight update simplifies to . In other words, when the perceptron’s output is too large, the weight update rule subtracts the input values from the weights. This will decrease the weights on inputs with positive values for the example, and increase the weights on inputs with negative values for the example (subtracting a negative number is the same as adding a positive number).

    The third weight update condition is when the output of the perceptron is too small. This weight update condition is the exact opposite of the second. It can only occur when  and so is triggered when . In this case (), and the weight is updated by . Again assuming that  is set to 0.5, then this update simplifies to , which highlights that when the error of the perceptron is positive, the rule updates the weight by adding the input to the weight. This has the effect of decreasing the weights on inputs with negative values for the example and increasing the weight on inputs with positive values for the example.

    At a number of points in the preceding paragraphs we have referred to learning rate, . The purpose of the learning rate, , is to control the size of the adjustments that are applied to a weight. The learning rate is an example of a hyperparameter that is preset before the model is trained. There is a tradeoff in setting the learning rate:
    • If the learning rate is too small, it may take a very long time for the training process to converge on an appropriate set of weights.
    • If the learning rate is too large, the network’s weights may jump around the weight space too much and the training may not converge at all.

    One strategy for setting the learning rate is to set it to a relatively small positive value (e.g., 0.01), and another strategy is to initialize it to a larger value (e.g., 1.0) but to systematically reduce it as the training progresses

    (e.g.,

    ).

    To make this discussion regarding the learning rate more concrete, imagine you are trying to solve a puzzle that requires you to get a small ball to roll into a hole. You are able to control the direction and speed of the ball by tilting the surface that the ball is rolling on. If you tilt the surface too steeply, the ball will move very fast and is likely to go past the hole, requiring you to adjust the surface again, and if you overadjust you may end up repeatedly tilting the surface. On the other hand, if you only tilt the surface a tiny bit, the ball may not start to move at all, or it may move very slowly taking a long time to reach the hole. Now, in many ways the challenge of getting the ball to roll into the hole is similar to the problem of finding the best set of weights for a network. Think of each point on the surface the ball is rolling across as a possible set of network weights. The ball’s position at each point in time specifies the current set of weights of the network. The position of the hole specifies the optimal set of network weights for the task we are training the network to complete. In this context, guiding the network to the optimal set of weights is analogous to guiding the ball to the hole. The learning rate allows us to control how quickly we move across the surface as we search for the optimal set of weights. If we set the learning rate to a high value, we move quickly across the surface: we allow large updates to the weights at each iteration, so there are big differences between the network weights in one iteration and the next. Or, using our rolling ball analogy, the ball is moving very quickly, and just like in the puzzle when the ball is rolling too fast and passes the hole, our search process may be moving so fast that it misses the optimal set of weights. Conversely, if we set the learning rate to a low value, we move very slowly across the surface: we only allow small updates to the weights at each iteration; or, in other words, we only allow the ball to move very slowly. With a low learning rate, we are less likely to miss the optimal set of weights, but it may take an inordinate amount of time to get to them. The strategy of starting with a high learning rate and then systematically reducing it is equivalent to steeply tilting the puzzle surface to get the ball moving and then reducing the tilt to control the ball as it approaches the hole.

    Rosenblatt proved that if a set of weights exists that enables the perceptron to properly classify all of the training examples correctly, the perceptron training algorithm will eventually converge on this set of weights. This finding is known as the perceptron convergence theorem (Rosenblatt 1962). The difficulty with training a perceptron, however, is that it may require a substantial number of iterations through the data before the algorithm converges. Furthermore, for many problems it is unknown whether an appropriate set of weights exists in advance; consequently, if training has been going on for a long time, it is not possible to know whether the training process is simply taking a long time to converge on the weights and terminate, or whether it will never terminate.

    The Least Mean Squares Algorithm

    Around the same time that Rosenblatt was developing the perceptron, Bernard Widrow and Marcian Hoff were developing a very similar model called the ADALINE (short for adaptive linear neuron), along with a learning rule called the LMS (least mean square) algorithm (Widrow and Hoff 1960). An ADALINE network consists of a single neuron that is very similar to a perceptron; the only difference is that an ADALINE network does not use a threshold function. In fact, the output of an ADALINE network is the just the weighted sum of the inputs. This is why it is known as a linear neuron: a weighted sum is a linear function (it defines a line), and so an ADALINE network implements a linear mapping from inputs to output. The LMS rule is nearly identical to the perceptron learning rule, except that the output of the perceptron for a given example  is replaced by the weighted sum of the inputs:

    The logic of the LMS update rule is the same as that of the perceptron training rule. If the output is too large, then weights that were applied to a positive input caused the output to be larger, and these weights should be decreased, and those that were applied to a negative input should be increased, thereby reducing the output the next time this input pattern is received. And, by the same logic, if the output is too small, then weights that were applied to a positive input are increased and those that were applied to a negative input should be decreased.

    If the output of the model is too large, then weights associated with positive inputs should be reduced, whereas if the output is too small, then these weights should be increased.

    One of the important aspects of Widrow and Hoff’s work was to show that LMS rule could be used to train network to predict a number of any value, not just a +1 or -1. This learning rule was called the least mean square algorithm because using the LMS rule to iteratively adjust the weights in a neuron is equivalent to minimizing the average squared error on the training set. Today, the LMS learning rule is sometimes called the Widrow-Hoff learning rule, after the inventors; however, it is more commonly called the delta rule because it uses the difference (or delta) between desired output and the actual output to calculate the weight adjustments. In other words, the LMS rule specifies that a weight should be adjusted in proportion to the difference between the output of an ADALINE network and the desired output: if the neuron makes a large error, then the weights are adjusted by a large amount, if the neuron makes a small error, then weights are adjusted by a small amount.

    Today, the perceptron is recognized as important milestone in the development of neural networks because it was the first neural network to be implemented. However, most modern algorithms for training neural networks are more similar to the LMS algorithm. The LMS algorithm attempts to minimize the mean squared error of the network. As will be discussed in chapter 6, technically this iterative error reduction process involves a gradient descent down an error surface; and, today, nearly all neural networks are trained using some variant of gradient descent.

    The XOR Problem

    The success of Rosenblatt, Widrow and Hoff, and others, in demonstrating that neural network models could automatically learn to distinguish between different sets of patterns, generated a lot of excitement around artificial intelligence and neural network research. However, in 1969, Marvin Minsky and Seymour Papert published a book entitled Perceptrons, which, in the annals of neural network research, is attributed with single-handedly destroying this early excitement and optimism (Minsky and Papert 1969). Admittedly, throughout the 1960s neural network research had suffered from a lot of hype, and a lack of success in terms of fulfilling the correspondingly high expectations. However, Minsky and Papert’s book set out a very negative view of the representational power of neural networks, and after its publication funding for neural network research dried up.

    Minsky and Papert’s book primarily focused on single layer perceptrons. Remember that a single layer perceptron is the same as a single neuron that uses a threshold activation function, and so a single layer perceptron is restricted to implementing a linear (straight-line) decision boundary.1 This means that a single layer perceptron can only learn to distinguish between two classes of inputs if it is possible to draw a straight line in the input space that has all of the examples of one class on one side of the line and all examples of the other class on the other side of the line. Minsky and Papert highlighted this restriction as a weakness of these models.

    To understand Minsky and Papert’s criticism of single layer perceptrons, we must first understand the concept of a linearly separable function. We will use a comparison between the logical AND and OR functions with the logical XOR function to explain the concept of a linearly separable function. The AND function takes two inputs, each of which can be either TRUE or FALSE, and returns TRUE if both inputs are TRUE. The plot on the left of figure 4.4 shows the input space for the AND function and categorizes each of the four possible input combinations as either resulting in an output value of TRUE (shown in the figure by using a clear dot) or FALSE (shown in the figure by using black dots). This plot illustrates that is possible to draw a straight line between the inputs for which the AND function returns TRUE, (T,T), and the inputs for which the function returns FALSE, {(F,F), (F,T), (T,F)}. The OR function is similar to the AND function, except that it returns TRUE if either or both inputs are TRUE. The middle plot in figure 4.4 shows that it is possible to draw a line that separates the inputs that the OR function classifies as TRUE, {(F,T), (T,F), (T,T)}, from those it classifies as FALSE, (F,F). It is because we can draw a single straight line in the input space of these functions that divides the inputs belonging to one category of output from the inputs belonging to the other output category that the AND and OR functions are linearly separable functions.

    The XOR function is also similar in structure to the AND and OR functions; however, it only returns TRUE if one (but not both) of its inputs are TRUE. The plot on the right of figure 4.2 shows the input space for the XOR function and categorizes each of the four possible input combinations as returning either TRUE (shown in the figure by using a clear dot) or FALSE (shown in the figure by using black dots). Looking at this plot you will see that it is not possible to draw a straight line between the inputs the XOR function classifies as TRUE and those that it classifies as FALSE. It is because we cannot use a single straight line to separate the inputs belonging to different categories of outputs for the XOR function that this function is said to be a nonlinearly separable function. The fact that the XOR function is nonlinearly separable does not make the function unique, or even rare—there are many functions that are nonlinearly separable.

    Figure 4.2 Illustrations of the linearly separable function. In each figure, black dots represent inputs for which the function returns FALSE, circles represent inputs for which the function returns TRUE. (T stands for true and F stands for false.)

    The key criticism that Minsky and Papert made of single layer perceptrons was that these single layer models were unable to learn nonlinearly separable functions, such as the XOR function. The reason for this limitation is that the decision boundary of a perceptron is linear and so a single layer perceptron cannot learn to distinguish between the inputs that belong to one output category of a nonlinearly separable function from those that belong to the other category.

    It was known at the time of Minsky and Papert’s publication that it was possible to construct neural networks that defined a nonlinear decision boundary, and thus learn nonlinearly separable functions (such as the XOR function). The key to creating networks with more complex (nonlinear) decision boundaries was to extend the network to have multiple layers of neurons. For example, figure 4.3 shows a two-layer network that implements the XOR function. In this network, the logical TRUE and FALSE values are mapped to numeric values: FALSE values are represented by 0, and TRUE values are represented by 1. In this network, units activate (output +1) if the weighted sum of inputs is ; otherwise, they output 0. Notice that the units in the hidden layer implement the logical AND and OR functions. These can be understood as intermediate steps to solving the XOR challenge. The unit in the output layer implements the XOR by composing the outputs of these hidden layers. In other words, the unit in the output layer returns TRUE only when the AND node is off (output=0) and the OR node is on (output=1). However, it wasn’t clear at the time how to train networks with multiple layers. Also, at the end of their book, Minsky and Papert argued that “in their judgment” the research on extending neural networks to multiple layers was “sterile” (Minsky and Papert 1969, sec. 13.2 page 23).

    Figure 4.3 A network that implements the XOR function. All processing units use a threshold activation function with a threshold of ≥1.

    In a somewhat ironic historical twist, contemporaneous with Minsky and Papert’s publication, Alexey Ivakhnenko, a Ukrainian researcher, proposed the group method for data handling (GMDH), and in 1971 published a paper that described how it could be used to learn a neural network with eight layers (Ivakhnenko 1971). Today Ivakhnenko’s 1971 GMDH network is credited with being the first published example of a deep network trained from data (Schmidhuber 2015). However, for many years, Ivaknenko’s accomplishment was largely overlooked by the wider neural network community. As a consequence, very little of the current work in deep learning uses the GMDH method for training: in the intervening years other training algorithms, such as backpropagation (described below), became standardized in the community. At the same time of Ivakhnenko’s overlooked accomplishment, Minsky and Papert’s critique was proving persuasive and it heralded the end of the first period of significant research on neural networks.

    This first period of neural network research, did, however, leave a legacy that shaped the development of the field up to the present day. The basic internal structure of an artificial neuron was defined: a weighted sum of inputs fed through an activation function. The concept of storing information within the weights of a network was developed. Furthermore, learning algorithms based on iteratively adapting weights were proposed, along with practical learning rules, such as the LMS rule. In particular, the LMS approach, of adjusting the weights of neurons in proportion to the difference between the output of the neuron and the desired output, is present in most modern training algorithms. Finally, there was recognition of the limitations of single layer networks, and an understanding that one way to address these limitations was to extend the networks to include multiple layers of neurons. At this time, however, it was unclear how to train networks with multiple layers. Updating a weight requires an understanding of how the weight affects the error of the network. For example, in the LMS rule if the output of the neuron was too large, then weights that were applied to positive inputs caused the output to increase. Therefore, decreasing the size of these weight would reduce the output and thereby reduce the error. But, in the late 1960s, the question of how to model the relationship between the weights of the inputs to neurons in the hidden layers of a network and the overall error of the network was still unanswered; and, without this estimation of the contribution of the weight to the error, it was not possible to adjust the weights in the hidden layers of a network. The problem of attributing (or assigning) an amount of error to the components in a network is sometimes referred to as the credit assignment problem, or as the blame assignment problem.

    Connectionism: Multilayer Perceptrons

    In the 1980s, people began to reevaluate the criticisms of the late 1960s as being overly severe. Two developments, in particular, reinvigorated the field: (1) Hopfield networks; and (2) the backpropagation algorithm.

    In 1982, John Hopfield published a paper where he described a network that could function as an associative memory (Hopfield 1982). During training, an associative memory learns a set of input patterns. Once the associate memory network has been trained, then, if a corrupted version of one of the input patterns is presented to the network, the network is able to regenerate the complete correct pattern. Associative memories are useful for a number of tasks, including pattern completion and error correction. Table 4.12 illustrates the tasks of pattern completion and error correction using the example of an associative memory that has been trained to store information on people’s birthdays. In a Hopfield network, the memories, or input patterns, are encoded in binary strings; and, assuming binary patterns are relatively distinct from each other, a Hopfield network can store up to 0.138N of these strings, where N is the number of neurons in the network. So to store 10 distinct patterns requires a Hopfield network with 73 neurons, and to store 14 distinct patterns requires 100 neurons.

    Table 4.1. Illustration of the uses of an association memory for pattern completion and error correction

    Training patternsPattern completion
    John**12MayLiz***?????Liz***25Feb
    Kerry*03Jan???***10MarDes***10Mar
    Liz***25FebError correction
    Des***10MarKerry*01AprKerry*03Jan
    Josef*13DecJxsuf*13DecJosef*13Dec

    Backpropagation and Vanishing Gradients

    In 1986, a group of researchers known as the parallel distributed processing (PDP) research group published a two-book overview of neural network research (Rumelhart et al. 1986b, 1986c). These books proved to be incredibly popular, and chapter 8 in volume one described the backpropagation algorithm (Rumelhart et al. 1986a). The backpropagation algorithm has been invented a number of times,3 but it was this chapter by Rumelhart, Hinton, and Williams, published by PDP, that popularized its use. The backpropagation algorithm is a solution to the credit assignment problem and so it can be used to train a neural network that has hidden layers of neurons. The backpropagation algorithm is possibly the most important algorithm in deep learning. However, a clear and complete explanation of the backpropagation algorithm requires first explaining the concept of an error gradient, and then the gradient descent algorithm. Consequently, the in-depth explanation of backpropagation is postponed until chapter 6, which begins with an explanation of these necessary concepts. The general structure of the algorithm, however, can be described relatively quickly. The backpropagation algorithm starts by assigning random weights to each of the connections in the network. The algorithm then iteratively updates the weights in the network by showing training instances to the network and updating the network weights until the network is working as expected. The core algorithm works in a two-stage process. In the first stage (known as the forward pass), an input is presented to the network and the neuron activations are allowed to flow forward through the network until an output is generated. The second stage (known as the backward pass) begins at the output layer and works backward through the network until the input layer is reached. This backward pass begins by calculating an error for each neuron in the output layer. This error is then used to update the weights of these output neurons. Then the error of each output neuron is shared back (backpropagated) to the hidden neurons that connect to it, in proportion to the weights on the connections between the output neuron and the hidden neuron. Once this sharing (or blame assignment) has been completed for a hidden neuron, the total blame attributable to that hidden neuron is summed and this total is used to update the weights on that neuron. The backpropagation (or sharing back) of blame is then repeated for the neurons that have not yet had blame attributed to them. This process of blame assignment and weight updates continues back through the network until all the weights have been updated.

    A key innovation that enabled the backpropagation algorithm to work was a change in the activation functions used in the neurons. The networks that were developed in the early years of neural network research used threshold activation functions. The backpropagation algorithm does not work with threshold activation functions because backpropagation requires that the activation functions used by the neurons in the network be differentiable. Threshold activation functions are not differentiable because there is a discontinuity in the output of the function at the threshold. In other words, the slope of a threshold function at the threshold is infinite and therefore it is not possible to calculate the gradient of the function at that point. This led to the use of differentiable activation functions in multilayer neural networks, such as the logistic and tanh functions.

    There is, however, an inherent limitation with using the backpropagation algorithm to train deep networks. In the 1980s, researchers found that backpropagation worked well with relatively shallow networks (one or two layers of hidden units), but that as the networks got deeper, the networks either took an inordinate amount of time to train, or else they entirely failed to converge on a good set of weights. In 1991, Sepp Hochreiter (working with Jürgen Schmidhuber) identified the cause of this problem in his diploma thesis (Hochreiter 1991). The problem is caused by the way the algorithm backpropagates errors. Fundamentally, the backpropagation algorithm is an implementation of the chain rule from calculus. The chain rule involves the multiplication of terms, and backpropagating an error from one neuron back to another can involve multiplying the error by a number terms with values less than 1. These multiplications by values less than 1 happen repeatedly as the error signal gets passed back through the network. This results in the error signal becoming smaller and smaller as it is backpropagated through the network. Indeed, the error signal often diminishes exponentially with respect to the distance from the output layer. The effect of this diminishing error is that the weights in the early layers of a deep network are often adjusted by only a tiny (or zero) amount during each training iteration. In other words, the early layers either train very, very slowly or do not move away from their random starting positions at all. However, the early layers in a neural network are vitally important to the success of the network, because it is the neurons in these layers that learn to detect the features in the input that the later layers of the network use as the fundamental building blocks of the representations that ultimately determine the output of the network. For technical reasons, which will be explained in chapter 6, the error signal that is backpropagated through the network is in fact the gradient of the error of the network, and, as a result, this problem of the error signal rapidly diminishing to near zero is known as the vanishing gradient problem.

    Connectionism and Local versus Distributed Representations

    Despite the vanishing gradient problem, the backpropagation algorithm opened up the possibility of training more complex (deeper) neural network architectures. This aligned with the principle of connectionism. Connectionism is the idea that intelligent behavior can emerge from the interactions between large numbers of simple processing units. Another aspect of connectionism was the idea of a distributed representation. A distinction can be made in the representations used by neural networks between localist and distributed representations. In a localist representation there is a one-to-one correspondence between concepts and neurons, whereas in a distributed representation each concept is represented by a pattern of activations across a set of neurons. Consequently, in a distributed representation each concept is represented by the activation of multiple neurons and the activation of each neuron contributes to the representation of multiple concepts.

    In a distributed representation each concept is represented by the activation of multiple neurons and the activation of each neuron contributes to the representation of multiple concepts.

    To illustrate the distinction between localist and distributed representations, consider a scenario where (for some unspecified reason) a set of neuron activations is being used to represent the absence or presence of different foods. Furthermore, each food has two properties, the country of origin of the recipe and its taste. The possible countries of origin are: ItalyMexico, or France; and, the set of possible tastes are: SweetSour, or Bitter. So, in total there are nine possible types of food: Italian+SweetItalian+SourItalian+BitterMexican+Sweet, etc. Using a localist representation would require nine neurons, one neuron per food type. There are, however, a number of ways to define a distributed representation of this domain. One approach is to assign a binary number to each combination. This representation would require only four neurons, with the activation pattern 0000 representing Italian+Sweet, 0001 representing Italian+Sour, 0010 representing Italian+Bitter, and so on up to 1000 representing French+Bitter. This is a very compact representation. However, notice that in this representation the activation of each neuron in isolation has no independently meaningful interpretation: the rightmost neuron would be active (***1) for Italian+SourMexican+SweetMexican+Bitter, and France+Sour, and without knowledge of the activation of the other neurons, it is not possible know what country or taste is being represented. However, in a deep network the lack of semantic interpretability of the activations of hidden units is not a problem, so long as the neurons in the output layer of the network are able to combine these representations in such a way so as to generate the correct output. Another, more transparent, distributed representation of this food domain is to use three neurons to represent the countries and three neurons to represent the tastes. In this representation, the activation pattern 100100 could represent Italian+Sweet, 001100 could represent French+Sweet, and 001001 could represent French+Bitter. In this representation, the activation of each neuron can be independently interpreted; however the distribution of activations across the set of neurons is required in order to retrieve the full description of the food (country+taste). Notice, however, that both of these distributed representations are more compact than the localist representation. This compactness can significantly reduce the number of weights required in a network, and this in turn can result in faster training times for the network.

    The concept of a distributed representation is very important within deep learning. Indeed, there is a good argument that deep learning might be more appropriately named representation learning—the argument being that the neurons in the hidden layers of a network are learning distributed representations of the input that are useful intermediate representations in the mapping from inputs to outputs that the network is attempting to learn. The task of the output layer of a network is then to learn how to combine these intermediate representations so as to generate the desired outputs. Consider again the network in figure 4.3 that implements the XOR function. The hidden units in this network learn an intermediate representation of the input, which can be understood as composed of the AND and OR functions; the output layer then combines this intermediate representation to generate the required output. In a deep network with multiple hidden layers, each subsequent hidden layer can be interpreted as learning a representation that is an abstraction over the outputs of the preceding layer. It is this sequential abstraction, through learning intermediate representations, that enables deep networks to learn such complex mappings from inputs to outputs.

    Network Architectures: Convolutional and Recurrent Neural Networks

    There are a considerable number of ways in which a set of neurons can be connected together. The network examples presented so far in the book have been connected together in a relatively uncomplicated manner: neurons are organized into layers and each neuron in a layer is directly connected to all of the neurons in the next layer of the network. These networks are known as feedforward networks because there are no loops within the network connections: all the connections point forward from the input toward the output. Furthermore, all of our network examples thus far would be considered to be fully connected, because each neuron is connected to all the neurons in the next layer. It is possible, and often useful, to design and train networks that are not feedforward and/or that are not fully connected. When done correctly, tailoring network architectures can be understood as embedding into the network architecture information about the properties of the problem that the network is trying to learn to model.

    A very successful example of incorporating domain knowledge into a network by tailoring the networks architecture is the design of convolutional neural networks (CNNs) for object recognition in images. In the 1960s, Hubel and Wiesel carried out a series of experiments on the visual cortex of cats (Hubel and Wiesel 1962, 1965). These experiments used electrodes inserted into the brains of sedated cats to study the response of the brain cells as the cats were presented with different visual stimuli. Examples of the stimuli used included bright spots or lines of light appearing at a location in the visual field, or moving across a region of the visual field. The experiments found that different cells responded to different stimuli at different locations in the visual field: in effect a single cell in the visual cortex would be wired to respond to a particular type of visual stimulus occurring within a particular region of the visual field. The region of the visual field that a cell responded to was known as the receptive field of the cell. Another outcome of these experiments was the differentiation between two types of cells: “simple” and “complex.” For simple cells, the location of the stimulus is critical with a slight displacement of the stimulus resulting in a significant reduction in the cell’s response. Complex cells, however, respond to their target stimuli regardless of where in the field of vision the stimulus occurs. Hubel and Wiesel (1965) proposed that complex cells behaved as if they received projections from a large number of simple cells all of which respond to the same visual stimuli but differing in the position of their receptive fields. This hierarchy of simple cells feeding into complex cells results in funneling of stimuli from large areas of the visual field, through a set of simple cells, into a single complex cell. Figure 4.4 illustrates this funneling effect. This figure shows a layer of simple cells each monitoring a receptive field at a different location in the visual field. The receptive field of the complex cell covers the layer of simple cells, and this complex cell activates if any of the simple cells in its receptive field activates. In this way the complex cell can respond to a visual stimulus if it occurs at any location in the visual field.

    Figure 4.4 The funneling effect of receptive fields created by the hierarchy of simple and complex cells.

    In the late 1970s and early 1980s, Kunihiko Fukushima was inspired by Hubel and Wiesel’s analysis of the visual cortex and developed a neural network architecture for visual pattern recognition that was called the neocognitron (Fukushima 1980). The design of the neocognitron was based on the observation that an image recognition network should be able to recognize if a visual feature is present in an image irrespective of location in the image—or, to put it slightly more technically, the network should be able to do spatially invariant visual feature detection. For example, a face recognition network should be able to recognize the shape of an eye no matter where in the image it occurs, similar to the way a complex cell in Hubel and Wiesel’s hierarchical model could detect the presence of a visual feature irrespective of where in the visual field it occurred.

    Fukushima realized that the functioning of the simple cells in the Hubel and Wiesel hierarchy could be replicated in a neural network using a layer of neurons that all use the same set of weights, but with each neuron receiving inputs from fixed small regions (receptive fields) at different locations in the input field. To understand the relationship between neurons sharing weights and spatially invariant visual feature detection, imagine a neuron that receives a set of pixel values, sampled from a region of an image, as its inputs. The weights that this neuron applies to these pixel values define a visual feature detection function that returns true (high activation) if a particular visual feature (pattern) occurs in the input pixels, and false otherwise. Consequently, if a set of neurons all use the same weights, they will all implement the same visual feature detector. If the receptive fields of these neurons are then organized so that together they cover the entire image, then if the visual feature occurs anywhere in the image at least one of the neurons in the group will identify it and activate.

    Fukushima also recognized that the Hubel and Wiesel funneling effect (into complex cells) could be obtained by neurons in later layers also receiving as input the outputs from a fixed set of neurons in a small region of the preceding layer. In this way, the neurons in the last layer of the network each receive inputs from across the entire input field allowing the network to identify the presence of a visual feature anywhere in the visual input.

    Some of the weights in neocognitron were set by hand, and others were set using an unsupervised training process. In this training process, each time an example is presented to the network a single layer of neurons that share the same weights is selected from the layers that yielded large outputs in response to the input. The weights of the neurons in the selected layer are updated so as to reinforce their response to that input pattern and the weights of neurons not in the layer are not updated. In 1989 Yann LeCun developed the convolutional neural network (CNN) architecture specifically for the task of image processing (LeCun 1989). The CNN architecture shared many of the design features found in the neocognitron; however, LeCun showed how these types of networks could be trained using backpropagation. CNNs have proved to be incredibly successful in image processing and other tasks. A particularly famous CNN is the AlexNet network, which won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2012 (Krizhevsky et al. 2012). The goal of the ILSVRC competition is to identify objects in photographs. The success of AlexNet at the ILSVRC competition generated a lot of excitement about CNNs, and since AlexNet a number of other CNN architectures have won the competition. CNNs are one of the most popular types of deep neural networks, and chapter 5 will provide a more detailed explanation of them.

    Recurrent neural networks (RNNs) are another example of a neural network architecture that has been tailored to the specific characteristics of a domain. RNNs are designed to process sequential data, such as language. An RNN network processes a sequence of data (such as a sentence) one input at a time. An RNN has only a single hidden layer. However, the output from each of these hidden neurons is not only fed forward to the output neurons, it is also temporarily stored in a buffer and then fed back into all of the hidden neurons at the next input. Consequently, each time the network processes an input, each neuron in the hidden layer receives both the current input and the output the hidden layer generated in response to the previous input. In order to understand this explanation, it may at this point be helpful to briefly skip forward to figure 5.2 to see an illustration of the structure of an RNN and the flow of information through the network. This recurrent loop, of activations from the output of the hidden layer for one input being fed back into the hidden layer alongside the next input, gives an RNN a memory that enables it to process each input in the context of the previous inputs it has processed.4 RNNs are considered deep networks because this evolving memory can be considered as deep as the sequence is long.

    An early well-known RNN is the Elman network. In 1990, Jeffrey Locke Elman published a paper that described an RNN that had been trained to predict the endings of simple two- and three-word utterances (Elman 1990). The model was trained on a synthesized dataset of simple sentences generated using an artificial grammar. The grammar was built using a lexicon of twenty-three words, with each word assigned to a single lexical category (e.g., man=NOUN-HUM, woman=NOUN-HUM, eat=VERB-EAT, cookie=NOUN-FOOD, etc.). Using this lexicon, the grammar defined fifteen sentence generation templates (e.g., NOUN-HUM+VERB-EAT+NOUN-FOOD which would generate sentences such as man eat cookie). Once trained, the model was able to generate reasonable continuations for sentences, such as woman+eat+? = cookie. Furthermore, once the network was started, it was able to generate longer strings consisting of multiple sentences, using the context it generated itself as the input for the next word, as illustrated by this three-sentence example:

    girl eat bread dog move mouse mouse move book

    Although this sentence generation task was applied to a very simple domain, the ability of the RNN to generate plausible sentences was taken as evidence that neural networks could model linguistic productivity without requiring explicit grammatical rules. Consequently, Elman’s work had a huge impact on psycholinguistics and psychology. The following quote, from Churchland 1996, illustrates the importance that some researchers attributed to Elman’s work:
    The productivity of this network is of course a feeble subset of the vast capacity that any normal English speaker commands. But productivity is productivity, and evidently a recurrent network can possess it. Elman’s striking demonstration hardly settles the issue between the rule-centered approach to grammar and the network approach. That will be some time in working itself out. But the conflict is now an even one. I’ve made no secret where my own bets will be placed. (Churchland 1996, p. 143)5

    Although RNNs work well with sequential data, the vanishing gradient problem is particularly severe in these networks. In 1997, Sepp Hochreiter and Jürgen Schmidhuber, the researchers who in 1991 had presented an explanation of the vanishing gradient problem, proposed the long short-term memory (LSTM) units as a solution to this problem in RNNs (Hochreiter and Schmidhuber 1997). The name of these units draws on a distinction between how a neural network encodes long-term memory (understood as concepts that are learned over a period of time) through training and short-term memory (understood as the response of the system to immediate stimuli). In a neural network, long-term memory is encoded through adjusting the weights of the network and once trained these weights do not change. Short-term memory is encoded in a network through the activations that flow through the network and these activation values decay quickly. LSTM units are designed to enable the short-term memory (the activations) in the network to be propagated over long periods of time (or sequences of inputs). The internal structure of an LSTM is relatively complex, and we will describe it in chapter 5. The fact that LSTM can propagate activations over long periods enables them to process sequences that include long-distance dependencies (interactions between elements in a sequence that are separated by two or more positions). For example, the dependency between the subject and the verb in an English sentence: The dog/dogs in that house is/are aggressive. This has made LSTM networks suitable for language processing, and for a number of years they have been the default neural network architecture for many natural language processing models, including machine translation. For example, the sequence-to-sequence (seq2seq) machine translation architecture introduced in 2014 connects two LSTM networks in sequence (Sutskever et al. 2014). The first LSTM network, the encoder, processes the input sequence one input at a time, and generates a distributed representation of that input. The first LSTM network is called an encoder because it encodes the sequence of words into a distributed representation. The second LSTM network, the decoder, is initialized with the distributed representation of the input and is trained to generate the output sequence one element at a time using a feedback loop that feeds the most recent output element generated by the network back in as the input for the next time step. Today, this seq2seq architecture is the basis for most modern machine translation systems, and is explained in more detail in chapter 5.

    By the late 1990s, most of the conceptual requirements for deep learning were in place, including both the algorithms to train networks with multiple layers, and the network architectures that are still very popular today (CNNs and RNNs). However, the problem of the vanishing gradients still stifled the creation of deep networks. Also, from a commercial perspective, the 1990s (similar to the 1960s) experienced a wave of hype based on neural networks and unrealized promises. At the same time, a number of breakthroughs in other forms of machine learning models, such as the development of support vector machines (SVMs), redirected the focus of the machine learning research community away from neural networks: at the time SVMs were achieving similar accuracy to neural network models but were easier to train. Together these factors led to a decline in neural network research that lasted up until the emergence of deep learning.

    The Era of Deep Learning

    The first recorded use of the term deep learning is credited to Rina Dechter (1986), although in Dechter’s paper the term was not used in relation to neural networks; and the first use of the term in relation to neural networks is credited to Aizenberg et al. (2000).6 In the mid-2000s, interest in neural networks started to grow, and it was around this time that the term deep learning came to prominence to describe deep neural networks. The term deep learning is used to emphasize the fact that the networks being trained are much deeper than previous networks.

    One of the early successes of this new era of neural network research was when Geoffrey Hinton and his colleagues demonstrated that it was possible to train a deep neural network using a process known as greedy layer-wise pretraining. Greedy layer-wise pretraining begins by training a single layer of neurons that receives input directly from the raw input. There are a number of different ways that this single layer of neurons can be trained, but one popular way is to use an autoencoder. An autoencoder is a neural network with three layers: an input layer, a hidden (encoding) layer, and an output (decoding) layer. The network is trained to reconstruct the inputs it receives in the output layer; in other words, the network is trained to output the exact same values that it received as input. A very important feature in these networks is that they are designed so that it is not possible for the network to simply copy the inputs to the outputs. For example, an autoencoder may have fewer neurons in the hidden layer than in the input and output layer. Because the autoencoder is trying to reconstruct the input at the output layer, the fact that the information from the input must pass through this bottleneck in the hidden layer forces the autoencoder to learn an encoding of the input data in the hidden layer that captures only the most important features in the input, and disregards redundant or superfluous information.7

    Layer-Wise Pretraining Using Autoencoders

    In layer-wise pretraining, the initial autoencoder learns an encoding for the raw inputs to the network. Once this encoding has been learned, the units in the hidden encoding layer are fixed, and the output (decoding) layer is thrown away. Then a second autoencoder is trained—but this autoencoder is trained to reconstruct the representation of the data generated by passing it through the encoding layer of the initial autoencoder. In effect, this second autoencoder is stacked on top of the encoding layer of the first autoencoder. This stacking of encoding layers is considered to be a greedy process because each encoding layer is optimized independently of the later layers; in other words, each autoencoder focuses on finding the best solution for its immediate task (learning a useful encoding for the data it must reconstruct) rather than trying to find a solution to the overall problem for the network.

    Once a sufficient number8 of encoding layers have been trained, a tuning phase can be applied. In the tuning phase, a final network layer is trained to predict the target output for the network. Unlike the pretraining of the earlier layers of the network, the target output for the final layer is different from the input vector and is specified in the training dataset. The simplest tuning is where the pretrained layers are kept frozen (i.e., the weights in the pretrained layers don’t change during the tuning); however, it is also feasible to train the entire network during the tuning phase. If the entire network is trained during tuning, then the layer-wise pretraining is best understood as finding useful initial weights for the earlier layers in the network. Also, it is not necessary that the final prediction model that is trained during tuning be a neural network. It is quite possible to take the representations of the data generated by the layer-wise pretraining and use it as the input representation for a completely different type of machine learning algorithm, for example, a support vector machine or a nearest neighbor algorithm. This scenario is a very transparent example of how neural networks learn useful representations of data prior to the final prediction task being learned. Strictly speaking, the term pretraining describes only the layer-wise training of the autoencoders; however, the term is often used to refer to both the layer-wise training stage and the tuning stage of the model.

    Figure 4.5 shows the stages in layer-wise pretraining. The figure on the left illustrates the training of the initial autoencoder where an encoding layer (the black circles) of three units is attempting to learn a useful representation for the task of reconstructing an input vector of length 4. The figure in the middle of figure 4.5 shows the training of a second autoencoder stacked on top of the encoding layer of the first autoencoder. In this autoencoder, a hidden layer of two units is attempting to learn an encoding for an input vector of length 3 (which in turn is an encoding of a vector of length 4). The grey background in each figure demarcates the components in the network that are frozen during this training stage. The figure on the right shows the tuning phase where a final output layer is trained to predict the target feature for the model. For this example, in the tuning phase the pretrained layers in the network have been frozen.

    Figure 4.5 The pretraining and tuning stages in greedy layer-wise pretraining. Black circles represent the neurons whose training is the primary objective at each training stage. The gray background marks the components in the network that are frozen during each training stage.

    Layer-wise pretraining was important in the evolution of deep learning because it was the first approach to training deep networks that was widely adopted.9 However, today most deep learning networks are trained without using layer-wise pretraining. In the mid-2000s, researchers began to appreciate that the vanishing gradient problem was not a strict theoretical limit, but was instead a practical obstacle that could be overcome. The vanishing gradient problem does not cause the error gradients to disappear entirely; there are still gradients being backpropagated through the early layers of the network, it is just that they are very small. Today, there are a number of factors that have been identified as important in successfully training a deep network.

    In the mid-2000s, researchers began to appreciate that the vanishing gradient problem was not a strict theoretical limit, but was instead a practical obstacle that could be overcome.

    Weight Initialization and ReLU Activation Functions

    One factor that is important in successfully training a deep network is how the network weights are initialized. The principles controlling how weight initialization affects the training of a network are still not clear. There are, however, weight initialization procedures that have been empirically shown to help with training a deep network. Glorot initialization10 is a frequently used weight initialization procedure for deep networks. It is based on a number of assumptions but has empirical success to support its use. To get an intuitive understanding of Glorot initialization, consider the fact that there is typically a relationship between the magnitude of values in a set and the variance of the set: generally the larger the values in a set, the larger the variance of the set. So, if the variance calculated on a set of gradients propagated through a layer at one point in the network is similar to the variance for the set of gradients propagated through another layer in a network, it is likely that the magnitude of the gradients propagated through both of these layers will also be similar. Furthermore, the variance of gradients in a layer can be related to the variance of the weights in the layer, so a potential strategy to maintain gradients flowing through a network is to ensure similar variances across each of the layer in a network. Glorot initialization is designed to initialize the weight in a network in such a way that all of the layers in a network will have a similar variance in terms of both forward pass activations and the gradients propagated during the backward pass in backpropagation. Glorot initialization defines a heuristic rule to meet this goal that involves sampling the weights for a network using the following uniform distribution (where w is the weight on a connection between layer j and j+i that is being initialized, U[-a,a] is the uniform distribution over the interval (-a,a),  is the number of neurons in layer , and the notation w ~ U indicates that the value of w is sampled from distribution U)11:

    Another factor that contributes to the success or failure of training a deep network is the selection of the activation function used in the neurons. Backpropagating an error gradient through a neuron involves multiplying the gradient by the value of the derivative of the activation function at the activation value of the neuron recorded during the forward pass. The derivatives of the logistic and tanh activation functions have a number of properties that can exacerbate the vanishing gradient problem if they are used in this multiplication step. Figure 4.6 presents a plot of the logistic function and the derivative of the logistic function. The maximum value of the derivative is 0.25. Consequently, after an error gradient has been multiplied by the value of the derivative of the logistic function at the appropriate activation for the neuron, the maximum value the gradient will have is a quarter of the gradient prior to the multiplication. Another problem with using the logistic function is that there are large portions of the domain of the function where the function is saturated (returning values that very close to 0 or 1), and the rate of change of the function in these regions is near zero; thus, the derivative of the function is near 0. This is an undesirable property when backpropagating error gradients because the error gradients will be forced to zero (or close to zero) when backpropagated through any neuron whose activation is within one of these saturated regions. In 2011 it was shown that switching to a rectified linear activation function, , improved training for deep feedforward neural networks (Glorot et al. 2011). Neurons that use a rectified linear activation function are known as rectified linear units (ReLUs). One advantage of ReLUs is that the activation function is linear for the positive portion of its domain with a derivative equal to 1. This means that gradients can flow easily through ReLUs that have positive activation. However, the drawback of ReLUs is that the gradient of the function for the negative part of its domain is zero, so ReLUs do not train in this portion of the domain. Although undesirable, this is not necessarily a fatal flaw for learning because when backpropagating through a layer of ReLUs the gradients can still flow through the ReLUs in the layers that have positive activation. Furthermore, there are a number of variants of the basic ReLU that introduce a gradient on the negative side of the domain, a commonly used variant being the leaky ReLU (Maas et al. 2013). Today, ReLUs (or variants of ReLUs) are the most frequently used neurons in deep learning research.

    Figure 4.6 Plots of the logistic function and the derivative of the logistic function.

    The Virtuous Cycle: Better Algorithms, Faster Hardware, Bigger Data

    Although improved weight initialization methods and new activation functions have both contributed to the growth of deep learning, in recent years the two most important factors driving deep learning have been the speedup in computer power and the massive increase in dataset sizes. From a computational perspective, a major breakthrough for deep learning occurred in the late 2000s with the adoption of graphical processing units (GPUs) by the deep learning community to speed up training. A neural network can be understood as a sequence of matrix multiplications that are interspersed with the application of nonlinear activation functions, and GPUs are optimized for very fast matrix multiplication. Consequently, GPUs are ideal hardware to speed up neural network training, and their use has made a significant contribution to the development of the field. In 2004, Oh and Jung reported a twentyfold performance increase using a GPU implementation of a neural network (Oh and Jung 2004), and the following year two further papers were published that demonstrated the potential of GPUs to speed up the training of neural networks: Steinkraus et al. (2005) used GPUs to train a two-layer neural network, and Chellapilla et al. (2006) used GPUs to train a CNN. However, at that time there were significant programming challenges to using GPUs for training networks (the training algorithm had to be implemented as a sequence of graphics operations), and so the initial adoption of GPUs by neural network researchers was relatively slow. These programming challenges were significantly reduced in 2007 when NVIDIA (a GPU manufacturer) released a C-like programming interface for GPUs called CUDA (compute unified device architecture).12 CUDA was specifically designed to facilitate the use of GPUs for general computing tasks. In the years following the release of CUDA, the use of GPUs to speed up neural network training became standard.

    However, even with these more powerful computer processors, deep learning would not have been possible unless massive datasets had also become available. The development of the internet and social media platforms, the proliferation of smartphones and “internet of things” sensors, has meant that the amount of data being captured has grown at an incredible rate over the last ten years. This has made it much easier for organizations to gather large datasets. This growth in data has been incredibly important to deep learning because neural network models scale well with larger data (and in fact they can struggle with smaller datasets). It has also prompted organizations to consider how this data can be used to drive the development of new applications and innovations. This in turn has driven a need for new (more complex) computational models in order to deliver these new applications. And, the combination of large data and more complex algorithms requires faster hardware in order to make the necessary computational workload tractable. Figure 4.7 illustrates the virtuous cycle between big data, algorithmic breakthroughs (e.g., better weight initialization, ReLUs, etc.), and improved hardware that is driving the deep learning revolution.

    Figure 4.7 The virtuous cycle driving deep learning. Figure inspired by figure 1.2 in Reagen et al. 2017.

    Summary

    The history of deep learning reveals a number of underlying themes. There has been a shift from simple binary inputs to more complex continuous valued input. This trend toward more complex inputs is set to continue because deep learning models are most useful in high-dimensional domains, such as image processing and language. Images often have thousands of pixels in them, and language processing requires the ability represents and process hundreds of thousands of different words. This is why some of the best-known applications of deep learning are in these domains, for example, Facebook’s face-recognition software, and Google’s neural machine translation system. However, there are a growing number of new domains where large and complex digital datasets are being gathered. One area where deep learning has the potential to make a significant impact within the coming years is healthcare, and another complex domain is the sensor rich field of self-driving cars.

    Somewhat surprisingly, at the core of these powerful models are simple information processing units: neurons. The connectionist idea that useful complex behavior can emerge from the interactions between large numbers of simple processing units is still valid today. This emergent behavior arises through the sequences of layers in a network learning a hierarchical abstraction of increasingly complex features. This hierarchical abstraction is achieved by each neuron learning a simple transformation of the input it receives. The network as a whole then composes these sequences of smaller transformations in order to apply a complex (highly) nonlinear mapping to the input. The output from the model is then generated by the final output layers of neuron, based the learned representation generated through the hierarchical abstraction. This is why depth is such an important factor in neural networks: the deeper the network, the more powerful the model becomes in terms of its ability to learn complex nonlinear mappings. In many domains, the relationship between input data and desired outputs involves just such complex nonlinear mappings, and it is in these domains that deep learning models outdo other machine learning approaches.

    An important design choice in creating a neural network is deciding which activation function to use within the neurons in a network. The activation function within each neuron in a network is how nonlinearity is introduced into the network, and as a result it is a necessary component if the network is to learn a nonlinear mapping from inputs to output. As networks have evolved, so too have the activation functions used in them. New activation functions have emerged throughout the history of deep learning, often driven by the need for functions with better properties for error-gradient propagation: a major factor in the shift from threshold to logistic and tanh activation functions was the need for differentiable functions in order to apply backpropagation; the more recent shift to ReLUs was, similarly, driven by the need to improve the flow of error gradients through the network. Research on activations functions is ongoing, and new functions will be developed and adopted in the coming years.

    Another important design choice in creating a neural network is to decide on the structure of the network: for example, how should the neurons in the network be connected together? In the next chapter, we will discuss two very different answers to this question: convolution neural networks and recurrent neural networks.

    5 Convolutional and Recurrent Neural Networks

    Tailoring the structure of a network to the specific characteristics of the data from a task domain can reduce the training time of the network, and improves the accuracy of the network. Tailoring can be done in a number of ways, such as: constraining the connections between neurons in adjacent layers to subsets (rather than having fully connected layers); forcing neurons to share weights; or introducing backward connections into the network. Tailoring in these ways can be understood as building domain knowledge into the network. Another, related, perspective is it helps the network to learn by constraining the set of possible functions that it can learn, and by so doing guides the network to find a useful solution. It is not always clear how to fit a network structure to a domain, but for some domains where the data has a very regular structure (e.g., sequential data such as text, or gridlike data such as images) there are well-known network architectures that have proved successful. This chapter will introduce two of the most popular deep learning architectures: convolutional neural networks and recurrent neural networks.

    Convolutional Neural Networks

    Convolution neural networks (CNNs) were designed for image recognition tasks and were originally applied to the challenge of handwritten digit recognition (Fukushima 1980; LeCun 1989). The basic design goal of CNNs was to create a network where the neurons in the early layer of the network would extract local visual features, and neurons in later layers would combine these features to form higher-order features. A local visual feature is a feature whose extent is limited to a small patch, a set of neighboring pixels, in an image. For example, when applied to the task of face recognition, the neurons in the early layers of a CNN learn to activate in response to simple local features (such as lines at a particular angle, or segments of curves), neurons deeper in the network combine these low-level features into features that represent body parts (such as eyes or noises), and the neurons in the final layers of the network combine body part activations in order to be able to identify whole faces in an image.

    Using this approach, the fundamental task in image recognition is learning the feature detection functions that can robustly identify the presence, or absence, of local visual features in an image. The process of learning functions is at the core of neural networks, and is achieved by learning the appropriate set of weights for the connections in the network. CNNs learn the feature detection functions for local visual features in this way. However, a related challenge is designing the architecture of the network so that the network will identify the presence of a local visual feature in an image irrespective of where in the image it occurs. In other words, the feature detection functions must be able to work in a translation invariant manner. For example, a face recognition system should be able to recognize the shape of an eye in an image whether the eye is in the center of the image or in the top-right corner of the image. This need for translation invariance has been a primary design principle of CNNs for image processing, as Yann LeCun stated in 1989:
    It seems useful to have a set of feature detectors that can detect a particular instance of a feature anywhere on the input plane. Since the precise location of a feature is not relevant to the classification, we can afford to lose some position information in the process. (LeCun 1989, p. 14)

    CNNs achieve this translation invariance of local visual feature detection by using weight sharing between neurons. In an image recognition setting, the function implemented by a neuron can be understood as a visual feature detector. For example, neurons in the first hidden layer of the network will receive a set of pixel values as input and output a high activation if a particular pattern (local visual feature) is present in this set of pixels. The fact that the function implemented by a neuron is defined by the weights the neuron uses means that if two neurons use the same set of weights then they both implement the same function (feature detector). In chapter 4, we introduced the concept of a receptive field to describe the area that a neuron receives its input from. If two neurons share the same weights but have different receptive fields (i.e., each neuron inspects different areas of the input), then together the neurons act as a feature detector that activates if the feature occurs in either of the receptive fields. Consequently, it is possible to design a network with translation invariant feature detection by creating a set of neurons that share the same weights and that are organized so that: (1) each neuron inspects a different portion of the image; and (2) together the receptive fields of the neurons cover the entire image.

    The scenario of searching an image in a dark room with a flashlight that has a narrow beam is sometimes used to explain how a CNN searches an image for local features. At each moment you can point the flashlight at a region of the image and inspect that local region. In this flashlight metaphor, the area of the image illuminated by the flashlight at any moment is equivalent to the receptive field of a single neuron, and so pointing the flashlight at a location is equivalent to applying the feature detection function to that local region. If, however, you want to be sure you inspect the whole image, then you might decide to be more systematic in how you direct the flashlight. For example, you might begin by pointing the flashlight at the top-left corner of the image and inspecting that region. You then move the flashlight to the right, across the image, inspecting each new location as it becomes visible, until you reach the right side of the image. You then point the flashlight back to the left of the image, but just below where you began, and move across the image again. You repeat this process until you reach the bottom-right corner of the image. The process of sequentially searching across an image and at each location in the search applying the same function to the local (illuminated) region is the essence of convolving a function across an image. Within a CNN, this sequential search across an image is implemented using a set of neurons that share weights and whose union of receptive fields covers the entire image.

    Figure 5.1 illustrates the different stages of processing that are often found in a CNN. Thematrix on the left of the figure represents the image that is the input to the CNN. Thematrix immediately to the right of the input represents a layer of neurons that together search the entire image for the presence of a particular local feature. Each neuron in this layer is connected to a differentreceptive field (area) in the image, and they all apply the same weight matrix to their inputs:

    The receptive field of the neuron(top-left) in this layer is marked with the gray square covering thearea in the top-left of the input image. The dotted arrows emerging from each of the locations in this gray area represent the inputs to neuron. The receptive field of the neighboring neuronis indicated bysquare, outlined in bold in the input image. Notice that the receptive fields of these two neurons overlap. The amount of overlap of receptive fields is controlled by a hyperparameter called the stride length. In this instance, the stride length is one, meaning that for each position moved in the layer the receptive field of the neuron is translated by the same amount on the input. If the stride length hyperparameter is increased, the amount of overlap between receptive fields is decreased.

    The receptive fields of both of these neurons (and) are matrices of pixel values and the weights used by these neurons are also matrices. In computer vision, the matrix of weights applied to an input is known as the kernel (or convolution mask); the operation of sequentially passing a kernel across an image and within each local region, weighting each input and adding the result to its local neighbors, is known as a convolution. Notice that a convolution operation does not include a nonlinear activation function (this is applied at a later stage in processing). The kernel defines the feature detection function that all the neurons in the convolution implement. Convolving a kernel across an image is equivalent to passing a local visual feature detector across the image and recording all the locations in the image where the visual feature was present. The output from this process is a map of all the locations in the image where the relevant visual feature occurred. For this reason, the output of a convolution process is sometimes known as a feature map. As noted above, the convolution operation does not include a nonlinear activation function (it only involves a weighted summation of the inputs). Consequently, it is standard to apply a nonlinearity operation to a feature map. Frequently, this is done by applying a rectified linear function to each position in a feature map; the rectified linear activation function is defined as:. Passing a rectified linear activation function over a feature map simply changes all negative values to 0. In figure 5.1, the process of updating a feature map by applying a rectified linear activation function to each of its elements is represented by the layer labeled Nonlinearity.

    The quote from Yann LeCun, at the start of this section, mentions that the precise location of a feature in an image may not be relevant to an image processing task. With this in mind, CNNs often discard location information in favor of generalizing the network’s ability to do image classification. Typically, this is achieved by down-sampling the updated feature map using a pooling layer. In some ways pooling is similar to the convolution operation described above, in so far as pooling involves repeatedly applying the same function across an input space. For pooling, the input space is frequently a feature map whose elements have been updated using a rectified linear function. Furthermore, each pooling operation has a receptive field on the input space—although, for pooling, the receptive fields sometimes do not overlap. There are a number of different pooling functions used; the most common is called max pooling, which returns the maximum value of any of its inputs. Calculating the average value of the inputs is also used as a pooling function.

    Convolving a kernel across an image is equivalent to passing a local visual feature detector across the image and recording all the locations in the image where the visual feature was present.

    The operation sequence of applying a convolution, followed by a nonlinearity, to the feature map, and then down-sampling using pooling, is relatively standard across most CNNs. Often these three operations are together considered to define a convolutional layer in a network, and this is how they are presented in figure 5.1.

    The fact that a convolution searches an entire image means that if the visual feature (pixel pattern) that the function (defined by shared kernel) detects occurs anywhere in the image, its presence will be recorded in the feature map (and if pooling is used, also in the subsequent output from the pooling layer). In this way, a CNN supports translation invariant visual feature detection. However, this has the limitation that the convolution can only identify a single type of feature. CNNs generalize beyond one feature by training multiple convolutional layers in parallel (or filters), with each filter learning a single kernel matrix (feature detection function). Note the convolution layer in figure 5.1 illustrates a single filter. The outputs of multiple filters can be integrated in a variety of ways. One way to integrate information from different filters is to take the feature maps generated by the separate filters and combine them into a single multifilter feature map. A subsequent convolutional layer then takes this multifilter feature map as input. Another other way to integrate information from different filter is to use a densely connected layer of neurons. The final layer in figure 5.1 illustrates a dense layer. This dense layer operates in exactly the same way as a standard layer in a fully connected feedforward network. Each neuron in the dense layer is connected to all of the elements output by each of the filters, and each neuron learns a set of weights unique to itself that it applies to the inputs. This means that each neuron in a dense layer can learn a different way to integrate information from across the different filters.

    Figure 5.1 Illustrations of the different stages of processing in a convolutional layer. Note in this figure the Image and Feature Map are data structures; the other stages represent operations on data.

    The AlexNet CNN, which won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2012, had five convolutional layers, followed by three dense layers. The first convolutional layer had ninety-six different kernels (or filters) and included a ReLU nonlinearity and pooling. The second convolution layer had 256 kernels and also included ReLU nonlinearity and pooling. The third, fourth, and fifth convolutional layers did not include a nonlinearity step or pooling, and had 384, 384, and 256 kernels, respectively. Following the fifth convolutional layer, the network had three dense layers with 4096 neurons each. In total, AlexNet had sixty million weights and 650,000 neurons. Although sixty million weights is a large number, the fact that many of the neurons shared weights actually reduced the number of weights in the network. This reduction in the number of required weights is one of the advantages of CNN networks. In 2015, Microsoft Research developed a CNN network called ResNet, which won the ILSVRC 2015 challenge (He et al. 2016). The ResNet architecture extended the standard CNN architecture using skip-connections. A skip-connection takes the output from one layer in the network and feeds it directly into a layer that may be much deeper in the network. Using skip-connections it is possible to train very deep networks. In fact, the ResNet model developed by Microsoft Research had a depth of 152 layers.

    Recurrent Neural Networks

    Recurrent neural networks (RNNs) are tailored to the processing of sequential data. An RNN processes a sequence of data by processing each element in the sequence one at time. An RNN network only has a single hidden layer, but it also has a memory buffer that stores the output of this hidden layer for one input and feeds it back into the hidden layer along with the next input from the sequence. This recurrent flow of information means that the network processes each input within the context generated by processing the previous input, which in turn was processed in the context of the input preceding it. In this way, the information that flows through the recurrent loop encodes contextual information from (potentially) all of the preceding inputs in the sequence. This allows the network to maintain a memory of what it has seen previously in the sequence to help it decide what to do with the current input. The depth of an RNN arises from the fact that the memory vector is propagated forward and evolved through each input in the sequence; as a result an RNN network is considered as deep as a sequence is long.

    The depth of an RNN arises from the fact that the memory vector is propagated forward and evolved through each input in the sequence; as a result an RNN network is considered as deep as a sequence is long.

    Figure 5.2 illustrates the architecture of an RNN and shows how information flows through the network as it processes a sequence. At each time step, the network in this figure receives a vector containing two elements as input. The schematic on the left of figure 5.2 (time step=1.0) shows the flow of information in the network when it receives the first input in the sequence. This input vector is fed forward into the three neurons in the hidden layer of the network. At the same time these neurons also receive whatever information is stored in the memory buffer. Because this is the initial input, the memory buffer will only contain default initialization values. Each of the neurons in the hidden layer will process the input and generate an activation. The schematic in the middle of figure 5.2 (time step=1.5) shows how this activation flows on through the network: the activation of each neuron is passed to the output layer where it is processed to generate the output of the network, and it is also stored in the memory buffer (overwriting whatever information was stored there). The elements of the memory buffer simply store the information written to them; they do not transform it in any way. As a result, there are no weights on the edges going from the hidden units to the buffer. There are, however, weights on all the other edges in the network, including those from the memory buffer units to the neurons in the hidden layer. At time step 2, the network receives the next input from the sequence, and this is passed to the hidden layer neurons along with the information stored in the buffer. This time the buffer contains the activations that were generated by the hidden neurons in response to the first input.

    Figure 5.2 The flow of information in an RNN as it processes a sequence of inputs. The arrows in bold are the active paths of information flow at each time point; the dashed arrows show connections that are not active at that time.

    Figure 5.3 shows an RNN that has been unrolled through time as it processes a sequence of inputs. Each box in this figure represents a layer of neurons. The box labeledrepresents the state of the memory buffer when the network is initialized; the boxes labeledrepresent the hidden layer of the network at each time step; and the boxes labeledrepresent the output layer of the network at each time step. Each of the arrows in the figure represents a set of connections between one layer and another layer. For example, the vertical arrow fromtorepresents the connections between the input layer and the hidden layer at time step 1. Similarly, the horizontal arrows connecting the hidden layers represent the storing of the activations from a hidden state at one time step in the memory buffer (not shown) and the propagation of these activations to the hidden layer at the next time step through the connections from the memory buffer to the hidden state. At each time step, an input from the sequence is presented to the network and is fed forward to the hidden layer. The hidden layer generates a vector of activations that is passed to the output layer and is also propagated forward to the next time step along the horizontal arrows connecting the hidden states.

    Figure 5.3 An RNN network unrolled through time as it processes a sequence of inputs [x1,x2,……,xt]

    Although RNNs can process a sequence of inputs, they struggle with the problem of vanishing gradients. This is because training an RNN to process a sequence of inputs requires the error to be backpropagated through the entire length of the sequence. For example, for the network in figure 5.3, the error calculated on the outputmust be backpropagated through the entire network so that it can be used to update the weights on the connections fromandto. This entails backpropagating the error through all the hidden layers, which in turn involves repeatedly multiplying the error by the weights on the connections feeding activations from one hidden layer forward to the next hidden layer. A particular problem with this process is that it is the same set of weights that are used on all the connections between the hidden layers: each horizontal arrow represents the same set of connections between the memory buffer and the hidden layer, and the weights on these connections are stationary through time (i.e., they don’t change from one time step to the next during the processing of a given sequence of inputs). Consequently, backpropogating an error through k time steps involves (among other multiplications) multiplying the error gradient by the same set of weights k times. This is equivalent to multiplying each error gradient by a weight raised to the power of k. If this weight is less than 1, then when it is raised to a power, it diminishes at an exponential rate, and consequently, the error gradient also tends to diminish at an exponential rate with respect to the length of the sequence—and vanish.

    Long short-term memory networks (LSTMs) are designed to reduce the effect of vanishing gradients by removing the repeated multiplication by the same weight vector during backpropagation in an RNN. At the core of an LSTM1 unit is a component called the cell. The cell is where the activation (the short-term memory) is stored and propagated forward. In fact, the cell often maintains a vector of activations. The propagation of the activations within the cell through time is controlled by three components called gates: the forget gate, the input gate, and the output gate. The forget gate is responsible for determining which activations in the cell should be forgotten at each time step, the input gate controls how the activations in the cell should be updated in response to the new input, and the output gate controls what activations should be used to generate the output in response to the current input. Each of the gates consists of layers of standard neurons, with one neuron in the layer per activation in the cell state.

    Figure 5.4 illustrates the internal structure of an LSTM cell. Each of the arrows in this image represents a vector of activations. The cell runs along the top of the figure from left () to right (). Activations in the cell can take values in the range -1 to +1. Stepping through the processing for a single input, the input vectoris first concatenated with the hidden state vector that has been propagated forward from the preceding time step. Working from left to right through the processing of the gates, the forget gate takes the concatenation of the input and the hidden state and passes this vector through a layer of neurons that use a sigmoid (also known as logistic)2 activation function. As a result of the neurons in the forget layer using sigmoid activation functions the output of this forget layer is a vector of values in the range 0 to 1. The cell state is then multiplied by this forget vector. The result of this multiplication is that activations in the cell state that are multiplied by components in the forget vector with values near 0 are forgotten, and activations that are multiplied by forget vector components with values near 1 are remembered. In effect, multiplying the cell state by the output of a sigmoid layer acts as a filter on the cell state.

    Next, the input gate decides what information should be added to the cell state. The processing in this step is done by the components in the middle block of figure 5.4, marked Input. This processing is broken down into two subparts. First, the gate decides which elements in the cell state should be updated, and second it decides what information should be included in the update. The decision regarding which elements in the cell state should be updated is implemented using a similar filter mechanism to the forget gate: the concatenated inputplus hidden stateis passed through a layer of sigmoid units to generate a vector of elements, the same width as the cell, where each element in the vector is in the range 0 to 1; values near 0 indicate that the corresponding cell element will not be updated, and values near 1 indicate that the corresponding cell element will be updated. At the same time that the filter vector is generated, the concatenated input and hidden state are also passed through a layer of tanh units (i.e., neurons that use the tanh activation function). Again, there is one tanh unit for each activation in the LSTM cell. This vector represents the information that may be added to the cell state. Tanh units are used to generate this update vector because tanh units output values in the range -1 to +1, and consequently the value of the activations in the cell elements can be both increased and decreased by an update.3 Once these two vectors have been generated, the final update vector is calculated by multiplying the vector output from the tanh layer by the filter vector generated from the sigmoid layer. The resulting vector is then added to the cell using vector addition.

    Figure 5.4 Schematic of the internal structure of an LSTM unit: σ represents a layer of neurons with sigmoid activations, T represents a layer of neurons with tanh activations, × represents vector multiplication, and + represents vector addition. The figure is inspired by an image by Christopher Olah available at: http://colah.github.io/posts/2015-08-Understanding-LSTMs/.

    The final stage of processing in an LSTM is to decide which elements of the cell should be output in response to the current input. This processing is done by the components in the block marked Output (on the right of figure 5.4). A candidate output vector is generated by passing the cell through a tanh layer. At the same time, the concatenated input and propagated hidden state vector are passed through a layer of sigmoid units to create another filter vector. The actual output vector is then calculated by multiplying the candidate output vector by this filter vector. The resulting vector is then passed to the output layer, and is also propagated forward to the next time step as the new hidden state.

    The fact that an LSTM unit contains multiple layers of neurons means that an LSTM is a network in itself. However, an RNN can be constructed by treating an LSTM as the hidden layer in the RNN. In this configuration, an LSTM unit receives an input at each time step and generates an output for each input. RNNs that use LSTM units are often known as LSTM networks.

    LSTM networks are ideally suited for natural language processing (NLP). A key challenge in using a neural network to do natural language processing is that the words in language must be converted into vectors of numbers. The word2vec models, created by Tomas Mikolov and colleagues at Google research, are one of the most popular ways of doing this conversion (Mikolov et al. 2013). The word2vec models are based on the idea that words that appear in similar contexts have similar meanings. The definition of context here is surrounding words. So for example, the words London and Paris are semantically similar because each of them often co-occur with words that the other word also co-occurs with, such as: capitalcityEuropeholidayairport, and so on. The word2vec models are neural networks that implement this idea of semantic similarity by initially assigning random vectors to each word and then using co-occurrences within a corpus to iteratively update these vectors so that semantically similar words end up with similar vectors. These vectors (known as word embeddings) are then used to represent a word when it is being input to a neural network.

    One of the areas of NLP where deep learning has had a major impact is in machine translation. Figure 5.5 presents a high-level schematic of the seq2seq (or encoder-decoder) architecture for neural machine translation (Sutskever et al. 2014). This architecture is composed of two LSTM networks that have been joined together. The first LSTM network processes the input sentence in a word-by-word fashion. In this example, the source language is French. The words are entered into the system in reverse order as it has been found that this leads to better translations. The symbolis a special end of sentence symbol. As each word is entered, the encoder updates the hidden state and propagates it forward to the next time step. The hidden state generated by the encoder in response to thesymbol is taken to be a vector representation of the input sentence. This vector is passed as the initial input to the decoder LSTM. The decoder is trained to output the translation sentence word by word, and after each word has been generated, this word is fed back into the system as the input for the next time step. In a way, the decoder is hallucinating the translation because it uses its own output to drive its own generation process. This process continues until the decoder outputs an

    symbol.

    Figure 5.5 Schematic of the seq2seq (or encoder-decoder) architecture.

    The idea of using a vector of numbers to represent the (interlingual) meaning of a sentence is very powerful, and this concept has been extended to the idea of using vectors to represent intermodal/multimodal representations. For example, an exciting development in recent years has been the development of automatic image captioning systems. These systems can take an image as input and generate a natural language description of the image. The basic structure of these systems is very similar to the neural machine translation architecture shown in figure 5.5. The main difference is that the encoder LSTM network is replaced by a CNN architecture that processes the input image and generates a vector representation that is then propagated to the decoder LSTM (Xu et al. 2015). This is another example of the power of deep learning arising from its ability to learn complex representations of information. In this instance, the system learns intermodal representations that enable information to flow from what is in an image to language. Combining CNN and RNN architectures is becoming more and more popular because it offers the potential to integrate the advantages of both systems and enables deep learning architectures to handle very complex data.

    Irrespective of the network architecture we use, we need to find the correct weights for the network if we wish to create an accurate model. The weights of a neuron determine the transformation the neuron applies to its inputs. So, it is the weights of the network that define the fundamental building blocks of the representation the network learns. Today the standard method for finding these weights is an algorithm that came to prominence in the 1980s: backpropagation. The next chapter will present a comprehensive introduction to this algorithm.

    6 Learning Functions

    A neural network model, no matter how deep or complex, implements a function, a mapping from inputs to outputs. The function implemented by a network is determined by the weights the network uses. So, training a network (learning the function the network should implement) on data involves searching for the set of weights that best enable the network to model the patterns in the data. The most commonly used algorithm for learning patterns from data is the gradient descent algorithm. The gradient descent algorithm is very like the perceptron learning rule and the LMS algorithm described in chapter 4: it defines a rule to update the weights used in a function based on the error of the function. By itself the gradient descent algorithm can be used to train a single output neuron. However, it cannot be used to train a deep network with multiple hidden layers. This limitation is because of the credit assignment problem: how should the blame for the overall error of a network be shared out among the different neurons (including the hidden neurons) in the network? Consequently, training a deep neural network involves using both the gradient descent algorithm and the backpropagation algorithm in tandem.

    The process used to train a deep neural network can be characterized as: randomly initializing the weight of a network, and then iteratively updating the weights of the network, in response to the errors the network makes on a dataset, until the network is working as expected. Within this training framework, the backpropagation algorithm solves the credit (or blame) assignment problem, and the gradient descent algorithm defines the learning rule that actually updates the weights in the network.

    This chapter is the most mathematical chapter in the book. However, at a high level, all you need to know about the backpropagation algorithm and the gradient descent algorithm is that they can be used to train deep networks. So, if you don’t have the time to work through the details in this chapter, feel free to skim through it. If, however, you wish to get a deeper understanding of these two algorithms, then I encourage you to engage with the material. These algorithms are at the core of deep learning and understanding how they work is, possibly, the most direct way of understanding its potentials and limitations. I have attempted to present the material in this chapter in an accessible way, so if you are looking for a relatively gentle but still comprehensive introduction to these algorithms, then I believe that this will provide it for you. The chapter begins by explaining the gradient descent algorithm, and then explains how gradient descent can be used in conjunction with the backpropagation algorithm to train a neural network.

    Gradient Descent

    A very simple type of function is a linear mapping from a single input to a single output. Table 6.1 presents a dataset with a single input feature and a single output. Figure 6.1 presents a scatterplot of this data along with a plot of the line that best fits this data. This line can be used as a function to map from an input value to a prediction of the output value. For example, if x = 0.9, then the response returned by this linear function is y = 0.6746. The error (or loss) of using this line as a model for the data is shown by the dashed lines from the line to each datum.

    Table 6.1. A sample dataset with one input feature, x, and an output (target) feature, y

    XY
    0.720.54
    0.450.56
    0.230.38
    0.760.57
    0.140.17
    Figure 6.1 Scatterplot of data with “best fit” line and the errors of the line on each example plotted as vertical dashed line segments. The figure also shows the mapping defined by the line for input x=0.9 to output y=0.6746.

    In chapter 2, we described how a linear function can be represented using the equation of a line:

    whereis the slope of the line, andis the y-intercept, which specifies where the line crosses the y-axis. For the line in figure 6.1,and; this is why the function returns the valuewhen, as in the following:

    The slopeand the y-interceptare the parameters of this model, and these parameters can be varied to fit the model to the data.

    The equation of a line has a close relationship with the weighted sum operation used in a neuron. This becomes apparent if we rewrite the equation of a line with model parameters rewritten as weights (:

    Different lines (different linear models for the data) can be created by varying either of these weights (or model parameters). Figure 6.2 illustrates how a line changes as the intercept and slope of the line varies: the dashed line illustrates what happens if the y-intercept is increased, and the dotted line shows what happens if the slope is decreased. Changing the y-interceptvertically translates the line, whereas modifying the sloperotates the line around the point.

    Each of these new lines defines a different function, mapping from  to, and each function will have a different error with respect to how well it matches the data. Looking at figure 6.2, we can see that the full line, , fits the data better than the other two lines because on average it passes closer to the data points. In other words, on average the error for this line for each data point is less than those of the other two lines. The total error of a model on a dataset can be measured by summing together the error the model makes on each example in the dataset. The standard way to calculate this total error is to use an equation known as the sum of squared errors (SSE):

    Figure 6.2 Plot illustrating how a line changes as the intercept (w0) and slope (w1) are varied.

    This equation tells us how to add together the errors of a model on a dataset containing n examples. This equation calculates for each of the  examples in the dataset the error of the model by subtracting the prediction of the target value returned by the model from the correct target value for that example, as specified in the dataset. In this equation  is the correct output value for target feature listed in the dataset for example j, and  is the estimate of the target value returned by the model for the same example. Each of these errors is then squared and these squared errors are then summed. Squaring the errors ensures that they are all positive, and therefore in the summation the errors for examples where the function underestimated the target do not cancel out the errors on examples where it overestimated the target. The multiplication of the summation of the errors by , although not important for the current discussion, will become useful later. The lower the SSE of a function, the better the function models the data. Consequently, the sum of squared errors can be used as a fitness function to evaluate how well a candidate function (in this situation a model instantiating a line) matches the data.

    Figure 6.3 shows how the error of a linear model varies as the parameters of the model change. These plots show the SSE of a linear model on the example single-input–single-output dataset listed in table 6.1. For each parameter there is a single best setting and as the parameter moves away from this setting (in either direction) the error of the model increases. A consequence of this is that the error profile of the model as each parameter varies is convex (bowl-shaped). This convex shape is particularly apparent in the top and middle plots in figure 6.3, which show that the SSE of the model is minimized when  (lowest point of the curve in the top plot), and when  (lowest point of the curve in the middle plot).

    Figure 6.3 Plots of the changes in the error (SSE) of a linear model as the parameters of the model change. Top: the SSE profile of a linear model with a fixed slope w1=0.524 when w0 ranges across the interval 0.3 to 1. Middle: the SSE profile of a linear model with a y-intercept fixed at w0=0.203 when w1 ranges across the interval 0 to 1. Bottom: the error surface of the linear model when both w0 and w1 are varied.

    If we plot the error of the model as both parameters are varied, we generate a three-dimensional convex bowl-shaped surface, known as an error surface. The bowl-shaped mesh in the plot at the bottom of figure 6.3 illustrates this error surface. This error surface was created by first defining a weight space. This weight space is represented by the flat grid at the bottom of the plot. Each coordinate in this weight space defines a different line because each coordinate specifies an intercept (a  value) and slope (a  value). Consequently, moving across this planar weight space is equivalent to moving between different models. The second step in constructing the error surface is to associate an elevation with each line (i.e., coordinate) in the weight space. The elevation associated with each weight space coordinate is the SSE of the model defined by that coordinate; or, put more directly, the height of the error surface above the weight space plane is the SSE of the corresponding linear model when it is used as a model for the dataset. The weight space coordinates that correspond with the lowest point of the error surface define the linear model that has the lowest SSE on the dataset (i.e., the linear model that best fits the data).

    The shape of the error surface in the plot on the right of figure 6.3 indicates that there is only a single best linear model for this dataset because there is a single point at the bottom of the bowl that has a lower elevation (lower error) than any other points on the surface. Moving away from this best model (by varying the weights of the model) necessarily involves moving to a model with a higher SSE. Such a move is equivalent to moving to a new coordinate in the weight space, which has a higher elevation associated with it on the error surface. A convex or bowl-shaped error surface is incredibly useful for learning a linear function to model a dataset because it means that the learning process can be framed as a search for the lowest point on the error surface. The standard algorithm used to find this lowest point is known as gradient descent.

    A convex or bowl-shaped error surface is incredibly useful for learning a linear function to model a dataset because it means that the learning process can be framed as a search for the lowest point on the error surface.

    The gradient descent algorithm begins by creating an initial model using a randomly selected a set of weights. Next the SSE of this randomly initialized model is calculated. Taken together, the guessed set of weights and the SSE of the corresponding model define the initial starting point on the error surface for the search. It is very likely that the randomly initialized model will be a bad model, so it is very likely that the search will begin at a location that has a high elevation on the error surface. This bad start, however, is not a problem, because once the search process is positioned on the error surface, the process can find a better set of weights by simply following the gradient of the error surface downhill until it reaches the bottom of the error surface (the location where moving in any direction results in an increase in SSE). This is why the algorithm is known as gradient descent: the gradient that the algorithm descends is the gradient of the error surface of the model with respect to the data.

    An important point is that the search does not progress from the starting location to the valley floor in one weight update. Instead, it moves toward the bottom of the error surface in an iterative manner, and during each iteration the current set of weights are updated so as to move to a nearby location in the weight space that has a lower SSE. Reaching the bottom of the error surface can take a large number of iterations. An intuitive way of understanding the process is to imagine a hiker who is caught on the side of a hill when a thick fog descends. Their car is parked at the bottom of the valley; however, due to the fog they can only see a few feet in any direction. Assuming that the valley has a nice convex shape to it, they can still find their way to their car, despite the fog, by repeatedly taking small steps that move down the hill following the local gradient at the position they are currently located. A single run of a gradient descent search is illustrated in the bottom plot of figure 6.3. The black curve plotted on the error surface illustrates the path the search followed down the surface, and the black line on the weight space plots the corresponding weight updates that occurred during the journey down the error surface. Technically, the gradient descent algorithm is known as an optimization algorithm because the goal of the algorithm is to find the optimal set of weights.

    The most important component of the gradient descent algorithm is the rule that defines how the weights are updated during each iteration of the algorithm. In order to understand how this rule is defined it is first necessary to understand that the error surface is made up of multiple error gradients. For our simple example, the error surface is created by combining two error curves. One error curve is defined by the changes in the SSE as  changes, shown in the top plot of figure 6.3. The other error curve is defined by the changes in the SSE as  changes, shown in the plot in the middle of figure 6.3. Notice that the gradient of each of these curves can vary along the curve, for example, the  error curve has a steep gradient on the extreme left and right of the plot, but the gradient becomes somewhat shallower in the middle of the curve. Also, the gradients of two different curves can vary dramatically; in this particular example the  error curve generally has a much steeper gradient than the  error curve.

    The fact that the error surface is composed of multiple curves, each with a different gradient, is important because the gradient descent algorithm moves down the combined error surface by independently updating each weight so as to move down the error curve associated with that weight. In other words, during a single iteration of the gradient descent algorithm,  is updated to move down the  error curve and  is updated the move down the  error curve. Furthermore, the amount each weight is updated in an iteration is proportional to the steepness of the gradient of the weight’s error curve, and this gradient will vary from one iteration to the next as the process moves down the error curve. For example,  will be updated by relatively large amounts in iterations where the search process is located high up on either side of the  error curve, but by smaller amounts in iterations where the search process is nearer to the bottom of the  error curve.

    The error curve associated with each weight is defined by how the SSE changes with respect to the change in the value of the weight. Calculus, and in particular differentiation, is the field of mathematics that deals with rates of change. For example, taking the derivative of a function, , calculates the rate of change of  (the output) for each unit change in  (the input). Furthermore, if a function takes multiple inputs [] then it is possible to calculate the rate of change of the output, , with respect to changes in each of these inputs, , by taking the partial derivative of the function of with respect to each input. The partial derivative of a function with respect to a particular input is calculated by first assuming that all the other inputs are held constant (and so their rate of change is 0 and they disappear from the calculation) and then taking the derivative of what remains. Finally, the rate of change of a function for a given input is also known as the gradient of the function at the location on the curve (defined by the function) that is specified by the input. Consequently, the partial derivative of the SSE with respect to a weight specifies how the output of the SSE changes as that weight changes, and so it specifies the gradient of the error curve of the weight. This is exactly what is needed to define the gradient descent weight update rule: the partial derivative of the SSE with respect to a weight specifies how to calculate the gradient of the weight’s error curve, and in turn this gradient specifies how the weight should be updated to reduce the error (the output of the SSE).

    The partial derivative of a function with respect to a particular variable is the derivative of the function when all the other variables are held constant. As a result there is a different partial derivative of a function with respect to each variable, because a different set of terms are considered constant in the calculation of each of the partial derivatives. Therefore, there is a different partial derivative of the SSE for each weight, although they all have a similar form. This is why each of the weights is updated independently in the gradient descent algorithm: the weight update rule is dependent on the partial derivative of the SSE for each weight, and because there is a different partial derivative for each weight, there is a separate weight update rule for each weight. Again, although the partial derivative for each weight is distinct, all of these derivatives have the same form, and so the weight update rule for each weight will also have the same form. This simplifies the definition of the gradient descent algorithm. Another simplifying factor is that the SSE is defined relative to a dataset with  examples. The relevance of this is that the only variables in the SSE are the weights; the target output  and the inputs  are all specified by the dataset for each example, and so can be considered constants. As a result, when calculating the partial derivative of the SSE with respect to a weight, many of the terms in the equation that do not include the weight can be deleted because they are considered constants.

    The relationship between the output of the SSE and each weight becomes more explicit if the SSE definition is rewritten so that the term , denoting the output predicted by the model, is replaced by the structure of the model generating the prediction. For the model with a single input  and a dummy input, this rewritten version of the SSE is:

    This equation uses a double subscript on the inputs, the first subscript  identifies the example (or row in the dataset) and the second subscript specifies the feature (or column in the dataset) of the input. For example,  represents feature 1 from example . This definition of the SSE can be generalized to a model with  inputs:

    Calculating the partial derivative of the SSE with respect to a specific weight involves the application of the chain rule from calculus and a number of standard differentiation rules. The result of this derivation is the following equation (for simplicity of presentation we switch back to the notation  to represent the output from the model):

    This partial derivative specifies how to calculate the error gradient for weight  for the dataset where  is the input associated with  for each example in the dataset. This calculation involves multiplying two terms, the error of the output and the rate of change of the output (i.e., the weighted sum) with respect to changes in the weight. One way of understanding this calculation is that if changing the weight changes the output of the weighted sum by a large amount, then the gradient of the error with respect to the weight is large (steep) because changing the weight will result in big changes in the error. However, this gradient is the uphill gradient, and we wish to move the weights so as to move down the error curve. So in the gradient descent weight update rule (shown below) the “–” sign in front of the input  is dropped. Using  to represent the iteration of the algorithm (an iteration involves a single pass through the  examples in the dataset), the gradient descent weight update rule is defined as:

    There are a number of notable factors about this weight update rule. First, the rule specifies how the weight  should be updated after iteration  through the dataset. This update is proportional to the gradient of the error curve for the weight for that iteration (i.e., the summation term, which actually defines the partial derivative of the SSE for that weight). Second, the weight update rule can be used to update the weights for functions with multiple inputs. This means that the gradient descent algorithm can be used to descend error surfaces with more than two weight coordinates. It is not possible to visualize these error surfaces because they will have more than three dimensions, but the basic principles of descending an error surface using the error gradient generalizes to learning functions with multiple inputs. Third, although the weight update rule has a similar structure for each weight, the rule does define a different update for each weight during each iteration because the update is dependent on the inputs in the dataset examples to which the weight is applied. Fourth, the summation in the rule indicates that, in each iteration of the gradient descent algorithm, the current model should be applied to all  of the examples in the dataset. This is one of the reasons why training a deep learning network is such a computationally expensive task. Typically for very large datasets, the dataset is split up into batches of examples sampled from the dataset, and each iteration of training is based on a batch, rather than the entire dataset. Fifth, apart from the modifications necessary to include the summation, this rule is identical to the LMS (also known as the Widrow-Hoff or delta) learning rule introduced in chapter 4, and the rule implements the same logic: if the output of the model is too large, then weights associated with positive inputs should be reduced; if the output is too small, then these weights should be increased. Moreover, the purpose and function of the learning rate hyperparameter (η) is the same as in the LMS rule: scale the weight adjustments to ensure that the adjustments aren’t so large that the algorithm misses (or steps over) the best set of weights. Using this weight update rule, the gradient descent algorithm can be summarized as follows:
    1. Construct a model using an initial set of weights.
    2. Repeat until the model performance is good enough.
    a. Apply the current model to the examples in the dataset.
    b. Adjust each weight using the weight update rule.
    3. Return the final model.

    One consequence of the independent updating of weights, and the fact that weight updates are proportional to the local gradient on the associated error curve, is that the path the gradient descent algorithm follows to the lowest point on the error surface may not be a straight line. This is because the gradient of each of the component error curves may not be equal at each location on the error surface (the gradient for one of the weights may be steeper than the gradient for the other weight). As a result, one weight may be updated by a larger amount than another weight during a given iteration, and thus the descent to the valley floor may not follow a direct route. Figure 6.4 illustrates this phenomenon. Figure 6.4 presents a set of top-down views of a portion of a contour plot of an error surface. This error surface is a valley that is quite long and narrow with steeper sides and gentler sloping ends; the steepness is reflected by the closeness of the contours. As a result, the search initially moves across the valley before turning toward the center of the valley. The plot on the left illustrates the first iteration of the gradient descent algorithm. The initial starting point is the location where the three arrows, in this plot, meet. The lengths of the dotted and dashed arrows represent the local gradients of the  and  error curves, respectively. The dashed arrow is longer than the dotted arrow reflecting the fact that the local gradient of the  error curve is steeper than that of the  error curve. In each iteration, each of the weights is updated in proportion to the gradient of their error curve; so in the first iteration, the update for  is larger than for  and therefore the overall movement is greater across the valley than along the valley. The thick black arrow illustrates the overall movement in the underlying weight space, resulting from the weight updates in this first iteration. Similarly, the middle plot illustrates the error gradients and overall weight update for the next iteration of gradient descent. The plot on the right shows the complete path of descent taken by the search process from initial location to the global minimum (the lowest point on the error surface).

    Figure 6.4 Top-down views of a portion of a contour plot of an error surface, illustrating the gradient descent path across the error surface. Each of the thick arrows illustrates the overall movement of the weight vector for a single iteration of the gradient descent algorithm. The length of dotted and dashed arrows represent the local gradient of the w0 and w1 error curves, respectively, for that iteration. The plot on the right shows the overall path taken to the global minimum of the error surface.
    It is relatively straightforward to map the weight update rule over to training a single neuron. In this mapping, the weight

    It is relatively straightforward to map the weight update rule over to training a single neuron. In this mapping, the weight  is the bias term for a neuron, and the other weights are associated with the other inputs to the neuron. The derivation of the partial derivative of the SSE is dependent on the structure of the function that generates . The more complex this function is, the more complex the partial derivative becomes. The fact that the function a neuron defines includes both a weighted summation and an activation function means that the partial derivative of the SSE with respect to a weight in a neuron is more complex than the partial derivative given above. The inclusion of the activation function within the neuron results in an extra term in the partial derivative of the SSE. This extra term is the derivative of the activation function with respect to the output from the weighted summation function. The derivative of the activation function is with respect to the output of the weighted summation function because this is the input that the activation function receives. The activation function does not receive the weight directly. Instead, the changes in the weight only affect the output of the activation function indirectly through the effect that these weight changes have on the output of the weighted summation. The main reason why the logistic function was such a popular activation function in neural networks for so long was that it has a very straightforward derivative with respect to its inputs. The gradient descent weight update rule for a neuron using the logistic function is as follows:

    The fact that the weight update rule includes the derivative of the activation function means that the weight update rule will change if the activation function of the neuron is changed. However, this change will simply involve updating the derivative of the activation function; the overall structure of the rule will remain the same.

    This extended weight update rule means that the gradient descent algorithm can be used to train a single neuron. It cannot, however, be used to train neural networks with multiple layers of neurons because the definition of the error gradient for a weight depends on the error of the output of the function, the term . Although it is possible to calculate the error of the output of a neuron in the output layer of the network by directly comparing the output with the expected output, it is not possible to calculate this error term directly for the neurons in the hidden layer of the network, and as a result it is not possible to calculate the error gradients for each weight. The backpropagation algorithm is a solution to the problem of calculating error gradients for the weights in the hidden layers of the network.

    Training a Neural Network Using Backpropagation

    The term backpropagation has two different meanings. The primary meaning is that it is an algorithm that can be used to calculate, for each neuron in a network, the sensitivity (gradient/rate-of-change) of the error of the network to changes in the weights. Once the error gradient for a weight has been calculated, the weight can then be adjusted to reduce the overall error of the network using a weight update rule similar to the gradient descent weight update rule. In this sense, the backpropagation algorithm is a solution to the credit assignment problem, introduced in chapter 4. The second meaning of backpropagation is that it is a complete algorithm for training a neural network. This second meaning encompasses the first sense, but also includes a learning rule that defines how the error gradients of the weights should be used to update the weights within the network. Consequently, the algorithm described by this second meaning involves a two-step process: solve the credit assignment problem, and then use the error gradients of the weights, calculated during credit assignment, to update the weights in the network. It is useful to distinguish between these two meanings of backpropagation because there are a number of different learning rules that can be used to update the weights, once the credit assignment problem has been resolved. The learning rule that is most commonly used with backpropagation is the gradient descent algorithm introduced earlier. The description of the backpropagation algorithm given here focuses on the first meaning of backpropagation, that of the algorithm being a solution to the credit assignment problem.

    Backpropagation: The Two-Stage Algorithm

    The backpropagation algorithm begins by initializing all the weights of the network using random values. Note that even a randomly initialized network can still generate an output when an input is presented to the network, although it is likely to be an output with a large error. Once the network weights have been initialized, the network can be trained by iteratively updating the weights so as to reduce the error of the network, where the error of the network is calculated in terms of the difference between the output generated by the network in response to an input pattern, and the expected output for that input, as defined in the training dataset. A crucial step in this iterative weight adjustment process involves solving the credit assignment problem, or, in other words, calculating the error gradients for each weight in the network. The backpropagation algorithm solves this problem using a two-stage process. In first stage, known as the forward pass, an input pattern is presented to the network, and the resulting neuron activations flow forward through the network until an output is generated. Figure 6.5 illustrates the forward pass of the backpropagation algorithm. In this figure, the weighted summation of inputs calculated at each neuron (e.g.,  represents the weighted summation of inputs calculated for neuron 1) and the outputs (or activations, e.g.,  represents the activation for neuron 1) of each neuron is shown. The reason for listing the  and  values for each neuron in this figure is to highlight the fact that during the forward pass both of these values, for each neuron, are stored in memory. The reason they are stored in memory is that they are used in the backward pass of the algorithm. The  value for a neuron is used to calculate the update to the weights on input connections to the neuron. The  value for a neuron is used to calculate the update to the weights on the output connections from a neuron. The specifics of how these values are used in the backward pass will be described below.

    The second stage, known as the backward pass, begins by calculating an error gradient for each neuron in the output layer. These error gradients represent the sensitivity of the network error to changes in the weighted summation calculation of the neuron, and they are often denoted by the shorthand notation  (pronounced delta) with a subscript indicating the neuron. For example, δk is the gradient of the network error with respect to small changes in the weighted summation calculation of the neuron k. It is important to recognize that there are two different error gradients calculated in the backpropagation algorithm:
    1. The first is the  value for each neuron. The  for each neuron is the rate of change of the error of the network with respect to changes in the weighted summation calculation of the neuron. There is one  for each neuron. It is these  error gradients that the algorithm backpropagates.
    2. The second is the error gradient of the network with respect to changes in the weights of the network. There is one of these error gradients for each weight in the network. These are the error gradients that are used to update the weights in the network. However, it is necessary to first calculate the  term for each neuron (using backpropagation) in order to calculate the error gradients for the weights.

    Note there is only a single  per neuron, but there may be many weights associated with that neuron, so the  term for a neuron may be used in the calculation of multiple weight error gradients.

    Once the s for the output neurons have been calculated, the s for the neurons in the last hidden layer are then calculated. This is done by assigning a portion of the  from each output neuron to each hidden neuron that is directly connected to it. This assignment of blame, from output neuron to hidden neuron, is dependent on the weight of the connection between the neurons, and the activation of the hidden neuron during the forward pass (this is why the activations are recorded in memory during the forward pass). Once the blame assignment, from the output layer, has been completed, the  for each neuron in the last hidden layer is calculated by summing the portions of the s assigned to the neuron from all of the output neurons it connects to. The same process of blame assignment and summing is then repeated to propagate the error gradient back from the last layer of hidden neurons to the neurons in the second last layer, and so on, back to the input layer. It is this backward propagation of s through the network that gives the algorithm its name. At the end of this backward pass there is a  calculated for each neuron in the network (i.e., the credit assignment problem has been solved) and these s can then be used to update the weights in the network (using, for example, the gradient descent algorithm introduced earlier). Figure 6.6 illustrates the backward pass of the backpropagation algorithm. In this figure, the s get smaller and smaller as the backpropagation process gets further from the output layer. This reflects the vanishing gradient problem discussed in chapter 4 that slows down the learning rate of the early layers of the network.

    Figure 6.5 The forward pass of the backpropagation algorithm.

    In summary, the main steps within each iteration of the backpropagation algorithm are as follows:
    1. Present an input to the network and allow the neuron activations to flow forward through the network until an output is generated. Record both the weighted sum and the activation of each neuron.

    Figure 6.6 The backward pass of the backpropagation algorithm.

    2. Calculate a  (delta) error gradient for each neuron in the output layer.
    3. Backpropagate the  error gradients to obtain a  (delta) error gradient for each neuron in the network.
    4. Use the  error gradients and a weight update algorithm, such as gradient descent, to calculate the error gradients for the weights and use these to update the weights in the network.

    The algorithm continues iterating through these steps until the error of the network is reduced (or converged) to an acceptable level.

    Backpropagation: Backpropagating the δ s

     term of a neuron describes the error gradient for the network with respect to changes in the weighted summation of inputs calculated by the neuron. To help make this more concrete, figure 6.7 (top) breaks open the processing stages within a neuron  and uses the term  to denote the result of the weighted summation within the neuron. The neuron in this figure receives inputs (or activations) from three other neurons (), and  is the weighted sum of these activations. The output of the neuron, , is then calculated by passing  through a nonlinear activation function, , such as the logistic function. Using this notation a  for a neuron  is the rate of change of the error of the network with respect to small changes in the value of . Mathematically, this term is the partial derivative of the networks error with respect to :

    No matter where in a network a neuron is located (output layer or hidden layer), the  for the neuron is calculated as the product of two terms:
    1. the rate of change of the network error in response to changes in the neuron’s activation (output): 

    Figure 6.7 Top: the forward propagation of activations through the weighted sum and activation function of a neuron. Middle: The calculation of the δ term for an output neuron (tk is the expected activation for the neuron and ak is the actual activation). Bottom: The calculation of the δ term for a hidden neuron. This figure is loosely inspired by figure 5.2 and figure 5.3 in Reed and Marks II 1999.

    2. the rate of change of the activation of the neuron with respect to changes in the weighted sum of inputs to the neuron: .

    Figure 6.7 (middle) illustrates how this product is calculated for neurons in the output layer of a network. The first step is to calculate the rate of change of the error of the network with respect to the output of the neuron, the term . Intuitively, the larger the difference between the activation of a neuron, , and the expected activation, , the faster the error can be changed by changing the activation of the neuron. So the rate of change of the error of the network with respect to changes in the activation of an output neuron  can be calculated by subtracting the neuron’s activation () from the expected activation ():

    This term connects the error of the network to the output of the neuron. The neuron’s , however, is the rate of change of the error with respect to the input to the activation function (), not the output of that function (). Consequently, in order to calculate the  for the neuron, the  value must be propagated back through the activation function to connect it to the input to the activation function. This is done by multiplying  by the rate of change of the activation function with respect to the input value to the function, . In figure 6.7, the rate of change of the activation function with respect to its input is denoted by the term: . This term is calculated by plugging the value  (stored from the forward pass through the network) into the equation of the derivative of the activation function with respect to . For example, the derivative of the logistic function with respect to its input is:

    Figure 6.8 plots this function and shows that plugging a  value into this equation will result in a value between 0 and 0.25. For example, figure 6.8 shows that if  then . This is why the weighted summation value for each neuron () is stored during the forward pass of the algorithm.

    The fact1 that the calculation of a neuron’s  involves a product that includes the derivative of the neuron’s activation function makes it necessary to be able to take the derivative of the neuron’s activation function. It is not possible to take the derivative of a threshold activation function because there is a discontinuity in the function at the threshold. As a result, the backpropagation algorithm does not work for networks composed of neurons that use threshold activation functions. This is one of the reasons why neural networks moved away from threshold activation and started to use the logistic and tanh activation functions. The logistic and tanh functions both have very simple derivatives and this made them particularly suitable to backpropagation.

    Figure 6.8 Plots of the logistic function and the derivative of the logistic function.

    Figure 6.7 (bottom) illustrates how the  for a neuron in a hidden layer is calculated. This involves the same product of terms as was used for neurons in the output layer. The difference is that the calculation of the  is more complex for hidden units. For hidden neurons, it is not possible to directly connect the output of the neuron with the error of a network. The output of a hidden neuron only indirectly affects the overall error of the network through the variations that it causes in the downstream neurons that receive the output as input, and the magnitude of these variations is dependent on the weight each of these downstream neurons applies to the output. Furthermore, this indirect effect on the network error is in turn dependent on the sensitivity of the network error to these later neurons, that is, their  values. Consequently, the sensitivity of the network error to the output of a hidden neuron can be calculated as a weighted sum of the  values of the neurons immediately downstream of the neuron:

    As a result, the error terms (the  values) for all the downstream neurons to which a neuron’s output is passed in the forward pass must be calculated before the  for neuron k can be calculated. This, however, is not a problem because in the backward pass the algorithm is working backward through the network and will have calculated the  terms for the downstream neurons before it reaches neuron k.

    For hidden neurons, the other term in the  product, , is calculated in the same way as it is calculated for output neurons: the  value for the neuron (the weighted summation of inputs, stored during the forward pass through the network) is plugged into the derivative of the neuron’s activation function with respect to .

    Backpropagation: Updating the Weights

    The fundamental principle of the backpropagation algorithm in adjusting the weights in a network is that each weight in a network should be updated in proportion to the sensitivity of the overall error of the network to changes in that weight. The intuition is that if the overall error of the network is not affected by a change in a weight, then the error of the network is independent of that weight, and, therefore, the weight did not contribute to the error. The sensitivity of the network error to a change in an individual weight is measured in terms of the rate of change of the network error in response to changes in that weight.

    The fundamental principle of the backpropagation algorithm in adjusting the weights in a network is that each weight in a network should be updated in proportion to the sensitivity of the overall error of the network to changes in that weight.

    The overall error of a network is a function with multiple inputs: both the inputs to the network and all the weights in the network. So, the rate of change of the error of a network in response to changes in a given network weight is calculated by taking the partial derivative of the network error with respect to that weight. In the backpropagation algorithm, the partial derivative of the network error for a given weight is calculated using the chain rule. Using the chain rule, the partial derivative of the network error with respect a weight  on the connection between a neuron  and a neuron  is calculated as the product of two terms:
    1. the first term describes the rate of change of the weighted sum of inputs in neuron  with respect to changes in the weight ;
    2. and the second term describes the rate of change of the network error in response to changes in the weighted sum of inputs calculated by the neuron . (This second term is the  for neuron .)

    Figure 6.9 shows how the product of these two terms connects a weight to the output error of the network. The figure shows the processing of the last two neurons ( and ) in a network with a single path of activation. Neuron  receives a single input  and the output from neuron  is the sole input to neuron . The output of neuron  is the output of the network. There are two weights in this portion of the network,  and .

    The calculations shown in figure 6.9 appear complicated because they contain a number of different components. However, as we will see, by stepping through these calculations, each of the individual elements is actually easy to calculate; it’s just keeping track of all the different elements that poses a difficulty.

    Figure 6.9 An illustration of how the product of derivatives connects weights in the network to the error of the network.

    Focusing on , this weight is applied to an input of the output neuron of the network. There are two stages of processing between this weight and the network output (and error): the first is the weighted sum calculated in neuron ; the second is the nonlinear function applied to this weighted sum by the activation function of neuron . Working backward from the output, the  term is calculated using the calculation shown in the middle figure of figure 6.7: the difference between the target activation for the neuron and the actual activation is calculated and is multiplied by the partial derivative of the neuron’s activation function with respect to its input (the weighted sum ), . Assuming that the activation function used by neuron  is the logistic function, the term  is calculated by plugging in the value  (stored during the forward pass of the algorithm) into the derivation of the logistic function:

    So the calculation of  under the assumption that neuron  uses a logistic function is:

    The  term connects the error of the network to the input to the activation function (the weighted sum ). However, we wish to connect the error of the network back to the weight . This is done by multiplying the  term by the partial derivative of the weighted summation function with respect to weight . This partial derivative describes how the output of the weighted sum function  changes as the weight  changes. The fact that the weighted summation function is a linear function of weights and activations means that in the partial derivative with respect to a particular weight all the terms in the function that do not involve the specific weight go to zero (are considered constants) and the partial derivative simplifies to just the input associated with that weight, in this instance input .

    This is why the activations for each neuron in the network are stored in the forward pass. Taken together these two terms,  and , connect the weight  to the network error by first connecting the weight to , and then connecting  to the activation of the neuron, and thereby to the network error. So, the error gradient of the network with respect to changes in weight  is calculated as:

    The other weight in the figure 6.9 network, , is deeper in the network, and, consequently, there are more processing steps between it and the network output (and error). The  term for neuron  is calculated, through backpropagation (as shown at the bottom of figure 6.7), using the following product of terms:

    Assuming the activation function used by neuron  is the logistic function, then the term  is calculated in a similar way to : the value  is plugged into the equation for the derivative of the logistic function. So, written out in long form the calculation of  is:

    However, in order to connect the weight  with the error of the network, the term  must be multiplied by the partial derivative of the weighted summation function with respect to the weight: . As described above, the partial derivative of a weighted sum function with respect to a weight reduces to the input associated with the weight  (i.e., ); and the gradient of the networks error with respect to the hidden weight  is calculated by multiplying  by  Consequently, the product of the terms ( and ) forms a chain connecting the weight  to the network error. For completeness, the product of terms for , assuming logistic activation functions in the neurons, is:

    Although this discussion has been framed in the context of a very simple network with only a single path of connections, it generalizes to more complex networks because the calculation of the  terms for hidden units already considers the multiple connections emanating from a neuron. Once the gradient of the network error with respect to a weight has been calculated (), the weight can be adjusted so as to reduce the weight of the network using the gradient descent weight update rule. Here is the weight update rule, specified using the notation from backpropagation, for the weight on the connection between neuron  and neuron  during iteration  of the algorithm:

    Finally, an important caveat on training neural networks with backpropagation and gradient descent is that the error surface of a neural network is much more complex than that of a linear models. Figure 6.3 illustrated the error surface of a linear model as a smooth convex bowl with a single global minimum (a single best set of weights). However, the error surface of a neural network is more like a mountain range with multiple valleys and peaks. This is because each of the neurons in a network includes a nonlinear function in its mapping of inputs to outputs, and so the function implemented by the network is a nonlinear function. Including a nonlinearity within the neurons of a network increases the expressive power of the network in terms of its ability to learn more complex functions. However, the price paid for this is that the error surface becomes more complex and the gradient descent algorithm is no longer guaranteed to find the set of weights that define the global minimum on the error surface; instead it may get stuck within a minima (local minimum). Fortunately, however, backpropagation and gradient descent can still often find sets of weights that define useful models, although searching for useful models may require running the training process multiple times to explore different parts of the error surface landscape.

    7 The Future of Deep Learning

    On March 27, 2019, Yoshua Bengio, Geoffrey Hinton, and Yann LeCun jointly received the ACM A.M. Turing award. The award recognized the contributions they have made to deep learning becoming the key technology driving the modern artificial intelligence revolution. Often described as the “Nobel Prize for Computing,” the ACM A.M Turing award carries a $1 million prize. Sometimes working together, and at other times working independently or in collaboration with others, these three researchers have, over a number of decades of work, made numerous contributions to deep learning, ranging from the popularization of backpropagation in the 1980s, to the development of convolutional neural networks, word embeddings, attention mechanisms in networks, and generative adversarial networks (to list just some examples). The announcement of the award noted the astonishing recent breakthroughs that deep learning has led to in computer vision, robotics, speech recognition, and natural language processing, as well as the profound impact that these technologies are having on society, with billions of people now using deep learning based artificial intelligence on a daily basis through smart phones applications. The announcement also highlighted how deep learning has provided scientists with powerful new tools that are resulting in scientific breakthroughs in areas as diverse as medicine and astronomy. The awarding of this prize to these researchers reflects the importance of deep learning to modern science and society. The transformative effects of deep learning on technology is set to increase over the coming decades with the development and adoption of deep learning continuing to be driven by the virtuous cycle of ever larger datasets, the development of new algorithms, and improved hardware. These trends are not stopping, and how the deep learning community responds to them will drive growth and innovations within the field over the coming years.

    Big Data Driving Algorithmic Innovations

    Chapter 1 introduced the different types of machine learning: supervised, unsupervised, and reinforcement learning. Most of this book has focused on supervised learning, primarily because it is the most popular form of machine learning. However, a difficulty with supervised learning is that it can cost a lot of money and time to annotate the dataset with the necessary target labels. As datasets continue to grow, the data annotation cost is becoming a barrier to the development of new applications. The ImageNet dataset1 provides a useful example of the scale of the annotation task involved in deep learning projects. This data was released in 2010, and is the basis for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). This is the challenge that the AlexNet CNN won in 2012 and the ResNet system won in 2015. As was discussed in chapter 4, AlexNet winning the 2012 ILSVRC challenge generated a lot of excitement about deep learning models. However, the AlexNet win would not have been possible without the creation of the ImageNet dataset. This dataset contains more than fourteen million images that have been manually annotated to indicate which objects are present in each image; and more than one million of the images have actually been annotated with the bounding boxes of the objects in the image. Annotating data at this scale required a significant research effort and budget, and was achieved using crowdsourcing platforms. It is not feasible to create annotated datasets of this size for every application.

    As datasets continue to grow, the data annotation cost is becoming a barrier to the development of new applications.

    One response to this annotation challenge has been a growing interest in unsupervised learning. The autoencoder models used in Hinton’s pretraining (see chapter 4) are one neural network approach to unsupervised learning, and in recent years different types of autoencoders have been proposed. Another approach to this problem is to train generative models. Generative models attempt to learn the distribution of the data (or, to model the process that generated the data). Similar to autoencoders, generative models are often used to learn a useful representation of the data prior to training a supervised model. Generative adversarial networks (GANs) are an approach to training generative models that has received a lot of attention in recent years (Goodfellow et al. 2014). A GAN consists of two neural networks, a generative model and a discriminative model, and a sample of real data. The models are trained in an adversarial manner. The task of the discriminative model is to learn to discriminate between real data sampled from the dataset, and fake data that has been synthesized by the generator. The task of the generator is to learn to synthesize fake data that can fool the discriminative model. Generative models trained using a GAN can learn to synthesize fake images that mimic an artistic style (Elgammal et al. 2017), and also to synthesize medical images along with lesion annotations (Frid-Adar et al. 2018). Learning to synthesize medical images, along with the segmentation of the lesions in the synthesized image, opens the possibility of automatically generating massive labeled datasets that can be used for supervised learning. A more worrying application of GANs is the use of these networks to generate deep fakes: a deep fake is a fake video of a person doing something they never did that is created by swapping their face into a video of someone else. Deep fakes are very hard to detect, and have been used maliciously on a number of occasions to embarrass public figures, or to spread fake news stories.

    Another solution to the data labeling bottleneck is that rather than training a new model from scratch for each new application, we rather repurpose models that have been trained on a similar task. Transfer learning is the machine learning challenge of using information (or representations) learned on one task to aid learning on another task. For transfer learning to work, the two tasks should be from related domains. Image processing is an example of a domain where transfer learning is often used to speed up the training of models across different tasks. Transfer learning is appropriate for image processing tasks because low-level visual features, such as edges, are relatively stable and useful across nearly all visual categories. Furthermore, the fact that CNN models learn a hierarchy of visual feature, with the early layers in CNN learning functions that detect these low-level visual features in the input, makes it possible to repurpose the early layers of pretrained CNNs across multiple image processing projects. For example, imagine a scenario where a project requires an image classification model that can identify objects from specialized categories for which there are no samples in general image datasets, such as ImageNet. Rather than training a new CNN model from scratch, it is now relatively standard to first download a state-of-the-art model (such as the Microsoft ResNet model) that has been trained on ImageNet, then replace the later layers of the model with a new set of layers, and finally to train this new hybrid-model on a relatively small dataset that has been labeled with the appropriate categories for the project. The later layers of the state-of-the-art (general) model are replaced because these layers contain the functions that combine the low-level features into the task specific categories the model was originally trained to identify. The fact that the early layers of the model have already been trained to identify the low-level visual features speeds up the training and reduces the amount of data needed to train the new project specific model.

    The increased interest in unsupervised learning, generative models, and transfer learning can all be understood as a response to the challenge of annotating increasingly large datasets.

    The Emergence of New Models

    The rate of emergence of new deep learning models is accelerating every year. A recent example is capsule networks (Hinton et al. 2018; Sabour et al. 2017). Capsule networks are designed to address some of the limitations of CNNs. One problem with CNNs, sometimes known as the Picasso problem, is the fact that a CNN ignores the precise spatial relationships between high-level components within an object’s structure. What this means in practice is that a CNN that has been trained to identify faces may learn to identify the shapes of eyes, the nose, and the mouth, but will not learn the required spatial relationships between these parts. Consequently, the network can be fooled by an image that contains these body parts, even if they are not in the correct relative position to each other. This problem arises because of the pooling layers in CNNs that discard positional information.

    At the core of capsule networks is the intuition that the human brain learns to identify object types in a viewpoint invariant manner. Essentially, for each object type there is an object class that has a number of instantiation parameters. The object class encodes information such as the relative relationship of different object parts to each other. The instantiation parameters control how the abstract description of an object type can be mapped to the specific instance of the object that is currently in view (for example, its pose, scale, etc.).

    A capsule is a set of neurons that learns to identify whether a specific type of object or object part is present at a particular location in an image. A capsule outputs an activity vector that represents the instantiation parameters of the object instance, if one is present at the relevant location. Capsules are embedded within convolutional layers. However, capsule networks replace the pooling process, which often defines the interface between convolutional layers, with a process called dynamic routing. The idea behind dynamic routing is that each capsule in one layer in the network learns to predict which capsule in the next layer is the most relevant capsule for it to forward its output vector to.

    At the time or writing, capsule networks have the state-of-the-art performance on the MNIST handwritten digit recognition dataset that the original CNNs were trained on. However, by today’s standards, this is a relatively small dataset, and capsule networks have not been scaled to larger datasets. This is partly because the dynamic routing process slows down the training of capsule networks. However, if capsule networks are successfully scaled, then they may introduce an important new form of model that extends the ability of neural networks to analyze images in a manner much closer to the way humans do.

    Another recent model that has garnered a lot of interest is the transformer model (Vaswani et al. 2017). The transformer model is an example of a growing trend in deep learning where models are designed to have sophisticated internal attention mechanisms that enable a model to dynamically select subsets of the input to focus on when generating an output. The transformer model has achieved state-of-the-art performance on machine translation for some language pairs, and in the future this architecture may replace the encoder-decoder architecture described in chapter 5. The BERT (Bidirectional Encoder Representations from Transformers) model has built on the Transformer architecture (Devlin et al. 2018). The BERT development is particularly interesting because at its core is the idea of transfer learning (as discussed above in relation to the data annotation bottleneck). The basic approach to creating a natural language processing model with BERT is to pretrain a model for a given language using a large unlabeled dataset (the fact that the dataset is unlabeled means that it is relatively cheap to create). This pretrained model can then be used as the basis to create a models for specific tasks for the language (such as sentiment classification or question answering) by fine-tuning the pretrained model using supervised learning and a relatively small annotated dataset. The success of BERT has shown this approach to be tractable and effective in developing state-of-the-art natural language processing systems.

    New Forms of Hardware

    Today’s deep learning is powered by graphics processing units (GPUs): specialized hardware that is optimized to do fast matrix multiplications. The adoption, in the late 2000s, of commodity GPUs to speed up neural network training was a key factor in many of the breakthroughs that built momentum behind deep learning. In the last ten years, hardware manufacturers have recognized the importance of the deep learning market and have developed and released hardware specifically designed for deep learning, and which supports deep learning libraries, such as TensorFlow and PyTorch. As datasets and networks continue to grow in size, the demand for faster hardware continues. At the same time, however, there is a growing recognition of the energy costs associated with deep learning, and people are beginning to look for hardware solutions that have a reduced energy footprint.

    Neuromorphic computing emerged in the late 1980s from the work of Carver Mead.2 A neuromorphic chip is composed of a very-large-scale integrated (VLSI) circuit, connecting potentially millions of low-power units known as spiking neurons. Compared with the artificial neurons used in standard deep learning systems, the design of a spiking neuron is closer to the behavior of biological neurons. In particular, a spiking neuron does not fire in response to the set of input activations propagated to it at a particular time point. Instead, a spiking neuron maintains an internal state (or activation potential) that changes through time as it receives activation pulses. The activation potential increases when new activations are received, and decays through time in the absence of incoming activations. The neuron fires when its activation potential surpasses a specific threshold. Due to the temporal decay of the neuron’s activation potential, a spiking neuron only fires if it receives the requisite number of input activations within a time window (a spiking pattern). One advantage of this temporal based processing is that spiking neurons do not fire on every propagation cycle, and this reduces the amount of energy the network consumes.

    In comparison with traditional CPU design, neuromorphic chips have a number of distinctive characteristics, including:
    1. Basic building blocks: traditional CPUs are built using transistor based logic gates (e.g., AND, OR, NAND gates), whereas neuromorphic chips are built using spiking neurons.
    2. Neuromorphic chips have an analog aspect to them: in a traditional digital computer, information is sent in high-low electrical bursts in sync with a central clock; in a neuromorphic chip, information is sent as patterns of high-low signals that vary through time.
    3. Architecture: the architecture of traditional CPUs is based on the von Neumann architecture, which is intrinsically centralized with all the information passing through the CPU. A neuromorphic chip is designed to allow massive parallelism of information flow between the spiking neurons. Spiking neurons communicate directly with each other rather than via a central information processing hub.
    4. Information representation is distributed through time: the information signals propagated through a neuromorphic chip use a distributed representation, similar to the distributed representations discussed in chapter 4, with the distinction that in a neuromorphic chip these representations are also distributed through time. Distributed representations are more robust to information loss than local representations, and this is a useful property when passing information between hundreds of thousands, or millions, of components, some of which are likely to fail.

    Currently there are a number of major research projects focused on neuromorphic computing. For example, in 2013 the European Commission allocated one billion euros in funding to the ten-year Human Brain Project.3 This project directly employs more than five hundred scientists, and involves research from more than a hundred research centers across Europe. One of the projects key objectives is the development of neuromorphic computing platforms capable of running a simulation of a complete human brain. A number of commercial neuromorphic chips have also been developed. In 2014, IBM launched the TrueNorth chip, which contained just over a million neurons that are connected together by over 286 million synapses. This chip uses approximately 1/10,000th the power of a conventional microprocessor. In 2018, Intel Labs announced the Loihi (pronounced low-ee-hee) neuromorphic chip. The Loihi chip has 131,072 neurons connected together by 130,000,000 synapses. Neuromorphic computing has the potential to revolutionize deep learning; however, it still faces a number of challenges, not least of which is the challenge of developing the algorithms and software patterns for programming this scale of massively parallel hardware.

    Finally, on a slightly longer time horizon, quantum computing is another stream of hardware research that has the potential to revolutionize deep learning. Quantum computing chips are already in existence; for example, Intel has created a 49-qubit quantum test chip, code named Tangle Lake. A qubit is the quantum equivalent of a binary digit (bit) in traditional computing. A qubit can store more than one bit of information; however, it is estimated that it will require a system with one million or more qubits before quantum computing will be useful for commercial purposes. The current time estimate for scaling quantum chips to this level is around seven years.

    The Challenge of Interpretability

    Machine learning, and deep learning, are fundamentally about making data-driven decisions. Although deep learning provides a powerful set of algorithms and techniques to train models that can compete (and in some cases outperform) humans on a range of decision-making tasks, there are many situations where a decision by itself is not sufficient. Frequently, it is necessary to provide not only a decision but also the reasoning behind a decision. This is particularly true when the decision affects a person, be it a medical diagnosis or a credit assessment. This concern is reflected in privacy and ethics regulations in relation to the use of personal data and algorithmic decision-making pertaining to individuals. For example, Recital 714 of the General Data Protection Regulations (GDPR) states that individuals, affected by a decision made by an automated decision-making process, have the right to an explanation with regards to how the decision was reached.

    Different machine learning models provide different levels of interpretability with regard to how they reach a specific decision. Deep learning models, however, are possibly the least interpretable. At one level of description, a deep learning model is quite simple: it is composed of simple processing units (neurons) that are connected together into a network. However, the scale of the networks (in terms of the number of neurons and the connections between them), the distributed nature of the representations, and the successive transformations of the input data as the information flows deeper into the network, makes it incredibly difficult to interpret, understand, and therefore explain, how the network is using an input to make a decision.

    The legal status of the right to explanation within GDPR is currently vague, and the specific implications of it for machine learning and deep learning will need to be worked out in the courts. This example does, however, highlight the societal need for a better understanding of how deep learning models use data. The ability to interpret and understand the inner workings of a deep learning model is also important from a technical perspective. For example, understanding how a model uses data can reveal if a model has an unwanted bias in how it makes its decisions, and also reveal the corner cases that the model will fail on. The deep learning and the broader artificial intelligence research communities are already responding to this challenge. Currently, there are a number of projects and conferences focused on topics such as explainable artificial intelligence, and human interpretability in machine learning.

    Chis Olah and his colleagues summarize the main techniques currently used to examine the inner workings of deep learning models as: feature visualization, attribution, and dimensionality reduction (Olah et al. 2018). One way to understand how a network processes information is to understand what inputs trigger particular behaviors in a network, such as a neuron firing. Understanding the specific inputs that trigger the activation of a neuron enables us to understand what the neuron has learned to detect in the input. The goal of feature visualization is to generate and visualize inputs that cause a specific activity within a network. It turns out that optimization techniques, such a backpropogation, can be used to generate these inputs. The process starts with a random generated input and the input is then iteratively updated until the target behavior is triggered. Once the required necessary input has been isolated, it can then be visualized in order to provide a better understanding of what the network is detecting in the input when it responds in a particular way. Attribution focuses on explaining the relationship between neurons, for example, how the output of a neuron in one layer of the network contributes to the overall output of the network. This can be done by generating a saliency (or heat-map) for the neurons in a network that captures how much weight the network puts on the output of a neuron when making a particular decision. Finally, much of the activity within a deep learning network is based on the processing of high-dimensional vectors. Visualizing data enables us to use our powerful visual cortex to interpret the data and the relationships within the data. However, it is very difficult to visualize data that has a dimensionality greater than three. Consequently, visualization techniques that are able to systematically reduce the dimensionality of high-dimensional data and visualize the results are incredibly useful tools for interpreting the flow of information within a deep network. t-SNE5 is a well-known technique that visualizes high-dimensional data by projecting each datapoint into a two- or three-dimensional map (van der Maaten and Hinton 2008). Research on interpreting deep learning networks is still in its infancy, but in the coming years, for both societal and technical reasons, this research is likely to become a more central concern to the broader deep learning community.

    Final Thoughts

    Deep learning is ideally suited for applications involving large datasets of high-dimensional data. Consequently, deep learning is likely to make a significant contribution to some of the major scientific challenges of our age. In the last two decades, breakthroughs in biological sequencing technology have made it possible to generate high-precision DNA sequences. This genetic data has the potential to be the foundation for the next generation of personalized precision medicine. At the same time, international research projects, such as the Large Hadron Collider and Earth orbit telescopes, generate huge amounts of data on a daily basis. Analyzing this data can help us to understand the physics of our universe at the smallest and the biggest scales. In response to this flood of data, scientists are, in ever increasing numbers, turning to machine learning and deep learning to enable them to analyze this data.

    One way to understand how a network processes information is to understand what inputs trigger particular behaviors in a network, such as a neuron firing.

    At a more mundane level, however, deep learning already directly affects our lives. It is likely, that for the last few years, you have unknowingly been using deep learning models on a daily basis. A deep learning model is probably being invoked every time you use an internet search engine, a machine translation system, a face recognition system on your camera or social media website, or use a speech interface to a smart device. What is potentially more worrying is that the trail of data and metadata that you leave as you move through the online world is also being processed and analzsed using deep learning models. This is why it is so important to understand what deep learning is, how it works, what is it capable of, and its current limitations.

  • 万里:农村改革从反对“学大寨”开始

    本文系作者1997年10月10日,与中共中央党史研究室负责人和记者的谈话的节选

    回想一下改革以前,要什么没什么,只能凭证凭票供应,什么粮票、布票,这个票那个票的,连买块肥皂也要票。至于水果,什么香蕉、橘子呀,见也见不到。什么都缺,人们把这种状况叫短缺经济。现在完全变了,短缺变为充足,甚至变为饱和。什么票证也不要了,只要一个票,就是人民币。有了人民币,什么都可以买得到。按总量计算,我们不少农产品名列前茅,甚至世界第一,但一看“人均”就成了后列。这是大国的好处,也是大国的难处。要保证这么一大家子人有饭吃,而且要逐渐逐渐地吃得稍为好一点,是很不容易的。包产到户提高了农民的积极性,使农产品丰富了,这对保证物价稳定,进而保证社会稳定、政治稳定,是个根本性的因素。因此,从人民公社到包产到户不是个小变化,而是个大变化,体制的变化,时代的变化。

    过去“左”了那么多年,几乎把农民的积极性打击完了。现在要翻过来,搞包产到户,把农民的积极性再提起来,提得比过去更高,这当然不可能那么容易,要有一个历史过程。我认为这个历史过程,是同“左”倾错误斗争的过程,应当把纠正“左”倾错误作为主线来考虑。

    大寨本来是个好典型,特别是自力更生、艰苦奋斗的精神,应当认真学习,发扬光大。但是,“文化大革命”时期,毛主席号召全国学大寨,要树这面红旗,事情就走到反面去了。中国这么大,农村的条件千差万别,只学一个典型,只念大寨“一本经”,这本身就不科学,就不实事求是。何况这时学大寨,并不是学它如何搞农业生产,搞山区建设,而主要是学它如何把阶级斗争的弦绷紧,如何“大批促大干”。大寨也自我膨胀,以为自己事事正确,把“左”倾错误恶性发展到登峰造极的地步,成为“四人帮”推行极“左”路线的工具。

    我为什么会有这样看法呢?并不是因为我对大寨有什么成见,而是我到安徽工作以后,从农村的实际中逐渐体会到的。

    1977年6月,党中央派我到安徽去当第一书记。我又不熟悉农村工作,所以一到任就先下去看农业、看农民,用三四个月的时间把全省大部分地区都跑到了。我这个长期在城市工作的干部,虽然不能说对农村的贫困毫无所闻,但是到农村一具体接触,还是非常受刺激。原来农民的生活水平这么低啊,吃不饱,穿不暖,住的房子不像个房子的样子。淮北、皖东有些穷村,门、窗都是泥土坯的,连桌子、凳子也是泥土坯的,找不到一件木器家具,真是家徒四壁呀。我真没料到,解放几十年了,不少农村还这么穷!我不能不问自己,这是什么原因?这能算是社会主义吗?人民公社到底有什么问题?当然,人民公社是上了宪法的,我也不能乱说,但我心里已经认定,看来从安徽的实际情况出发,最重要的是怎么调动农民的积极性,否则连肚子也吃不饱,一切无从谈起。

    我刚到安徽那一年,全省二十八万多个生产队,只有10%的生产队能维持温饱;67%的队人均年收入低于60元,40元以下的约占25%。我这个第一书记心里怎么能不犯愁啊?越看越听越问心情越沉重,越认定非另找出路不可。于是,回省便找新调来的顾卓新、赵守一反复交换意见,共同研究解决办法。同时,决定派农委的周曰礼他们再去做专题调查,起草对策。随即搞出了一份《关于目前农村经济政策几个问题的规定》(简称“省委六条”),常委讨论通过后,再下去征求意见修改。经过几上几下,拿出了一个正式“草案”。“六条”强调农村一切工作要以生产为中心。我们当时的决心是,不管上面那些假、大、空的叫喊,一定要从安徽的实际情况出发,切切实实解决面临的许多严重问题。这样做,受到广大农民的热烈拥护。但“左”的影响确实是年深日久,有些干部满脑子“以阶级斗争为纲”,听到“六条”的传达竟吓了一跳。他们忧心忡忡地说:“怎么能以生产为中心昵?纲到哪里去了?不怕再批唯生产力论吗?”

    就在1978年初,党中央决定召开全国“普及大寨县”的现场会议。农业生产力主要是手工工具,靠农民的两只手,而手是脑子指挥的,农民思想不通,没有积极性,手怎么会勤快呢?生产怎么会提高呢?我们不能按全国这一套办,又不能到会上去说,说也没有用。怎么办才好呢?按通知,这个会应该由省委第一把手去,我找了个借口没有去,让书记赵守一代表我去。我对他说,你去了光听光看,什么也不要说。大寨这一套,安徽的农民不拥护,我们不能学,也学不起,当然我们也不能公开反对。你就是不发言、不吭气,回来以后也不必传达。总之,我们必须对本省人民负责,在我们权力范围内做我们自己应该做、能够做的事情,继续坚决贯彻“六条”。在这段时间,新闻界的一些同志比较深入实际。新华社记者、《人民日报》记者为我们写“内参”、写通讯,宣传“六条”,《人民日报》还发了评论,这些都给了我们有力的支持。如果不反掉学大寨“以阶级斗争为纲”那一套,就不可能提出和坚持以生产为中心,这实际是最初也是最重要的拨乱反正,可以说是农村改革的第一个回合。

    参考资料:大寨的谎言是怎么被揭穿的

    (山间听雨 ) 2024年10月22日 16:17 北京

    1978年夏,中国农学会在山西太原召开全国代表大会,会议结束后,组织代表们参观大寨。时任副总理的陈永贵亲自出面接见,并发表了讲话。

    据参会代表回忆,当时陈永贵结合自己的亲身经历谈农业科学的重要性,譬如几年前大寨的玉米得了一种什么病,农业技术人员告诉他必须赶快把病株拔出烧掉,以防传播开去。他不相信,就是不拔,结果全部玉米病死,颗粒无收,他才信服了,等等。

    陈永贵的坦率不免让与会的专家们瞠目结舌:一个分管农业的副总理,竟可以完全不懂农业科学常识,而让全国农业专家向他学习。

    有意思的是,在陈永贵讲话时,台上右角落里还坐着一个年轻人提醒他农业的统计数据和名词术语,与会者完全可以从扩音器里听到他的声音。

    听完陈永贵的讲话后,代表们还被“安排”分组在大寨村里进行了一次参观活动。路线是固定的,都有人带队。代表们不仅在参观过程中没有看到大寨的农民,在田间也没有看到,而且家家户户大门紧闭,也不能进去探寻。

    有趣的是,几乎家家的窗口上,都放有金鱼缸,里面养着金鱼;同时,每家的小天井也必有一个大缸,里面种上花木,而且都在开花。

    代表们明显感到这是在“做秀”给参观者看,因为当时就连沿海城市,也并非家家养金鱼、户户种花木,何况大寨人的劳动时间长,哪有此等闲情逸致?

    代表们来到向往已久的大寨山头最高处,放眼四周,却大失所望。因为大寨为了人造山间小平原,砍掉了树林,把小麦种到了山顶上,但麦苗却长得不如人意:夏收季节已过,麦苗只有六、七寸高,麦穗抽不出来。即使抽出来的麦穗,也小得可怜,每穗只有几粒瘪籽。

    至于玉米,大寨附近生产队地里的,生长得都不好,只有大寨范围以内的玉米地,才是一派大好风光。这说明大寨的玉米是吃“小灶”的,即有国家额外支援的物资化肥之类为后盾。

    代表们议论纷纷,有的说没有树林,没有畜牧业,谈不上综合经营;有的说大寨的经验连自己附近的生产队都未推开,还谈什么全国学大寨。

    当时参会的农业专家、农业部副部长杨显东也深觉大寨无科学,因此在回到北京后,组织了60多人参加的座谈会,决定“揭开大寨的盖子”。

    1979年春,在全国政协小组会上,杨显东披露了大寨虚假的真面目,并指出“动员全国各地学大寨是极大的浪费,是把农业引入歧途,是把农民推入穷困的峡谷”。

    他还批评道:“陈永贵当上了副总理,至今却不承认自己的严重错误。”

    杨显东的发言引发了轩然大波,一位来自大寨的政协委员大吵大闹,说杨显东是诬蔑大寨,攻击大寨,是要砍掉毛主席亲手培植和树立起来的一面红旗。

    不过,杨显东还是得到了大多数人的支持。

    1981年,在国务院召开的国务会上,正式提出了大寨的问题,才把大寨的盖子彻底揭开了。大寨的主要问题是弄虚作假,而且在文革中迫害无辜,制造了不少冤假错案。

    大寨造假最早被发现于1964年。那一年的冬季,大寨被上级派驻的“四清”工作队查出,粮食的实际亩产量少于陈永贵的报告。此事等于宣布大寨的先进乃是一种欺骗,其所引起的震动可想而知。

    大寨成为了全国样版,通往昔阳的公路,在1978年即被修筑成柏油大马路。昔阳城里也兴建了气魄非凡的招待所,建有可以一次容纳上千人同时用餐的大食堂,参观者在这里不吃大寨玉米,而是可以吃到全国各地的山珍海味。

    当时从中央到省,为大寨输送了多少资金和物资,才树立起这个全国农业样板。

    另据县志记载,1967年至1979年,在陈永贵统辖昔阳的13年间,昔阳共完成农田水利基本建设工程9330处,新增改造耕地9.8万亩。昔阳农民因此伤亡1040人,其中死亡310人。

    至于昔阳粮食产量,则增长1.89倍,同时又虚报产量2.7亿斤,占实际产量的26%。虚报的后果自然由昔阳的农民承担了,给国家的粮食,一斤也没有少交。

    此外,昔阳挨斗挨批判并且被扣上各种帽子的有两千多人,占全县人口的百分之一。立案处理过的人数超过三千,每70人就摊上一个。

    新县委书记刘树岗上台后,昔阳开始了大平反。1979年全县就复查平反冤假错案70余件,许多因贩卖牲畜、粮食、占小便宜、不守纪律、搞婚外男女关系、不学大寨等问题而被处分的人被取消了处分;一些由于偷了一点粮食,骂了几句干部,说了几句“反动话”被判刑的老百姓被释放出狱。

    1980年,昔阳“平反”达到高潮,并持续到次年。全县共纠正冤假错案3028件,为在学大寨运动中被戴上各种帽子批斗的2061人恢复了名誉。

    全国掀起的十几年的“农业学大寨”运动,给中国农业带来的是僵硬、刻板以及弄虚作假。从20世纪60年代中期到70年代后期,大寨共接待参观者达960万人次,毛泽东没有去过一次,甚至都不曾提出过什么时候去大寨看一看。