专栏名称: 范阳

Being more human, less perfect.

思维的合成器 | AI 改变软件只刚刚开始

范阳 · 公众号 · · 2024-07-01 20:13

正文

人工智能也许可以从音乐创作获得很多灵感，就像曾经的个人电脑从收音机，音响和 Walkman 获得了许多灵感一样。

当前的深度学习和神经网络，和下一代人工智能与计算机的突破，正在和即将深刻地改变 软件开发，音乐制作，游戏设计，艺术创作，个人计算设备 等等这些 “创造性” 活动，它们的底层原理会越来越接近，边界会逐渐模糊，就像 1970 年代个人计算机革命刚开始时候那样。当前我们看到的聊天机器人，文本对话，生成式图像，还有很多 co-pilot 以及 agent 应用，还只是触碰到了新的 “人机交互” 界面的浅层，在人们生活的现实空间和神经网络的 “隐藏空间” 之间，还有太多等待人们探索。

今天这篇文章是前面两篇文章的延续，我们继续来探讨为什么人工智能的持续进步，现在需要过去没有的用户界面和产品体验：

AI 在把传统软件当早餐吃掉

AI 需要开创性的用户界面和产品: 从苹果说起，深度访谈 Daniel Gross 和 Nat Friedman（3万字)

蓝色是我的看法。

最近和我的朋友 Samim 交流我们最近在做和研究的东西，他是 2007 年加入谷歌人工智能团队的成员，现在在探索先进的模型工程和人机交互的可能性，我提到我感觉世界上最聪明的懂模型和架构以及系统设计的人（虽然顶级人才很稀缺不过越来越多的年轻人才正在被培养出来），和懂产品、懂人性的需要、懂交互与界面重要性的人之间的间隙依然非常大，或者说，像是史蒂夫·乔布斯和史蒂夫·沃兹尼亚克之间的 “合体” 还是太稀少了。

完整版乔布斯 “遗失的访谈” 1995：我们非常努力把人文的精神带入计算机领域。

但这是一个令人兴奋的艰难挑战，只有 “真正的发明” 和触动人心的东西才有机会成为突破。我们必须深刻的理解和实践快速变化当中的技术（人工智能只是技术栈其中一项），还需要有艺术家的心和眼光，我们需要世界上有更多 technological artists ( 科技艺术家）。

这篇分享的文章来自于 Linus Lee 的个人博客（ thesephist.com ）, 他是我关注很久的一位出色的 AI 工程师和产品经理，现在是 Notion 的人工智能产品负责人（ AI product leader ）。因为 Notion 主要是一个思维和笔记协作工具，所以这篇文章里他更多是从 “写作工具” 的角度展开讨论的，有一些概念名词可能不是很好读，但是我很喜欢他使用了 “思想的物理学” （ physics of ideas ）以及 “思维的合成器”（ synthesizer for thought ）的比喻，这也适用于未来的软件和游戏等等创作。

最近我们能看到一些软件产品，它的交互界面已经越来越像一个 “合成器”。

我会在后面的一些文章里继续分享 AI 和软件以及交互变革的文章。希望对你有启发。

思维的合成器

Synthesizer for thought

作者：Linus Lee

编辑：范阳

写作日期：2024年6月23日

在音乐的大部分历史中，人类使用自然材料制造声音 — 如摩擦琴弦、敲击物体、通过各种长度的管子吹气。直到有两件事发生。

我们理解了声音的物理学原理（ We understood the physics of sound ）。声音是重叠的波的组合，而波是一种奇妙的数学对象，我们可以写出它的定理，证明一些事情，并从根本上深刻理解它（ a wave is a kind of marvelous mathematical object that we can write theorems and prove things about and deeply and fundamentally understand ）。

一旦我们开始将声音理解为一种数学对象（ understanding sound as a mathematical object ），我们谈论声音的词汇就会更加深入和精确。建筑师可以设计音乐厅的声学特性，音乐家可以谈论不同的音调（ musicians could talk about different temperaments of tuning ）。我们还加深了对人类如何感知声音的理解。

我们学会了将数学模型与现实世界中的声音联系起来（ We learned to relate our mathematical models to sounds in the real world ）。我们建立了记录声音并使用波的数学模型将其分解为基本组成部分的设备（ We built devices to record sound and decompose it into its constituent fundamental parts using the mathematical model of waves ）。我们还发明了一种方法，通过电子设备如振荡器和扬声器，将声音的数学模型转换回我们能听到的真实音符（ We also invented a way to turn that mathematical model of a sound back into real notes we could hear ）。

这意味着我们可以把新的声音想象成数学结构（ we could imagine new kinds of sounds as mathematical constructs ），然后将它们变为现实，创造出我们用天然材料永远无法创造出的全新声音（ and then conjure them into reality, creating entirely new kinds of sounds we could never have created with natural materials ）。我们还可以从现实世界中采样声音并调制其数学结构（ We could also sample sounds from the real world and modulate its mathematical structure ）。不仅如此，在我们的声音数学模型的支持下，我们还可以系统地探索可能的声音和滤波器空间。

这就是合成器（ The instrument that results is a synthesizer ）。

合成器

Synthesizers

合成器制作音乐的方式与声学乐器（ acoustic instrument ）完全不同。它在最低的抽象层级上生成音乐，以声音波的数学模型为基础（ It produces music at the lowest level of abstraction, as mathematical models of sound waves ）。它从定义为振荡器的原始波形开始，通过一系列滤波器和调制器变换，然后传达到我们的耳朵（ It begins with raw waveforms defined as oscillators, which get transmogrified through a sequence of filters and modulators before reaching our ears）。这是一种通过逻辑组件组装声音的方式，而不是通过敲击或振动自然材料来整体创造声音（ It’s a way of producing sound by assembling it from logical components rather than creating it wholesale by hitting or vibrating something natural ）。

由于合成器是电子设备，与传统乐器不同，我们可以为其附加任意的人机界面（ we can attach arbitrary human interfaces to it ）。这大大扩展了人类与音乐互动的设计空间（This dramatically expands the design space of how humans can interact with music ）。合成器可以连接到键盘、音序器、鼓机、用于连续控制的触摸屏、用于视觉反馈的显示器，当然还有用于自动化和无尽动态用户界面的软件接口（ Synthesizers can be connected to keyboards, sequencers, drum machines, touchscreens for continuous control, displays for visual feedback, and of course, software interfaces for automation and endlessly dynamic user interfaces ）。

通过这种方式，我们将音乐制作从任何特定的物理形式中解放出来（ we freed the production of music from any particular physical form ）。

合成器使全新的声音和音乐类型得以实现，如电子流行音乐和电子舞曲。这些新声音更容易被发现和分享，因为新声音不需要设计全新的乐器。合成器将声音空间组织成一个可触摸的人机界面（ The synthesizer organizes the space of sound into a tangible human interface ），当我们发现新的声音时，我们可以将其作为数字和数字文件与他人分享，就像它们一直以来都是数学对象一样。（ as we discover new sounds, we could share it with others as numbers and digital files, as the mathematical objects they’ve always been ）。

光学（ Optics ），即光和颜色的数学原理（ the mathematics of light and color ），是当今人类与视觉媒体互动的基础。屏幕上你看到的每一张图像背后都是一个色彩空间，比如 RGB 或CMYK，这是一种我们感知颜色的数学模型（ a mathematical model of how we perceive color ）。我们在设备上编辑照片，不是通过在暗室中使用化学药品，而是通过我们称之为滤镜的数学函数（ mathematical functions we call filters ）处理我们的照片。这种色彩和光线的数学模型还赋予了我们新的词汇（如饱和度、色调、对比度）和新的界面（如色彩曲线、示波器、直方图）来处理视觉媒体。

直到最近，我们看到神经网络学会了详细的语言数学模型，这些模型似乎对人类有意义（ we’ve seen neural networks learn detailed mathematical models of language that seem to make sense to humans ）。随着对某种媒介数学理解的突破，出现了新的工具，这些工具使新的创作形式成为可能，并使我们能够解决新的问题（ with a breakthrough in mathematical understanding of a medium, come new tools that enable new creative forms and allow us to tackle new problems ）。

思维的工具

Instruments for thought

在我之前写过的一篇博文《 Prism 》中，我讨论了由可解释语言模型实现的两个新界面基元（ two new interface primitives enabled by interpretable language models ）：

1. 把语言中的概念和风格详细地分解（ Detailed decomposition of concepts and styles in language ）。这类似于将声音分解成其基本的波形成分（ This is analogous to splitting a sound into its constituent fundamental waves ）。它将 “ 合成器制作音乐的方式与原声乐器非常不同（ A synthesizer produces music very differently than an acoustic instrument ）” 这样的句子分解成一系列 “特征”（ decomposes it into a list of “features” ），如 “ 技术电子和信号处理 ” 和 “ 实体之间的比较 ”。

2. 高级语义编辑的精确控制（ Precise steering of high-level semantic edits ）。我可以拿同样这个合成器句子，加入一些 “ 关于父母身份的讨论 ”（ Discussions about parenthood ），得到 “ 父母制作音乐的方式通常与他们的孩子不同 ”（ A parent often produces music differently from their children ）。

范阳注：上面提到的博文《 Prism 》里面的编写的语言合成器：

https://thesephist--prism-start-app.modal.run/f/lg-v6/2686

换句话说，我们可以将写作分解成其更基本的组成部分的数学模型，然后将这些思想的数学模型重构回文本中（ we can decompose writing into a mathematical model of its more fundamental, constituent parts, and reconstruct those mathematical models of ideas back into text ）。

我花了一些时间想象这种新兴技术随着时间的成熟可能会出现的各种奇妙而有趣的界面（ what kinds of wild and interesting interfaces may be possible ）。

范阳注：也让我想起来《西部世界》里的 “故事线” 界面。

热力图将文档变成概地形图

Heatmaps turn documents into terrain maps of concepts

在数据可视化中，热力图（ heatmap ）可以让用户快速扫描数值高或低的区域，从而轻松浏览和浏览一个非常大的区域或数据集。类似地，在文本文档的上下文中（ in the context of text documents ），热力图可以突出主题元素的分布（ a heatmap can highlight the distribution of thematic elements ），使用户能够快速识别关键主题及其重要性（ quickly identify key topics and their prominence ）。热力图在分析大型语料库或非常长的文档时特别有用，使用户能够一目了然地找到感兴趣或相关的区域。

例如，用户可能从成千上万甚至数百万本书和 PDF 文档的集合开始，打开一些特定特征的过滤器，如 “提及地缘政治冲突” （ mention of geopolitical conflict ）和 “升级的言论” （ escalating rhetoric ），然后快速放大到突出显示的部分以找到相关的句子和段落，与将所有细节压缩成一个几十项排序列表的传统搜索相比，热力图可以让用户看到细节而不迷失在细节当中（ Compared to a conventional search that flattens all the detail into a single sorted list of a few dozen items, a heatmap lets the user see detail without getting lost in it ）。

安德烈·卡尔帕西（ Andrej Karpathy ）的文章《循环神经网络的不合理有效性》中的图片。

The Unreasonable Effectiveness of Recurrent Neural Networks :

https://karpathy.github.io/2015/05/21/rnn-effectiveness/

另一种语义热力图（ semantic heatmap ）的视角是语义语法高亮显示（ semantic syntax highlighting ）。就像我们高亮显示程序源代码以帮助在视觉上区分程序的特定部分一样（ as we highlight program source code to help visually distinguish specific parts of a program ），热力图和高亮显示可以帮助人们快速地在复杂的文档结构中进行视觉导航（ heatmaps and highlights could help humans quickly visually navigate complex document structures ）。

频谱图和音轨视图揭示跨时间的有意义模式

Spectrograms and track views reveal meaningful patterns across time

在音频处理的背景下，频谱图可视化单个音频流中不同频率波的显著性，以及它们随时间的演变（ a spectrogram visualizes the prominence of different frequency waves within a single stream of audio, and how it evolves over time ）。换句话说，它将声音波的单个纯数学成分分解出来，并将每个成分可视化为随时间变化的独立信号（ it breaks out the individual mathematically pure components of a sound wave and visualizes each as its own signal over time ）。

频谱图通过将音轨的不同成分（如低音线和上升/下降的旋律进程）分解成独特的视觉模式，让你可以以比原始波形传达更多结构的方式可视化声音，从而生成音轨的缩略图（ Spectrograms let you visualize sound in a way that communicates much more structure than raw waveforms, by producing a kind of thumbnail of an audio track that breaks out different components like bass lines and rising/falling melodic progressions into distinctive visual patterns ）。

如果我们将同样的想法应用到长篇写作的阅读体验中，它可能会是这样的。想象一下，你在手机上打开一个 “故事”，从滚动条边缘向内滑动，就会看到一个垂直的频谱图（ Imagine opening a story on your phone and swiping in from the scrollbar edge to reveal a vertical spectrogram ），频谱图上的每个 "频率 "都代表着不同概念的突出程度，如情感或叙事张力随着时间的推移而变化（ sentiment or narrative tension varying over time ）。在某个特定特征 “列” 上滑动可以展开它，告诉你该特征是什么，以及该特征与文本的哪一部分最相关。

我们还可以从另一种音乐界面中汲取灵感，即轨道视图（ track view ）。像 Logic Pro 这样的音乐制作软件（如下图所示）允许用户将一首歌曲从许多不同的轨道中组装起来，每个轨道对应一个乐器，并通过不同的滤波器和调制器进行处理。

语义差异可视化相邻的可能性

Semantic diffs visualize adjacent possibilities

请看 Red Blob Games 的《捕食者与猎物》（ Predator-Prey from Red Blob Games）文章中的这个交互小部件。当我将鼠标悬停在每个控件上时，界面就会显示出随着特定参数的变化，我编辑的主题可能呈现的各种形式。

https://www.redblobgames.com/dynamics/predator-prey/

我称之为语义差异视图（ a semantic diff view ）。它有助于可视化当用户调整某些输入变量时，输出或编辑对象的变化（ It helps visualize how some output or subject of an edit changes when the user modulates some input variable ）。它显示了在输出的可能空间中，基于特定语义特征的各种可能点之间的所有 “差异”（ It shows all the “diffs” between various possible points in the possibility space of outputs, anchored on a particular semantic feature ）。

那么，文本的语义差异视图（ a semantic diff view for text ）会是什么样子呢？也许当我编辑文本时，我可以将鼠标悬停在 "叙述性语音 " 或 "比喻性语言 " 等特定风格或概念特征的控件上（ hover over a control for a particular style or concept feature like “Narrative voice” or “Figurative language” ），然后我的高亮部分就会像扑克牌一样展开，显示出其他可供选择的 “邻近” 句子。或者，如果阅读量过大，也可以直接高亮显示每个单词，以显示该单词在更 "叙事 "或更 "形象 "的句子中出现的可能性是大还是小（ each word could simply be highlighted to indicate whether that word would be more or less likely to appear in a sentence that was more “narrative” or more “figurative” ），这是一种基于高亮显示的语义编辑方向指示器（ highlight-based indicator for the direction of a semantic edit ）。

神经网络中发现的新概念的图标和字形

Icons and glyphs for new concepts discovered in neural networks

在《解释和引导图像中的特征》（ Interpreting and Steering Features in Images ）一文中，作者提出了 “特征图标” （ feature icon ）的概念，这是一种几乎空白的图像，通过计算机视觉模型修改，以强烈表达特定特征（ a nearly-blank image modified to strongly express a particular feature drawn from a computer vision model ），如特定的光照、颜色、图案和主体。作者写道：

“ 为了快速比较特征，我们还发现将特征应用于标准模板以生成供人类比较的参考照片非常有用。我们称之为特征表达图标（ feature expression icon ），简称图标（ icon ）。我们将其作为该特征的人类可解释参考（ human-interpretable reference ）的一部分 ” 。

以下是该文章中的一些特征表达图标的示例：

Interpreting and Steering Features in Images 原文链接：

https://www.lesswrong.com/posts/Quqekpvx8BGMMcaem/interpreting-and-steering-features-in-images

我发现这是这项工作中最有趣的部分。浏览这些图标感觉就像我们在发明一种新的词汇（ Browsing through these icons felt as if we were inventing a new kind of word ），或者是一种以神经网络为媒介的视觉概念新符号（a new notation for visual concepts mediated by neural networks）。这可能使我们能够沟通在现实世界中发现的抽象概念和模式，而这些可能并没有对应于我们今天词典中的任何词（ This could allow us to communicate about abstract concepts and patterns found in the wild that may not correspond to any word in our dictionary today ）。

在我以前写的《想象更好的语言模型界面》（ Imagining better interfaces to language models ）这篇文章中，我指出了设计基于隐藏空间的信息界面的一个主要挑战（ a major challenge in designing latent space-based information interfaces ）：高维性（ high dimensionality ）。

范阳注：隐藏空间（ latent space，也可以叫做潜在空间）可以被定义为抽象的多维空间，它编码外部观察事件的有意义的内部表示，与外部世界中相似的样本在隐藏空间中 “彼此靠近”。

“ 这里的主要界面挑战在于维度：大型语言模型在训练中构建的 “意义空间” 有成百上千的维度，而人类难以导航超过 3-4 维的空间（ the “space of meaning” that large language models construct in training is hundreds and thousands of dimensions large, and humans struggle to navigate spaces more than 3-4 dimensions deep ）。我们可以使用哪些视觉和感官技巧来引导我们的视觉感知系统理解和操作高维空间中的对象（ What visual and sensory tricks can we use to coax our visual-perceptual systems to understand and manipulate objects in higher dimensions ）？”

解决这个问题的一种方法可能涉及发明新的符号系统（ involve inventing new notation ），无论是视觉概念的字面图标表示（ literal iconic representations of visual ideas ），还是一些更抽象的符号系统（ some more abstract system of symbols ）。

思维的合成器 | AI 改变软件只刚刚开始

正文

请到「今天看啥」查看全文