从两个mysql表中查找字谜

西尔万

我目前正在尝试实现一种算法来查找看起来像真实姓名的字谜。我有一个有效的解决方案，但在某些查询上花费了太多时间，我想知道如何改进它。

我试图根据包含 50k 名和 50k 姓氏的数据库查找由名和姓组成的字谜。数据库的架构如下：


CREATE TABLE `forename` (
  `id` int(11) NOT NULL,
  `q` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL,
  `label` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `labels` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `labels_length` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

CREATE TABLE `surname` (
  `id` int(11) NOT NULL,
  `q` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL,
  `label` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `labels` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `labels_length` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

ALTER TABLE `forename`
  ADD PRIMARY KEY (`id`),
  ADD KEY `idx_length` (`labels_length`);
ALTER TABLE `forename` ADD FULLTEXT KEY `idx_labels` (`labels`);

ALTER TABLE `surname`
  ADD PRIMARY KEY (`id`),
  ADD KEY `idx_length` (`labels_length`),
  ADD KEY `idx_labels` (`labels`);

在每个表中，列的含义如下：

label : 名或确定名
labels:标签的slugified版本 : 所有大写字符按字母顺序排序；
labels_length: 中的字符数labels；

我目前正在使用在 php 中生成的查询来查询这个数据库，例如，对于 Ada Lovelace，它看起来像：

select distinct A.label as surname, B.label as forename 
from forename as A, surname as B WHERE (A.labels not like '%B%' and B.labels not like '%B%') AND 
(A.labels not like '%F%' and B.labels not like '%F%') AND 
(A.labels not like '%G%' and B.labels not like '%G%') AND 
(A.labels not like '%H%' and B.labels not like '%H%') AND 
(A.labels not like '%I%' and B.labels not like '%I%') AND 
(A.labels not like '%J%' and B.labels not like '%J%') AND 
(A.labels not like '%K%' and B.labels not like '%K%') AND 
(A.labels not like '%M%' and B.labels not like '%M%') AND 
(A.labels not like '%N%' and B.labels not like '%N%') AND 
(A.labels not like '%P%' and B.labels not like '%P%') AND 
(A.labels not like '%Q%' and B.labels not like '%Q%') AND 
(A.labels not like '%R%' and B.labels not like '%R%') AND 
(A.labels not like '%S%' and B.labels not like '%S%') AND 
(A.labels not like '%T%' and B.labels not like '%T%') AND 
(A.labels not like '%U%' and B.labels not like '%U%') AND 
(A.labels not like '%W%' and B.labels not like '%W%') AND 
(A.labels not like '%X%' and B.labels not like '%X%') AND 
(A.labels not like '%Y%' and B.labels not like '%Y%') AND 
(A.labels not like '%Z%' and B.labels not like '%Z%') AND 
(A.labels like '%A%' or B.labels like '%A%') AND 
(A.labels like '%C%' or B.labels like '%C%') AND 
(A.labels like '%D%' or B.labels like '%D%') AND 
(A.labels like '%E%' or B.labels like '%E%') AND 
(A.labels like '%L%' or B.labels like '%L%') AND 
(A.labels like '%O%' or B.labels like '%O%') AND 
(A.labels like '%V%' or B.labels like '%V%') AND 
(A.labels_length + B.labels_length) = 11

这个查询的解释是 Ada Lovelace slugAAACDEELLOV所以我需要找到包含这些字母并且不包含字母表中其他字母的姓氏和名字。我在字符数上添加了一个过滤器，以尝试限制返回的行数。

通过这个查询，我得到了需要使用 PHP 处理的结果，以控制每个字符的使用次数是否正确（例如，对于 Ada Lovelace，我的结果包含 3 A）。

我当前的数据库包含大约 50k 个姓氏和 50k 个名字。当我搜索 Ada Lovelace 时，我在大约 0.30 秒内得到 458 个 SQL 行（如果你想知道的话，可以找到 11 个精确的字谜）。

如果我更改对 Sylvain Lovelace 的搜索，我会在 10 多秒内得到 1774 行。慢了 30 倍，Ada Lovelace 可以接受的持续时间现在超出了范围。我试图删除对字符数的过滤器，并且持续时间降低到 8 秒，仍然太多了。

我很确定应该可以改进我的数据库的索引，或者我的查询的构建方式。如果有人有任何想法，我很乐意尝试！

如果有人想在真实数据上尝试，转储可在 github 存储库中获得。

西尔万

几个月后，我遇到了这个问题，现在找到了一种我认为可以接受的方法。解决方案是通过向两个表添加 26 列来更改我的数据模型，每列包含字母数，每列都有一个索引。

基于这个数据模型，我能够构建这样的查询：

select distinct A.label as surname, B.label as forename 
from forename as A, surname as B 
WHERE 
(A.A >= 1 or B.A >= 1) AND 
(A.B = 0 and B.B = 0) AND 
(A.C = 1 xor B.C = 1) AND 
(A.D = 0 and B.D = 0) AND 
(A.E = 0 and B.E = 0) AND 
/--/
(A.Z = 1 xor B.Z = 1) AND 
(A.labels_length = 4) AND (B.labels_length = 9)

在此示例查询中，我正在搜索 Aaron Schwartz（字母：AAACHNORRSTWZ）的字谜，其姓氏包含 4 个字母。我需要至少一个姓氏和前名中的一个包含 A 的结果，因为我需要其中的 3 个，前名和姓氏都不包含 B 因为我不想要任何，并且因为我只想要一个 C，所以前名 XOR 姓氏可能包含一。

这个查询不会给我确切的结果，但返回的结果数量足以让我在之后用 PHP 处理它们并控制它们是否是真正的字谜。

已在http://apf.geobib.fr/上建立了一个作为概念证明的网站

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-07-31

我来说两句

0 条评论

登录后参与评论

上一篇：C# 登录 XML 循环错误的密码

TOP 榜单

文章

从两个mysql表中查找字谜

从两个mysql表中查找字谜

Linux的官方Adobe Flash存储库是否已过时？

如何使用HttpClient的在使用SSL证书，无论多么“糟糕”是

错误：“ javac”未被识别为内部或外部命令，

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Modbus Python施耐德PM5300

为什么Object.hashCode（）不遵循Java代码约定

如何检查字符串输入的格式

检查嵌套列表中的长度是否相同

错误TS2365：运算符'！=='无法应用于类型'“（”'和'“）”'

如何自动选择正确的键盘布局？-仅具有一个键盘布局

如何正确比较 scala.xml 节点？

在令牌内联程序集错误之前预期为 ')'

如何在JavaScript中获取数组的第n个元素？

如何将sklearn.naive_bayes与（多个）分类功能一起使用？

ValueError：尝试同时迭代两个列表时，解包的值太多（预期为 2）

如何监视应用程序而不是单个进程的CPU使用率？

解决类Koin的实例时出错

ES5的代理替代

有什么解决方案可以将android设备用作Cast Receiver？

VBA 自动化错误：-2147221080 (800401a8)

套接字无法检测到断开连接