TSL&PRF

2025-03-27 02:30:10 +08:00 · 2019-09-03 10:14:29 +08:00 · 2019-09-03 10:14:29 +08:00 · 91f9ea8205
commit 91f9ea8205
parent e491cfaa70
2 changed files with 115 additions and 108 deletions
--- a/sources/tech/20180704
+++ b/sources/tech/20180704
@ -1,108 +0,0 @@
-wxy has appled
-BASHing data: Truncated data items
-======
-### Truncated data items
-
-**truncated** (adj.): abbreviated, abridged, curtailed, cut off, clipped, cropped, trimmed...
-
-One way to truncate a data item is to enter it into a database field that has a character limit shorter than the data item. For example, the string
-
->Yarrow Ravine Rattlesnake Habitat Area, 2 mi ENE of Yermo CA
-
-is 60 characters long. If you enter it into a "Locality" field with a 50-character limit, you get
-
->Yarrow Ravine Rattlesnake Habitat Area, 2 mi ENE #Ends with a whitespace
-
-Truncations can also be data entry errors. You meant to enter
-
->Sally Ann Hunter (aka Sally Cleveland)
-
-but you forgot the closing bracket
-
->Sally Ann Hunter (aka Sally Cleveland
-
-leaving the data user to wonder whether Sally has other aliases that were trimmed off the data item.
-
-Truncated data items are very difficult to detect. When auditing data I use three different methods to find possible truncations, but I probably miss some.
-
-**Item length distribution.** The first method catches most of the truncations I find in individual fields. I pass the field to an AWK command that tallies up data items by field width, then I use **sort** to print the tallies in reverse order of width. For example, to check field 33 in the tab-separated file  "midges":
-
-```
-awk -F"\t" 'NR>1 {a[length($33)]++} \
-END {for (i in a) print i FS a[i]}' midges | sort -nr
-```
-
-![distro1][1]
-
-The longest entries have exactly 50 characters, which is suspicious, and there's a "bulge" of data items at that width, which is even more suspicious. Inspection of those 50-character-wide items reveals truncations:
-
-![distro2][2]
-
-Other tables I've checked this way had bulges at 100, 200 and 255 characters. In each case the bulges contained apparent truncations.
-
-**Unmatched brackets**. The second method looks for data items like  "...(Sally Cleveland" above. A good starting point is a tally of all the punctuation in the data table. Here I'm checking the file "mag2":
-
-grep -o "[[:punct:]]" file | sort | uniqc
-
-![punct][3]
-
-Note that the numbers of opening and closing round brackets in "mag2" aren't equal. To see what's going on, I use the function "unmatched", which takes three arguments and checks all fields in a data table. The first argument is the filename and the second and third are the opening and closing brackets, enclosed in quotes.
-
-```
-unmatched()
-{
-awk -F"\t" -v start="$2" -v end="$3" \
-'{for (i=1;i<=NF;i++) \
-if (split($i,a,start) != split($i,b,end)) \
-print "line "NR", field "i":\n"$i}' "$1"
-
-}
-```
-
-"unmatched" reports line number and field number if it finds a mismatch between opening and closing brackets in the field. It relies on AWK's **split** function, which returns the number of elements (including blank space) separated by the splitting character. This number will always be one more than the number of splitters:
-
-![split][4]
-
-Here "ummatched" checks the round brackets in "mag2" and finds some likely truncations:
-
-![unmatched][5]
-
-I use "unmatched" to locate unmatched round brackets (), square brackets [], curly brackets {} and arrows <>, but the function can be used for any paired punctuation characters.
-
-**Unexpected endings**. The third method looks for data items that end in a trailing space or a non-terminal punctuation mark, like a comma or a hyphen. This can be done on a single field with **cut** piped to **grep** , or in one step with AWK. Here I'm checking field 47 of the tab-separated table "herp5", and pulling out suspect data items and their line numbers:
-
-```
-cut -f47 herp5 | grep -n "[ ,;:-]$"
-
-awk -F"\t" '$47 ~ /[ ,;:-]$/ {print NR": "$47}' herp5
-```
-
-![herps5][6]
-
-The all-fields version of the AWK command for a tab-separated file is:
-
-```
-awk -F"\t" '{for (i=1;i<=NF;i++) if ($i ~ /[ ,;:-]$/) \
-print "line "NR", field "i":\n"$i}' file
-```
-
-**Cautionary thoughts**. Truncations also appear during the validation tests I do on fields. For example, I might be checking for plausible 4-digit entries in a  "Year" field, and there's a 198 that hints at 198n. Or is it 1898? Truncated data items with their lost characters are mysteries. As a data auditor I can only report (possible) character losses and suggest that the (possibly) missing characters be restored by the data compilers or managers.
-
--------------------------------------------------------------------------------
-
-via: https://www.polydesmida.info/BASHing/2018-07-04.html
-
-作者：[polydesmida][a]
-选题：[lujun9972](https://github.com/lujun9972)
-译者：[译者ID](https://github.com/译者ID)
-校对：[校对者ID](https://github.com/校对者ID)
-
-本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译，[Linux中国](https://linux.cn/) 荣誉推出
-
-[a]:https://www.polydesmida.info/
-[1]:https://www.polydesmida.info/BASHing/img1/2018-07-04_1.png
-[2]:https://www.polydesmida.info/BASHing/img1/2018-07-04_2.png
-[3]:https://www.polydesmida.info/BASHing/img1/2018-07-04_3.png
-[4]:https://www.polydesmida.info/BASHing/img1/2018-07-04_4.png
-[5]:https://www.polydesmida.info/BASHing/img1/2018-07-04_5.png
-[6]:https://www.polydesmida.info/BASHing/img1/2018-07-04_6.png
--- a/translated/tech/20180704
+++ b/translated/tech/20180704
@ -0,0 +1,115 @@
+如何发现截断的数据项
+======
+
+**截断**（形容词）：缩写、删节、缩减、剪切、剪裁、裁剪、修剪……
+
+数据项被截断的一种情况是将其输入到数据库字段中，该字段的字符限制比数据项的长度要短。例如，字符串：
+
+```
+Yarrow Ravine Rattlesnake Habitat Area, 2 mi ENE of Yermo CA
+```
+
+是 60 个字符长。如果你将其输入到具有 50 个字符限制的“位置”字段，则可以获得：
+
+```
+Yarrow Ravine Rattlesnake Habitat Area, 2 mi ENE #末尾带有一个空格
+```
+
+截断也可能导致数据错误，比如你打算输入：
+
+```
+Sally Ann Hunter (aka Sally Cleveland)
+```
+
+但是你忘记了闭合的括号：
+
+```
+Sally Ann Hunter (aka Sally Cleveland
+```
+
+这会让使用数据的用户觉得 Sally 是否有被修剪掉了数据项的其它的别名。
+
+截断的数据项很难检测。在审核数据时，我使用三种不同的方法来查找可能的截断，但我仍然可能会错过一些。
+
+**数据项的长度分布。**第一种方法是捕获我在各个字段中找到的大多数截断的数据。我将字段传递给 `awk` 命令，该命令按字段宽度计算数据项，然后我使用 `sort` 以宽度的逆序打印计数。例如，要检查以 `tab` 分隔的文件 `midges` 中的第 33 个字段：
+
+```
+awk -F"\t" 'NR>1 {a[length($33)]++} \
+    END {for (i in a) print i FS a[i]}' midges | sort -nr
+```
+
+![distro1][1]
+
+最长的条目恰好有 50 个字符，这是可疑的，并且在该宽度处存在数据项的“凸起”，这更加可疑。检查这些 50 个字符的项目会发现截断：
+
+![distro2][2]
+
+我用这种方式检查的其他数据表有 100、200 和 255 个字符的“凸起”。在每种情况下，这种“凸起”都包含明显的截断。
+
+**未匹配的括号。**第二种方法查找类似 `...(Sally Cleveland` 的数据项。一个很好的起点是数据表中所有标点符号的统计。这里我检查文件 `mag2`：
+
+```
+grep -o "[[:punct:]]" file | sort | uniqc
+```
+
+![punct][3]
+
+请注意，`mag2` 中的开括号和闭括号的数量不相等。要查看发生了什么，我使用 `unmatched` 函数，它接受三个参数并检查数据表中的所有字段。第一个参数是文件名，第二个和第三个是开括号和闭括号，用引号括起来。
+
+```
+unmatched()
+{
+    awk -F"\t" -v start="$2" -v end="$3" \
+        '{for (i=1;i<=NF;i++) \
+            if (split($i,a,start) != split($i,b,end)) \
+                print "line "NR", field "i":\n"$i}' "$1"
+}
+```
+
+如果在字段中找到开括号和闭括号之间不匹配，则 `unmatched` 会报告行号和字段号。这依赖于 `awk` 的 `split` 函数，它返回由分隔符分隔的元素数（包括空格）。这个数字总是比分隔符的数量多一个：
+
+![split][4]
+
+这里 `ummatched` 检查 `mag2` 中的圆括号并找到一些可能的截断：
+
+![unmatched][5]
+
+我使用 `unmatched` 来找到不匹配的圆括号 `()`、方括号 `[]`、花括号 `{}` 和尖括号 `<>`，但该函数可用于任何配对的标点字符。
+
+**意外的结尾。**第三种方法查找以尾随空格或非终止标点符号结尾的数据项，如逗号或连字符。这可以在单个字段上用 `cut` 用管道输入到 `grep` 完成，或者用 `awk` 一步完成。在这里，我正在检查以制表符分隔的表 `herp5` 的字段 47，并提取可疑数据项及其行号：
+
+```
+cut -f47 herp5 | grep -n "[ ,;:-]$"
+或
+awk -F"\t" '$47 ~ /[ ,;:-]$/ {print NR": "$47}' herp5
+```
+
+![herps5][6]
+
+用于制表符分隔文件的 awk 命令的全字段版本是：
+
+```
+awk -F"\t" '{for (i=1;i<=NF;i++) if ($i ~ /[ ,;:-]$/) \
+    print "line "NR", field "i":\n"$i}' file
+```
+
+**谨慎的想法。**在我对字段进行的验证测试期间也会出现截断。例如，我可能会在“年”的字段中检查合理的 4 位数条目，并且有个 `198` 可能是 198n？还是 1898 年？带有丢失字符的截断数据项是个谜。 作为数据审计员，我只能报告（可能的）字符损失，并建议数据编制者或管理者恢复（可能）丢失的字符。
+
+--------------------------------------------------------------------------------
+
+via: https://www.polydesmida.info/BASHing/2018-07-04.html
+
+作者：[polydesmida][a]
+选题：[lujun9972](https://github.com/lujun9972)
+译者：[wxy](https://github.com/wxy)
+校对：[wxy](https://github.com/wxy)
+
+本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译，[Linux中国](https://linux.cn/) 荣誉推出
+
+[a]:https://www.polydesmida.info/
+[1]:https://www.polydesmida.info/BASHing/img1/2018-07-04_1.png
+[2]:https://www.polydesmida.info/BASHing/img1/2018-07-04_2.png
+[3]:https://www.polydesmida.info/BASHing/img1/2018-07-04_3.png
+[4]:https://www.polydesmida.info/BASHing/img1/2018-07-04_4.png
+[5]:https://www.polydesmida.info/BASHing/img1/2018-07-04_5.png
+[6]:https://www.polydesmida.info/BASHing/img1/2018-07-04_6.png