5.3 KiB
translating by lujun9972 A gawk script to convert smart quotes
I manage a personal website and edit the web pages by hand. Since I don't have many pages on my site, this works well for me, letting me "scratch the itch" of getting into the site's code.
When I updated my website's design recently, I decided to turn all the plain quotes into "smart quotes," or quotes that look like those used in print material: “” instead of "".
Editing all of the quotes by hand would take too long, so I decided to automate the process of converting the quotes in all of my HTML files. But doing so via a script or program requires some intelligence. The script needs to know when to convert a plain quote to a smart quote, and which quote to use.
You can use different methods to convert quotes. Greg Pittman wrote a Python script for fixing smart quotes in text. I wrote mine in GNU awk (gawk).
Get our awk cheat sheet. Free download.
To start, I wrote a simple gawk function to evaluate a single character. If that character is a quote, the function determines if it should output a plain quote or a smart quote. The function looks at the previous character; if the previous character is a space, the function outputs a left smart quote. Otherwise, the function outputs a right smart quote. The script does the same for single quotes.
function smartquote (char, prevchar) {
# print smart quotes depending on the previous character
# otherwise just print the character as-is
if (prevchar ~ /\s/) {
# prev char is a space
if (char == "'") {
printf("‘");
}
else if (char == "\"") {
printf("“");
}
else {
printf("%c", char);
}
}
else {
# prev char is not a space
if (char == "'") {
printf("’");
}
else if (char == "\"") {
printf("”");
}
else {
printf("%c", char);
}
}
}
With that function, the body of the gawk script processes the HTML input file character by character. The script prints all text verbatim when inside an HTML tag (for example, <html lang="en">
. Outside any HTML tags, the script uses the smartquote()
function to print text. The smartquote()
function does the work of evaluating when to print plain quotes or smart quotes.
function smartquote (char, prevchar) {
...
}
BEGIN {htmltag = 0}
{
# for each line, scan one letter at a time:
linelen = length($0);
prev = "\n";
for (i = 1; i <= linelen; i++) {
char = substr($0, i, 1);
if (char == "<") {
htmltag = 1;
}
if (htmltag == 1) {
printf("%c", char);
}
else {
smartquote(char, prev);
prev = char;
}
if (char == ">") {
htmltag = 0;
}
}
# add trailing newline at end of each line
printf ("\n");
}
Here's an example:
gawk -f quotes.awk test.html > test2.html
Sample input:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Test page</title>
<link rel="stylesheet" type="text/css" href="/test.css" />
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width" />
</head>
<body>
<h1><a href="/"><img src="logo.png" alt="Website logo" /></a></h1>
<p>"Hi there!"</p>
<p>It's and its.</p>
</body>
</html>
Sample output:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Test page</title>
<link rel="stylesheet" type="text/css" href="/test.css" />
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width" />
</head>
<body>
<h1><a href="/"><img src="logo.png" alt="Website logo" /></a></h1>
<p>“Hi there!”</p>
<p>It’s and its.</p>
</body>
</html>
via: https://opensource.com/article/18/8/gawk-script-convert-smart-quotes