Skip to content

Commit

Permalink
Merge pull request #6 from cicirello/development
Browse files Browse the repository at this point in the history
Sitemap entries sorted
  • Loading branch information
cicirello authored Aug 10, 2020
2 parents be0db52 + 9f620f4 commit 1b50fa7
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 21 deletions.
15 changes: 15 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@ FROM alpine:3.10
RUN apk update
RUN apk add git

# The base alpine find command is quite
# limited. We need full featured find.
RUN apk add findutils

# We also need coreutils to get fuller
# featured versions of shell commands,
# such as sort.
RUN apk add coreutils

# We also need gawk
RUN apk add gawk

# Let's use bash
RUN apk add bash bash-doc bash-completion

COPY LICENSE README.md /
COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
21 changes: 14 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Generate Sitemap
# generate-sitemap

[![build](https://github.com/cicirello/generate-sitemap/workflows/build/badge.svg)](https://github.com/cicirello/generate-sitemap/actions?query=workflow%3Abuild)
[![GitHub](https://img.shields.io/github/license/cicirello/generate-sitemap)](https://github.com/cicirello/generate-sitemap/blob/master/LICENSE)
Expand All @@ -11,7 +11,14 @@ html as well as pdf files in the sitemap, and has inputs to
control the included file types (defaults include both html
and pdf files in the sitemap). It skips over html files that
contain `<meta name="robots" content="noindex">`. It otherwise
does not currently attempt to respect a robots.txt file.
does not currently attempt to respect a robots.txt file. The
sitemap entries are sorted in a consistent order. Specifically,
all html pages appear prior to all URLs to pdf files (if pdfs
are included). The html pages are then first sorted by depth
in the directory structure (i.e., pages at the website root
appear first, etc), and then pages at the same depth are sorted
alphabetically. URLs to pdf files are sorted in the same manner
as the html pages.

It is designed to be used in combination with other GitHub
Actions. For example, it does not commit and push the generated
Expand Down Expand Up @@ -101,7 +108,7 @@ file in the root of the repository. After completion, it then
simply echos the outputs.

```yml
name: Generate API sitemap
name: Generate xml sitemap
on:
push:
Expand All @@ -119,7 +126,7 @@ jobs:
fetch-depth: 0
- name: Generate the sitemap
id: sitemap
uses: cicirello/generate-sitemap@v1.0.0
uses: cicirello/generate-sitemap@v1.1.0
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
- name: Output stats
Expand Down Expand Up @@ -155,7 +162,7 @@ jobs:
fetch-depth: 0
- name: Generate the sitemap
id: sitemap
uses: cicirello/generate-sitemap@v1.0.0
uses: cicirello/generate-sitemap@v1.1.0
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
path-to-root: docs
Expand All @@ -178,7 +185,7 @@ then the `peter-evans/create-pull-request` monitors for changes, and
if the sitemap changed will create a pull request.

```yml
name: Generate API sitemap
name: Generate xml sitemap
on:
push:
Expand All @@ -196,7 +203,7 @@ jobs:
fetch-depth: 0
- name: Generate the sitemap
id: sitemap
uses: cicirello/generate-sitemap@v1.0.0
uses: cicirello/generate-sitemap@v1.1.0
with:
base-url-path: https://THE.URL.TO.YOUR.PAGE/
- name: Create Pull Request
Expand Down
25 changes: 11 additions & 14 deletions entrypoint.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/sh -l
#!/bin/bash -l

websiteRoot=$1
baseUrl=$2
Expand All @@ -11,12 +11,9 @@ skipCount=0

function formatSitemapEntry {
if [ "$sitemapFormat" == "xml" ]; then
lastModDate=${3/ /T}
lastModDate=${lastModDate/ /}
lastModDate="${lastModDate:0:22}:${lastModDate:22:2}"
echo "<url>" >> sitemap.xml
echo "<loc>$2${1%index.html}</loc>" >> sitemap.xml
echo "<lastmod>$lastModDate</lastmod>" >> sitemap.xml
echo "<lastmod>$3</lastmod>" >> sitemap.xml
echo "</url>" >> sitemap.xml
else
echo "$2${1/%\/index.html/\/}" >> sitemap.txt
Expand All @@ -35,20 +32,20 @@ else
fi

if [ "$includeHTML" == "true" ]; then
for i in $(find . \( -name '*.html' -o -name '*.htm' \) -type f); do
if [ "0" == $(grep -i -c -E "<meta*.*name*.*robots*.*content*.*noindex" $i || true) ]; then
lastMod=$(git log -1 --format=%ci $i)
formatSitemapEntry ${i#./} "$baseUrl" "$lastMod"
while read file; do
if [ "0" == $(grep -i -c -E "<meta*.*name*.*robots*.*content*.*noindex" $file || true) ]; then
lastMod=$(git log -1 --format=%cI $file)
formatSitemapEntry ${file#./} "$baseUrl" "$lastMod"
else
skipCount=$((skipCount+1))
fi
done
done < <(find . \( -name '*.html' -o -name '*.htm' \) -type f -printf '%d\0%h\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}')
fi
if [ "$includePDF" == "true" ]; then
for i in $(find . -name '*.pdf' -type f); do
lastMod=$(git log -1 --format=%ci $i)
formatSitemapEntry ${i#./} "$baseUrl" "$lastMod"
done
while read file; do
lastMod=$(git log -1 --format=%cI $file)
formatSitemapEntry ${file#./} "$baseUrl" "$lastMod"
done < <(find . -name '*.pdf' -type f -printf '%d\0%h\0%p\n' | sort -t '\0' -n | awk -F '\0' '{print $3}')
fi

if [ "$sitemapFormat" == "xml" ]; then
Expand Down

0 comments on commit 1b50fa7

Please sign in to comment.