-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve RSS importer with canonical_link and extract_tags option #489
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few comments.
"title" => item.title, | ||
} | ||
"title" => item.title, | ||
"canonical_url" => (canonical_link ? item.link : nil), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this could create a large site with a lot of canonical_url: nil
in them since the key is present. It might be best to follow the below "tag" method and do something like:
header["canonical_url"] = item.link if canonical_link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will update it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @parkr, we don't have change it to item.link if canonical_link
as .compact
will take care of null values
docs/_importers/rss.md
Outdated
@@ -17,3 +17,10 @@ $ ruby -r rubygems -e 'require "jekyll-import"; | |||
{% endhighlight %} | |||
|
|||
The `source` field is required and can be either a local file or a remote one. | |||
Other optional fields are as follows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other optional fields are as follows | |
Other optional fields are as follows: |
docs/_importers/rss.md
Outdated
@@ -17,3 +17,10 @@ $ ruby -r rubygems -e 'require "jekyll-import"; | |||
{% endhighlight %} | |||
|
|||
The `source` field is required and can be either a local file or a remote one. | |||
Other optional fields are as follows | |||
* `canonical_link` add canonical to posts. Default value is `false` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `canonical_link` add canonical to posts. Default value is `false` | |
* `canonical_link` – copy original link as `canonical_url` to post front matter. (default: `false`) |
lib/jekyll-import/importers/rss.rb
Outdated
@@ -89,6 +102,11 @@ def self.write_rss_item(item, options) | |||
end | |||
|
|||
f.puts output | |||
if canonical_link | |||
f.puts <<~HTML | |||
<p>Originally posted on <a href="#{item.link}">#{URI.parse(item.link).host}</a>.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this – the user can add this if they want it since canonical_url
is in the front matter. It can be easily added to the post
layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @parkr, that's correct, it can be easily do able by making simple change it post
layout. But I would like to have it here due to following reasons:
- it's configurable using
--canonical_link
option - I believe multiple people wants to have reference to the original post. To make every one of them to change
post
layout it's little too much according to me.
In case if it's not preferred to inter connect frontmatter header and this footer, maybe adding a new option will be better.
docs/_importers/rss.md
Outdated
@@ -17,3 +17,10 @@ $ ruby -r rubygems -e 'require "jekyll-import"; | |||
{% endhighlight %} | |||
|
|||
The `source` field is required and can be either a local file or a remote one. | |||
Other optional fields are as follows | |||
* `canonical_link` add canonical to posts. Default value is `false` | |||
* `render_audio` render <audio> element in posts. Default value is `false` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `render_audio` render <audio> element in posts. Default value is `false` | |
* `render_audio` – render `<audio>` element in posts for the enclosure URLs (default: `false`) |
docs/_importers/rss.md
Outdated
Other optional fields are as follows | ||
* `canonical_link` add canonical to posts. Default value is `false` | ||
* `render_audio` render <audio> element in posts. Default value is `false` | ||
* `tag` add specific tag to posts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `tag` add specific tag to posts | |
* `tag` – add a specific tag to all posts |
docs/_importers/rss.md
Outdated
* `canonical_link` add canonical to posts. Default value is `false` | ||
* `render_audio` render <audio> element in posts. Default value is `false` | ||
* `tag` add specific tag to posts | ||
* `extract_tags` extract tags from given key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this one meant to be used? What variable/field on item
?
* `extract_tags` extract tags from given key | |
* `extract_tags` – copies tags from the given subfield on the RSS `<item>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs some changes to be made.
lib/jekyll-import/importers/rss.rb
Outdated
c.option "render_audio", "--render_audio", "Render <audio> element as necessary" | ||
c.option "canonical_link", "--canonical_link", "Add source canonical link" | ||
c.option "extract_tags", "--extract_tags key", "Extract tag from given key" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section of code gets exposed to the user via command-line interface. So clarity and readability is important.
- Option parameter be in uppercase.
- Group boolean options together.
c.option "render_audio", "--render_audio", "Render <audio> element as necessary" | |
c.option "canonical_link", "--canonical_link", "Add source canonical link" | |
c.option "extract_tags", "--extract_tags key", "Extract tag from given key" | |
c.option "extract_tags", "--extract_tags KEY", "Extract tag from given key" | |
c.option "render_audio", "--render_audio", "Render <audio> element as necessary" | |
c.option "canonical_link", "--canonical_link", "Add source canonical link to front matter" |
lib/jekyll-import/importers/rss.rb
Outdated
|
||
header = { | ||
"layout" => "post", | ||
"title" => item.title, | ||
} | ||
"title" => item.title, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[style]: The =>
operators at same level should be aligned.
"title" => item.title, | |
"title" => item.title, |
lib/jekyll-import/importers/rss.rb
Outdated
if options["extract_tags"] | ||
tags_from_feed = item.instance_variable_get("@#{options["extract_tags"]}") | ||
unless tags_from_feed.nil? | ||
tags = tags_from_feed.map { |feed_tag| feed_tag.content.downcase } | ||
header["tag"] = tags unless tags.empty? | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section of code is confusing.
- First we check for
options["extract_tags"]
whereoptions
is the CLI options. - Then we check for
item.instance_variable_get("@#{options["extract_tags"]}")
using sameoptions
object. So why not store thekey
in a variable for reuse? - Then we preprocess feed-tags using
#map
. That means we expect an array offeed_tag
instead of any non-nil object. So, we should explicitly test for array instead. We also downcase feed_tags. So, if the feed originally contained["holiday", "Holiday"]
, we'd end up with:["holiday", "holiday"]
. So we should filter duplicates.
if options["extract_tags"] | |
tags_from_feed = item.instance_variable_get("@#{options["extract_tags"]}") | |
unless tags_from_feed.nil? | |
tags = tags_from_feed.map { |feed_tag| feed_tag.content.downcase } | |
header["tag"] = tags unless tags.empty? | |
end | |
end | |
tag_name = options["extract_tags"] | |
if tag_name | |
tags_from_feed = item.instance_variable_get("@#{tag_name}") | |
if tags_from_feed.is_a?(Array) | |
tags = tags_from_feed.map { |feed_tag| feed_tag.content.downcase } | |
tags.uniq! | |
header["tag"] = tags unless tags.empty? | |
end | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this code looks much better :). Only thing tag_name
is actually the key
mentioned by CLI options. Therefor I'll update it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that CLI option param KEY
corresponds to the tag_name
in my suggestion. But now that you mentioned it, I wonder why the option is in plural extract_tags
and why not singular as in
--extract_tag TAG_NAME
..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ashmaroli, the reason of introducing --extract_tags
is to extract tag(s) from RSS feed. When someone add a post they'll have multiple tags. In this case, you'll provide the key(sub field of RSS payload) and it'll add all those tags to frontmatter.
For example, if you look into the follow RSS feed from Medium, it has multiple category (which is actually tags).
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"
xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
<channel>
<title><![CDATA[Stories by Suman Maity on Medium]]></title>
<description><![CDATA[Stories by Suman Maity on Medium]]></description>
<link>https://medium.com/@suman.maity112?source=rss-92eeb3145de9------2</link>
<image>
<url>https://cdn-images-1.medium.com/fit/c/150/150/1*rAk1Q7-7kFOM-arRqtpvMA.jpeg</url>
<title>Stories by Suman Maity on Medium</title>
<link>https://medium.com/@suman.maity112?source=rss-92eeb3145de9------2</link>
</image>
<generator>Medium</generator>
<lastBuildDate>Mon, 07 Nov 2022 19:38:06 GMT</lastBuildDate>
<atom:link href="https://medium.com/@suman.maity112/feed" rel="self" type="application/rss+xml"/>
<webMaster><![CDATA[[email protected]]]></webMaster>
<atom:link href="http://medium.superfeedr.com" rel="hub"/>
<item>
<title><![CDATA[Is it the era of mono-repo?]]></title>
<link>
https://medium.com/@suman.maity112/is-it-the-era-of-mono-repo-671e6dee387?source=rss-92eeb3145de9------2
</link>
<guid isPermaLink="false">https://medium.com/p/671e6dee387</guid>
<category><![CDATA[version-control]]></category>
<category><![CDATA[multi-repo]]></category>
<category><![CDATA[monorepo]]></category>
<dc:creator><![CDATA[Suman Maity]]></dc:creator>
<pubDate>Mon, 07 Nov 2022 08:11:46 GMT</pubDate>
<atom:updated>2022-11-07T08:11:46.975Z</atom:updated>
<content:encoded><![CDATA[dummy content]]></content:encoded>
</item>
</channel>
</rss>```
This is the main reason of having `--extract_tags` as plural instead of singular. You can find the imported RSS feed [here](https://sumanmaity112.github.io/posts/is-it-the-era-of-mono-repo/)
I'll make the required/suggested changes and update the same PR |
@sumanmaity112 This also requires tests especially because the implementation expects a particular state. Starting with some basic tests will do. |
Hi @ashmaroli, didn't got your point. Are you asking me to add some sort unit tests? |
Hello again, @sumanmaity112
|
@ashmaroli, refactored the |
Co-authored-by: Ashwin Maroli <[email protected]>
@ashmaroli, @parkr Any idea how to fix Continuous Integration / Ruby 2.5? It failing to setup the Ruby 2.5 itself |
@sumanmaity112 To fix the CI for Ruby 2.5, we will need to lock gem |
Locked. If you're comfortable rebasing, please do so! |
Sure, will rebase the changes |
Hi @parkr , I rebased the changes couple days back. Please let me know if anything else is pending from my side which is required to merge this PR. |
Hello @sumanmaity112, Could we have the CLI option description and docs mirror each other? I don't see why the concise description in the docs wouldn't be apt for the CLI context display as well. |
sure, will update it |
Hi @parkr, Could you please check if it's better now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
lib/jekyll-import/importers/rss.rb
Outdated
@@ -89,8 +93,28 @@ def self.write_rss_item(item, options) | |||
end | |||
|
|||
f.puts output | |||
if canonical_link | |||
f.puts <<~HTML | |||
<p>Originally posted on <a href="#{item.link}">#{URI.parse(item.link).host}</a>.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should remove this before merging. What if the blog is not in English? Or if they just don't want this link? Now the user has to manually edit all of their posts. This is asking way more of the user than editing their post layout. In general, injecting our own content is not what we do. Metadata is fine but I'm not comfortable writing content to these posts.
it's configurable using
--canonical_link
option
True but then you lose the front matter value. The docs for this only says you get the front matter value, not extra content. We can trust that the user can add a link themselves if they so desire. Adding canonical_url is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @parkr, Your point is valid in terms of user can add it if they want. But I am not sure how it can be automated?
For my case, I import all my posts from Medium to github pages. I want to add canonical_url at the end of the imported content. You can see the example here. Currently, I have a Github Workflow to import all posts from Medium.
At least I am not sure how to achieve this if the importer is not adding it automatically. Maybe you can help me here. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool workflow! Nice way of having a backup/mirror of your site.
I mean that adding it to the post layout adds this link to every post automatically since every post used the post layout, right? Even if they don't, you could add front matter defaults so that each post uses the post layout, then add the link to the post layout.
- Post itself contains only the content the user wrote
- _layouts/post.html contains canonical link if specified
- Each post specifies that it should use the post layout, or use front matter defaults to specify the post layout should be used by all posts
I don't want this importer to modifying the user's post contents (especially when it's not documented).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parkr Your point is valid.
Now in my mind there are couple of options
- Remove the changes which adds canonical url to the post
- Keep the changes and document it
- Add a new options (addition to
--canonical_link
) to configure this specific footer content
Please let me know whatever you think is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumanmaity112 Since Parker doesn't want importers to modify imported content regardless of configuration, it's best that you remove the canonical_url
markup injection. Let the option to add the info to front matter remain for subsequent consumption.
Regarding your personal blog, you have two options:
- Modify the footer template in your theme to render the canonical url.
- Write a custom Ruby plugin into a
_plugins
directory in your blog source repository. The plugin can easily inject custom content into your posts at runtime via Jekyll Hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashmaroli, @parkr Removed the canonical url from footer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@jekyllbot: merge +minor |
No description provided.