Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RSS importer with canonical_link and extract_tags option #489

Merged
merged 8 commits into from
Nov 28, 2022
Merged

Improve RSS importer with canonical_link and extract_tags option #489

merged 8 commits into from
Nov 28, 2022

Conversation

sumanmaity112
Copy link
Contributor

No description provided.

Copy link
Member

@parkr parkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few comments.

"title" => item.title,
}
"title" => item.title,
"canonical_url" => (canonical_link ? item.link : nil),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this could create a large site with a lot of canonical_url: nil in them since the key is present. It might be best to follow the below "tag" method and do something like:

header["canonical_url"] = item.link if canonical_link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @parkr, we don't have change it to item.link if canonical_link as .compact will take care of null values

@@ -17,3 +17,10 @@ $ ruby -r rubygems -e 'require "jekyll-import";
{% endhighlight %}

The `source` field is required and can be either a local file or a remote one.
Other optional fields are as follows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Other optional fields are as follows
Other optional fields are as follows:

@@ -17,3 +17,10 @@ $ ruby -r rubygems -e 'require "jekyll-import";
{% endhighlight %}

The `source` field is required and can be either a local file or a remote one.
Other optional fields are as follows
* `canonical_link` add canonical to posts. Default value is `false`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `canonical_link` add canonical to posts. Default value is `false`
* `canonical_link` – copy original link as `canonical_url` to post front matter. (default: `false`)

@@ -89,6 +102,11 @@ def self.write_rss_item(item, options)
end

f.puts output
if canonical_link
f.puts <<~HTML
<p>Originally posted on <a href="#{item.link}">#{URI.parse(item.link).host}</a>.</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this – the user can add this if they want it since canonical_url is in the front matter. It can be easily added to the post layout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @parkr, that's correct, it can be easily do able by making simple change it post layout. But I would like to have it here due to following reasons:

  • it's configurable using --canonical_link option
  • I believe multiple people wants to have reference to the original post. To make every one of them to change post layout it's little too much according to me.

In case if it's not preferred to inter connect frontmatter header and this footer, maybe adding a new option will be better.

@@ -17,3 +17,10 @@ $ ruby -r rubygems -e 'require "jekyll-import";
{% endhighlight %}

The `source` field is required and can be either a local file or a remote one.
Other optional fields are as follows
* `canonical_link` add canonical to posts. Default value is `false`
* `render_audio` render <audio> element in posts. Default value is `false`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `render_audio` render <audio> element in posts. Default value is `false`
* `render_audio` render `<audio>` element in posts for the enclosure URLs (default: `false`)

Other optional fields are as follows
* `canonical_link` add canonical to posts. Default value is `false`
* `render_audio` render <audio> element in posts. Default value is `false`
* `tag` add specific tag to posts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `tag` add specific tag to posts
* `tag` add a specific tag to all posts

* `canonical_link` add canonical to posts. Default value is `false`
* `render_audio` render <audio> element in posts. Default value is `false`
* `tag` add specific tag to posts
* `extract_tags` extract tags from given key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this one meant to be used? What variable/field on item?

Suggested change
* `extract_tags` extract tags from given key
* `extract_tags` – copies tags from the given subfield on the RSS `<item>`

Copy link
Member

@ashmaroli ashmaroli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some changes to be made.

Comment on lines 9 to 11
c.option "render_audio", "--render_audio", "Render <audio> element as necessary"
c.option "canonical_link", "--canonical_link", "Add source canonical link"
c.option "extract_tags", "--extract_tags key", "Extract tag from given key"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section of code gets exposed to the user via command-line interface. So clarity and readability is important.

  • Option parameter be in uppercase.
  • Group boolean options together.
Suggested change
c.option "render_audio", "--render_audio", "Render <audio> element as necessary"
c.option "canonical_link", "--canonical_link", "Add source canonical link"
c.option "extract_tags", "--extract_tags key", "Extract tag from given key"
c.option "extract_tags", "--extract_tags KEY", "Extract tag from given key"
c.option "render_audio", "--render_audio", "Render <audio> element as necessary"
c.option "canonical_link", "--canonical_link", "Add source canonical link to front matter"


header = {
"layout" => "post",
"title" => item.title,
}
"title" => item.title,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[style]: The => operators at same level should be aligned.

Suggested change
"title" => item.title,
"title" => item.title,

Comment on lines 68 to 74
if options["extract_tags"]
tags_from_feed = item.instance_variable_get("@#{options["extract_tags"]}")
unless tags_from_feed.nil?
tags = tags_from_feed.map { |feed_tag| feed_tag.content.downcase }
header["tag"] = tags unless tags.empty?
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section of code is confusing.

  • First we check for options["extract_tags"] where options is the CLI options.
  • Then we check for item.instance_variable_get("@#{options["extract_tags"]}") using same options object. So why not store the key in a variable for reuse?
  • Then we preprocess feed-tags using #map. That means we expect an array of feed_tag instead of any non-nil object. So, we should explicitly test for array instead. We also downcase feed_tags. So, if the feed originally contained ["holiday", "Holiday"], we'd end up with: ["holiday", "holiday"]. So we should filter duplicates.
Suggested change
if options["extract_tags"]
tags_from_feed = item.instance_variable_get("@#{options["extract_tags"]}")
unless tags_from_feed.nil?
tags = tags_from_feed.map { |feed_tag| feed_tag.content.downcase }
header["tag"] = tags unless tags.empty?
end
end
tag_name = options["extract_tags"]
if tag_name
tags_from_feed = item.instance_variable_get("@#{tag_name}")
if tags_from_feed.is_a?(Array)
tags = tags_from_feed.map { |feed_tag| feed_tag.content.downcase }
tags.uniq!
header["tag"] = tags unless tags.empty?
end
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this code looks much better :). Only thing tag_name is actually the key mentioned by CLI options. Therefor I'll update it accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that CLI option param KEY corresponds to the tag_name in my suggestion. But now that you mentioned it, I wonder why the option is in plural extract_tags and why not singular as in
--extract_tag TAG_NAME..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ashmaroli, the reason of introducing --extract_tags is to extract tag(s) from RSS feed. When someone add a post they'll have multiple tags. In this case, you'll provide the key(sub field of RSS payload) and it'll add all those tags to frontmatter.

For example, if you look into the follow RSS feed from Medium, it has multiple category (which is actually tags).

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"
     xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Suman Maity on Medium]]></title>
        <description><![CDATA[Stories by Suman Maity on Medium]]></description>
        <link>https://medium.com/@suman.maity112?source=rss-92eeb3145de9------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*rAk1Q7-7kFOM-arRqtpvMA.jpeg</url>
            <title>Stories by Suman Maity on Medium</title>
            <link>https://medium.com/@suman.maity112?source=rss-92eeb3145de9------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 07 Nov 2022 19:38:06 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@suman.maity112/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[[email protected]]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Is it the era of mono-repo?]]></title>
            <link>
                https://medium.com/@suman.maity112/is-it-the-era-of-mono-repo-671e6dee387?source=rss-92eeb3145de9------2
            </link>
            <guid isPermaLink="false">https://medium.com/p/671e6dee387</guid>
            <category><![CDATA[version-control]]></category>
            <category><![CDATA[multi-repo]]></category>
            <category><![CDATA[monorepo]]></category>
            <dc:creator><![CDATA[Suman Maity]]></dc:creator>
            <pubDate>Mon, 07 Nov 2022 08:11:46 GMT</pubDate>
            <atom:updated>2022-11-07T08:11:46.975Z</atom:updated>
            <content:encoded><![CDATA[dummy content]]></content:encoded>
        </item>
    </channel>
</rss>```

This is the main reason of having `--extract_tags` as plural instead of singular. You can find the imported RSS feed [here](https://sumanmaity112.github.io/posts/is-it-the-era-of-mono-repo/)

@sumanmaity112
Copy link
Contributor Author

I'll make the required/suggested changes and update the same PR

@ashmaroli
Copy link
Member

@sumanmaity112 This also requires tests especially because the implementation expects a particular state. Starting with some basic tests will do.

@sumanmaity112
Copy link
Contributor Author

@sumanmaity112 This also requires tests especially because the implementation expects a particular state. Starting with some basic tests will do.

Hi @ashmaroli, didn't got your point. Are you asking me to add some sort unit tests?

@ashmaroli
Copy link
Member

Hello again, @sumanmaity112
There are couple of issues with the new "private" method you have defined:

  • In Ruby, private keyword only affects the instance methods. So, the self.get_tags method is still public. You will have to use the private_class_method :get_tags declaration to make it truly private.
  • Please consider refactoring the self.get_tags once more. The following lines do not seem right:
    explicit_tag = options["tag"]
    if explicit_tag
      return explicit_tag unless explicit_tag.nil? || explicit_tag.empty?
    nil is a falsy. The check for explicit_tag.nil? is redundant after the initial if explicit_tag check. Plus, what happened to the exclusivity of the two options? Why ignore options["extract_tags"] if explicit_tag is nil or undefined?
    (This is why test coverage is important)

@sumanmaity112
Copy link
Contributor Author

@ashmaroli, refactored the get_tags method. Please do have a look and let me know if I have to update something else

@sumanmaity112
Copy link
Contributor Author

@ashmaroli, @parkr Any idea how to fix Continuous Integration / Ruby 2.5? It failing to setup the Ruby 2.5 itself

@ashmaroli
Copy link
Member

@sumanmaity112 To fix the CI for Ruby 2.5, we will need to lock gem rdoc to an older version.
But that is a maintenance task and falls outside the scope of this PR. If you have the time and have the technical know-how regarding this, feel free to submit a pull request. Either ways, I will tackle it in a couple of days.

@parkr
Copy link
Member

parkr commented Nov 20, 2022

Locked. If you're comfortable rebasing, please do so!

@sumanmaity112
Copy link
Contributor Author

Sure, will rebase the changes

@sumanmaity112
Copy link
Contributor Author

Hi @parkr , I rebased the changes couple days back. Please let me know if anything else is pending from my side which is required to merge this PR.

@ashmaroli
Copy link
Member

Hello @sumanmaity112,
One final nitpick from my end:

Could we have the CLI option description and docs mirror each other?

I don't see why the concise description in the docs wouldn't be apt for the CLI context display as well.

@sumanmaity112
Copy link
Contributor Author

sure, will update it

@sumanmaity112
Copy link
Contributor Author

Hi @parkr, Could you please check if it's better now

Copy link
Member

@ashmaroli ashmaroli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -89,8 +93,28 @@ def self.write_rss_item(item, options)
end

f.puts output
if canonical_link
f.puts <<~HTML
<p>Originally posted on <a href="#{item.link}">#{URI.parse(item.link).host}</a>.</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should remove this before merging. What if the blog is not in English? Or if they just don't want this link? Now the user has to manually edit all of their posts. This is asking way more of the user than editing their post layout. In general, injecting our own content is not what we do. Metadata is fine but I'm not comfortable writing content to these posts.

it's configurable using --canonical_link option

True but then you lose the front matter value. The docs for this only says you get the front matter value, not extra content. We can trust that the user can add a link themselves if they so desire. Adding canonical_url is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @parkr, Your point is valid in terms of user can add it if they want. But I am not sure how it can be automated?

For my case, I import all my posts from Medium to github pages. I want to add canonical_url at the end of the imported content. You can see the example here. Currently, I have a Github Workflow to import all posts from Medium.
At least I am not sure how to achieve this if the importer is not adding it automatically. Maybe you can help me here. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool workflow! Nice way of having a backup/mirror of your site.

I mean that adding it to the post layout adds this link to every post automatically since every post used the post layout, right? Even if they don't, you could add front matter defaults so that each post uses the post layout, then add the link to the post layout.

  1. Post itself contains only the content the user wrote
  2. _layouts/post.html contains canonical link if specified
  3. Each post specifies that it should use the post layout, or use front matter defaults to specify the post layout should be used by all posts

I don't want this importer to modifying the user's post contents (especially when it's not documented).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parkr Your point is valid.

Now in my mind there are couple of options

  1. Remove the changes which adds canonical url to the post
  2. Keep the changes and document it
  3. Add a new options (addition to --canonical_link) to configure this specific footer content

Please let me know whatever you think is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumanmaity112 Since Parker doesn't want importers to modify imported content regardless of configuration, it's best that you remove the canonical_url markup injection. Let the option to add the info to front matter remain for subsequent consumption.

Regarding your personal blog, you have two options:

  1. Modify the footer template in your theme to render the canonical url.
  2. Write a custom Ruby plugin into a _plugins directory in your blog source repository. The plugin can easily inject custom content into your posts at runtime via Jekyll Hooks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashmaroli, @parkr Removed the canonical url from footer

Copy link
Member

@parkr parkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@parkr
Copy link
Member

parkr commented Nov 28, 2022

@jekyllbot: merge +minor

@jekyllbot jekyllbot merged commit f9b6653 into jekyll:master Nov 28, 2022
jekyllbot added a commit that referenced this pull request Nov 28, 2022
@jekyll jekyll locked and limited conversation to collaborators Nov 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants