In a previous post, I shared my journey with Sitecore Data Exchange Framework. I was excited to use it to create a reusable importer process. In the end, I decided it was not the right tool for me. I ended up using Powershell. In the process of researching site migration tools, I discovered “Merlin”. Merlin uses yaml configuration files to crawl a site and create structured representations (JSON) of pages. It would be easy to create a powershell script to import this data. It would be a super-fast way to create the basic structure of your site: items in the content tree, correct hierarchy of pages, page titles and metadata, and basic page content. I’ll use the blogs.perficient.com site to show some examples.
Setup
Merlin is a php based tool. Setup was super easy! I used Chocolatey to install php and composer.
choco install php choco install composer
I downloaded the tool from github. Under the assets section, click “merlin-framework”. This file a php archive file (phar) and will run from the command line or powershell with the php.exe command.
Crawl
Merlin includes a crawler that can be used to create a list of urls for the generate feature. The crawler can be configured (https://salsadigitalauorg.github.io/merlin-framework/docs/crawler) to cache results, include or exclude specific urls by regex pattern, follow redirects and ignore robots file.
To run the crawl command, navigate to the directory where you downloaded the tool. Use the php executable to run the tool. The -c option specifies your crawler config. The -o option specifies where to put the output files.
PS > C:\tools\php81\php.exe merlin-framework crawl -c .\prft\crawler_blogs_merlin.yml -o .\output\prft
You can use multiple crawl configs to crawl different sections of your site. The entity_type option sets the name of the output file “blog_site_structure” creates url list file called “crawled-urls-blogs_site_structure_default.yml”. This makes it easier to break the site into pages that have similar DOM layouts (ie: blogs, news, products, etc) for the generate command.
In this example, I limited my results to 50 as well as limiting the crawl to specific directories. You can use the crawler_include and include options to limit what urls will end up in the output file.
--- domain: https://blogs.perficient.com entity_type: blogs_site_structure options: cache_enabled: true delay: 500 maximum_total: 50 urls: - /2023/06/ - /2023/05/ - /2023/04/ - /2023/03/ - /2023/03/ - /2023/02/ - /2023/01/ - /2022/12/ - /2022/11/ - /2022/10/ - /2022/09/ - /2022/08/ - /2022/07/ - /2022/06/ crawler_include: - "~/202[23]/\d{2}/\d{2}/.*~" include: - "~/202[23]/\d{2}/\d{2}/.*~"
Below is an excerpt from my output file. The include attribute prevented https://blogs.perficient.com/2023/06/ and similar pages from being added to the output.
--- urls: - https://blogs.perficient.com/2023/06/11/install-docker-on-an-amazon-ec2-instance-using-the-yum-package-manager/ - https://blogs.perficient.com/2023/06/10/introduction-to-terraform-day-1/ - https://blogs.perficient.com/2023/06/09/unlocking-digital-accessibility-exploring-the-power-of-cognitive-assistive-technologies-2-2/ - https://blogs.perficient.com/2023/06/09/unlocking-digital-accessibility-exploring-the-power-of-cognitive-assistive-technologies-2/ - https://blogs.perficient.com/2023/06/09/oci-gen-2-refresh-token-setup-troubleshooting/ - https://blogs.perficient.com/2023/06/09/what-if-college-was-just-a-pit-stop-an-interview-with-luca-ranzani/ - https://blogs.perficient.com/2023/06/09/the-ev-leadership-of-luca-ranzini-proves-automotive-has-a-bright-future/ - https://blogs.perficient.com/2023/06/09/unleashing-creativity-through-constraints/
Generate
In order generate your structured data, you need a list of urls to process and a list of field mappings. The generator can be configured https://salsadigitalauorg.github.io/merlin-framework/docs/getting-started to cache results, read from the cache to make future processing faster, use css selectors or xpath selectors to map fields to the DOM, and perform post processing on the data.
To run the generate command, navigate to the directory where you downloaded the tool. Use the php executable to run the tool. The -c option specifies your generate config. The -o option specifies where to put the output files.
PS> C:\tools\php81\php.exe merlin-framework generate -c .\prft\blogs_merlin.yml -o .\output\prft
Merlin defines several special data types to make mapping easy.
- alias – The url of the page
- link – Reads the href attribute and text of an anchor tag
- long_text – Reads multiline text and rich text fields
- meta – Reads the content attribute of a meta tag
- media – Generates a separate file for media items grouped by the specified type
- static_value – Outputs a string to the output as entered
- taxonomy_term – Generates a separate file for taxonomy terms grouped by the specified type
- text – Reads a single-line text field
The text and long_text have extra processors that can be applied to the result.
- nl2br – Changes new lines to br tags
- remove_empty_tags – Removes any empty tags (ie <p></p>)
- replace – Regular expression-based string replacement (NOTE: this uses php’s preg_replace function which functions differently than the regular expression engine in .net)
- strip_tags – Removes tags not in the allowed_tags list
- whitespace – Removes extra whitespace characters
Compare the following config to the source of this page and see if you can match up the fields to the html source.
--- domain: https://blogs.perficient.com urls: - /2023/05/30/perficient-included-in-two-commerce-focused-idc-market-glances/ - /2023/05/16/getting-to-know-sitecore-search-part-4/ urls_file: - "..\output\prft\effective-urls-blogs_site_structure_default.yml" #relative to the location of this file fetch_options: delay: 500 ignore_ssl_errors: true entity_type: blogs mappings: - field: url type: alias - field: sitecore_root_path type: static_value options: value: "/sitecore/content/tenant/site/home/blogs" - field: sitecore_template_id type: static_value options: value: "{B69EDBAD-2FDF-4120-B13C-CDDF4B127B9F}" - field: meta_title type: text selector: //title - field: meta_keywords type: meta options: value: keywords attr: name - field: meta_description type: meta options: value: description attr: name - field: meta_og_title type: meta options: value: og:title attr: property - field: meta_og_description type: meta options: value: og:description attr: property - field: meta_og_sitename type: meta options: value: og:site_name attr: property - field: meta_og_sitename type: meta options: value: og:site_name attr: property - field: meta_og_type type: meta options: value: og:type attr: property - field: meta_og_image type: meta options: value: og:image attr: property - field: meta_published_time type: meta options: value: article:published_time attr: property - field: featured_image type: media selector: div.story-two-header-content-img img options: file: src alt: alt type: featured_images - field: title selector: h1:first-of-type type: text processors: - processor: nl2br - field: primary_category selector: p.eyebrow-header-eyebrow type: text - field: author selector: h4.byline span.author a type: text - field: date selector: h4.byline span.date type: text - field: content selector: div.entry type: long_text processors: - processor: nl2br - processor: remove_empty_tags - processor: whitespace - field: content_images type: media selector: div.entry img options: file: src alt: alt type: content_images - field: author_page type: link selector: div.author-avatar-and-name-avatar a options: link: href - field: author_image type: media selector: div.author-avatar-and-name-avatar img options: file: src alt: alt type: author_images - field: author_bio selector: div.author-avatar-and-name-description p:first-of-type type: text processors: - processor: replace pattern: "More from this Author" - field: categories selector: //div[@class="widget"]//ul/li #Taxonomy_term only works with xpath selector type: taxonomy_term vocab: category children: - field: uuid type: uuid selector: a - field: name type: text selector: a - field: tags selector: div.tags-author-info a type: text
The output file is in JSON format with your field names as the keys and the content of your selectors as the value. I included two fields that will help make it easier to import this data into Sitecore.
- sitecore_root_path – A static_value that contains the path to use as the parent when creating the new page in Sitecore
- sitecore_template_id – A static_value that contains the id of the template to use when creating the new page in Sitecore
[ { "url": "/2023/05/16/getting-to-know-sitecore-search-part-4/", "sitecore_root_path": "/sitecore/content/tenant/site/home/blogs", "sitecore_template_id": "{B69EDBAD-2FDF-4120-B13C-CDDF4B127B9F}", "meta_title": "Getting to know Sitecore Search – Part 4 / Blogs / Perficient", "meta_description": "Dig into the weeds of managing your search sources. Learn about triggers, extractors, attributes, excluding urls, and scan frequency.", "meta_og_title": "Getting to know Sitecore Search – Part 4 / Blogs / Perficient", "meta_og_description": "Dig into the weeds of managing your search sources. Learn about triggers, extractors, attributes, excluding urls, and scan frequency.", "meta_og_sitename": "Perficient Blogs", "meta_og_type": "article", "meta_og_image": "https://blogs.perficient.com/files/forest-simon-TX0ufDSCV4-unsplash-scaled.jpg", "meta_published_time": "2023-05-16T13:30:04+00:00", "featured_image": [ "55789139-9512-33f6-bf85-cd1897eb36fa" ], "title": "Getting to know Sitecore Search – Part 4", "primary_category": "Sitecore", "author": "Eric Sanner", "date": "May 16th, 2023", "content": { "format": "rich_text", "value": "<div id=\"bsf_rt_marker\"><p>Welcome back to getting to know Sitecore search</p>shorted for brevity<p>In the next post, we’ll build a simple UI and connect to the api to get our first real results!</p></div>" }, "content_images": [ "b1b335ec-95a7-3609-89d0-65805abd3c68", "2eeacd44-f2d4-337c-86eb-ee1249bcf10b", "974e0ccb-ce39-3fc2-bfd7-299dbfd1f9b1", "379645ef-b0d8-3fa9-8088-49974bc4312e" ], "author_page": [ { "link": "https://blogs.perficient.com/author/esanner/", "text": "" } ], "author_image": [ "43d4f615-da07-3a01-ad1f-5f53d4b11835" ], "author_bio": "", "categories": [ "205dd6d7-887c-3501-b20a-3a2137437a47", "00ba851e-b649-30d3-902a-3a32d230110f", "8492492a-25c4-3c45-95a1-59c3d6b59620" ], "tags": [ "Sitecore", "Sitecore.Search" ] }, { "url": "/2023/05/23/the-dialogue-element-modals-made-simple/", "sitecore_root_path": "/sitecore/content/tenant/site/home/blogs", "sitecore_template_id": "{B69EDBAD-2FDF-4120-B13C-CDDF4B127B9F}", "meta_title": "The Dialogue Element: Modals Made Simple / Blogs / Perficient", "meta_description": "The new dialogue element makes modals simple. Learn to create user-friendly modals with ease using the new HTML dialogue element.", "meta_og_title": "The Dialogue Element: Modals Made Simple / Blogs / Perficient", "meta_og_description": "The new dialogue element makes modals simple. Learn to create user-friendly modals with ease using the new HTML dialogue element.", "meta_og_sitename": "Perficient Blogs", "meta_og_type": "article", "meta_og_image": "https://blogs.perficient.com/files/Group-of-People-Holding-Speech-bubbles-scaled.jpg", "meta_published_time": "2023-05-23T13:17:20+00:00", "featured_image": [ "f8a33d23-893e-3625-8665-76bf657c8e72" ], "title": "The Dialogue Element: Modals Made Simple", "primary_category": "Accessibility", "author": "Drew Taylor", "date": "May 23rd, 2023", "content": { "format": "rich_text", "value": "<div id=\"bsf_rt_marker\"><p>Any front-end developer has likely experienced the pain of covering an exhaustive list of accessibility and UI edge cases while implementing modals. Well guess what? Not any longer.</p>shortend for brevity<p>The dialog is just another HTML element. Style it with CSS just as any other HTML element.</p></div>" }, "content_images": [ "add188a4-2d87-32ae-9a63-77bbfbaef6fb" ], "author_page": [ { "link": "https://blogs.perficient.com/author/drewtaylor/", "text": "" } ], "author_image": [ "3cbd7fd0-687e-3856-b212-38848e781de0" ], "author_bio": "Drew is a technical consultant at Perficient. He enjoys writing code and books, talking AI, and advocating accessibility.", "categories": [ "f47d9f1b-dc4f-31fd-b896-3fa00fe4d304" ], "tags": [ "accessibility", "AI", "modal", "UX" ] } ]
The taxonomy_term field type creates a guid value for the category and outputs the mapping in a separate file. These items could be imported into Sitecore first so the content can be mapped to the correct category.
{ "data": [ { "uuid": "205dd6d7-887c-3501-b20a-3a2137437a47", "name": "Technical" }, { "uuid": "00ba851e-b649-30d3-902a-3a32d230110f", "name": "Development" }, { "uuid": "8492492a-25c4-3c45-95a1-59c3d6b59620", "name": "Sitecore" }, { "uuid": "f47d9f1b-dc4f-31fd-b896-3fa00fe4d304", "name": "Accessibility" } ] }
Conclusion
Merlin is a really neat tool! I’m excited to use it next time I’m on a site migration project. I believe it has the opportunity to save tons of time migrating content. How happy would content authors be to not have to go through the process to create a new page, set the name, set the display name, set the title and meta data manually? Creating the configuration files can take some effort to tweak as you add more fields and adjust to get exactly the right output. The caching feature is helpful to avoid overloading the source website with requests. Once you have an idea of how the field types work, it becomes easier to create new configurations.