Importing data into CMS with a scheduled job

This is a short guide to how to create and update pages in CMS programmatically intended for new developers to the CMS.

Why importing data and create pages?

The most common reason is that there is a requirement to import data from an external source and show it on the site like users, documents from Sharepoint, press releases, available positions at the company or similar.
This can be done in two ways, either by getting the external data using a scheduled job and creating pages in the CMS for it or having a single page in the CMS that loads the relevant data from the external datasource every time. I often prefer creating pages in CMS for it because that will ensure great performance at all times and keep on running even if the external data source is down for a few minutes. This blog post is about that use case.
Avoid storing external data in CMS if you have 10k+ items. I would probably use an EF database solution instead then.

How?

Create a scheduled job that will run the import.
Use the ScheduledPlugIn attribute and inherit from ScheduledJobBase. Set a unique guid (generate online) and some name and description for the administrators. The guid will make it possible to switch class name and namespace later if you need so don't forget it.
```
[ScheduledPlugIn(DisplayName = "Import data", Description = "Imports great data about ....", GUID = "9d074410-05c1-4125-a09d-1170dd531234")]
public class ImportJob : ScheduledJobBase
{
   ...
}
```
Create a separate root page in the content tree that will contain the created pages.
Add a setting on start page that points to the root page for the import.
Use this setting in the scheduled job to find the root page where items should be imported.
Get the items from data source, create a separate page type with the relevant properties that needs to be stored. Think of it like modelling a table in the database. Try avoid to store more than one piece of information per field if possible. One property for first name and one for last name is better than a single Name property that merges these together. If you have more than one type of objects, create a second pagetype to store the second object in instead of adding separate fields to the first. Use inheritance between the pages if it makes sense.

Clear the old import if needed using the .Delete method on the contentRepository.

 var children = contentRepository.GetChildren<PageData>(siteSettingsParentPage,new System.Globalization.CultureInfo("sv"));
if(children.Any())
{
    foreach(var child in children)
    {
           contentRepository.Delete(child.ContentLink,true);
      }
 }

Use the IContentRepository to save the pages:
```
 var pageToImport= contentRepository.GetDefault<ImportedPageType>(siteSettings.ParentPage, new System.Globalization.CultureInfo("sv"));
 pageToImport.Name = data.Name;
 pageToImport.CustomDataToImport= data.ImportantData;
 contentRepository.Save(clinicPage, EPiServer.DataAccess.SaveAction.Publish | EPiServer.DataAccess.SaveAction.SkipValidation, EPiServer.Security.AccessLevel.NoAccess);
```
To get a new page to store the data in, use the GetDefault method and specify the content type you wish to use and the parent page and language branch.
Fill those properties with data from the external data source.
Use the .Save method on the contentRepository to store it in the CMS. If you don't specify the NoAccess flag, it's likely the scheduled job won't work when running automatically since the scheduled job runs as an anonymous user as default. It's also possible to set the PrincipalInfo.CurrentPrincipal for this purpose if you need to run the scheduled job as a different user. If it works when running manually but fails when running automatically, this is normally the cure.
```
if (HttpContext.Current == null)
{
     PrincipalInfo.CurrentPrincipal = new GenericPrincipal(
       new GenericIdentity("Scheduled job service account"),
       new[] { "Administrators" });
}
```
Also notice the SkipValidation flag in the .Save call earlier. This is not mandatory but often you want to migrate data even if it doesn't look like it should to be sure you have everything. If you need to skip validation for that page type when creating pages automatically, then this is the one.

The SaveAction.Publish will make sure those new pages will be visible on site. If you want an editor to review them first, use SaveAction.CheckIn or SaveAction.RequestApproval instead. The latter is only used if you are using approval workflows on that content which is pretty rare but happens.
Try to avoid having more than 100 pages below a single parent. Edit mode doesn't really work well above that. Structure them with additional folder by date, category or alphabetically to avoid this depending on what type of pages you have.
Log everything! Importing data from an external source can be tricky to debug. Add plenty of logging from the start to save some time bug hunting later.
```
  private ILogger importLog = LogManager.GetLogger(typeof(ImportJob));
```
```
 importLog.Information($"GetAllData webservice returned {instructions.Count()} items");
```
Make sure the log itself doesn't throw an error. For instance, the call to instructions.Count() above can't ever be null. If it is, not only will the job fail but the logging will be disabled by it which will make it difficult to find. This happened to me recently.
For greater migrations it's usually good to limit blast radius. Start with a subset of data if possible that affects fewer end users and launch that first. When that is stable, continue with the rest.
Remember to test performance with realistic amounts for data early. Autogenerating fake content with a scheduled job like above is a good idea. Measure, improve, measure again until it's fast enough.
For large amounts of children, remember the method
```
 contentRepository.GetBySegment(parentLink, "id-of-item", new System.Globalization.CultureInfo("sv"));
```
As long as you know the id of the item, it's pretty fast to get it by using the urlsegement. If possible, avoid using GetChildren() if there are 100+ children.
Work with dictionaries<> instead of list<> if you are loading all items and doing lookups based on id for large collections. Cache it!
Test run it and show off your new stable solution to the customer!

I hope this post helps someone looking to do their first import job to the CMS. Leave a comment if it does! Or if you want me to add something you feel is missing,

Happy coding!

Jun 15, 2022

Comments

Please login to comment.